GIE-460: Add mcpchecker evals for obs-mcp tools#34
GIE-460: Add mcpchecker evals for obs-mcp tools#34openshift-merge-bot[bot] merged 21 commits intorhobs:mainfrom
Conversation
|
Skipping CI for Draft Pull Request. |
| taskSets: | ||
| # Metric discovery | ||
| - path: tasks/metrics/list-metrics.yaml | ||
| assertions: |
There was a problem hiding this comment.
I know it's not fault of this PR, but the split between the task definition and assertions is pretty terrible. Luckily the mcpchecker folks seem to be aware mcpchecker/mcpchecker#168
|
@slashpai: This pull request references GIE-460 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@slashpai: This pull request references GIE-460 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@slashpai: This pull request references GIE-460 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@slashpai: This pull request references GIE-460 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@slashpai: This pull request references GIE-460 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
I updated the branch with latest changes in mcpchecker https://github.com/mcpchecker/mcpchecker/releases/tag/v0.0.14 (previously was using v0.0.11 nightly) Tested one of the evals This was tested against OpenShift cluster after port-forwarding Result [
{
"taskName": "cpu-usage",
"taskPath": "/Users/pai/github.com/pai/obs-mcp/evals/mcpchecker/tasks/queries/cpu-usage.yaml",
"taskPassed": true,
"taskOutput": "Here are the top CPU-using pods in the last 5 minutes (by sum rate of container_cpu_usage_seconds_total, excluding the “POD” infra containers):\n\n- prometheus-k8s-0 (openshift-monitoring): ~0.078 CPU cores\n- prometheus-k8s-1 (openshift-monitoring): ~0.070 CPU cores\n- ovnkube-node-bm7cm (openshift-ovn-kubernetes): ~0.029 CPU cores\n- ovnkube-node-qccp8 (openshift-ovn-kubernetes): ~0.025 CPU cores\n\nNote: There’s an overall total row at the top of the results (not a pod), which represents the aggregate rate. If you want the top 10 (or to filter differently), I can run a query with topk(10, ...) or adjust the filters.",
"taskJudgeReason": "The reference is the metric name 'container_cpu_usage_seconds_total'. The model output explicitly references this metric ('sum rate of container_cpu_usage_seconds_total') and describes CPU usage for pods, thereby semantically including the core information from the reference.",
"difficulty": "medium",
"parallel": true,
"totalRuns": 1,
"assertionResults": {
"toolsUsed": {
"passed": true
},
"minToolCalls": {
"passed": true
},
"maxToolCalls": {
"passed": true
},
"callOrder": {
"passed": true
}
},
"allAssertionsPassed": true,
"callHistory": {
"ToolCalls": [
{
"serverName": "obs",
"timestamp": "2026-04-02T11:49:30.662479+05:30",
"success": true,
"name": "list_metrics",
"result": {
"content": [
{
"type": "text",
"text": "{\"metrics\":[\"container_cpu_usage_seconds_total\"]}"
}
],
"structuredContent": {
"metrics": [
"container_cpu_usage_seconds_total"
]
}
},
"tokens": {
"inputTokens": 10,
"outputTokens": 22,
"totalTokens": 32
},
"request": {
"Session": {},
"Params": {
"name": "list_metrics",
"arguments": {
"name_regex": "container_cpu_usage_seconds_total"
}
},
"Extra": {
"TokenInfo": null,
"Header": {
"Accept": [
"application/json, text/event-stream"
],
"Accept-Encoding": [
"gzip"
],
"Content-Length": [
"142"
],
"Content-Type": [
"application/json"
],
"Mcp-Protocol-Version": [
"2025-06-18"
],
"Mcp-Session-Id": [
"6DDECKOCAFQDCLUPEIFPJNXLJP"
],
"User-Agent": [
"Go-http-client/1.1"
]
}
}
}
},
{
"serverName": "obs",
"timestamp": "2026-04-02T11:49:43.923378+05:30",
"success": true,
"name": "execute_instant_query",
"result": {
"content": [
{
"type": "text",
"text": "{\"result\":[{\"metric\":{},\"value\":[1775110783.927,\"1.1882256674681153\"]},{\"metric\":{\"namespace\":\"openshift-monitoring\",\"pod\":\"prometheus-k8s-0\"},\"value\":[1775110783.927,\"0.07787675311958422\"]},{\"metric\":{\"namespace\":\"openshift-monitoring\",\"pod\":\"prometheus-k8s-1\"},\"value\":[1775110783.927,\"0.07034366453073795\"]},{\"metric\":{\"namespace\":\"openshift-ovn-kubernetes\",\"pod\":\"ovnkube-node-bm7cm\"},\"value\":[1775110783.927,\"0.028958297344501942\"]},{\"metric\":{\"namespace\":\"openshift-ovn-kubernetes\",\"pod\":\"ovnkube-node-qccp8\"},\"value\":[1775110783.927,\"0.025098961382484318\"]}],\"resultType\":\"vector\"}"
}
],
"structuredContent": {
"result": [
{
"metric": {},
"value": [
1775110783.927,
"1.1882256674681153"
]
},
{
"metric": {
"namespace": "openshift-monitoring",
"pod": "prometheus-k8s-0"
},
"value": [
1775110783.927,
"0.07787675311958422"
]
},
{
"metric": {
"namespace": "openshift-monitoring",
"pod": "prometheus-k8s-1"
},
"value": [
1775110783.927,
"0.07034366453073795"
]
},
{
"metric": {
"namespace": "openshift-ovn-kubernetes",
"pod": "ovnkube-node-bm7cm"
},
"value": [
1775110783.927,
"0.028958297344501942"
]
},
{
"metric": {
"namespace": "openshift-ovn-kubernetes",
"pod": "ovnkube-node-qccp8"
},
"value": [
1775110783.927,
"0.025098961382484318"
]
}
],
"resultType": "vector"
}
},
"tokens": {
"inputTokens": 38,
"outputTokens": 222,
"totalTokens": 260
},
"request": {
"Session": {},
"Params": {
"name": "execute_instant_query",
"arguments": {
"query": "topk(5, sum by (namespace, pod) (rate(container_cpu_usage_seconds_total{container!=\"POD\"}[5m])))",
"time": "NOW"
}
},
"Extra": {
"TokenInfo": null,
"Header": {
"Accept": [
"application/json, text/event-stream"
],
"Accept-Encoding": [
"gzip"
],
"Content-Length": [
"224"
],
"Content-Type": [
"application/json"
],
"Mcp-Protocol-Version": [
"2025-06-18"
],
"Mcp-Session-Id": [
"6DDECKOCAFQDCLUPEIFPJNXLJP"
],
"User-Agent": [
"Go-http-client/1.1"
]
}
}
}
}
],
"ResourceReads": null,
"PromptGets": null
},
"tokenEstimate": {
"inputTokens": 4394,
"outputTokens": 3475,
"totalTokens": 9533,
"promptTokens": 8,
"messageTokens": 180,
"thinkingTokens": 0,
"toolInputTokens": 48,
"toolOutputTokens": 574,
"mcpSchemaTokens": 2277,
"promptGetInputTokens": 0,
"promptGetOutputTokens": 0,
"resourceInputTokens": 0,
"resourceOutputTokens": 0,
"source": "actual",
"actual": {
"inputTokens": 4394,
"outputTokens": 3475,
"totalTokens": 9533
}
},
"judgeTokenUsage": {
"inputTokens": 22553,
"outputTokens": 15927,
"totalTokens": 63184
},
"setupOutput": {
"Steps": [],
"Success": true,
"Error": ""
},
"agentOutput": {
"Steps": [
{
"type": "tool_call",
"success": true,
"message": "list_metrics"
},
{
"type": "tool_call",
"success": true,
"message": "execute_instant_query"
},
{
"type": "message",
"success": true,
"message": "Here are the top CPU-using pods in the last 5 minutes (by sum rate of container_cpu_usage_seconds_total, excluding the “POD” infra containers):\n\n- prometheus-k8s-0 (openshift-monitoring): ~0.078 CPU cores\n- prometheus-k8s-1 (openshift-monitoring): ~0.070 CPU cores\n- ovnkube-node-bm7cm (openshift-ovn-kubernetes): ~0.029 CPU cores\n- ovnkube-node-qccp8 (openshift-ovn-kubernetes): ~0.025 CPU cores\n\nNote: There’s an overall total row at the top of the results (not a pod), which represents the aggregate rate. If you want the top 10 (or to filter differently), I can run a query with topk(10, ...) or adjust the filters."
}
],
"Success": true,
"Error": "",
"agentDetails": {
"tokenEstimate": {
"inputTokens": 4394,
"outputTokens": 3475,
"totalTokens": 9533,
"promptTokens": 8,
"messageTokens": 180,
"thinkingTokens": 0,
"toolInputTokens": 48,
"toolOutputTokens": 574,
"mcpSchemaTokens": 2277,
"promptGetInputTokens": 0,
"promptGetOutputTokens": 0,
"resourceInputTokens": 0,
"resourceOutputTokens": 0,
"source": "actual",
"actual": {
"inputTokens": 4394,
"outputTokens": 3475,
"totalTokens": 9533
}
},
"toolCalls": [
{
"title": "list_metrics",
"status": "completed",
"rawInput": {
"name_regex": "container_cpu_usage_seconds_total"
},
"rawOutput": "{\"content\":[{\"type\":\"text\",\"text\":\"{\\\"metrics\\\":[\\\"container_cpu_usage_seconds_total\\\"]}\"}],\"structuredContent\":{\"metrics\":[\"container_cpu_usage_seconds_total\"]}}"
},
{
"title": "execute_instant_query",
"status": "completed",
"rawInput": {
"query": "topk(5, sum by (namespace, pod) (rate(container_cpu_usage_seconds_total{container!=\"POD\"}[5m])))",
"time": "NOW"
},
"rawOutput": "{\"content\":[{\"type\":\"text\",\"text\":\"{\\\"result\\\":[{\\\"metric\\\":{},\\\"value\\\":[1775110783.927,\\\"1.1882256674681153\\\"]},{\\\"metric\\\":{\\\"namespace\\\":\\\"openshift-monitoring\\\",\\\"pod\\\":\\\"prometheus-k8s-0\\\"},\\\"value\\\":[1775110783.927,\\\"0.07787675311958422\\\"]},{\\\"metric\\\":{\\\"namespace\\\":\\\"openshift-monitoring\\\",\\\"pod\\\":\\\"prometheus-k8s-1\\\"},\\\"value\\\":[1775110783.927,\\\"0.07034366453073795\\\"]},{\\\"metric\\\":{\\\"namespace\\\":\\\"openshift-ovn-kubernetes\\\",\\\"pod\\\":\\\"ovnkube-node-bm7cm\\\"},\\\"value\\\":[1775110783.927,\\\"0.028958297344501942\\\"]},{\\\"metric\\\":{\\\"namespace\\\":\\\"openshift-ovn-kubernetes\\\",\\\"pod\\\":\\\"ovnkube-node-qccp8\\\"},\\\"value\\\":[1775110783.927,\\\"0.025098961382484318\\\"]}],\\\"resultType\\\":\\\"vector\\\"}\"}],\"structuredContent\":{\"result\":[{\"metric\":{},\"value\":[1775110783.927,\"1.1882256674681153\"]},{\"metric\":{\"namespace\":\"openshift-monitoring\",\"pod\":\"prometheus-k8s-0\"},\"value\":[1775110783.927,\"0.07787675311958422\"]},{\"metric\":{\"namespace\":\"openshift-monitoring\",\"pod\":\"prometheus-k8s-1\"},\"value\":[1775110783.927,\"0.07034366453073795\"]},{\"metric\":{\"namespace\":\"openshift-ovn-kubernetes\",\"pod\":\"ovnkube-node-bm7cm\"},\"value\":[1775110783.927,\"0.028958297344501942\"]},{\"metric\":{\"namespace\":\"openshift-ovn-kubernetes\",\"pod\":\"ovnkube-node-qccp8\"},\"value\":[1775110783.927,\"0.025098961382484318\"]}],\"resultType\":\"vector\"}}"
}
],
"outputSteps": [
{
"type": "tool_call",
"toolCall": {
"title": "list_metrics",
"status": "completed",
"rawInput": {
"name_regex": "container_cpu_usage_seconds_total"
},
"rawOutput": "{\"content\":[{\"type\":\"text\",\"text\":\"{\\\"metrics\\\":[\\\"container_cpu_usage_seconds_total\\\"]}\"}],\"structuredContent\":{\"metrics\":[\"container_cpu_usage_seconds_total\"]}}"
}
},
{
"type": "tool_call",
"toolCall": {
"title": "execute_instant_query",
"status": "completed",
"rawInput": {
"query": "topk(5, sum by (namespace, pod) (rate(container_cpu_usage_seconds_total{container!=\"POD\"}[5m])))",
"time": "NOW"
},
"rawOutput": "{\"content\":[{\"type\":\"text\",\"text\":\"{\\\"result\\\":[{\\\"metric\\\":{},\\\"value\\\":[1775110783.927,\\\"1.1882256674681153\\\"]},{\\\"metric\\\":{\\\"namespace\\\":\\\"openshift-monitoring\\\",\\\"pod\\\":\\\"prometheus-k8s-0\\\"},\\\"value\\\":[1775110783.927,\\\"0.07787675311958422\\\"]},{\\\"metric\\\":{\\\"namespace\\\":\\\"openshift-monitoring\\\",\\\"pod\\\":\\\"prometheus-k8s-1\\\"},\\\"value\\\":[1775110783.927,\\\"0.07034366453073795\\\"]},{\\\"metric\\\":{\\\"namespace\\\":\\\"openshift-ovn-kubernetes\\\",\\\"pod\\\":\\\"ovnkube-node-bm7cm\\\"},\\\"value\\\":[1775110783.927,\\\"0.028958297344501942\\\"]},{\\\"metric\\\":{\\\"namespace\\\":\\\"openshift-ovn-kubernetes\\\",\\\"pod\\\":\\\"ovnkube-node-qccp8\\\"},\\\"value\\\":[1775110783.927,\\\"0.025098961382484318\\\"]}],\\\"resultType\\\":\\\"vector\\\"}\"}],\"structuredContent\":{\"result\":[{\"metric\":{},\"value\":[1775110783.927,\"1.1882256674681153\"]},{\"metric\":{\"namespace\":\"openshift-monitoring\",\"pod\":\"prometheus-k8s-0\"},\"value\":[1775110783.927,\"0.07787675311958422\"]},{\"metric\":{\"namespace\":\"openshift-monitoring\",\"pod\":\"prometheus-k8s-1\"},\"value\":[1775110783.927,\"0.07034366453073795\"]},{\"metric\":{\"namespace\":\"openshift-ovn-kubernetes\",\"pod\":\"ovnkube-node-bm7cm\"},\"value\":[1775110783.927,\"0.028958297344501942\"]},{\"metric\":{\"namespace\":\"openshift-ovn-kubernetes\",\"pod\":\"ovnkube-node-qccp8\"},\"value\":[1775110783.927,\"0.025098961382484318\"]}],\"resultType\":\"vector\"}}"
}
},
{
"type": "message",
"content": "Here are the top CPU-using pods in the last 5 minutes (by sum rate of container_cpu_usage_seconds_total, excluding the “POD” infra containers):\n\n- prometheus-k8s-0 (openshift-monitoring): ~0.078 CPU cores\n- prometheus-k8s-1 (openshift-monitoring): ~0.070 CPU cores\n- ovnkube-node-bm7cm (openshift-ovn-kubernetes): ~0.029 CPU cores\n- ovnkube-node-qccp8 (openshift-ovn-kubernetes): ~0.025 CPU cores\n\nNote: There’s an overall total row at the top of the results (not a pod), which represents the aggregate rate. If you want the top 10 (or to filter differently), I can run a query with topk(10, ...) or adjust the filters."
}
]
}
},
"verifyOutput": {
"Steps": [
{
"type": "llmJudge",
"success": true,
"message": "The reference is the metric name 'container_cpu_usage_seconds_total'. The model output explicitly references this metric ('sum rate of container_cpu_usage_seconds_total') and describes CPU usage for pods, thereby semantically including the core information from the reference.",
"usage": {
"inputTokens": 6393,
"outputTokens": 3862,
"totalTokens": 10255
}
},
{
"type": "llmJudge",
"success": true,
"message": "The reference answer is 'pod', indicating the core concept is pods. The model output lists specific pods using CPU and includes the pod names, thereby semantically including the same meaning (pods). No contradiction; the extra details are acceptable per the CONTAINS criterion.",
"usage": {
"inputTokens": 16160,
"outputTokens": 12065,
"totalTokens": 52929
}
}
],
"Success": true,
"Error": ""
},
"cleanupOutput": {
"Steps": [],
"Success": true,
"Error": ""
}
}
]cc: @iNecas @saswatamcode #46 was based on |
|
@slashpai: This pull request references GIE-460 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
| @@ -0,0 +1,46 @@ | |||
| # Metrics Reference | |||
There was a problem hiding this comment.
This doc is for development reference
| @@ -0,0 +1,51 @@ | |||
| # Prompts You Can Try | |||
There was a problem hiding this comment.
This doc is for development reference
|
This is good now for initial review for the cases covered in evals cc: @iNecas |
|
|
||
| ## Coverage | ||
|
|
||
| 17 eval tasks across 4 categories and 3 difficulty levels: |
There was a problem hiding this comment.
I found the coverage part useless unless generated automatically.
f5af01c to
8448081
Compare
|
I've run the evals locally and got these results: Some of the failures were caused by just flaky network, and it's been with gpt-5-nano, so the results are not bad. I'm not sure what's the expected evals/per tool call ratio, but the coverage is looks good to me. The only tool I've not seen covered what the |
Signed-off-by: Jayapriya Pai <janantha@redhat.com>
…pe targets Signed-off-by: Jayapriya Pai <janantha@redhat.com>
- Add callOrder assertions to enforce list_metrics-first workflow - Add runs metadata (2 for easy, 3 for medium/hard) and category labels - Tighten maxToolCalls bounds and strengthen LLM judge criteria - Replace duplicate get-active-alerts with multi-step alert-investigation task - Add namespace-resource-usage and diagnose-cluster-health hard tasks Signed-off-by: Jayapriya Pai <janantha@redhat.com>
Eval run showed agents legitimately use execute_range_query for prompts that were assumed to need execute_instant_query. Uses toolPattern to accept either, simplifies callOrder to list_metrics-first only, and increases maxToolCalls from 7 to 15. Signed-off-by: Jayapriya Pai <janantha@redhat.com>
Signed-off-by: Jayapriya Pai <janantha@redhat.com>
Developer-facing references for eval authoring and debugging. Also adds query efficiency tips to METRICS_REFERENCE.md and a single-task run example to TESTING.md. Signed-off-by: Jayapriya Pai <janantha@redhat.com>
Migrate LLM judge from deprecated env-based config to ref-based agent format, simplify env var setup, and document new CLI flags and providers. Signed-off-by: Jayapriya Pai <janantha@redhat.com>
Signed-off-by: Jayapriya Pai <janantha@redhat.com>
Signed-off-by: Jayapriya Pai <janantha@redhat.com>
Signed-off-by: Jayapriya Pai <janantha@redhat.com>
Signed-off-by: Jayapriya Pai <janantha@redhat.com>
- nonexistent-metric: agent handles missing metric gracefully - nonexistent-namespace: agent handles empty results for fake namespace - high-cardinality-rejection: agent recovers from guardrail rejection - namespace-pod-count: multi-step label exploration to query chaining Signed-off-by: Jayapriya Pai <janantha@redhat.com>
Add prometheus-reachability smoke test to verify connectivity before running the full suite, avoiding wasted tokens on unreachable backends. Add time-range-query task to test range query with correct time window. Fix nonexistent-namespace contains check from 'no' to 'no data' to reduce false positives. Document smoke test as step 3 in Quick Start." Signed-off-by: Jayapriya Pai <janantha@redhat.com>
…e task Set all task runs to 1 to reduce token cost while iterating on evals. Add visualize-cpu-usage task to cover the show_timeseries tool. Fix eval assertions: require query tool for high-cardinality-rejection, add list_metrics to diagnose-cluster-health, lower minToolCalls for namespace-resource-usage. Signed-off-by: Jayapriya Pai <janantha@redhat.com>
|
@slashpai: This pull request references GIE-460 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/hold for more reviews |
|
@slashpai I've used in-cluster OpenShift thanos querier with guardrails off. |
|
@iNecas addressed the comment. Can you review again? I think we can iterate more and add more cases as needed after this one is completed. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: iNecas, slashpai The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/hold cancel |
|
Against Prometheus metrics backend (OpenShift 4.21 ROSA) with mcpchecker v0.0.15 == Results Summary ===
Task: label-values
Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/labels/label-values.yaml
Difficulty: medium
Task Status: PASSED
Assertions: PASSED (4/4)
Task: list-node-metrics
Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/metrics/list-node-metrics.yaml
Difficulty: easy
Task Status: PASSED
Assertions: PASSED (3/3)
Task: pending-pods
Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/pending-pods.yaml
Difficulty: medium
Task Status: PASSED
Assertions: PASSED (4/4)
Task: get-silences
Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/alerts/get-silences.yaml
Difficulty: easy
Task Status: FAILED (Verification failed, but assertions passed)
Error: one or more verification steps failed
Assertions: PASSED (3/3)
Task: prometheus-head-series
Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/prometheus-head-series.yaml
Difficulty: easy
Task Status: PASSED
Assertions: PASSED (4/4)
Task: get-series-cardinality
Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/labels/get-series.yaml
Difficulty: medium
Task Status: PASSED
Assertions: PASSED (4/4)
Task: prometheus-requests
Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/prometheus-requests.yaml
Difficulty: medium
Task Status: PASSED
Assertions: PASSED (4/4)
Task: prometheus-wal-size
Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/prometheus-wal-size.yaml
Difficulty: easy
Task Status: PASSED
Assertions: PASSED (4/4)
Task: get-alerts
Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/alerts/get-alerts.yaml
Difficulty: easy
Task Status: PASSED
Assertions: PASSED (3/3)
Task: high-cardinality-rejection
Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/high-cardinality-rejection.yaml
Difficulty: medium
Task Status: PASSED
Assertions: FAILED (2/3)
- MaxToolCalls: Too many tool calls: expected <= 15, got 23
Task: label-names
Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/labels/label-names.yaml
Difficulty: easy
Task Status: PASSED
Assertions: PASSED (4/4)
Task: alert-investigation
Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/alerts/alert-investigation.yaml
Difficulty: medium
Task Status: FAILED (Agent execution error)
Error details saved to: github.com/slashpai/obs-mcp/evals/mcpchecker/alert-investigation-error.txt
Assertions: FAILED (2/3)
- MaxToolCalls: Too many tool calls: expected <= 15, got 19
Task: time-range-query
Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/time-range-query.yaml
Difficulty: medium
Task Status: PASSED
Assertions: PASSED (4/4)
Task: crashlooping-pods
Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/crashlooping-pods.yaml
Difficulty: medium
Task Status: PASSED
Assertions: FAILED (3/4)
- MaxToolCalls: Too many tool calls: expected <= 7, got 14
Task: prometheus-reachability
Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/prometheus-reachability.yaml
Difficulty: easy
Task Status: PASSED
Assertions: PASSED (3/3)
Task: pods-created
Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/pods-created.yaml
Difficulty: medium
Task Status: FAILED
Error: one or more verification steps failed
Assertions: FAILED (1/4)
- ToolsUsed: Required tool not called: server=obs, tool=execute_range_query, pattern=
- MaxToolCalls: Too many tool calls: expected <= 7, got 9
- CallOrder: Expected call order not satisfied. Got to 1/2
Task: nonexistent-namespace
Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/nonexistent-namespace.yaml
Difficulty: easy
Task Status: PASSED
Assertions: PASSED (4/4)
Task: list-kube-metrics
Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/metrics/list-metrics.yaml
Difficulty: easy
Task Status: PASSED
Assertions: PASSED (3/3)
Task: diagnose-cluster-health
Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/diagnose-cluster-health.yaml
Difficulty: hard
Task Status: PASSED
Assertions: PASSED (3/3)
Task: namespace-resource-usage
Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/namespace-resource-usage.yaml
Difficulty: hard
Task Status: PASSED
Assertions: PASSED (4/4)
Task: nonexistent-metric
Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/nonexistent-metric.yaml
Difficulty: easy
Task Status: PASSED
Assertions: PASSED (3/3)
Task: visualize-cpu-usage
Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/visualize-cpu-usage.yaml
Difficulty: medium
Task Status: PASSED
Assertions: PASSED (4/4)
Task: namespace-pod-count
Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/namespace-pod-count.yaml
Difficulty: medium
Task Status: PASSED
Assertions: PASSED (4/4)
Task: network-traffic
Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/network-traffic.yaml
Difficulty: medium
Task Status: FAILED (Verification failed, but assertions passed)
Error: one or more verification steps failed
Assertions: PASSED (4/4)
Task: cpu-usage
Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/cpu-usage.yaml
Difficulty: medium
Task Status: PASSED
Assertions: PASSED (4/4)
=== Overall Statistics ===
Total Tasks: 25
Tasks Passed: 21/25
Assertions Passed: 85/91
Tasks where verification failed but assertions passed: 2
Assertions in these tasks: 8/11
Tokens: ~1423666 (estimate - excludes system prompt & cache)
MCP schemas: ~74300 (included in token total)
=== Statistics by Difficulty ===
easy:
Tasks: 9/10
Assertions: 34/34
medium:
Tasks: 10/13
Assertions: 44/50
hard:
Tasks: 2/2
Assertions: 7/7
⏱️ Completed in 18m46s |
Add mcpchecker evaluations to validate that AI agents can discover and correctly use all 8 obs-mcp tools against a live Prometheus/Alertmanager backend.
mcpchecker version used: https://github.com/mcpchecker/mcpchecker/releases/tag/v0.0.15