GIE-460: Add mcpchecker evals for obs-mcp tools by slashpai · Pull Request #34 · rhobs/obs-mcp

slashpai · 2026-02-25T08:10:57Z

Add mcpchecker evaluations to validate that AI agents can discover and correctly use all 8 obs-mcp tools against a live Prometheus/Alertmanager backend.

mcpchecker version used: https://github.com/mcpchecker/mcpchecker/releases/tag/v0.0.15

openshift-ci · 2026-02-25T08:11:01Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

iNecas · 2026-03-09T10:57:12Z

+  taskSets:
+    # Metric discovery
+    - path: tasks/metrics/list-metrics.yaml
+      assertions:


I know it's not fault of this PR, but the split between the task definition and assertions is pretty terrible. Luckily the mcpchecker folks seem to be aware mcpchecker/mcpchecker#168

openshift-ci-robot · 2026-03-11T02:20:56Z

@slashpai: This pull request references GIE-460 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Add evaluation tasks using the mcpchecker framework (v1alpha2) to test that AI agents can discover and correctly use all 8 obs-mcp tools against a live Prometheus/Alertmanager backend.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2026-03-11T02:52:01Z

@slashpai: This pull request references GIE-460 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Add mcpchecker evaluations to validate that AI agents can discover and correctly use all 8 obs-mcp tools against a live Prometheus/Alertmanager backend.

16 eval tasks across 4 categories:

Category Tasks Tools tested

Metrics discovery list kube metrics, list node metrics list_metrics

Label exploration label names, label values, series cardinality get_label_names, get_label_values, get_series

PromQL queries CPU usage, pending pods, crashlooping pods, network traffic, Prometheus internals (head series, requests, WAL size) execute_instant_query, execute_range_query

Alertmanager firing alerts, active alerts, silences get_alerts, get_silences

Each task verifies:

The agent selects the correct tool(s)

Tool call count stays within bounds

Response contains expected content (via LLM judge)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2026-03-11T02:54:16Z

@slashpai: This pull request references GIE-460 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Add mcpchecker evaluations to validate that AI agents can discover and correctly use all 8 obs-mcp tools against a live Prometheus/Alertmanager backend.

16 eval tasks across 4 categories:

Category Tasks Tools tested

Metrics discovery list kube metrics, list node metrics list_metrics

Label exploration label names, label values, series cardinality get_label_names, get_label_values, get_series

PromQL queries CPU usage, pending pods, crashlooping pods, network traffic, Prometheus internals (head series, requests, WAL size) execute_instant_query, execute_range_query

Alertmanager firing alerts, active alerts, silences get_alerts, get_silences

Each task verifies:

The agent selects the correct tool(s)

Tool call count stays within bounds

Response contains expected content (via LLM judge)

Note: This is a smoke-test level evaluation covering basic tool discovery and usage. We need to add:

Multi-step reasoning — tasks requiring 3+ chained tools (e.g., discover metric → query → analyze trend)

Error handling — agent recovery from invalid queries or missing metrics

Guardrail behavior — agent response when dangerous queries are blocked

Parameter coverage — testing less-used params like silenced, inhibited, receiver, filter, time ranges

Ambiguous prompts — vague diagnostic questions (e.g., "Why is my app slow?") requiring the agent to choose the right tools

Hard difficulty tasks — complex multi-tool diagnostic scenarios

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2026-03-11T02:55:14Z

@slashpai: This pull request references GIE-460 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Add mcpchecker evaluations to validate that AI agents can discover and correctly use all 8 obs-mcp tools against a live Prometheus/Alertmanager backend.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2026-03-11T04:45:27Z

@slashpai: This pull request references GIE-460 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Add mcpchecker evaluations to validate that AI agents can discover and correctly use all 8 obs-mcp tools against a live Prometheus/Alertmanager backend.

mcpchecker version used: https://github.com/mcpchecker/mcpchecker/releases/tag/nightly

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

slashpai · 2026-04-02T06:38:38Z

I updated the branch with latest changes in mcpchecker https://github.com/mcpchecker/mcpchecker/releases/tag/v0.0.14 (previously was using v0.0.11 nightly)

Tested one of the evals cpu-usage with gpt-5-nano which gave better result than gpt-4o-mini so I have switched to that as default model in config

This was tested against OpenShift cluster after port-forwarding

PROMETHEUS_URL=http://localhost:9090 ALERTMANAGER_URL=http://localhost:9093 make run

─░▒▓ ~/github.com/slashpai/obs-mcp/evals/mcpchecker  mcp-evals ························································· ✔  3.13  Py  1.25.5 Go  11:49:17 ▓▒░─╮
╰─ mcpchecker check eval.yaml --run "cpu-usage" --runs 1 --verbose                                                                                            ─╯


=== Starting Evaluation ===

[cpu-usage] Starting (parallel, medium)
[cpu-usage] → Setting up task environment...
[cpu-usage] → Running agent...
  → Agent 'llm-agent-openai-gpt-5-nano' is working…
2026/04/02 11:50:00 INFO peer connection closed
2026/04/02 11:50:00 INFO peer connection closed
[cpu-usage] → Verifying results...
  → LLM judge 'llm-agent-openai-gpt-5-nano' is evaluating…
2026/04/02 11:50:40 ERROR failed to handle notification method=session/update err="{\"code\":-32603,\"message\":\"Internal error\",\"data\":{\"error\":\"no matching session on client\"}}"
2026/04/02 11:50:40 INFO peer connection closed
2026/04/02 11:50:40 INFO peer connection closed
  → LLM judge 'llm-agent-openai-gpt-5-nano' is evaluating…
2026/04/02 11:52:54 ERROR failed to handle notification method=session/update err="{\"code\":-32603,\"message\":\"Internal error\",\"data\":{\"error\":\"no matching session on client\"}}"
2026/04/02 11:52:54 INFO peer connection closed
2026/04/02 11:52:54 INFO peer connection closed
[cpu-usage] → Evaluating assertions...
[cpu-usage] ✓ Task passed

=== Evaluation Complete ===

📄 Results saved to: mcpchecker-obs-mcp-tools-out.json

=== Results Summary ===

Task: cpu-usage
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/cpu-usage.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: PASSED (4/4)

=== Overall Statistics ===
Total Tasks: 1
Tasks Passed: 1/1
Assertions Passed: 4/4
Tokens:     ~9533 (estimate - excludes system prompt & cache)
MCP schemas: ~2277 (included in token total)

=== Statistics by Difficulty ===

medium:
  Tasks: 1/1
  Assertions: 4/4
⏱️  Completed in 3m33s
╭─░▒▓ ~/github.com/pai/obs-mcp/evals/mcpchecker  mcp-evals ················································ ✔  3m 33s │ 3.13  Py  1.25.5 Go  11:52:55 ▓▒░─╮
╰─

Result

[
  {
    "taskName": "cpu-usage",
    "taskPath": "/Users/pai/github.com/pai/obs-mcp/evals/mcpchecker/tasks/queries/cpu-usage.yaml",
    "taskPassed": true,
    "taskOutput": "Here are the top CPU-using pods in the last 5 minutes (by sum rate of container_cpu_usage_seconds_total, excluding the “POD” infra containers):\n\n- prometheus-k8s-0 (openshift-monitoring): ~0.078 CPU cores\n- prometheus-k8s-1 (openshift-monitoring): ~0.070 CPU cores\n- ovnkube-node-bm7cm (openshift-ovn-kubernetes): ~0.029 CPU cores\n- ovnkube-node-qccp8 (openshift-ovn-kubernetes): ~0.025 CPU cores\n\nNote: There’s an overall total row at the top of the results (not a pod), which represents the aggregate rate. If you want the top 10 (or to filter differently), I can run a query with topk(10, ...) or adjust the filters.",
    "taskJudgeReason": "The reference is the metric name 'container_cpu_usage_seconds_total'. The model output explicitly references this metric ('sum rate of container_cpu_usage_seconds_total') and describes CPU usage for pods, thereby semantically including the core information from the reference.",
    "difficulty": "medium",
    "parallel": true,
    "totalRuns": 1,
    "assertionResults": {
      "toolsUsed": {
        "passed": true
      },
      "minToolCalls": {
        "passed": true
      },
      "maxToolCalls": {
        "passed": true
      },
      "callOrder": {
        "passed": true
      }
    },
    "allAssertionsPassed": true,
    "callHistory": {
      "ToolCalls": [
        {
          "serverName": "obs",
          "timestamp": "2026-04-02T11:49:30.662479+05:30",
          "success": true,
          "name": "list_metrics",
          "result": {
            "content": [
              {
                "type": "text",
                "text": "{\"metrics\":[\"container_cpu_usage_seconds_total\"]}"
              }
            ],
            "structuredContent": {
              "metrics": [
                "container_cpu_usage_seconds_total"
              ]
            }
          },
          "tokens": {
            "inputTokens": 10,
            "outputTokens": 22,
            "totalTokens": 32
          },
          "request": {
            "Session": {},
            "Params": {
              "name": "list_metrics",
              "arguments": {
                "name_regex": "container_cpu_usage_seconds_total"
              }
            },
            "Extra": {
              "TokenInfo": null,
              "Header": {
                "Accept": [
                  "application/json, text/event-stream"
                ],
                "Accept-Encoding": [
                  "gzip"
                ],
                "Content-Length": [
                  "142"
                ],
                "Content-Type": [
                  "application/json"
                ],
                "Mcp-Protocol-Version": [
                  "2025-06-18"
                ],
                "Mcp-Session-Id": [
                  "6DDECKOCAFQDCLUPEIFPJNXLJP"
                ],
                "User-Agent": [
                  "Go-http-client/1.1"
                ]
              }
            }
          }
        },
        {
          "serverName": "obs",
          "timestamp": "2026-04-02T11:49:43.923378+05:30",
          "success": true,
          "name": "execute_instant_query",
          "result": {
            "content": [
              {
                "type": "text",
                "text": "{\"result\":[{\"metric\":{},\"value\":[1775110783.927,\"1.1882256674681153\"]},{\"metric\":{\"namespace\":\"openshift-monitoring\",\"pod\":\"prometheus-k8s-0\"},\"value\":[1775110783.927,\"0.07787675311958422\"]},{\"metric\":{\"namespace\":\"openshift-monitoring\",\"pod\":\"prometheus-k8s-1\"},\"value\":[1775110783.927,\"0.07034366453073795\"]},{\"metric\":{\"namespace\":\"openshift-ovn-kubernetes\",\"pod\":\"ovnkube-node-bm7cm\"},\"value\":[1775110783.927,\"0.028958297344501942\"]},{\"metric\":{\"namespace\":\"openshift-ovn-kubernetes\",\"pod\":\"ovnkube-node-qccp8\"},\"value\":[1775110783.927,\"0.025098961382484318\"]}],\"resultType\":\"vector\"}"
              }
            ],
            "structuredContent": {
              "result": [
                {
                  "metric": {},
                  "value": [
                    1775110783.927,
                    "1.1882256674681153"
                  ]
                },
                {
                  "metric": {
                    "namespace": "openshift-monitoring",
                    "pod": "prometheus-k8s-0"
                  },
                  "value": [
                    1775110783.927,
                    "0.07787675311958422"
                  ]
                },
                {
                  "metric": {
                    "namespace": "openshift-monitoring",
                    "pod": "prometheus-k8s-1"
                  },
                  "value": [
                    1775110783.927,
                    "0.07034366453073795"
                  ]
                },
                {
                  "metric": {
                    "namespace": "openshift-ovn-kubernetes",
                    "pod": "ovnkube-node-bm7cm"
                  },
                  "value": [
                    1775110783.927,
                    "0.028958297344501942"
                  ]
                },
                {
                  "metric": {
                    "namespace": "openshift-ovn-kubernetes",
                    "pod": "ovnkube-node-qccp8"
                  },
                  "value": [
                    1775110783.927,
                    "0.025098961382484318"
                  ]
                }
              ],
              "resultType": "vector"
            }
          },
          "tokens": {
            "inputTokens": 38,
            "outputTokens": 222,
            "totalTokens": 260
          },
          "request": {
            "Session": {},
            "Params": {
              "name": "execute_instant_query",
              "arguments": {
                "query": "topk(5, sum by (namespace, pod) (rate(container_cpu_usage_seconds_total{container!=\"POD\"}[5m])))",
                "time": "NOW"
              }
            },
            "Extra": {
              "TokenInfo": null,
              "Header": {
                "Accept": [
                  "application/json, text/event-stream"
                ],
                "Accept-Encoding": [
                  "gzip"
                ],
                "Content-Length": [
                  "224"
                ],
                "Content-Type": [
                  "application/json"
                ],
                "Mcp-Protocol-Version": [
                  "2025-06-18"
                ],
                "Mcp-Session-Id": [
                  "6DDECKOCAFQDCLUPEIFPJNXLJP"
                ],
                "User-Agent": [
                  "Go-http-client/1.1"
                ]
              }
            }
          }
        }
      ],
      "ResourceReads": null,
      "PromptGets": null
    },
    "tokenEstimate": {
      "inputTokens": 4394,
      "outputTokens": 3475,
      "totalTokens": 9533,
      "promptTokens": 8,
      "messageTokens": 180,
      "thinkingTokens": 0,
      "toolInputTokens": 48,
      "toolOutputTokens": 574,
      "mcpSchemaTokens": 2277,
      "promptGetInputTokens": 0,
      "promptGetOutputTokens": 0,
      "resourceInputTokens": 0,
      "resourceOutputTokens": 0,
      "source": "actual",
      "actual": {
        "inputTokens": 4394,
        "outputTokens": 3475,
        "totalTokens": 9533
      }
    },
    "judgeTokenUsage": {
      "inputTokens": 22553,
      "outputTokens": 15927,
      "totalTokens": 63184
    },
    "setupOutput": {
      "Steps": [],
      "Success": true,
      "Error": ""
    },
    "agentOutput": {
      "Steps": [
        {
          "type": "tool_call",
          "success": true,
          "message": "list_metrics"
        },
        {
          "type": "tool_call",
          "success": true,
          "message": "execute_instant_query"
        },
        {
          "type": "message",
          "success": true,
          "message": "Here are the top CPU-using pods in the last 5 minutes (by sum rate of container_cpu_usage_seconds_total, excluding the “POD” infra containers):\n\n- prometheus-k8s-0 (openshift-monitoring): ~0.078 CPU cores\n- prometheus-k8s-1 (openshift-monitoring): ~0.070 CPU cores\n- ovnkube-node-bm7cm (openshift-ovn-kubernetes): ~0.029 CPU cores\n- ovnkube-node-qccp8 (openshift-ovn-kubernetes): ~0.025 CPU cores\n\nNote: There’s an overall total row at the top of the results (not a pod), which represents the aggregate rate. If you want the top 10 (or to filter differently), I can run a query with topk(10, ...) or adjust the filters."
        }
      ],
      "Success": true,
      "Error": "",
      "agentDetails": {
        "tokenEstimate": {
          "inputTokens": 4394,
          "outputTokens": 3475,
          "totalTokens": 9533,
          "promptTokens": 8,
          "messageTokens": 180,
          "thinkingTokens": 0,
          "toolInputTokens": 48,
          "toolOutputTokens": 574,
          "mcpSchemaTokens": 2277,
          "promptGetInputTokens": 0,
          "promptGetOutputTokens": 0,
          "resourceInputTokens": 0,
          "resourceOutputTokens": 0,
          "source": "actual",
          "actual": {
            "inputTokens": 4394,
            "outputTokens": 3475,
            "totalTokens": 9533
          }
        },
        "toolCalls": [
          {
            "title": "list_metrics",
            "status": "completed",
            "rawInput": {
              "name_regex": "container_cpu_usage_seconds_total"
            },
            "rawOutput": "{\"content\":[{\"type\":\"text\",\"text\":\"{\\\"metrics\\\":[\\\"container_cpu_usage_seconds_total\\\"]}\"}],\"structuredContent\":{\"metrics\":[\"container_cpu_usage_seconds_total\"]}}"
          },
          {
            "title": "execute_instant_query",
            "status": "completed",
            "rawInput": {
              "query": "topk(5, sum by (namespace, pod) (rate(container_cpu_usage_seconds_total{container!=\"POD\"}[5m])))",
              "time": "NOW"
            },
            "rawOutput": "{\"content\":[{\"type\":\"text\",\"text\":\"{\\\"result\\\":[{\\\"metric\\\":{},\\\"value\\\":[1775110783.927,\\\"1.1882256674681153\\\"]},{\\\"metric\\\":{\\\"namespace\\\":\\\"openshift-monitoring\\\",\\\"pod\\\":\\\"prometheus-k8s-0\\\"},\\\"value\\\":[1775110783.927,\\\"0.07787675311958422\\\"]},{\\\"metric\\\":{\\\"namespace\\\":\\\"openshift-monitoring\\\",\\\"pod\\\":\\\"prometheus-k8s-1\\\"},\\\"value\\\":[1775110783.927,\\\"0.07034366453073795\\\"]},{\\\"metric\\\":{\\\"namespace\\\":\\\"openshift-ovn-kubernetes\\\",\\\"pod\\\":\\\"ovnkube-node-bm7cm\\\"},\\\"value\\\":[1775110783.927,\\\"0.028958297344501942\\\"]},{\\\"metric\\\":{\\\"namespace\\\":\\\"openshift-ovn-kubernetes\\\",\\\"pod\\\":\\\"ovnkube-node-qccp8\\\"},\\\"value\\\":[1775110783.927,\\\"0.025098961382484318\\\"]}],\\\"resultType\\\":\\\"vector\\\"}\"}],\"structuredContent\":{\"result\":[{\"metric\":{},\"value\":[1775110783.927,\"1.1882256674681153\"]},{\"metric\":{\"namespace\":\"openshift-monitoring\",\"pod\":\"prometheus-k8s-0\"},\"value\":[1775110783.927,\"0.07787675311958422\"]},{\"metric\":{\"namespace\":\"openshift-monitoring\",\"pod\":\"prometheus-k8s-1\"},\"value\":[1775110783.927,\"0.07034366453073795\"]},{\"metric\":{\"namespace\":\"openshift-ovn-kubernetes\",\"pod\":\"ovnkube-node-bm7cm\"},\"value\":[1775110783.927,\"0.028958297344501942\"]},{\"metric\":{\"namespace\":\"openshift-ovn-kubernetes\",\"pod\":\"ovnkube-node-qccp8\"},\"value\":[1775110783.927,\"0.025098961382484318\"]}],\"resultType\":\"vector\"}}"
          }
        ],
        "outputSteps": [
          {
            "type": "tool_call",
            "toolCall": {
              "title": "list_metrics",
              "status": "completed",
              "rawInput": {
                "name_regex": "container_cpu_usage_seconds_total"
              },
              "rawOutput": "{\"content\":[{\"type\":\"text\",\"text\":\"{\\\"metrics\\\":[\\\"container_cpu_usage_seconds_total\\\"]}\"}],\"structuredContent\":{\"metrics\":[\"container_cpu_usage_seconds_total\"]}}"
            }
          },
          {
            "type": "tool_call",
            "toolCall": {
              "title": "execute_instant_query",
              "status": "completed",
              "rawInput": {
                "query": "topk(5, sum by (namespace, pod) (rate(container_cpu_usage_seconds_total{container!=\"POD\"}[5m])))",
                "time": "NOW"
              },
              "rawOutput": "{\"content\":[{\"type\":\"text\",\"text\":\"{\\\"result\\\":[{\\\"metric\\\":{},\\\"value\\\":[1775110783.927,\\\"1.1882256674681153\\\"]},{\\\"metric\\\":{\\\"namespace\\\":\\\"openshift-monitoring\\\",\\\"pod\\\":\\\"prometheus-k8s-0\\\"},\\\"value\\\":[1775110783.927,\\\"0.07787675311958422\\\"]},{\\\"metric\\\":{\\\"namespace\\\":\\\"openshift-monitoring\\\",\\\"pod\\\":\\\"prometheus-k8s-1\\\"},\\\"value\\\":[1775110783.927,\\\"0.07034366453073795\\\"]},{\\\"metric\\\":{\\\"namespace\\\":\\\"openshift-ovn-kubernetes\\\",\\\"pod\\\":\\\"ovnkube-node-bm7cm\\\"},\\\"value\\\":[1775110783.927,\\\"0.028958297344501942\\\"]},{\\\"metric\\\":{\\\"namespace\\\":\\\"openshift-ovn-kubernetes\\\",\\\"pod\\\":\\\"ovnkube-node-qccp8\\\"},\\\"value\\\":[1775110783.927,\\\"0.025098961382484318\\\"]}],\\\"resultType\\\":\\\"vector\\\"}\"}],\"structuredContent\":{\"result\":[{\"metric\":{},\"value\":[1775110783.927,\"1.1882256674681153\"]},{\"metric\":{\"namespace\":\"openshift-monitoring\",\"pod\":\"prometheus-k8s-0\"},\"value\":[1775110783.927,\"0.07787675311958422\"]},{\"metric\":{\"namespace\":\"openshift-monitoring\",\"pod\":\"prometheus-k8s-1\"},\"value\":[1775110783.927,\"0.07034366453073795\"]},{\"metric\":{\"namespace\":\"openshift-ovn-kubernetes\",\"pod\":\"ovnkube-node-bm7cm\"},\"value\":[1775110783.927,\"0.028958297344501942\"]},{\"metric\":{\"namespace\":\"openshift-ovn-kubernetes\",\"pod\":\"ovnkube-node-qccp8\"},\"value\":[1775110783.927,\"0.025098961382484318\"]}],\"resultType\":\"vector\"}}"
            }
          },
          {
            "type": "message",
            "content": "Here are the top CPU-using pods in the last 5 minutes (by sum rate of container_cpu_usage_seconds_total, excluding the “POD” infra containers):\n\n- prometheus-k8s-0 (openshift-monitoring): ~0.078 CPU cores\n- prometheus-k8s-1 (openshift-monitoring): ~0.070 CPU cores\n- ovnkube-node-bm7cm (openshift-ovn-kubernetes): ~0.029 CPU cores\n- ovnkube-node-qccp8 (openshift-ovn-kubernetes): ~0.025 CPU cores\n\nNote: There’s an overall total row at the top of the results (not a pod), which represents the aggregate rate. If you want the top 10 (or to filter differently), I can run a query with topk(10, ...) or adjust the filters."
          }
        ]
      }
    },
    "verifyOutput": {
      "Steps": [
        {
          "type": "llmJudge",
          "success": true,
          "message": "The reference is the metric name 'container_cpu_usage_seconds_total'. The model output explicitly references this metric ('sum rate of container_cpu_usage_seconds_total') and describes CPU usage for pods, thereby semantically including the core information from the reference.",
          "usage": {
            "inputTokens": 6393,
            "outputTokens": 3862,
            "totalTokens": 10255
          }
        },
        {
          "type": "llmJudge",
          "success": true,
          "message": "The reference answer is 'pod', indicating the core concept is pods. The model output lists specific pods using CPU and includes the pod names, thereby semantically including the same meaning (pods). No contradiction; the extra details are acceptable per the CONTAINS criterion.",
          "usage": {
            "inputTokens": 16160,
            "outputTokens": 12065,
            "totalTokens": 52929
          }
        }
      ],
      "Success": true,
      "Error": ""
    },
    "cleanupOutput": {
      "Steps": [],
      "Success": true,
      "Error": ""
    }
  }
]

cc: @iNecas @saswatamcode

#46 was based on gpt-4o-mini results so I am ok to close that out or update it to work with lower models also.

openshift-ci-robot · 2026-04-02T08:14:10Z

@slashpai: This pull request references GIE-460 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Add mcpchecker evaluations to validate that AI agents can discover and correctly use all 8 obs-mcp tools against a live Prometheus/Alertmanager backend.

mcpchecker version used: https://github.com/mcpchecker/mcpchecker/releases/tag/v0.0.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

slashpai · 2026-04-02T08:31:51Z

@@ -0,0 +1,46 @@
+# Metrics Reference


This doc is for development reference

slashpai · 2026-04-02T08:31:58Z

@@ -0,0 +1,51 @@
+# Prompts You Can Try


This doc is for development reference

slashpai · 2026-04-02T08:33:06Z

This is good now for initial review for the cases covered in evals

cc: @iNecas

iNecas · 2026-04-09T12:06:42Z

+
+## Coverage
+
+17 eval tasks across 4 categories and 3 difficulty levels:


I found the coverage part useless unless generated automatically.

iNecas · 2026-04-14T16:06:47Z

I've run the evals locally and got these results:

Total Tasks: 47
Tasks Passed: 34/47
Assertions Passed: 146/171

Tasks where verification failed but assertions passed: 5
  Assertions in these tasks: 30/38
Tokens:     ~1513448 (estimate - excludes system prompt & cache)
MCP schemas: ~107019 (included in token total)

=== Statistics by Difficulty ===

easy:
  Tasks: 13/14
  Assertions: 46/48

medium:
  Tasks: 20/27
  Assertions: 89/105

hard:
  Tasks: 1/6
  Assertions: 11/18

=== Consistency Summary ===
Task                                     Pass Rate
-------------------------------------------------------
label-names                              2/2 (100.0%)
pending-pods                             3/3 (100.0%)
prometheus-head-series                   1/2 (50.0%)
cpu-usage                                1/3 (33.3%)
network-traffic                          2/3 (66.7%)
list-kube-metrics                        2/2 (100.0%)
label-values                             3/3 (100.0%)
get-alerts                               2/2 (100.0%)
namespace-resource-usage                 1/3 (33.3%)
list-node-metrics                        2/2 (100.0%)
get-series-cardinality                   3/3 (100.0%)
prometheus-wal-size                      2/2 (100.0%)
alert-investigation                      3/3 (100.0%)
get-silences                             2/2 (100.0%)
crashlooping-pods                        2/3 (66.7%)
prometheus-requests                      2/3 (66.7%)
diagnose-cluster-health                  0/3 (0.0%)
pods-created                             1/3 (33.3%)

Some of the failures were caused by just flaky network, and it's been with gpt-5-nano, so the results are not bad. I'm not sure what's the expected evals/per tool call ratio, but the coverage is looks good to me.

The only tool I've not seen covered what the show-timeframe? @slashpai mind adding some small coverage for it as well? Just adding Visualize to cpu-usage eval seems to do the trick.

Signed-off-by: Jayapriya Pai <janantha@redhat.com>

…pe targets Signed-off-by: Jayapriya Pai <janantha@redhat.com>

- Add callOrder assertions to enforce list_metrics-first workflow - Add runs metadata (2 for easy, 3 for medium/hard) and category labels - Tighten maxToolCalls bounds and strengthen LLM judge criteria - Replace duplicate get-active-alerts with multi-step alert-investigation task - Add namespace-resource-usage and diagnose-cluster-health hard tasks Signed-off-by: Jayapriya Pai <janantha@redhat.com>

Eval run showed agents legitimately use execute_range_query for prompts that were assumed to need execute_instant_query. Uses toolPattern to accept either, simplifies callOrder to list_metrics-first only, and increases maxToolCalls from 7 to 15. Signed-off-by: Jayapriya Pai <janantha@redhat.com>

Signed-off-by: Jayapriya Pai <janantha@redhat.com>

Developer-facing references for eval authoring and debugging. Also adds query efficiency tips to METRICS_REFERENCE.md and a single-task run example to TESTING.md. Signed-off-by: Jayapriya Pai <janantha@redhat.com>

Migrate LLM judge from deprecated env-based config to ref-based agent format, simplify env var setup, and document new CLI flags and providers. Signed-off-by: Jayapriya Pai <janantha@redhat.com>

Signed-off-by: Jayapriya Pai <janantha@redhat.com>

- nonexistent-metric: agent handles missing metric gracefully - nonexistent-namespace: agent handles empty results for fake namespace - high-cardinality-rejection: agent recovers from guardrail rejection - namespace-pod-count: multi-step label exploration to query chaining Signed-off-by: Jayapriya Pai <janantha@redhat.com>

Add prometheus-reachability smoke test to verify connectivity before running the full suite, avoiding wasted tokens on unreachable backends. Add time-range-query task to test range query with correct time window. Fix nonexistent-namespace contains check from 'no' to 'no data' to reduce false positives. Document smoke test as step 3 in Quick Start." Signed-off-by: Jayapriya Pai <janantha@redhat.com>

…e task Set all task runs to 1 to reduce token cost while iterating on evals. Add visualize-cpu-usage task to cover the show_timeseries tool. Fix eval assertions: require query tool for high-cardinality-rejection, add list_metrics to diagnose-cluster-health, lower minToolCalls for namespace-resource-usage. Signed-off-by: Jayapriya Pai <janantha@redhat.com>

openshift-ci-robot · 2026-04-15T08:17:59Z

@slashpai: This pull request references GIE-460 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Add mcpchecker evaluations to validate that AI agents can discover and correctly use all 8 obs-mcp tools against a live Prometheus/Alertmanager backend.

mcpchecker version used: https://github.com/mcpchecker/mcpchecker/releases/tag/v0.0.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

slashpai · 2026-04-15T08:18:21Z

/hold for more reviews

iNecas · 2026-04-15T09:33:37Z

@slashpai I've used in-cluster OpenShift thanos querier with guardrails off.

slashpai · 2026-04-15T10:14:19Z

@iNecas addressed the comment.

Can you review again?

I think we can iterate more and add more cases as needed after this one is completed.

openshift-ci · 2026-04-15T10:42:22Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: iNecas, slashpai

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [iNecas,slashpai]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

slashpai · 2026-04-15T10:45:53Z

/hold cancel

slashpai · 2026-04-15T17:53:15Z

Against Prometheus metrics backend (OpenShift 4.21 ROSA) with mcpchecker v0.0.15

== Results Summary ===

Task: label-values
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/labels/label-values.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: list-node-metrics
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/metrics/list-node-metrics.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: pending-pods
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/pending-pods.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: get-silences
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/alerts/get-silences.yaml
  Difficulty: easy
  Task Status: FAILED (Verification failed, but assertions passed)
  Error: one or more verification steps failed
  Assertions: PASSED (3/3)

Task: prometheus-head-series
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/prometheus-head-series.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: get-series-cardinality
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/labels/get-series.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: prometheus-requests
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/prometheus-requests.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: prometheus-wal-size
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/prometheus-wal-size.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: get-alerts
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/alerts/get-alerts.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: high-cardinality-rejection
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/high-cardinality-rejection.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: FAILED (2/3)
    - MaxToolCalls: Too many tool calls: expected <= 15, got 23

Task: label-names
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/labels/label-names.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: alert-investigation
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/alerts/alert-investigation.yaml
  Difficulty: medium
  Task Status: FAILED (Agent execution error)
  Error details saved to: github.com/slashpai/obs-mcp/evals/mcpchecker/alert-investigation-error.txt
  Assertions: FAILED (2/3)
    - MaxToolCalls: Too many tool calls: expected <= 15, got 19

Task: time-range-query
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/time-range-query.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: crashlooping-pods
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/crashlooping-pods.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: FAILED (3/4)
    - MaxToolCalls: Too many tool calls: expected <= 7, got 14

Task: prometheus-reachability
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/prometheus-reachability.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: pods-created
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/pods-created.yaml
  Difficulty: medium
  Task Status: FAILED
  Error: one or more verification steps failed
  Assertions: FAILED (1/4)
    - ToolsUsed: Required tool not called: server=obs, tool=execute_range_query, pattern=
    - MaxToolCalls: Too many tool calls: expected <= 7, got 9
    - CallOrder: Expected call order not satisfied. Got to 1/2

Task: nonexistent-namespace
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/nonexistent-namespace.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: list-kube-metrics
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/metrics/list-metrics.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: diagnose-cluster-health
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/diagnose-cluster-health.yaml
  Difficulty: hard
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: namespace-resource-usage
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/namespace-resource-usage.yaml
  Difficulty: hard
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: nonexistent-metric
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/nonexistent-metric.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: visualize-cpu-usage
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/visualize-cpu-usage.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: namespace-pod-count
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/namespace-pod-count.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: network-traffic
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/network-traffic.yaml
  Difficulty: medium
  Task Status: FAILED (Verification failed, but assertions passed)
  Error: one or more verification steps failed
  Assertions: PASSED (4/4)

Task: cpu-usage
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/cpu-usage.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: PASSED (4/4)

=== Overall Statistics ===
Total Tasks: 25
Tasks Passed: 21/25
Assertions Passed: 85/91

Tasks where verification failed but assertions passed: 2
  Assertions in these tasks: 8/11
Tokens:     ~1423666 (estimate - excludes system prompt & cache)
MCP schemas: ~74300 (included in token total)

=== Statistics by Difficulty ===

easy:
  Tasks: 9/10
  Assertions: 34/34

medium:
  Tasks: 10/13
  Assertions: 44/50

hard:
  Tasks: 2/2
  Assertions: 7/7
⏱️  Completed in 18m46s

openshift-ci Bot added the do-not-merge/work-in-progress label Feb 25, 2026

openshift-ci Bot added the approved label Feb 25, 2026

slashpai mentioned this pull request Mar 6, 2026

Use rhobs/obs-mcp toolset for Prometheus/Alertmanager (replaces observability mcp) openshift/openshift-mcp-server#124

Closed

iNecas reviewed Mar 9, 2026

View reviewed changes

slashpai changed the title ~~WIP feat: Add mcpchecker evals for obs-mcp tools~~ WIP GIE-460: Add mcpchecker evals for obs-mcp tools Mar 11, 2026

slashpai force-pushed the mcp-evals branch from c2b4c03 to 1fed864 Compare March 11, 2026 04:44

slashpai mentioned this pull request Mar 13, 2026

GIE-501(prompt): add query efficiency guidance to reduce excessive tool calls #46

Open

slashpai force-pushed the mcp-evals branch from 4d1169a to 0ad3695 Compare April 2, 2026 06:29

slashpai mentioned this pull request Apr 2, 2026

llmJudge verification fails with "cannot run llmJudge step before agent" when running multiple tasks mcpchecker/mcpchecker#270

Closed

slashpai commented Apr 2, 2026

View reviewed changes

Comment thread docs/dev/PROMPTS.md

@@ -0,0 +1,51 @@

# Prompts You Can Try

Copy link
Copy Markdown

Member Author

slashpai Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc is for development reference

slashpai marked this pull request as ready for review April 2, 2026 08:32

slashpai requested a review from a team April 2, 2026 08:32

openshift-ci Bot requested review from saswatamcode and xiormeesh April 2, 2026 08:32

iNecas reviewed Apr 9, 2026

View reviewed changes

matzew mentioned this pull request Apr 14, 2026

GIE-504: Use rhobs/obs-mcp toolset for Prometheus/Alertmanager (replaces observability mcp) openshift/openshift-mcp-server#226

Merged

slashpai force-pushed the mcp-evals branch 3 times, most recently from f5af01c to 8448081 Compare April 14, 2026 14:07

slashpai added 14 commits April 15, 2026 12:58

docs(evals): clarify agent vs judge LLM roles in mcpchecker README

ffd253f

Signed-off-by: Jayapriya Pai <janantha@redhat.com>

chore: add Makefile targets to deploy additional kube-prometheus scra…

bd2d421

…pe targets Signed-off-by: Jayapriya Pai <janantha@redhat.com>

docs: add PROMPTS.md with example prompts for testing obs-mcp tools

6105385

Signed-off-by: Jayapriya Pai <janantha@redhat.com>

docs: move PROMPTS.md and METRICS_REFERENCE.md to docs/dev/

9609d1f

Developer-facing references for eval authoring and debugging. Also adds query efficiency tips to METRICS_REFERENCE.md and a single-task run example to TESTING.md. Signed-off-by: Jayapriya Pai <janantha@redhat.com>

evals: update mcpchecker config for v0.0.14

09aa69f

Migrate LLM judge from deprecated env-based config to ref-based agent format, simplify env var setup, and document new CLI flags and providers. Signed-off-by: Jayapriya Pai <janantha@redhat.com>

evals: switch agent and judge model to gpt-5-nano

c83e3c8

Signed-off-by: Jayapriya Pai <janantha@redhat.com>

chore: add makefile target to install mcpchecker

c79a357

Signed-off-by: Jayapriya Pai <janantha@redhat.com>

docs: consolidate mcpchecker eval docs and update for v0.0.15

4eaf77f

Signed-off-by: Jayapriya Pai <janantha@redhat.com>

docs: add CATEGORY filter for mcpchecker evals

206362d

Signed-off-by: Jayapriya Pai <janantha@redhat.com>

slashpai force-pushed the mcp-evals branch from 8448081 to c1852c5 Compare April 15, 2026 08:11

slashpai changed the title ~~WIP GIE-460: Add mcpchecker evals for obs-mcp tools~~ GIE-460: Add mcpchecker evals for obs-mcp tools Apr 15, 2026

openshift-ci Bot removed the do-not-merge/work-in-progress label Apr 15, 2026

openshift-ci Bot added the do-not-merge/hold label Apr 15, 2026

iNecas approved these changes Apr 15, 2026

View reviewed changes

openshift-ci Bot assigned iNecas Apr 15, 2026

openshift-ci Bot added the lgtm label Apr 15, 2026

openshift-ci Bot removed the do-not-merge/hold label Apr 15, 2026

openshift-merge-bot Bot merged commit 15374bf into rhobs:main Apr 15, 2026
7 checks passed


		## Coverage

		17 eval tasks across 4 categories and 3 difficulty levels:

Conversation

slashpai commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci Bot commented Feb 25, 2026

Uh oh!

iNecas Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented Mar 11, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Mar 11, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Mar 11, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Mar 11, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Mar 11, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slashpai commented Apr 2, 2026

Uh oh!

openshift-ci-robot commented Apr 2, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slashpai Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

slashpai Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

slashpai commented Apr 2, 2026

Uh oh!

iNecas Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

iNecas commented Apr 14, 2026

Uh oh!

openshift-ci-robot commented Apr 15, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slashpai commented Apr 15, 2026

Uh oh!

iNecas commented Apr 15, 2026

Uh oh!

slashpai commented Apr 15, 2026

Uh oh!

openshift-ci Bot commented Apr 15, 2026

Uh oh!

slashpai commented Apr 15, 2026

Uh oh!

Uh oh!

slashpai commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

slashpai commented Feb 25, 2026 •

edited

Loading

openshift-ci-robot commented Mar 11, 2026 •

edited by openshift-ci Bot

Loading

openshift-ci-robot commented Mar 11, 2026 •

edited by openshift-ci Bot

Loading

openshift-ci-robot commented Mar 11, 2026 •

edited by openshift-ci Bot

Loading

openshift-ci-robot commented Mar 11, 2026 •

edited by openshift-ci Bot

Loading

openshift-ci-robot commented Mar 11, 2026 •

edited by openshift-ci Bot

Loading

openshift-ci-robot commented Apr 2, 2026 •

edited by openshift-ci Bot

Loading

openshift-ci-robot commented Apr 15, 2026 •

edited by openshift-ci Bot

Loading