Skip to content

GIE-460: Add mcpchecker evals for obs-mcp tools#34

Merged
openshift-merge-bot[bot] merged 21 commits intorhobs:mainfrom
slashpai:mcp-evals
Apr 15, 2026
Merged

GIE-460: Add mcpchecker evals for obs-mcp tools#34
openshift-merge-bot[bot] merged 21 commits intorhobs:mainfrom
slashpai:mcp-evals

Conversation

@slashpai
Copy link
Copy Markdown
Member

@slashpai slashpai commented Feb 25, 2026

Add mcpchecker evaluations to validate that AI agents can discover and correctly use all 8 obs-mcp tools against a live Prometheus/Alertmanager backend.

mcpchecker version used: https://github.com/mcpchecker/mcpchecker/releases/tag/v0.0.15

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Feb 25, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

taskSets:
# Metric discovery
- path: tasks/metrics/list-metrics.yaml
assertions:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's not fault of this PR, but the split between the task definition and assertions is pretty terrible. Luckily the mcpchecker folks seem to be aware mcpchecker/mcpchecker#168

@slashpai slashpai changed the title WIP feat: Add mcpchecker evals for obs-mcp tools WIP GIE-460: Add mcpchecker evals for obs-mcp tools Mar 11, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Collaborator

openshift-ci-robot commented Mar 11, 2026

@slashpai: This pull request references GIE-460 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Add evaluation tasks using the mcpchecker framework (v1alpha2) to test that AI agents can discover and correctly use all 8 obs-mcp tools against a live Prometheus/Alertmanager backend.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Copy Markdown
Collaborator

openshift-ci-robot commented Mar 11, 2026

@slashpai: This pull request references GIE-460 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Add mcpchecker evaluations to validate that AI agents can discover and correctly use all 8 obs-mcp tools against a live Prometheus/Alertmanager backend.

16 eval tasks across 4 categories:

Category Tasks Tools tested
Metrics discovery list kube metrics, list node metrics list_metrics
Label exploration label names, label values, series cardinality get_label_names, get_label_values, get_series
PromQL queries CPU usage, pending pods, crashlooping pods, network traffic, Prometheus internals (head series, requests, WAL size) execute_instant_query, execute_range_query
Alertmanager firing alerts, active alerts, silences get_alerts, get_silences

Each task verifies:

  • The agent selects the correct tool(s)
  • Tool call count stays within bounds
  • Response contains expected content (via LLM judge)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Copy Markdown
Collaborator

openshift-ci-robot commented Mar 11, 2026

@slashpai: This pull request references GIE-460 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Add mcpchecker evaluations to validate that AI agents can discover and correctly use all 8 obs-mcp tools against a live Prometheus/Alertmanager backend.

16 eval tasks across 4 categories:

Category Tasks Tools tested
Metrics discovery list kube metrics, list node metrics list_metrics
Label exploration label names, label values, series cardinality get_label_names, get_label_values, get_series
PromQL queries CPU usage, pending pods, crashlooping pods, network traffic, Prometheus internals (head series, requests, WAL size) execute_instant_query, execute_range_query
Alertmanager firing alerts, active alerts, silences get_alerts, get_silences

Each task verifies:

  • The agent selects the correct tool(s)
  • Tool call count stays within bounds
  • Response contains expected content (via LLM judge)

Note: This is a smoke-test level evaluation covering basic tool discovery and usage. We need to add:

  • Multi-step reasoning — tasks requiring 3+ chained tools (e.g., discover metric → query → analyze trend)
  • Error handling — agent recovery from invalid queries or missing metrics
  • Guardrail behavior — agent response when dangerous queries are blocked
  • Parameter coverage — testing less-used params like silenced, inhibited, receiver, filter, time ranges
  • Ambiguous prompts — vague diagnostic questions (e.g., "Why is my app slow?") requiring the agent to choose the right tools
  • Hard difficulty tasks — complex multi-tool diagnostic scenarios

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Copy Markdown
Collaborator

openshift-ci-robot commented Mar 11, 2026

@slashpai: This pull request references GIE-460 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Add mcpchecker evaluations to validate that AI agents can discover and correctly use all 8 obs-mcp tools against a live Prometheus/Alertmanager backend.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Copy Markdown
Collaborator

openshift-ci-robot commented Mar 11, 2026

@slashpai: This pull request references GIE-460 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Add mcpchecker evaluations to validate that AI agents can discover and correctly use all 8 obs-mcp tools against a live Prometheus/Alertmanager backend.

mcpchecker version used: https://github.com/mcpchecker/mcpchecker/releases/tag/nightly

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@slashpai
Copy link
Copy Markdown
Member Author

slashpai commented Apr 2, 2026

I updated the branch with latest changes in mcpchecker https://github.com/mcpchecker/mcpchecker/releases/tag/v0.0.14 (previously was using v0.0.11 nightly)

Tested one of the evals cpu-usage with gpt-5-nano which gave better result than gpt-4o-mini so I have switched to that as default model in config

This was tested against OpenShift cluster after port-forwarding

PROMETHEUS_URL=http://localhost:9090 ALERTMANAGER_URL=http://localhost:9093 make run
─░▒▓ ~/github.com/slashpai/obs-mcp/evals/mcpchecker  mcp-evals ························································· ✔  3.13  Py  1.25.5 Go  11:49:17 ▓▒░─╮
╰─ mcpchecker check eval.yaml --run "cpu-usage" --runs 1 --verbose                                                                                            ─╯


=== Starting Evaluation ===

[cpu-usage] Starting (parallel, medium)
[cpu-usage] → Setting up task environment...
[cpu-usage] → Running agent...
  → Agent 'llm-agent-openai-gpt-5-nano' is working…
2026/04/02 11:50:00 INFO peer connection closed
2026/04/02 11:50:00 INFO peer connection closed
[cpu-usage] → Verifying results...
  → LLM judge 'llm-agent-openai-gpt-5-nano' is evaluating…
2026/04/02 11:50:40 ERROR failed to handle notification method=session/update err="{\"code\":-32603,\"message\":\"Internal error\",\"data\":{\"error\":\"no matching session on client\"}}"
2026/04/02 11:50:40 INFO peer connection closed
2026/04/02 11:50:40 INFO peer connection closed
  → LLM judge 'llm-agent-openai-gpt-5-nano' is evaluating…
2026/04/02 11:52:54 ERROR failed to handle notification method=session/update err="{\"code\":-32603,\"message\":\"Internal error\",\"data\":{\"error\":\"no matching session on client\"}}"
2026/04/02 11:52:54 INFO peer connection closed
2026/04/02 11:52:54 INFO peer connection closed
[cpu-usage] → Evaluating assertions...
[cpu-usage] ✓ Task passed

=== Evaluation Complete ===

📄 Results saved to: mcpchecker-obs-mcp-tools-out.json

=== Results Summary ===

Task: cpu-usage
  Path: /Users/jayapriyapai/github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/cpu-usage.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: PASSED (4/4)

=== Overall Statistics ===
Total Tasks: 1
Tasks Passed: 1/1
Assertions Passed: 4/4
Tokens:     ~9533 (estimate - excludes system prompt & cache)
MCP schemas: ~2277 (included in token total)

=== Statistics by Difficulty ===

medium:
  Tasks: 1/1
  Assertions: 4/4
⏱️  Completed in 3m33s
╭─░▒▓ ~/github.com/pai/obs-mcp/evals/mcpchecker  mcp-evals ················································ ✔  3m 33s │ 3.13  Py  1.25.5 Go  11:52:55 ▓▒░─╮
╰─

Result

[
  {
    "taskName": "cpu-usage",
    "taskPath": "/Users/pai/github.com/pai/obs-mcp/evals/mcpchecker/tasks/queries/cpu-usage.yaml",
    "taskPassed": true,
    "taskOutput": "Here are the top CPU-using pods in the last 5 minutes (by sum rate of container_cpu_usage_seconds_total, excluding the “POD” infra containers):\n\n- prometheus-k8s-0 (openshift-monitoring): ~0.078 CPU cores\n- prometheus-k8s-1 (openshift-monitoring): ~0.070 CPU cores\n- ovnkube-node-bm7cm (openshift-ovn-kubernetes): ~0.029 CPU cores\n- ovnkube-node-qccp8 (openshift-ovn-kubernetes): ~0.025 CPU cores\n\nNote: There’s an overall total row at the top of the results (not a pod), which represents the aggregate rate. If you want the top 10 (or to filter differently), I can run a query with topk(10, ...) or adjust the filters.",
    "taskJudgeReason": "The reference is the metric name 'container_cpu_usage_seconds_total'. The model output explicitly references this metric ('sum rate of container_cpu_usage_seconds_total') and describes CPU usage for pods, thereby semantically including the core information from the reference.",
    "difficulty": "medium",
    "parallel": true,
    "totalRuns": 1,
    "assertionResults": {
      "toolsUsed": {
        "passed": true
      },
      "minToolCalls": {
        "passed": true
      },
      "maxToolCalls": {
        "passed": true
      },
      "callOrder": {
        "passed": true
      }
    },
    "allAssertionsPassed": true,
    "callHistory": {
      "ToolCalls": [
        {
          "serverName": "obs",
          "timestamp": "2026-04-02T11:49:30.662479+05:30",
          "success": true,
          "name": "list_metrics",
          "result": {
            "content": [
              {
                "type": "text",
                "text": "{\"metrics\":[\"container_cpu_usage_seconds_total\"]}"
              }
            ],
            "structuredContent": {
              "metrics": [
                "container_cpu_usage_seconds_total"
              ]
            }
          },
          "tokens": {
            "inputTokens": 10,
            "outputTokens": 22,
            "totalTokens": 32
          },
          "request": {
            "Session": {},
            "Params": {
              "name": "list_metrics",
              "arguments": {
                "name_regex": "container_cpu_usage_seconds_total"
              }
            },
            "Extra": {
              "TokenInfo": null,
              "Header": {
                "Accept": [
                  "application/json, text/event-stream"
                ],
                "Accept-Encoding": [
                  "gzip"
                ],
                "Content-Length": [
                  "142"
                ],
                "Content-Type": [
                  "application/json"
                ],
                "Mcp-Protocol-Version": [
                  "2025-06-18"
                ],
                "Mcp-Session-Id": [
                  "6DDECKOCAFQDCLUPEIFPJNXLJP"
                ],
                "User-Agent": [
                  "Go-http-client/1.1"
                ]
              }
            }
          }
        },
        {
          "serverName": "obs",
          "timestamp": "2026-04-02T11:49:43.923378+05:30",
          "success": true,
          "name": "execute_instant_query",
          "result": {
            "content": [
              {
                "type": "text",
                "text": "{\"result\":[{\"metric\":{},\"value\":[1775110783.927,\"1.1882256674681153\"]},{\"metric\":{\"namespace\":\"openshift-monitoring\",\"pod\":\"prometheus-k8s-0\"},\"value\":[1775110783.927,\"0.07787675311958422\"]},{\"metric\":{\"namespace\":\"openshift-monitoring\",\"pod\":\"prometheus-k8s-1\"},\"value\":[1775110783.927,\"0.07034366453073795\"]},{\"metric\":{\"namespace\":\"openshift-ovn-kubernetes\",\"pod\":\"ovnkube-node-bm7cm\"},\"value\":[1775110783.927,\"0.028958297344501942\"]},{\"metric\":{\"namespace\":\"openshift-ovn-kubernetes\",\"pod\":\"ovnkube-node-qccp8\"},\"value\":[1775110783.927,\"0.025098961382484318\"]}],\"resultType\":\"vector\"}"
              }
            ],
            "structuredContent": {
              "result": [
                {
                  "metric": {},
                  "value": [
                    1775110783.927,
                    "1.1882256674681153"
                  ]
                },
                {
                  "metric": {
                    "namespace": "openshift-monitoring",
                    "pod": "prometheus-k8s-0"
                  },
                  "value": [
                    1775110783.927,
                    "0.07787675311958422"
                  ]
                },
                {
                  "metric": {
                    "namespace": "openshift-monitoring",
                    "pod": "prometheus-k8s-1"
                  },
                  "value": [
                    1775110783.927,
                    "0.07034366453073795"
                  ]
                },
                {
                  "metric": {
                    "namespace": "openshift-ovn-kubernetes",
                    "pod": "ovnkube-node-bm7cm"
                  },
                  "value": [
                    1775110783.927,
                    "0.028958297344501942"
                  ]
                },
                {
                  "metric": {
                    "namespace": "openshift-ovn-kubernetes",
                    "pod": "ovnkube-node-qccp8"
                  },
                  "value": [
                    1775110783.927,
                    "0.025098961382484318"
                  ]
                }
              ],
              "resultType": "vector"
            }
          },
          "tokens": {
            "inputTokens": 38,
            "outputTokens": 222,
            "totalTokens": 260
          },
          "request": {
            "Session": {},
            "Params": {
              "name": "execute_instant_query",
              "arguments": {
                "query": "topk(5, sum by (namespace, pod) (rate(container_cpu_usage_seconds_total{container!=\"POD\"}[5m])))",
                "time": "NOW"
              }
            },
            "Extra": {
              "TokenInfo": null,
              "Header": {
                "Accept": [
                  "application/json, text/event-stream"
                ],
                "Accept-Encoding": [
                  "gzip"
                ],
                "Content-Length": [
                  "224"
                ],
                "Content-Type": [
                  "application/json"
                ],
                "Mcp-Protocol-Version": [
                  "2025-06-18"
                ],
                "Mcp-Session-Id": [
                  "6DDECKOCAFQDCLUPEIFPJNXLJP"
                ],
                "User-Agent": [
                  "Go-http-client/1.1"
                ]
              }
            }
          }
        }
      ],
      "ResourceReads": null,
      "PromptGets": null
    },
    "tokenEstimate": {
      "inputTokens": 4394,
      "outputTokens": 3475,
      "totalTokens": 9533,
      "promptTokens": 8,
      "messageTokens": 180,
      "thinkingTokens": 0,
      "toolInputTokens": 48,
      "toolOutputTokens": 574,
      "mcpSchemaTokens": 2277,
      "promptGetInputTokens": 0,
      "promptGetOutputTokens": 0,
      "resourceInputTokens": 0,
      "resourceOutputTokens": 0,
      "source": "actual",
      "actual": {
        "inputTokens": 4394,
        "outputTokens": 3475,
        "totalTokens": 9533
      }
    },
    "judgeTokenUsage": {
      "inputTokens": 22553,
      "outputTokens": 15927,
      "totalTokens": 63184
    },
    "setupOutput": {
      "Steps": [],
      "Success": true,
      "Error": ""
    },
    "agentOutput": {
      "Steps": [
        {
          "type": "tool_call",
          "success": true,
          "message": "list_metrics"
        },
        {
          "type": "tool_call",
          "success": true,
          "message": "execute_instant_query"
        },
        {
          "type": "message",
          "success": true,
          "message": "Here are the top CPU-using pods in the last 5 minutes (by sum rate of container_cpu_usage_seconds_total, excluding the “POD” infra containers):\n\n- prometheus-k8s-0 (openshift-monitoring): ~0.078 CPU cores\n- prometheus-k8s-1 (openshift-monitoring): ~0.070 CPU cores\n- ovnkube-node-bm7cm (openshift-ovn-kubernetes): ~0.029 CPU cores\n- ovnkube-node-qccp8 (openshift-ovn-kubernetes): ~0.025 CPU cores\n\nNote: There’s an overall total row at the top of the results (not a pod), which represents the aggregate rate. If you want the top 10 (or to filter differently), I can run a query with topk(10, ...) or adjust the filters."
        }
      ],
      "Success": true,
      "Error": "",
      "agentDetails": {
        "tokenEstimate": {
          "inputTokens": 4394,
          "outputTokens": 3475,
          "totalTokens": 9533,
          "promptTokens": 8,
          "messageTokens": 180,
          "thinkingTokens": 0,
          "toolInputTokens": 48,
          "toolOutputTokens": 574,
          "mcpSchemaTokens": 2277,
          "promptGetInputTokens": 0,
          "promptGetOutputTokens": 0,
          "resourceInputTokens": 0,
          "resourceOutputTokens": 0,
          "source": "actual",
          "actual": {
            "inputTokens": 4394,
            "outputTokens": 3475,
            "totalTokens": 9533
          }
        },
        "toolCalls": [
          {
            "title": "list_metrics",
            "status": "completed",
            "rawInput": {
              "name_regex": "container_cpu_usage_seconds_total"
            },
            "rawOutput": "{\"content\":[{\"type\":\"text\",\"text\":\"{\\\"metrics\\\":[\\\"container_cpu_usage_seconds_total\\\"]}\"}],\"structuredContent\":{\"metrics\":[\"container_cpu_usage_seconds_total\"]}}"
          },
          {
            "title": "execute_instant_query",
            "status": "completed",
            "rawInput": {
              "query": "topk(5, sum by (namespace, pod) (rate(container_cpu_usage_seconds_total{container!=\"POD\"}[5m])))",
              "time": "NOW"
            },
            "rawOutput": "{\"content\":[{\"type\":\"text\",\"text\":\"{\\\"result\\\":[{\\\"metric\\\":{},\\\"value\\\":[1775110783.927,\\\"1.1882256674681153\\\"]},{\\\"metric\\\":{\\\"namespace\\\":\\\"openshift-monitoring\\\",\\\"pod\\\":\\\"prometheus-k8s-0\\\"},\\\"value\\\":[1775110783.927,\\\"0.07787675311958422\\\"]},{\\\"metric\\\":{\\\"namespace\\\":\\\"openshift-monitoring\\\",\\\"pod\\\":\\\"prometheus-k8s-1\\\"},\\\"value\\\":[1775110783.927,\\\"0.07034366453073795\\\"]},{\\\"metric\\\":{\\\"namespace\\\":\\\"openshift-ovn-kubernetes\\\",\\\"pod\\\":\\\"ovnkube-node-bm7cm\\\"},\\\"value\\\":[1775110783.927,\\\"0.028958297344501942\\\"]},{\\\"metric\\\":{\\\"namespace\\\":\\\"openshift-ovn-kubernetes\\\",\\\"pod\\\":\\\"ovnkube-node-qccp8\\\"},\\\"value\\\":[1775110783.927,\\\"0.025098961382484318\\\"]}],\\\"resultType\\\":\\\"vector\\\"}\"}],\"structuredContent\":{\"result\":[{\"metric\":{},\"value\":[1775110783.927,\"1.1882256674681153\"]},{\"metric\":{\"namespace\":\"openshift-monitoring\",\"pod\":\"prometheus-k8s-0\"},\"value\":[1775110783.927,\"0.07787675311958422\"]},{\"metric\":{\"namespace\":\"openshift-monitoring\",\"pod\":\"prometheus-k8s-1\"},\"value\":[1775110783.927,\"0.07034366453073795\"]},{\"metric\":{\"namespace\":\"openshift-ovn-kubernetes\",\"pod\":\"ovnkube-node-bm7cm\"},\"value\":[1775110783.927,\"0.028958297344501942\"]},{\"metric\":{\"namespace\":\"openshift-ovn-kubernetes\",\"pod\":\"ovnkube-node-qccp8\"},\"value\":[1775110783.927,\"0.025098961382484318\"]}],\"resultType\":\"vector\"}}"
          }
        ],
        "outputSteps": [
          {
            "type": "tool_call",
            "toolCall": {
              "title": "list_metrics",
              "status": "completed",
              "rawInput": {
                "name_regex": "container_cpu_usage_seconds_total"
              },
              "rawOutput": "{\"content\":[{\"type\":\"text\",\"text\":\"{\\\"metrics\\\":[\\\"container_cpu_usage_seconds_total\\\"]}\"}],\"structuredContent\":{\"metrics\":[\"container_cpu_usage_seconds_total\"]}}"
            }
          },
          {
            "type": "tool_call",
            "toolCall": {
              "title": "execute_instant_query",
              "status": "completed",
              "rawInput": {
                "query": "topk(5, sum by (namespace, pod) (rate(container_cpu_usage_seconds_total{container!=\"POD\"}[5m])))",
                "time": "NOW"
              },
              "rawOutput": "{\"content\":[{\"type\":\"text\",\"text\":\"{\\\"result\\\":[{\\\"metric\\\":{},\\\"value\\\":[1775110783.927,\\\"1.1882256674681153\\\"]},{\\\"metric\\\":{\\\"namespace\\\":\\\"openshift-monitoring\\\",\\\"pod\\\":\\\"prometheus-k8s-0\\\"},\\\"value\\\":[1775110783.927,\\\"0.07787675311958422\\\"]},{\\\"metric\\\":{\\\"namespace\\\":\\\"openshift-monitoring\\\",\\\"pod\\\":\\\"prometheus-k8s-1\\\"},\\\"value\\\":[1775110783.927,\\\"0.07034366453073795\\\"]},{\\\"metric\\\":{\\\"namespace\\\":\\\"openshift-ovn-kubernetes\\\",\\\"pod\\\":\\\"ovnkube-node-bm7cm\\\"},\\\"value\\\":[1775110783.927,\\\"0.028958297344501942\\\"]},{\\\"metric\\\":{\\\"namespace\\\":\\\"openshift-ovn-kubernetes\\\",\\\"pod\\\":\\\"ovnkube-node-qccp8\\\"},\\\"value\\\":[1775110783.927,\\\"0.025098961382484318\\\"]}],\\\"resultType\\\":\\\"vector\\\"}\"}],\"structuredContent\":{\"result\":[{\"metric\":{},\"value\":[1775110783.927,\"1.1882256674681153\"]},{\"metric\":{\"namespace\":\"openshift-monitoring\",\"pod\":\"prometheus-k8s-0\"},\"value\":[1775110783.927,\"0.07787675311958422\"]},{\"metric\":{\"namespace\":\"openshift-monitoring\",\"pod\":\"prometheus-k8s-1\"},\"value\":[1775110783.927,\"0.07034366453073795\"]},{\"metric\":{\"namespace\":\"openshift-ovn-kubernetes\",\"pod\":\"ovnkube-node-bm7cm\"},\"value\":[1775110783.927,\"0.028958297344501942\"]},{\"metric\":{\"namespace\":\"openshift-ovn-kubernetes\",\"pod\":\"ovnkube-node-qccp8\"},\"value\":[1775110783.927,\"0.025098961382484318\"]}],\"resultType\":\"vector\"}}"
            }
          },
          {
            "type": "message",
            "content": "Here are the top CPU-using pods in the last 5 minutes (by sum rate of container_cpu_usage_seconds_total, excluding the “POD” infra containers):\n\n- prometheus-k8s-0 (openshift-monitoring): ~0.078 CPU cores\n- prometheus-k8s-1 (openshift-monitoring): ~0.070 CPU cores\n- ovnkube-node-bm7cm (openshift-ovn-kubernetes): ~0.029 CPU cores\n- ovnkube-node-qccp8 (openshift-ovn-kubernetes): ~0.025 CPU cores\n\nNote: There’s an overall total row at the top of the results (not a pod), which represents the aggregate rate. If you want the top 10 (or to filter differently), I can run a query with topk(10, ...) or adjust the filters."
          }
        ]
      }
    },
    "verifyOutput": {
      "Steps": [
        {
          "type": "llmJudge",
          "success": true,
          "message": "The reference is the metric name 'container_cpu_usage_seconds_total'. The model output explicitly references this metric ('sum rate of container_cpu_usage_seconds_total') and describes CPU usage for pods, thereby semantically including the core information from the reference.",
          "usage": {
            "inputTokens": 6393,
            "outputTokens": 3862,
            "totalTokens": 10255
          }
        },
        {
          "type": "llmJudge",
          "success": true,
          "message": "The reference answer is 'pod', indicating the core concept is pods. The model output lists specific pods using CPU and includes the pod names, thereby semantically including the same meaning (pods). No contradiction; the extra details are acceptable per the CONTAINS criterion.",
          "usage": {
            "inputTokens": 16160,
            "outputTokens": 12065,
            "totalTokens": 52929
          }
        }
      ],
      "Success": true,
      "Error": ""
    },
    "cleanupOutput": {
      "Steps": [],
      "Success": true,
      "Error": ""
    }
  }
]

cc: @iNecas @saswatamcode

#46 was based on gpt-4o-mini results so I am ok to close that out or update it to work with lower models also.

@openshift-ci-robot
Copy link
Copy Markdown
Collaborator

openshift-ci-robot commented Apr 2, 2026

@slashpai: This pull request references GIE-460 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Add mcpchecker evaluations to validate that AI agents can discover and correctly use all 8 obs-mcp tools against a live Prometheus/Alertmanager backend.

mcpchecker version used: https://github.com/mcpchecker/mcpchecker/releases/tag/v0.0.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@@ -0,0 +1,46 @@
# Metrics Reference
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc is for development reference

Comment thread docs/dev/PROMPTS.md
@@ -0,0 +1,51 @@
# Prompts You Can Try
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc is for development reference

@slashpai slashpai marked this pull request as ready for review April 2, 2026 08:32
@slashpai slashpai requested a review from a team April 2, 2026 08:32
@openshift-ci openshift-ci Bot requested review from saswatamcode and xiormeesh April 2, 2026 08:32
@slashpai
Copy link
Copy Markdown
Member Author

slashpai commented Apr 2, 2026

This is good now for initial review for the cases covered in evals

cc: @iNecas

Comment thread evals/mcpchecker/README.md Outdated

## Coverage

17 eval tasks across 4 categories and 3 difficulty levels:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the coverage part useless unless generated automatically.

@iNecas
Copy link
Copy Markdown
Contributor

iNecas commented Apr 14, 2026

I've run the evals locally and got these results:

Total Tasks: 47
Tasks Passed: 34/47
Assertions Passed: 146/171

Tasks where verification failed but assertions passed: 5
  Assertions in these tasks: 30/38
Tokens:     ~1513448 (estimate - excludes system prompt & cache)
MCP schemas: ~107019 (included in token total)

=== Statistics by Difficulty ===

easy:
  Tasks: 13/14
  Assertions: 46/48

medium:
  Tasks: 20/27
  Assertions: 89/105

hard:
  Tasks: 1/6
  Assertions: 11/18

=== Consistency Summary ===
Task                                     Pass Rate
-------------------------------------------------------
label-names                              2/2 (100.0%)
pending-pods                             3/3 (100.0%)
prometheus-head-series                   1/2 (50.0%)
cpu-usage                                1/3 (33.3%)
network-traffic                          2/3 (66.7%)
list-kube-metrics                        2/2 (100.0%)
label-values                             3/3 (100.0%)
get-alerts                               2/2 (100.0%)
namespace-resource-usage                 1/3 (33.3%)
list-node-metrics                        2/2 (100.0%)
get-series-cardinality                   3/3 (100.0%)
prometheus-wal-size                      2/2 (100.0%)
alert-investigation                      3/3 (100.0%)
get-silences                             2/2 (100.0%)
crashlooping-pods                        2/3 (66.7%)
prometheus-requests                      2/3 (66.7%)
diagnose-cluster-health                  0/3 (0.0%)
pods-created                             1/3 (33.3%)

Some of the failures were caused by just flaky network, and it's been with gpt-5-nano, so the results are not bad. I'm not sure what's the expected evals/per tool call ratio, but the coverage is looks good to me.

The only tool I've not seen covered what the show-timeframe? @slashpai mind adding some small coverage for it as well? Just adding Visualize to cpu-usage eval seems to do the trick.

slashpai added 14 commits April 15, 2026 12:58
Signed-off-by: Jayapriya Pai <janantha@redhat.com>
…pe targets

Signed-off-by: Jayapriya Pai <janantha@redhat.com>
- Add callOrder assertions to enforce list_metrics-first workflow
- Add runs metadata (2 for easy, 3 for medium/hard) and category labels
- Tighten maxToolCalls bounds and strengthen LLM judge criteria
- Replace duplicate get-active-alerts with multi-step alert-investigation task
- Add namespace-resource-usage and diagnose-cluster-health hard tasks

Signed-off-by: Jayapriya Pai <janantha@redhat.com>
Eval run showed agents legitimately use execute_range_query for prompts
that were assumed to need execute_instant_query. Uses toolPattern to
accept either, simplifies callOrder to list_metrics-first only, and
increases maxToolCalls from 7 to 15.

Signed-off-by: Jayapriya Pai <janantha@redhat.com>
Signed-off-by: Jayapriya Pai <janantha@redhat.com>
Developer-facing references for eval authoring and debugging. Also adds
query efficiency tips to METRICS_REFERENCE.md and a single-task run
example to TESTING.md.

Signed-off-by: Jayapriya Pai <janantha@redhat.com>
Migrate LLM judge from deprecated env-based config to ref-based agent
format, simplify env var setup, and document new CLI flags and providers.

Signed-off-by: Jayapriya Pai <janantha@redhat.com>
Signed-off-by: Jayapriya Pai <janantha@redhat.com>
Signed-off-by: Jayapriya Pai <janantha@redhat.com>
Signed-off-by: Jayapriya Pai <janantha@redhat.com>
Signed-off-by: Jayapriya Pai <janantha@redhat.com>
- nonexistent-metric: agent handles missing metric gracefully
- nonexistent-namespace: agent handles empty results for fake namespace
- high-cardinality-rejection: agent recovers from guardrail rejection
- namespace-pod-count: multi-step label exploration to query chaining

Signed-off-by: Jayapriya Pai <janantha@redhat.com>
Add prometheus-reachability smoke test to verify connectivity before
running the full suite, avoiding wasted tokens on unreachable backends.
Add time-range-query task to test range query with correct time window.
Fix nonexistent-namespace contains check from 'no' to 'no data' to
reduce false positives. Document smoke test as step 3 in Quick Start."

Signed-off-by: Jayapriya Pai <janantha@redhat.com>
…e task

Set all task runs to 1 to reduce token cost while iterating on evals.
Add visualize-cpu-usage task to cover the show_timeseries tool.
Fix eval assertions: require query tool for high-cardinality-rejection,
add list_metrics to diagnose-cluster-health, lower minToolCalls for
namespace-resource-usage.

Signed-off-by: Jayapriya Pai <janantha@redhat.com>
@slashpai slashpai changed the title WIP GIE-460: Add mcpchecker evals for obs-mcp tools GIE-460: Add mcpchecker evals for obs-mcp tools Apr 15, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Collaborator

openshift-ci-robot commented Apr 15, 2026

@slashpai: This pull request references GIE-460 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Add mcpchecker evaluations to validate that AI agents can discover and correctly use all 8 obs-mcp tools against a live Prometheus/Alertmanager backend.

mcpchecker version used: https://github.com/mcpchecker/mcpchecker/releases/tag/v0.0.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@slashpai
Copy link
Copy Markdown
Member Author

/hold for more reviews

@iNecas
Copy link
Copy Markdown
Contributor

iNecas commented Apr 15, 2026

@slashpai I've used in-cluster OpenShift thanos querier with guardrails off.

@slashpai
Copy link
Copy Markdown
Member Author

@iNecas addressed the comment.

Can you review again?

I think we can iterate more and add more cases as needed after this one is completed.

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 15, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: iNecas, slashpai

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@slashpai
Copy link
Copy Markdown
Member Author

/hold cancel

@openshift-merge-bot openshift-merge-bot Bot merged commit 15374bf into rhobs:main Apr 15, 2026
7 checks passed
@slashpai
Copy link
Copy Markdown
Member Author

Against Prometheus metrics backend (OpenShift 4.21 ROSA) with mcpchecker v0.0.15

== Results Summary ===

Task: label-values
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/labels/label-values.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: list-node-metrics
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/metrics/list-node-metrics.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: pending-pods
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/pending-pods.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: get-silences
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/alerts/get-silences.yaml
  Difficulty: easy
  Task Status: FAILED (Verification failed, but assertions passed)
  Error: one or more verification steps failed
  Assertions: PASSED (3/3)

Task: prometheus-head-series
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/prometheus-head-series.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: get-series-cardinality
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/labels/get-series.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: prometheus-requests
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/prometheus-requests.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: prometheus-wal-size
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/prometheus-wal-size.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: get-alerts
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/alerts/get-alerts.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: high-cardinality-rejection
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/high-cardinality-rejection.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: FAILED (2/3)
    - MaxToolCalls: Too many tool calls: expected <= 15, got 23

Task: label-names
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/labels/label-names.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: alert-investigation
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/alerts/alert-investigation.yaml
  Difficulty: medium
  Task Status: FAILED (Agent execution error)
  Error details saved to: github.com/slashpai/obs-mcp/evals/mcpchecker/alert-investigation-error.txt
  Assertions: FAILED (2/3)
    - MaxToolCalls: Too many tool calls: expected <= 15, got 19

Task: time-range-query
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/time-range-query.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: crashlooping-pods
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/crashlooping-pods.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: FAILED (3/4)
    - MaxToolCalls: Too many tool calls: expected <= 7, got 14

Task: prometheus-reachability
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/prometheus-reachability.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: pods-created
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/pods-created.yaml
  Difficulty: medium
  Task Status: FAILED
  Error: one or more verification steps failed
  Assertions: FAILED (1/4)
    - ToolsUsed: Required tool not called: server=obs, tool=execute_range_query, pattern=
    - MaxToolCalls: Too many tool calls: expected <= 7, got 9
    - CallOrder: Expected call order not satisfied. Got to 1/2

Task: nonexistent-namespace
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/nonexistent-namespace.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: list-kube-metrics
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/metrics/list-metrics.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: diagnose-cluster-health
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/diagnose-cluster-health.yaml
  Difficulty: hard
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: namespace-resource-usage
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/namespace-resource-usage.yaml
  Difficulty: hard
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: nonexistent-metric
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/nonexistent-metric.yaml
  Difficulty: easy
  Task Status: PASSED
  Assertions: PASSED (3/3)

Task: visualize-cpu-usage
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/visualize-cpu-usage.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: namespace-pod-count
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/namespace-pod-count.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: PASSED (4/4)

Task: network-traffic
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/network-traffic.yaml
  Difficulty: medium
  Task Status: FAILED (Verification failed, but assertions passed)
  Error: one or more verification steps failed
  Assertions: PASSED (4/4)

Task: cpu-usage
  Path: github.com/slashpai/obs-mcp/evals/mcpchecker/tasks/queries/cpu-usage.yaml
  Difficulty: medium
  Task Status: PASSED
  Assertions: PASSED (4/4)

=== Overall Statistics ===
Total Tasks: 25
Tasks Passed: 21/25
Assertions Passed: 85/91

Tasks where verification failed but assertions passed: 2
  Assertions in these tasks: 8/11
Tokens:     ~1423666 (estimate - excludes system prompt & cache)
MCP schemas: ~74300 (included in token total)

=== Statistics by Difficulty ===

easy:
  Tasks: 9/10
  Assertions: 34/34

medium:
  Tasks: 10/13
  Assertions: 44/50

hard:
  Tasks: 2/2
  Assertions: 7/7
⏱️  Completed in 18m46s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants