mcp-data-check

Evaluate MCP server accuracy against known questions and answers.

Installation

pip install mcp-data-check

Or install from source:

pip install -e .

Usage

Python API

Anthropic (default)

from mcp_data_check import run_evaluation

results = run_evaluation(
    questions_filepath="questions.csv",
    api_key="sk-ant-...",
    server_url="https://mcp.example.com/sse"
)

print(f"Pass rate: {results['summary']['pass_rate']:.1%}")
print(f"Passed: {results['summary']['passed']}/{results['summary']['total']}")

OpenAI

from mcp_data_check import run_evaluation

results = run_evaluation(
    questions_filepath="questions.csv",
    api_key="sk-...",
    server_url="https://mcp.example.com/sse",
    provider="openai",
    model="gpt-4o"
)

Command Line

Anthropic (default)

mcp-data-check https://mcp.example.com/sse -q questions.csv -k YOUR_API_KEY

OpenAI

mcp-data-check https://mcp.example.com/sse -q questions.csv -p openai -m gpt-4o -k YOUR_API_KEY

Baseline comparison (MCP vs no tools)

# Anthropic
mcp-data-check https://mcp.example.com/sse -q questions.csv --compare

# OpenAI
mcp-data-check https://mcp.example.com/sse -q questions.csv -p openai -m gpt-4o --compare

Each question is run --repeats times per mode (default 5). Pass/fail is determined by majority vote. Results are saved to ./results/comparison_<timestamp>.json.

Options:

-q, --questions: Path to questions CSV file (required)
-p, --provider: LLM provider to use: anthropic (default) or openai
-k, --api-key: API key for the chosen provider (defaults to ANTHROPIC_API_KEY or OPENAI_API_KEY env var)
-o, --output: Output directory for results (default: ./results)
-m, --model: Model to use for evaluation (default: claude-sonnet-4-20250514; use e.g. gpt-4o for OpenAI)
-n, --server-name: Name for the MCP server (default: mcp-server)
-r, --repeats: Number of times to run each question (default: 5; majority vote determines pass/fail)
-v, --verbose: Print detailed progress
--compare: Run each question with and without the MCP server and report the delta

Baseline Comparison (Python API)

Anthropic

from mcp_data_check import Evaluator

evaluator = Evaluator(server_url="https://mcp.example.com/sse", api_key="sk-ant-...")
questions = evaluator.load_questions("questions.csv")

comparison = evaluator.run_comparison(questions, repeats=5, verbose=True)

print(f"MCP pass rate:      {comparison.mcp_pass_rate:.1%}")
print(f"Baseline pass rate: {comparison.baseline_pass_rate:.1%}")
print(f"Delta:              {comparison.mcp_pass_rate - comparison.baseline_pass_rate:+.1%}")

evaluator.save_comparison(comparison, "./results")

OpenAI

from mcp_data_check import Evaluator

evaluator = Evaluator(
    server_url="https://mcp.example.com/sse",
    api_key="sk-...",
    provider="openai",
    model="gpt-4o"
)
questions = evaluator.load_questions("questions.csv")

comparison = evaluator.run_comparison(questions, repeats=5, verbose=True)

print(f"MCP pass rate:      {comparison.mcp_pass_rate:.1%}")
print(f"Baseline pass rate: {comparison.baseline_pass_rate:.1%}")

Questions CSV Format

The questions CSV file must have three columns:

Column	Description
`question`	The question to ask the MCP server
`expected_answer`	The expected answer to compare against
`eval_type`	Evaluation method: `numeric`, `string`, or `llm_judge`

Example:

question,expected_answer,eval_type
How many grants were awarded in 2023?,1234,numeric
What organization received the most funding?,NIH,string
Explain the grant distribution,Most grants went to research institutions...,llm_judge

Evaluation Types

numeric: Extracts numbers from responses and compares with 5% tolerance
string: Checks if expected string appears in response (case-insensitive)
llm_judge: Uses the selected model to semantically evaluate if the response is correct

Return Value

The run_evaluation function returns a dictionary:

{
    "summary": {
        "total": 10,
        "passed": 8,
        "failed": 2,
        "pass_rate": 0.8,
        "by_eval_type": {
            "numeric": {"total": 5, "passed": 4},
            "string": {"total": 3, "passed": 3},
            "llm_judge": {"total": 2, "passed": 1}
        }
    },
    "results": [
        {
            "question": "...",
            "expected_answer": "...",
            "eval_type": "numeric",
            "model_response": "...",
            "passed": True,
            "details": {...},
            "error": None,
            "time_to_answer": 2.35,
            "tools_called": [
                {
                    "tool_name": "get_grants",
                    "server_name": "mcp-server",
                    "input": {"year": 2023}
                }
            ]
        },
        ...
    ],
    "metadata": {
        "server_url": "https://mcp.example.com/sse",
        "model": "claude-sonnet-4-20250514",
        "provider": "anthropic",
        "timestamp": "20250127_143022"
    }
}

Result Fields

Each result in the results array contains:

Field	Description
`question`	The original question asked
`expected_answer`	The expected answer from the CSV
`eval_type`	Evaluation method used
`model_response`	The model's full response text
`passed`	Whether the evaluation passed
`details`	Additional evaluation details
`error`	Error message if the evaluation failed
`time_to_answer`	Average response time in seconds across repeats
`tools_called`	List of MCP tools invoked during the response
`repeat_count`	Number of times the question was run
`repeat_pass_count`	Number of runs that passed (out of `repeat_count`)

The tools_called array contains objects with:

tool_name: Name of the MCP tool called
server_name: Name of the MCP server that provided the tool
input: Parameters passed to the tool

Requirements

Python 3.10+
API key for your chosen provider (Anthropic or OpenAI)

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github/workflows		.github/workflows
src/mcp_data_check		src/mcp_data_check
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mcp-data-check

Installation

Usage

Python API

Command Line

Baseline Comparison (Python API)

Questions CSV Format

Evaluation Types

Return Value

Result Fields

Requirements

About

Uh oh!

Releases 4

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

mcp-data-check

Installation

Usage

Python API

Command Line

Baseline Comparison (Python API)

Questions CSV Format

Evaluation Types

Return Value

Result Fields

Requirements

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages