Evaluate MCP server accuracy against known questions and answers.
pip install mcp-data-checkOr install from source:
pip install -e .Anthropic (default)
from mcp_data_check import run_evaluation
results = run_evaluation(
questions_filepath="questions.csv",
api_key="sk-ant-...",
server_url="https://mcp.example.com/sse"
)
print(f"Pass rate: {results['summary']['pass_rate']:.1%}")
print(f"Passed: {results['summary']['passed']}/{results['summary']['total']}")OpenAI
from mcp_data_check import run_evaluation
results = run_evaluation(
questions_filepath="questions.csv",
api_key="sk-...",
server_url="https://mcp.example.com/sse",
provider="openai",
model="gpt-4o"
)Anthropic (default)
mcp-data-check https://mcp.example.com/sse -q questions.csv -k YOUR_API_KEYOpenAI
mcp-data-check https://mcp.example.com/sse -q questions.csv -p openai -m gpt-4o -k YOUR_API_KEYBaseline comparison (MCP vs no tools)
# Anthropic
mcp-data-check https://mcp.example.com/sse -q questions.csv --compare
# OpenAI
mcp-data-check https://mcp.example.com/sse -q questions.csv -p openai -m gpt-4o --compareEach question is run --repeats times per mode (default 5). Pass/fail is determined by majority vote. Results are saved to ./results/comparison_<timestamp>.json.
Options:
-q, --questions: Path to questions CSV file (required)-p, --provider: LLM provider to use:anthropic(default) oropenai-k, --api-key: API key for the chosen provider (defaults toANTHROPIC_API_KEYorOPENAI_API_KEYenv var)-o, --output: Output directory for results (default:./results)-m, --model: Model to use for evaluation (default:claude-sonnet-4-20250514; use e.g.gpt-4ofor OpenAI)-n, --server-name: Name for the MCP server (default:mcp-server)-r, --repeats: Number of times to run each question (default: 5; majority vote determines pass/fail)-v, --verbose: Print detailed progress--compare: Run each question with and without the MCP server and report the delta
Anthropic
from mcp_data_check import Evaluator
evaluator = Evaluator(server_url="https://mcp.example.com/sse", api_key="sk-ant-...")
questions = evaluator.load_questions("questions.csv")
comparison = evaluator.run_comparison(questions, repeats=5, verbose=True)
print(f"MCP pass rate: {comparison.mcp_pass_rate:.1%}")
print(f"Baseline pass rate: {comparison.baseline_pass_rate:.1%}")
print(f"Delta: {comparison.mcp_pass_rate - comparison.baseline_pass_rate:+.1%}")
evaluator.save_comparison(comparison, "./results")OpenAI
from mcp_data_check import Evaluator
evaluator = Evaluator(
server_url="https://mcp.example.com/sse",
api_key="sk-...",
provider="openai",
model="gpt-4o"
)
questions = evaluator.load_questions("questions.csv")
comparison = evaluator.run_comparison(questions, repeats=5, verbose=True)
print(f"MCP pass rate: {comparison.mcp_pass_rate:.1%}")
print(f"Baseline pass rate: {comparison.baseline_pass_rate:.1%}")The questions CSV file must have three columns:
| Column | Description |
|---|---|
question |
The question to ask the MCP server |
expected_answer |
The expected answer to compare against |
eval_type |
Evaluation method: numeric, string, or llm_judge |
Example:
question,expected_answer,eval_type
How many grants were awarded in 2023?,1234,numeric
What organization received the most funding?,NIH,string
Explain the grant distribution,Most grants went to research institutions...,llm_judge- numeric: Extracts numbers from responses and compares with 5% tolerance
- string: Checks if expected string appears in response (case-insensitive)
- llm_judge: Uses the selected model to semantically evaluate if the response is correct
The run_evaluation function returns a dictionary:
{
"summary": {
"total": 10,
"passed": 8,
"failed": 2,
"pass_rate": 0.8,
"by_eval_type": {
"numeric": {"total": 5, "passed": 4},
"string": {"total": 3, "passed": 3},
"llm_judge": {"total": 2, "passed": 1}
}
},
"results": [
{
"question": "...",
"expected_answer": "...",
"eval_type": "numeric",
"model_response": "...",
"passed": True,
"details": {...},
"error": None,
"time_to_answer": 2.35,
"tools_called": [
{
"tool_name": "get_grants",
"server_name": "mcp-server",
"input": {"year": 2023}
}
]
},
...
],
"metadata": {
"server_url": "https://mcp.example.com/sse",
"model": "claude-sonnet-4-20250514",
"provider": "anthropic",
"timestamp": "20250127_143022"
}
}Each result in the results array contains:
| Field | Description |
|---|---|
question |
The original question asked |
expected_answer |
The expected answer from the CSV |
eval_type |
Evaluation method used |
model_response |
The model's full response text |
passed |
Whether the evaluation passed |
details |
Additional evaluation details |
error |
Error message if the evaluation failed |
time_to_answer |
Average response time in seconds across repeats |
tools_called |
List of MCP tools invoked during the response |
repeat_count |
Number of times the question was run |
repeat_pass_count |
Number of runs that passed (out of repeat_count) |
The tools_called array contains objects with:
tool_name: Name of the MCP tool calledserver_name: Name of the MCP server that provided the toolinput: Parameters passed to the tool
- Python 3.10+
- API key for your chosen provider (Anthropic or OpenAI)