Affected Notebook/File
https://platform.claude.com/docs/en/test-and-evaluate/develop-tests#:~:text=Example%3A%20LLM%2Dbased%20grading
Bug Description
Description:
The grade_completion function in the documentation example has a string matching bug that causes incorrect answers to be graded as correct.
The current logic:
return "correct" if "correct" in grader_response.lower() else "incorrect"
The problem: when the grader outputs "incorrect", the substring check "correct" in "incorrect" evaluates to True, so every response is graded as "correct" — even wrong answers.
Steps to reproduce:
- Set a golden_answer to something intentionally wrong (e.g., "The capital of Japan is Madrid")
- Ask the model "What is the capital of France?"
- The model correctly answers "Paris"
- The grader correctly identifies the mismatch and outputs
<result>incorrect</result>
- But the function returns "correct" because "correct" is a substring of "incorrect"
Result: 100% score even with wrong golden answers.
Suggested fix:
Parse the <result> tags that the prompt already asks the model to produce:
def grade_completion(output, golden_answer):
grader_response = (
client.messages.create(
model="claude-opus-4-6",
max_tokens=2048,
messages=[
{"role": "user", "content": build_grader_prompt(output, golden_answer)}
],
)
.content[0]
.text
)
if "<result>" in grader_response and "</result>" in grader_response:
result = grader_response.split("<result>")[1].split("</result>")[0].strip().lower()
return "correct" if result == "correct" else "incorrect"
return "incorrect"
This properly extracts the content from the XML tags that the prompt already requests, avoiding the sub
Steps to Reproduce
import anthropic
client = anthropic.Anthropic()
def build_grader_prompt(answer, rubric):
return f"""Grade this answer based on the rubric:
{rubric}
{answer}
Think through your reasoning in tags, then output 'correct' or 'incorrect' in tags."""
def grade_completion(output, golden_answer):
grader_response = (
client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=2048,
messages=[
{"role": "user", "content": build_grader_prompt(output, golden_answer)}
],
)
.content[0]
.text
)
return "correct" if "correct" in grader_response.lower() else "incorrect"
Example usage
eval_data = [
{
"question": "Is 42 the answer to life, the universe, and everything?",
"golden_answer": "Yes, according to 'The Hitchhiker's Guide to the Galaxy'.",
},
{
"question": "What is the capital of France?",
"golden_answer": "The capital of japan is madrid.",
},
]
def get_completion(prompt: str):
message = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}],
)
return message.content[0].text
outputs = [get_completion(q["question"]) for q in eval_data]
grades = [
grade_completion(output, a["golden_answer"])
for output, a in zip(outputs, eval_data)
]
print(f"Score: {grades.count('correct') / len(grades) * 100}%")
Error Message
Environment
No response
Would you be willing to submit a PR to fix this?
None
Affected Notebook/File
https://platform.claude.com/docs/en/test-and-evaluate/develop-tests#:~:text=Example%3A%20LLM%2Dbased%20grading
Bug Description
Description:
The
grade_completionfunction in the documentation example has a string matching bug that causes incorrect answers to be graded as correct.The current logic:
The problem: when the grader outputs "incorrect", the substring check
"correct" in "incorrect"evaluates toTrue, so every response is graded as "correct" — even wrong answers.Steps to reproduce:
<result>incorrect</result>Result: 100% score even with wrong golden answers.
Suggested fix:
Parse the
<result>tags that the prompt already asks the model to produce:This properly extracts the content from the XML tags that the prompt already requests, avoiding the sub
Steps to Reproduce
import anthropic
client = anthropic.Anthropic()
def build_grader_prompt(answer, rubric):
return f"""Grade this answer based on the rubric:
{rubric}
{answer}
Think through your reasoning in tags, then output 'correct' or 'incorrect' in tags."""
def grade_completion(output, golden_answer):
grader_response = (
client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=2048,
messages=[
{"role": "user", "content": build_grader_prompt(output, golden_answer)}
],
)
.content[0]
.text
)
Example usage
eval_data = [
{
"question": "Is 42 the answer to life, the universe, and everything?",
"golden_answer": "Yes, according to 'The Hitchhiker's Guide to the Galaxy'.",
},
{
"question": "What is the capital of France?",
"golden_answer": "The capital of japan is madrid.",
},
]
def get_completion(prompt: str):
message = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}],
)
return message.content[0].text
outputs = [get_completion(q["question"]) for q in eval_data]
grades = [
grade_completion(output, a["golden_answer"])
for output, a in zip(outputs, eval_data)
]
print(f"Score: {grades.count('correct') / len(grades) * 100}%")
Error Message
Environment
No response
Would you be willing to submit a PR to fix this?
None