Skip to content

[BUG] False positive in LLM-based grading example #497

@thutav

Description

@thutav

Affected Notebook/File

https://platform.claude.com/docs/en/test-and-evaluate/develop-tests#:~:text=Example%3A%20LLM%2Dbased%20grading

Bug Description

Description:

The grade_completion function in the documentation example has a string matching bug that causes incorrect answers to be graded as correct.

The current logic:

return "correct" if "correct" in grader_response.lower() else "incorrect"

The problem: when the grader outputs "incorrect", the substring check "correct" in "incorrect" evaluates to True, so every response is graded as "correct" — even wrong answers.

Steps to reproduce:

  1. Set a golden_answer to something intentionally wrong (e.g., "The capital of Japan is Madrid")
  2. Ask the model "What is the capital of France?"
  3. The model correctly answers "Paris"
  4. The grader correctly identifies the mismatch and outputs <result>incorrect</result>
  5. But the function returns "correct" because "correct" is a substring of "incorrect"

Result: 100% score even with wrong golden answers.

Suggested fix:

Parse the <result> tags that the prompt already asks the model to produce:

def grade_completion(output, golden_answer):
    grader_response = (
        client.messages.create(
            model="claude-opus-4-6",
            max_tokens=2048,
            messages=[
                {"role": "user", "content": build_grader_prompt(output, golden_answer)}
            ],
        )
        .content[0]
        .text
    )
    
    if "<result>" in grader_response and "</result>" in grader_response:
        result = grader_response.split("<result>")[1].split("</result>")[0].strip().lower()
        return "correct" if result == "correct" else "incorrect"
    
    return "incorrect"

This properly extracts the content from the XML tags that the prompt already requests, avoiding the sub

Steps to Reproduce

import anthropic

client = anthropic.Anthropic()

def build_grader_prompt(answer, rubric):
return f"""Grade this answer based on the rubric:
{rubric}
{answer}
Think through your reasoning in tags, then output 'correct' or 'incorrect' in tags."""

def grade_completion(output, golden_answer):
grader_response = (
client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=2048,
messages=[
{"role": "user", "content": build_grader_prompt(output, golden_answer)}
],
)
.content[0]
.text
)

return "correct" if "correct" in grader_response.lower() else "incorrect"

Example usage

eval_data = [
{
"question": "Is 42 the answer to life, the universe, and everything?",
"golden_answer": "Yes, according to 'The Hitchhiker's Guide to the Galaxy'.",
},
{
"question": "What is the capital of France?",
"golden_answer": "The capital of japan is madrid.",
},
]

def get_completion(prompt: str):
message = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}],
)
return message.content[0].text

outputs = [get_completion(q["question"]) for q in eval_data]
grades = [
grade_completion(output, a["golden_answer"])
for output, a in zip(outputs, eval_data)
]
print(f"Score: {grades.count('correct') / len(grades) * 100}%")

Error Message

Environment

No response

Would you be willing to submit a PR to fix this?

None

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions