[BUG] False positive in LLM-based grading example

### Affected Notebook/File

https://platform.claude.com/docs/en/test-and-evaluate/develop-tests#:~:text=Example%3A%20LLM%2Dbased%20grading

### Bug Description

**Description:**

The `grade_completion` function in the documentation example has a string matching bug that causes incorrect answers to be graded as correct.

The current logic:
```python
return "correct" if "correct" in grader_response.lower() else "incorrect"
```

The problem: when the grader outputs "incorrect", the substring check `"correct" in "incorrect"` evaluates to `True`, so every response is graded as "correct" — even wrong answers.

**Steps to reproduce:**

1. Set a golden_answer to something intentionally wrong (e.g., "The capital of Japan is Madrid")
2. Ask the model "What is the capital of France?"
3. The model correctly answers "Paris"
4. The grader correctly identifies the mismatch and outputs `<result>incorrect</result>`
5. But the function returns "correct" because "correct" is a substring of "incorrect"

**Result:** 100% score even with wrong golden answers.

**Suggested fix:**

Parse the `<result>` tags that the prompt already asks the model to produce:

```python
def grade_completion(output, golden_answer):
    grader_response = (
        client.messages.create(
            model="claude-opus-4-6",
            max_tokens=2048,
            messages=[
                {"role": "user", "content": build_grader_prompt(output, golden_answer)}
            ],
        )
        .content[0]
        .text
    )
    
    if "<result>" in grader_response and "</result>" in grader_response:
        result = grader_response.split("<result>")[1].split("</result>")[0].strip().lower()
        return "correct" if result == "correct" else "incorrect"
    
    return "incorrect"
```

This properly extracts the content from the XML tags that the prompt already requests, avoiding the sub

### Steps to Reproduce

import anthropic

client = anthropic.Anthropic()


def build_grader_prompt(answer, rubric):
    return f"""Grade this answer based on the rubric:
    <rubric>{rubric}</rubric>
    <answer>{answer}</answer>
    Think through your reasoning in <thinking> tags, then output 'correct' or 'incorrect' in <result> tags."""


def grade_completion(output, golden_answer):
    grader_response = (
        client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=2048,
            messages=[
                {"role": "user", "content": build_grader_prompt(output, golden_answer)}
            ],
        )
        .content[0]
        .text
    )

    return "correct" if "correct" in grader_response.lower() else "incorrect"


# Example usage
eval_data = [
    {
        "question": "Is 42 the answer to life, the universe, and everything?",
        "golden_answer": "Yes, according to 'The Hitchhiker's Guide to the Galaxy'.",
    },
    {
        "question": "What is the capital of France?",
        "golden_answer": "The capital of japan is madrid.",
    },
]


def get_completion(prompt: str):
    message = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}],
    )
    return message.content[0].text


outputs = [get_completion(q["question"]) for q in eval_data]
grades = [
    grade_completion(output, a["golden_answer"])
    for output, a in zip(outputs, eval_data)
]
print(f"Score: {grades.count('correct') / len(grades) * 100}%")

### Error Message

```shell

```

### Environment

_No response_

### Would you be willing to submit a PR to fix this?

None

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] False positive in LLM-based grading example #497

Affected Notebook/File

Bug Description

Steps to Reproduce

Example usage

Error Message

Environment

Would you be willing to submit a PR to fix this?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[BUG] False positive in LLM-based grading example #497

Description

Affected Notebook/File

Bug Description

Steps to Reproduce

Example usage

Error Message

Environment

Would you be willing to submit a PR to fix this?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions