Skip to content

Add comment-efficiency metrics to perturbation benchmark#84

Open
jingxuangu wants to merge 1 commit into
ChicagoHAI:mainfrom
jingxuangu:add-comment-efficiency-metrics
Open

Add comment-efficiency metrics to perturbation benchmark#84
jingxuangu wants to merge 1 commit into
ChicagoHAI:mainfrom
jingxuangu:add-comment-efficiency-metrics

Conversation

@jingxuangu
Copy link
Copy Markdown

Summary

This PR adds budget-aware comment-efficiency metrics to the perturbation benchmark scoring pipeline.

The current benchmark primarily reports seeded-error recall. This is useful, but it does not distinguish concise reviewers from noisy reviewers that find the same number of injected errors by producing many more comments.

This PR preserves the existing detection semantics:

  • quote must match the perturbed text
  • explanation must match the perturbation's why_wrong

It records the first comment index that detects each perturbation and adds:

  • n_detected_at_1, n_detected_at_3, n_detected_at_5, n_detected_at_10
  • recall_at_1, recall_at_3, recall_at_5, recall_at_10
  • comments_per_detected_error
  • detected_per_comment

These are comment-efficiency metrics, not true precision metrics, because unmatched comments may still identify real non-injected issues.

Testing

  • python -m pytest tests/test_perturbation_score.py -q
  • python -m py_compile benchmarks/perturbation/score.py benchmarks/perturbation/models.py benchmarks/perturbation/generate_report.py src/reviewer/cli.py tests/test_perturbation_score.py

Both passed locally.

Notes

I attempted a full local score,report smoke run, but the checked-in perturbation configs appear to use an older pipeline schema and the repo does not include prepared/reviewed artifacts for the current unified runner. This appears unrelated to the scoring metric changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant