Context
Once user-configurations become first-class leaderboard entries (#257), the obvious failure mode is overfitting: contributors tune their config against the public Pew/GlobalOpinionQA dataset, post a high score, and the score doesn't generalize.
Same problem Kaggle solved with the private leaderboard split.
What to build
Split the eval dataset into two cuts:
- Public leaderboard set (~70-80% of items): visible, included in all
synthbench run invocations, used for the headline score.
- Held-out validation set (~20-30% of items): hidden, only run server-side at scheduled intervals or on request, used to flag configs whose held-out score diverges materially from their public score.
Surface the divergence on each config row as a "trust badge":
- ✓ green: within expected divergence (Δ < some statistical threshold)
- ⚠ yellow: borderline (within Δ but small sample)
- ✗ red: held-out shows significant degradation from public — flag for review
Done when
Leading indicator
After 60 days of operation, ≤10% of high-scoring configs (top quartile public) show ✗-red on held-out. Configs that DO show red get a transparent post-mortem that becomes part of the methodology paper (Move 3a).
Tension to resolve
Reproducibility commitment vs held-out secrecy: how do contributors verify their submission was scored correctly if they can't see the held-out items? Resolution likely: contributors get score + cell counts, not the raw held-out items. Sketch this in the issue thread before implementing.
Context
Once user-configurations become first-class leaderboard entries (#257), the obvious failure mode is overfitting: contributors tune their config against the public Pew/GlobalOpinionQA dataset, post a high score, and the score doesn't generalize.
Same problem Kaggle solved with the private leaderboard split.
What to build
Split the eval dataset into two cuts:
synthbench runinvocations, used for the headline score.Surface the divergence on each config row as a "trust badge":
Done when
pip install synthbenchLeading indicator
After 60 days of operation, ≤10% of high-scoring configs (top quartile public) show ✗-red on held-out. Configs that DO show red get a transparent post-mortem that becomes part of the methodology paper (Move 3a).
Tension to resolve
Reproducibility commitment vs held-out secrecy: how do contributors verify their submission was scored correctly if they can't see the held-out items? Resolution likely: contributors get score + cell counts, not the raw held-out items. Sketch this in the issue thread before implementing.