Held-out validation set + private leaderboard (Move 4 — integrity)

## Context

Once user-configurations become first-class leaderboard entries (#257), the obvious failure mode is **overfitting**: contributors tune their config against the public Pew/GlobalOpinionQA dataset, post a high score, and the score doesn't generalize.

Same problem Kaggle solved with the private leaderboard split.

## What to build

Split the eval dataset into two cuts:

1. **Public leaderboard set** (~70-80% of items): visible, included in all `synthbench run` invocations, used for the headline score.
2. **Held-out validation set** (~20-30% of items): hidden, only run server-side at scheduled intervals or on request, used to flag configs whose held-out score diverges materially from their public score.

Surface the divergence on each config row as a "trust badge":
- ✓ green: within expected divergence (Δ < some statistical threshold)
- ⚠ yellow: borderline (within Δ but small sample)
- ✗ red: held-out shows significant degradation from public — flag for review

## Done when

- [ ] Pew/GlobalOpinionQA datasets split into public + held-out cuts, deterministically (seeded)
- [ ] Held-out cut is not distributed with `pip install synthbench`
- [ ] Server-side runner periodically re-evaluates top-N configs against held-out + publishes the divergence badge
- [ ] Documentation explains the split to contributors (transparency = trust)

## Leading indicator

After 60 days of operation, ≤10% of high-scoring configs (top quartile public) show ✗-red on held-out. Configs that DO show red get a transparent post-mortem that becomes part of the methodology paper (Move 3a).

## Tension to resolve

Reproducibility commitment vs held-out secrecy: how do contributors verify their submission was scored correctly if they can't see the held-out items? Resolution likely: contributors get score + cell counts, not the raw held-out items. Sketch this in the issue thread before implementing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Held-out validation set + private leaderboard (Move 4 — integrity) #259

Context

What to build

Done when

Leading indicator

Tension to resolve

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Held-out validation set + private leaderboard (Move 4 — integrity) #259

Description

Context

What to build

Done when

Leading indicator

Tension to resolve

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions