Skip to content

Held-out validation set + private leaderboard (Move 4 — integrity) #259

@openclaw-dv

Description

@openclaw-dv

Context

Once user-configurations become first-class leaderboard entries (#257), the obvious failure mode is overfitting: contributors tune their config against the public Pew/GlobalOpinionQA dataset, post a high score, and the score doesn't generalize.

Same problem Kaggle solved with the private leaderboard split.

What to build

Split the eval dataset into two cuts:

  1. Public leaderboard set (~70-80% of items): visible, included in all synthbench run invocations, used for the headline score.
  2. Held-out validation set (~20-30% of items): hidden, only run server-side at scheduled intervals or on request, used to flag configs whose held-out score diverges materially from their public score.

Surface the divergence on each config row as a "trust badge":

  • ✓ green: within expected divergence (Δ < some statistical threshold)
  • ⚠ yellow: borderline (within Δ but small sample)
  • ✗ red: held-out shows significant degradation from public — flag for review

Done when

  • Pew/GlobalOpinionQA datasets split into public + held-out cuts, deterministically (seeded)
  • Held-out cut is not distributed with pip install synthbench
  • Server-side runner periodically re-evaluates top-N configs against held-out + publishes the divergence badge
  • Documentation explains the split to contributors (transparency = trust)

Leading indicator

After 60 days of operation, ≤10% of high-scoring configs (top quartile public) show ✗-red on held-out. Configs that DO show red get a transparent post-mortem that becomes part of the methodology paper (Move 3a).

Tension to resolve

Reproducibility commitment vs held-out secrecy: how do contributors verify their submission was scored correctly if they can't see the held-out items? Resolution likely: contributors get score + cell counts, not the raw held-out items. Sketch this in the issue thread before implementing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestleaderboardLeaderboard UX + data modelpmf-researchSurfaced from synthpanel-driven PMF research 2026-05-14

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions