Skip to content

Verifier-as-selector: the gate runs single-pass; best-of-N would exploit cladding's strongest asset #209

Description

@yuyu04

Problem

cladding owns something most harnesses lack: a strong, deterministic, execution-based verifier (the 15-stage gate + spec-conformance). But the autonomous drive loop (clad run) uses it as a single-pass PASS/FAIL on one attempt. The 2025–26 consensus is that the verifier — not the generator — is the bottleneck and the moat, and that solution coverage scales with the number of candidate attempts given a good verifier to select among them. cladding has exactly that verifier and isn't exploiting it.

Proposed shape

A best-of-N mode for clad run: generate K candidate implementations, gate each in isolation, select the green winner (rank among green by a structural rubric), keep the winner and discard the rest, audit the selection. This makes the gate a selector, not just a judge.

Verified (independent A/B)

  • Mechanism (deterministic, real selectBest): where the first candidate is red but a later one is green, single-pass MISSES and best-of-N HITS; among several green candidates the selector keeps the higher-quality one (fewest stub-fallbacks).
  • Coverage lift (simulation, per-candidate pass ~ Bernoulli(p)): P(green) tracks 1-(1-p)^N — e.g. at p=0.3, N=1 → 0.30 vs N=10 → 0.97.

Honest scope

  • clad run is the experimental autonomous surface (the supported path is host-delegated); best-of-N's reach is gated on autonomous-loop adoption.
  • N>1 trades N× generation + gate cost for higher P(green) — worth it when a human-required halt costs more than N× compute.
  • The real-generator pass-rate is not measured here (needs live, non-deterministic LLM runs); the A/B proves the selector mechanism + the coverage math it unlocks.

Implemented by F-ac92c812.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions