Skip to content

[feature] Best-of-N / verifier-in-the-loop selection over the gate #200

Description

@yuyu04

What problem does this solve

cladding owns something most harnesses lack: a strong, deterministic, execution-based verifier (the 15-stage gate + spec-conformance). But it's used as a single-pass PASS/FAIL on one attempt. The 2025–26 consensus is that the verifier — not the generator — is now the bottleneck and the moat, and that solution coverage scales with the number of candidate attempts as long as you have a good verifier to select among them. cladding has exactly that verifier and isn't exploiting it.

Proposed shape

A best-of-N mode for the drive loop / clad run:

  • Generate K candidate implementations of a feature (varying seed/temperature/persona framing).
  • Run the gate on each (already isolated per-feature by modules).
  • Select the green candidate — or, when several pass, rank by a spec-conformance rubric (oracle coverage, fewest warn-level findings).
  • Keep the winner, discard the rest; record the selection in the audit log.

This makes cladding's gate a selector, not just a judge, and turns its verification rigor into higher first-pass conformance.

Versioning scope (GOVERNANCE.md §2)

  • Minor — new drive-loop mode
  • possibly Major if it touches the clad run / drive-loop contract — deferring to maintainer scoping, which is why this is an issue first.

In-scope check (GOVERNANCE.md §4.1 / §4.2)

  • Not regressing Iron Law conformance — it raises selected-candidate quality
  • Not bypassing the anti-self-cert guard — selection is by the gate + an independent rubric, not self-judgment by the author persona
  • Not forking the Ironclad spec
  • Not cosmetic-only — ships with a test harness exercising K-candidate selection on a fixture feature

Alternatives considered

  • Single-attempt + reflect loop (retry on failure) — complementary, not a substitute. Reflect fixes a candidate; best-of-N explores several and selects. They compose.
  • LLM-judge selection — rejected as primary: cladding's deterministic gate is a better, cheaper, non-self-certifying selector. An LLM rubric only tie-breaks among gate-green candidates.

Willing to implement?

  • Yes
  • Open to either — would want maintainer agreement on scope/contract before coding (flagship idea, larger surface).

The strategic headline of a competitive-gap analysis: cladding's verifier is its differentiator; best-of-N is how it compounds.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions