Problem
cladding owns something most harnesses lack: a strong, deterministic, execution-based verifier (the 15-stage gate + spec-conformance). But the autonomous drive loop (clad run) uses it as a single-pass PASS/FAIL on one attempt. The 2025–26 consensus is that the verifier — not the generator — is the bottleneck and the moat, and that solution coverage scales with the number of candidate attempts given a good verifier to select among them. cladding has exactly that verifier and isn't exploiting it.
Proposed shape
A best-of-N mode for clad run: generate K candidate implementations, gate each in isolation, select the green winner (rank among green by a structural rubric), keep the winner and discard the rest, audit the selection. This makes the gate a selector, not just a judge.
Verified (independent A/B)
- Mechanism (deterministic, real
selectBest): where the first candidate is red but a later one is green, single-pass MISSES and best-of-N HITS; among several green candidates the selector keeps the higher-quality one (fewest stub-fallbacks).
- Coverage lift (simulation, per-candidate pass ~ Bernoulli(p)): P(green) tracks
1-(1-p)^N — e.g. at p=0.3, N=1 → 0.30 vs N=10 → 0.97.
Honest scope
clad run is the experimental autonomous surface (the supported path is host-delegated); best-of-N's reach is gated on autonomous-loop adoption.
- N>1 trades N× generation + gate cost for higher P(green) — worth it when a human-required halt costs more than N× compute.
- The real-generator pass-rate is not measured here (needs live, non-deterministic LLM runs); the A/B proves the selector mechanism + the coverage math it unlocks.
Implemented by F-ac92c812.
Problem
cladding owns something most harnesses lack: a strong, deterministic, execution-based verifier (the 15-stage gate + spec-conformance). But the autonomous drive loop (
clad run) uses it as a single-pass PASS/FAIL on one attempt. The 2025–26 consensus is that the verifier — not the generator — is the bottleneck and the moat, and that solution coverage scales with the number of candidate attempts given a good verifier to select among them. cladding has exactly that verifier and isn't exploiting it.Proposed shape
A best-of-N mode for
clad run: generate K candidate implementations, gate each in isolation, select the green winner (rank among green by a structural rubric), keep the winner and discard the rest, audit the selection. This makes the gate a selector, not just a judge.Verified (independent A/B)
selectBest): where the first candidate is red but a later one is green, single-pass MISSES and best-of-N HITS; among several green candidates the selector keeps the higher-quality one (fewest stub-fallbacks).1-(1-p)^N— e.g. at p=0.3, N=1 → 0.30 vs N=10 → 0.97.Honest scope
clad runis the experimental autonomous surface (the supported path is host-delegated); best-of-N's reach is gated on autonomous-loop adoption.Implemented by F-ac92c812.