Skip to content

feat(bbh): resilient sweep — stream results, abort-on-exhaustion, auto-resume#5

Open
nickmeinhold wants to merge 3 commits into
mainfrom
feat/bbh-resilient-resume
Open

feat(bbh): resilient sweep — stream results, abort-on-exhaustion, auto-resume#5
nickmeinhold wants to merge 3 commits into
mainfrom
feat/bbh-resilient-resume

Conversation

@nickmeinhold

Copy link
Copy Markdown
Collaborator

Why

The BBH sweep wrote results only once, at the end of the loop, from an in-memory list. Any process death mid-run (Max usage exhausted, kill, laptop sleep) lost the entire run. Worse, a usage-limit error was caught per-call and recorded as passed=False, so an exhausted run limped to completion writing hundreds of failure rows that summarize() averaged into pass rates as if they were wrong answers — silent corruption of the headline metric.

This matters now: we're scaling to n=100 (~720 Claude calls) on the Max 20× plan, where brushing the 5-hour usage window is plausible.

What

  1. Stream + flush each result to disk — partial progress survives any process death (OS page cache outlives the process). Output file format is byte-identical (meta line + one row per result); analyze_sweep.py / RUN_FOR_NICK.md unaffected.
  2. Abort loudly on genuine usage exhaustion (usage limit / credit / 429) with the exact resume command, instead of poisoning the dataset. Transient signals (timeout / 529) still record-and-continue — one blip doesn't kill a run.
  3. --resume <jsonl> — skip already-completed (task_id, arm) pairs and append. Resume is clean because the task list is deterministic (load_bbh(subtasks, n_per_subtask, start)), so "what's done" is recoverable from the output file alone.
  4. run_bbh_resumable.sh — unattended wrapper: catches the exit-2 abort, backs off BACKOFF_SECONDS, re-invokes with --resume, up to MAX_RETRIES. A long sweep survives the usage window with no human in the loop.

Verification (by running, not just reasoning)

  • Unit: usage-exhaustion classifier (separates exhaustion vs transient vs real wrong answer) + resume-parse round-trip.
  • Live n=1 end-to-end: fresh write lands on disk during the run; resume skips the done pair and writes nothing new.
  • Wrapper: syntax check, path-extraction for both success & abort lines, live n=1 run exits 0.

🤖 Generated with Claude Code

nickmeinhold and others added 3 commits June 25, 2026 11:13
The BBH sweep wrote results only once, at the end of the loop, from an
in-memory list — so any process death mid-run (Max usage exhausted,
kill, laptop sleep) lost the entire run. Worse, a usage-limit error was
caught per-call and recorded as passed=False, so an exhausted run limped
to completion writing hundreds of failure rows that summarize() averaged
into pass rates as if they were wrong answers (silent corruption).

Three fixes:
- Stream each TaskResult to disk with a flush; partial progress now
  survives process death (OS page cache outlives the process).
- Detect genuine Max-usage exhaustion (usage limit / credit / 429) and
  abort loudly with the exact resume command, instead of poisoning the
  dataset. Transient signals (timeout / 529) still record-and-continue.
- Add --resume <jsonl>: skip already-completed (task_id, arm) pairs and
  append, so an interrupted sweep continues where it stopped.

Output file format is byte-identical (meta line + one row per result);
existing analyze_sweep.py / RUN_FOR_NICK.md commands are unaffected.
Verified: unit tests for the classifier + resume parse, and a live
n=1 end-to-end (fresh write lands on disk, resume skips the done pair).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
run_bbh_resumable.sh wraps run_bbh_pilot.py: on a usage-exhaustion abort
(exit 2) it extracts the partial-results path, waits BACKOFF_SECONDS for
the Max window to recover, and re-invokes with --resume — looping up to
MAX_RETRIES until the sweep completes or a hard error occurs. Lets a
long n=100+ run survive hitting the 5-hour usage window with no human
in the loop. Verified: syntax check, path-extraction unit, live n=1 run.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
gen_progress.py renders the streaming sweep JSONL into a self-contained
auto-refreshing HTML status page (overall + per-arm progress, pass rates,
cost units, ETA). publish_progress.sh force-pushes it to the gh-pages
branch every 5min (under GitHub Pages' ~10 builds/hr soft limit) from an
isolated worktree until the run PID exits. Live at enspyrco.github.io/echo.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant