feat(bbh): resilient sweep — stream results, abort-on-exhaustion, auto-resume by nickmeinhold · Pull Request #5 · enspyrco/echo

nickmeinhold · 2026-06-25T01:22:39Z

Why

The BBH sweep wrote results only once, at the end of the loop, from an in-memory list. Any process death mid-run (Max usage exhausted, kill, laptop sleep) lost the entire run. Worse, a usage-limit error was caught per-call and recorded as passed=False, so an exhausted run limped to completion writing hundreds of failure rows that summarize() averaged into pass rates as if they were wrong answers — silent corruption of the headline metric.

This matters now: we're scaling to n=100 (~720 Claude calls) on the Max 20× plan, where brushing the 5-hour usage window is plausible.

What

Stream + flush each result to disk — partial progress survives any process death (OS page cache outlives the process). Output file format is byte-identical (meta line + one row per result); analyze_sweep.py / RUN_FOR_NICK.md unaffected.
Abort loudly on genuine usage exhaustion (usage limit / credit / 429) with the exact resume command, instead of poisoning the dataset. Transient signals (timeout / 529) still record-and-continue — one blip doesn't kill a run.
--resume <jsonl> — skip already-completed (task_id, arm) pairs and append. Resume is clean because the task list is deterministic (load_bbh(subtasks, n_per_subtask, start)), so "what's done" is recoverable from the output file alone.
run_bbh_resumable.sh — unattended wrapper: catches the exit-2 abort, backs off BACKOFF_SECONDS, re-invokes with --resume, up to MAX_RETRIES. A long sweep survives the usage window with no human in the loop.

Verification (by running, not just reasoning)

Unit: usage-exhaustion classifier (separates exhaustion vs transient vs real wrong answer) + resume-parse round-trip.
Live n=1 end-to-end: fresh write lands on disk during the run; resume skips the done pair and writes nothing new.
Wrapper: syntax check, path-extraction for both success & abort lines, live n=1 run exits 0.

🤖 Generated with Claude Code

The BBH sweep wrote results only once, at the end of the loop, from an in-memory list — so any process death mid-run (Max usage exhausted, kill, laptop sleep) lost the entire run. Worse, a usage-limit error was caught per-call and recorded as passed=False, so an exhausted run limped to completion writing hundreds of failure rows that summarize() averaged into pass rates as if they were wrong answers (silent corruption). Three fixes: - Stream each TaskResult to disk with a flush; partial progress now survives process death (OS page cache outlives the process). - Detect genuine Max-usage exhaustion (usage limit / credit / 429) and abort loudly with the exact resume command, instead of poisoning the dataset. Transient signals (timeout / 529) still record-and-continue. - Add --resume <jsonl>: skip already-completed (task_id, arm) pairs and append, so an interrupted sweep continues where it stopped. Output file format is byte-identical (meta line + one row per result); existing analyze_sweep.py / RUN_FOR_NICK.md commands are unaffected. Verified: unit tests for the classifier + resume parse, and a live n=1 end-to-end (fresh write lands on disk, resume skips the done pair). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

run_bbh_resumable.sh wraps run_bbh_pilot.py: on a usage-exhaustion abort (exit 2) it extracts the partial-results path, waits BACKOFF_SECONDS for the Max window to recover, and re-invokes with --resume — looping up to MAX_RETRIES until the sweep completes or a hard error occurs. Lets a long n=100+ run survive hitting the 5-hour usage window with no human in the loop. Verified: syntax check, path-extraction unit, live n=1 run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

gen_progress.py renders the streaming sweep JSONL into a self-contained auto-refreshing HTML status page (overall + per-arm progress, pass rates, cost units, ETA). publish_progress.sh force-pushes it to the gh-pages branch every 5min (under GitHub Pages' ~10 builds/hr soft limit) from an isolated worktree until the run PID exits. Live at enspyrco.github.io/echo. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

nickmeinhold and others added 3 commits June 25, 2026 11:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(bbh): resilient sweep — stream results, abort-on-exhaustion, auto-resume#5

feat(bbh): resilient sweep — stream results, abort-on-exhaustion, auto-resume#5
nickmeinhold wants to merge 3 commits into
mainfrom
feat/bbh-resilient-resume

nickmeinhold commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

nickmeinhold commented Jun 25, 2026

Why

What

Verification (by running, not just reasoning)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant