feat(bbh): resilient sweep — stream results, abort-on-exhaustion, auto-resume#5
Open
nickmeinhold wants to merge 3 commits into
Open
feat(bbh): resilient sweep — stream results, abort-on-exhaustion, auto-resume#5nickmeinhold wants to merge 3 commits into
nickmeinhold wants to merge 3 commits into
Conversation
The BBH sweep wrote results only once, at the end of the loop, from an in-memory list — so any process death mid-run (Max usage exhausted, kill, laptop sleep) lost the entire run. Worse, a usage-limit error was caught per-call and recorded as passed=False, so an exhausted run limped to completion writing hundreds of failure rows that summarize() averaged into pass rates as if they were wrong answers (silent corruption). Three fixes: - Stream each TaskResult to disk with a flush; partial progress now survives process death (OS page cache outlives the process). - Detect genuine Max-usage exhaustion (usage limit / credit / 429) and abort loudly with the exact resume command, instead of poisoning the dataset. Transient signals (timeout / 529) still record-and-continue. - Add --resume <jsonl>: skip already-completed (task_id, arm) pairs and append, so an interrupted sweep continues where it stopped. Output file format is byte-identical (meta line + one row per result); existing analyze_sweep.py / RUN_FOR_NICK.md commands are unaffected. Verified: unit tests for the classifier + resume parse, and a live n=1 end-to-end (fresh write lands on disk, resume skips the done pair). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
run_bbh_resumable.sh wraps run_bbh_pilot.py: on a usage-exhaustion abort (exit 2) it extracts the partial-results path, waits BACKOFF_SECONDS for the Max window to recover, and re-invokes with --resume — looping up to MAX_RETRIES until the sweep completes or a hard error occurs. Lets a long n=100+ run survive hitting the 5-hour usage window with no human in the loop. Verified: syntax check, path-extraction unit, live n=1 run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
gen_progress.py renders the streaming sweep JSONL into a self-contained auto-refreshing HTML status page (overall + per-arm progress, pass rates, cost units, ETA). publish_progress.sh force-pushes it to the gh-pages branch every 5min (under GitHub Pages' ~10 builds/hr soft limit) from an isolated worktree until the run PID exits. Live at enspyrco.github.io/echo. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The BBH sweep wrote results only once, at the end of the loop, from an in-memory list. Any process death mid-run (Max usage exhausted, kill, laptop sleep) lost the entire run. Worse, a usage-limit error was caught per-call and recorded as
passed=False, so an exhausted run limped to completion writing hundreds of failure rows thatsummarize()averaged into pass rates as if they were wrong answers — silent corruption of the headline metric.This matters now: we're scaling to n=100 (~720 Claude calls) on the Max 20× plan, where brushing the 5-hour usage window is plausible.
What
analyze_sweep.py/RUN_FOR_NICK.mdunaffected.usage limit/credit/429) with the exact resume command, instead of poisoning the dataset. Transient signals (timeout/529) still record-and-continue — one blip doesn't kill a run.--resume <jsonl>— skip already-completed(task_id, arm)pairs and append. Resume is clean because the task list is deterministic (load_bbh(subtasks, n_per_subtask, start)), so "what's done" is recoverable from the output file alone.run_bbh_resumable.sh— unattended wrapper: catches the exit-2 abort, backs offBACKOFF_SECONDS, re-invokes with--resume, up toMAX_RETRIES. A long sweep survives the usage window with no human in the loop.Verification (by running, not just reasoning)
🤖 Generated with Claude Code