ci: add gpu benchmarks#724
Conversation
# Conflicts: # .github/workflows/benchmark-pr.yml
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
AI ReviewPR #724 · 3 changed files Findings
Status column reflects the verdict from the verifier: deepseek-verifier (openrouter/deepseek/deepseek-v4-pro). AI-003: GPU workflow comment claims 'default 10' but the actual default is 14
Claim Doc says "N = pair count, default 10", but the workflow's workflow_dispatch input Evidence benchmark-gpu.yml line 9 (header): Suggested fix Change line 9 to say "default 14" so the docs match the actual default and the AI-005: Missing pipefail masks GPU benchmark failures
Claim The Evidence benchmark-gpu.yml line 309: Suggested fix Add AI-007: Unpinned `pip install --upgrade vastai` makes the workflow break silently if Vast CLI ships a breaking change
Claim The "Install Vast CLI" step runs Evidence benchmark-gpu.yml line 113: Suggested fix Pin a specific version, e.g. AI-008: Hardcoded rustup toolchain date (nightly-2026-02-01) in onstart check will become stale
Claim The provisioning check looks for a specific nightly toolchain date that will become outdated, causing false provisioning failures. Evidence benchmark-gpu.yml line 262: Suggested fix Make the expected toolchain a workflow AI-011: Teardown does not retry on transient vastai destroy failure — leaks paid GPU instances
Claim If Evidence The destroy step (lines 372–380) has no retry loop; it just Suggested fix Wrap the destroy call in a small retry loop (e.g. 3 attempts with 10s sleep) and, on final failure, also attempt to record the leaked instance id to a workflow artifact so a follow-up cleanup job can sweep it. AI-012: BENCH_FEATURES changes do not invalidate cached binaries
Claim bench_abba.sh caches binaries keyed only by SHA. When the same refs are benchmarked with a different Evidence Lines 94-102 compare only Suggested fix Store a feature fingerprint alongside the SHA file (e.g., AI-023: `listComments` uses default 30-item page, so the GPU comment can be missed and duplicated on a chatty PR
Claim The "Comment ABBA result on PR" step uses Evidence benchmark-gpu.yml lines 352-368: Suggested fix Pass Reviewer Lanes
Verification Lanes
Native Codex and Claude reviews run separately and post their own comments. They are not included in this structured provenance report. Discarded candidates (10) — rejected by the verifier
Raw lane outputs, candidates, final issues, and model metrics are uploaded as workflow artifacts. |
Addressed AI review findingsAll 7 confirmed findings are resolved.
The 10 candidates the verifier rejected were spot-checked and need no action (e.g. Lint: |
Review follow-ups on the GPU benchmark workflow: - Teardown: fall back to destroying by the unique RUN_LABEL when no instance id was recorded. The id file is written only after `create` succeeds and its JSON parses, so a box created in that window (concurrency cancel, or a parse failure) could otherwise leak and bill indefinitely. - Cap pairs at 32 (was 40) and round odd requests up to even (the AB/BA design wants even N); raise the job timeout to 210 min so a worst-case 32-pair run (64 proves + slow provisioning + dual CUDA build) fits without timing out after the expensive build. - Fix the CUDARC_PIN comment: the boxes are ~CUDA 12.8 (matching cuda-12080 and the cuda_max_good>=12.8 offer floor), not 13.0; tie it to the MIN_DRIVER guard as the opposite end of the same compatibility window. - Log only the needed fields of create.json instead of the full --raw response, so an unexpected sensitive field can't land in the run log. - Validate the workflow_dispatch branch name before it is interpolated into the remote `bash -lc` command. - Move the run-summary write into an always() step so workflow_dispatch failures are visible in the Actions summary rather than only the raw step log.
GPU benchmark workflow (
/bench-gpu)This PR adds a per-PR GPU prover benchmark that rents a GPU on Vast.ai on demand, runs the prover on it, and posts the result back to the PR — then always tears the box down.
What it does
$1/hr(with retry if the pool is momentarily empty)./bench-abba, but with the CUDA prover path enabled (--features jemalloc-stats,prover/cuda).cliat the PR head and atmain, interleaves N prove pairs of the headline ethrex 20-transfer block, and reports a paired-t 95% CI + Wilcoxon verdict (PR vs main;+= PR faster).How to use it
Comment on any PR:
/bench-gpu— run with the default 14 pairs (sized to resolve a ~2% change given GPU noise)./bench-gpu N— runNpairs (clamped to[2, 40]; more pairs → finer resolution, longer/€€).It can also be launched manually from the Actions tab via workflow_dispatch (with a
pairsinput); those runs post the result to the run summary instead of a PR comment.Reading the result
Trust the verdict when the paired-t CI and Wilcoxon agree.
INCONCLUSIVEon a small-looking delta → re-run with more pairs.Requirements (repo secrets)
VAST_API_KEY— Vast.ai API key.VAST_TEMPLATE_HASH— the "NVIDIA CUDA Lambda VM" template hash.