Skip to content

ci: add gpu benchmarks#724

Merged
MauroToscano merged 20 commits into
mainfrom
gpu_benchmarks
Jun 29, 2026
Merged

ci: add gpu benchmarks#724
MauroToscano merged 20 commits into
mainfrom
gpu_benchmarks

Conversation

@JuArce

@JuArce JuArce commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

GPU benchmark workflow (/bench-gpu)

This PR adds a per-PR GPU prover benchmark that rents a GPU on Vast.ai on demand, runs the prover on it, and posts the result back to the PR — then always tears the box down.

What it does

  • Rents an RTX 5090 on Vast.ai (hourly) — datacenter-grade host, ≥16 cores, ≥64 GB RAM, ≥64 GB disk, driver ≥ 580, ≤ $1/hr (with retry if the pool is momentarily empty).
  • Provisions via our Vast template (Rust + LLVM + sysroot + CUDA), checks out the PR head, and runs the drift-free A/B/B/A (ABBA) paired benchmark — the same method as the CPU /bench-abba, but with the CUDA prover path enabled (--features jemalloc-stats,prover/cuda).
  • Builds the cli at the PR head and at main, interleaves N prove pairs of the headline ethrex 20-transfer block, and reports a paired-t 95% CI + Wilcoxon verdict (PR vs main; + = PR faster).
  • Always destroys the instance at the end (success, failure, or cancel) so there's no lingering hourly cost.

How to use it

Comment on any PR:

/bench-gpu
  • /bench-gpu — run with the default 14 pairs (sized to resolve a ~2% change given GPU noise).
  • /bench-gpu N — run N pairs (clamped to [2, 40]; more pairs → finer resolution, longer/€€).
  • Restricted to repo members/owners/collaborators. The bot reacts 👀, posts a "started" notice (the run takes ~1 hr: dual CUDA build + the proves), then edits in the result.

It can also be launched manually from the Actions tab via workflow_dispatch (with a pairs input); those runs post the result to the run summary instead of a PR comment.

Reading the result

=== ABBA paired result  (improvement: + = PR faster) ===
  pairs: 14   mean A (PR): 95.49s   mean B (base): 96.38s
  [parametric] paired-t   mean +0.88%   95% CI: [-0.90%, +2.65%]
  [robust]     median +1.11%   Wilcoxon ... p=0.32
  VERDICT: INCONCLUSIVE / REAL IMPROVEMENT / REAL REGRESSION

Trust the verdict when the paired-t CI and Wilcoxon agree. INCONCLUSIVE on a small-looking delta → re-run with more pairs.

Requirements (repo secrets)

  • VAST_API_KEY — Vast.ai API key.
  • VAST_TEMPLATE_HASH — the "NVIDIA CUDA Lambda VM" template hash.

Note: /bench-gpu comments only take effect once this is merged to main (GitHub runs issue_comment workflows from the default branch).

@JuArce JuArce self-assigned this Jun 26, 2026
@JuArce JuArce marked this pull request as ready for review June 29, 2026 15:09
@JuArce JuArce requested a review from MauroToscano June 29, 2026 15:10
@JuArce JuArce added the ai-review Trigger the AI review label Jun 29, 2026
Comment thread .github/workflows/benchmark-gpu.yml
Comment thread .github/workflows/benchmark-gpu.yml Outdated
Comment thread .github/workflows/benchmark-gpu.yml Outdated
Comment thread .github/workflows/benchmark-gpu.yml Outdated
@github-actions

This comment was marked as resolved.

@claude

This comment was marked as resolved.

@github-actions

Copy link
Copy Markdown

AI Review

PR #724 · 3 changed files

Findings

Status Sev Location Finding Found by
confirmed high .github/workflows/benchmark-gpu.yml:9 GPU workflow comment claims 'default 10' but the actual default is 14 minimax
minimax/MiniMax-M3
glm
openrouter/z-ai/glm-5.2
kimi
openrouter/moonshotai/kimi-k2.7-code
moonmath
zro/minimax-m3
confirmed high .github/workflows/benchmark-gpu.yml:309 Missing pipefail masks GPU benchmark failures kimi
openrouter/moonshotai/kimi-k2.7-code
confirmed medium .github/workflows/benchmark-gpu.yml:113 Unpinned pip install --upgrade vastai makes the workflow break silently if Vast CLI ships a breaking change moonmath
zro/minimax-m3
confirmed medium .github/workflows/benchmark-gpu.yml:142 Hardcoded rustup toolchain date (nightly-2026-02-01) in onstart check will become stale nemotron
openrouter/nvidia/nemotron-3-ultra-550b-a55b
moonmath
zro/minimax-m3
confirmed medium .github/workflows/benchmark-gpu.yml:372 Teardown does not retry on transient vastai destroy failure — leaks paid GPU instances minimax
minimax/MiniMax-M3
confirmed medium scripts/bench_abba.sh:94 BENCH_FEATURES changes do not invalidate cached binaries kimi
openrouter/moonshotai/kimi-k2.7-code
confirmed low .github/workflows/benchmark-gpu.yml:352 listComments uses default 30-item page, so the GPU comment can be missed and duplicated on a chatty PR moonmath
zro/minimax-m3

Status column reflects the verdict from the verifier: deepseek-verifier (openrouter/deepseek/deepseek-v4-pro).

AI-003: GPU workflow comment claims 'default 10' but the actual default is 14
  • Status: confirmed
  • Severity: high
  • Location: .github/workflows/benchmark-gpu.yml:9
  • Found by: minimax:minimax/MiniMax-M3, glm:openrouter/z-ai/glm-5.2, kimi:openrouter/moonshotai/kimi-k2.7-code, moonmath:zro/minimax-m3
  • Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
  • Rejected by: -

Claim

Doc says "N = pair count, default 10", but the workflow's workflow_dispatch input pairs has default: "14" and the issue_comment parser falls back to ${N:-14}.

Evidence

benchmark-gpu.yml line 9 (header): Triggered by a "/bench-gpu [N]" comment on a PR (N = pair count, default 10) or via workflow_dispatch. Line 22: default: "14". Line 73: PAIRS=${N:-14}. Line 77: PAIRS=${DISPATCH_PAIRS:-14}.

Suggested fix

Change line 9 to say "default 14" so the docs match the actual default and the 14 ~ resolves a 2% delta rationale on line 79.

AI-005: Missing pipefail masks GPU benchmark failures
  • Status: confirmed
  • Severity: high
  • Location: .github/workflows/benchmark-gpu.yml:309
  • Found by: kimi:openrouter/moonshotai/kimi-k2.7-code
  • Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
  • Rejected by: -

Claim

The Run GPU ABBA benchmark step pipes the SSH command through tee without set -o pipefail. GitHub Actions' default shell is bash -e, so the step succeeds if tee succeeds even when the remote SSH benchmark command fails. This can produce a misleading "success" PR comment with empty or partial results.

Evidence

benchmark-gpu.yml line 309: $SSH "bash -lc \"$REMOTE\"" | tee "$RUNNER_TEMP/abba_out.txt". No set -o pipefail is set in the step, unlike bench-abba.yml line 90 which explicitly uses it.

Suggested fix

Add set -euo pipefail (or at least set -o pipefail) at the top of the Run GPU ABBA benchmark step's run block.

AI-007: Unpinned `pip install --upgrade vastai` makes the workflow break silently if Vast CLI ships a breaking change
  • Status: confirmed
  • Severity: medium
  • Location: .github/workflows/benchmark-gpu.yml:113
  • Found by: moonmath:zro/minimax-m3
  • Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
  • Rejected by: -

Claim

The "Install Vast CLI" step runs pip install --quiet --upgrade vastai with no version pin. Any future Vast CLI release that changes the vastai search offers --raw, vastai show instance --raw, vastai create instance --raw, or vastai attach ssh interface (all of which this workflow depends on) will silently break the workflow on the next run, and the failure will surface as a Vast API error rather than a build error, with a Vast box potentially already rented by then.

Evidence

benchmark-gpu.yml line 113: pip install --quiet --upgrade vastai. No version specifier. The workflow parses Vast JSON output (jq on .new_contract, .public_ipaddr, .ports["22/tcp"][0].HostPort, .driver_version, etc.) and calls multiple Vast subcommands — every one of which is a candidate for breakage across releases.

Suggested fix

Pin a specific version, e.g. pip install --quiet vastai==<tested_version>, and bump intentionally after re-testing the workflow.

AI-008: Hardcoded rustup toolchain date (nightly-2026-02-01) in onstart check will become stale
  • Status: confirmed
  • Severity: medium
  • Location: .github/workflows/benchmark-gpu.yml:142
  • Found by: nemotron:openrouter/nvidia/nemotron-3-ultra-550b-a55b, moonmath:zro/minimax-m3
  • Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
  • Rejected by: -

Claim

The provisioning check looks for a specific nightly toolchain date that will become outdated, causing false provisioning failures.

Evidence

benchmark-gpu.yml line 262: "$HOME/.cargo/bin/rustup" toolchain list 2>/dev/null | grep -q nightly-2026-02-01. Hard-coded literal. The same template name appears in the GPU benchmark comment on line 54 ("~15-30 min") but not as a configurable knob.

Suggested fix

Make the expected toolchain a workflow env variable (e.g. EXPECTED_TOOLCHAIN: nightly-2026-02-01) so the check can be updated centrally. Even better, drop the toolchain-version check entirely and rely on the === done === marker in /var/log/onstart.log, which is the primary check anyway.

AI-011: Teardown does not retry on transient vastai destroy failure — leaks paid GPU instances
  • Status: confirmed
  • Severity: medium
  • Location: .github/workflows/benchmark-gpu.yml:372
  • Found by: minimax:minimax/MiniMax-M3
  • Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
  • Rejected by: -

Claim

If vastai destroy instance $IID --yes fails for any reason (transient network/auth), the workflow emits a ::warning:: and continues. The instance is paid hourly until a human notices.

Evidence

The destroy step (lines 372–380) has no retry loop; it just || swallows the error and prints a warning. Compare with the SSH-attach step (line 193) which loops 12 times.

Suggested fix

Wrap the destroy call in a small retry loop (e.g. 3 attempts with 10s sleep) and, on final failure, also attempt to record the leaked instance id to a workflow artifact so a follow-up cleanup job can sweep it.

AI-012: BENCH_FEATURES changes do not invalidate cached binaries
  • Status: confirmed
  • Severity: medium
  • Location: scripts/bench_abba.sh:94
  • Found by: kimi:openrouter/moonshotai/kimi-k2.7-code
  • Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
  • Rejected by: -

Claim

bench_abba.sh caches binaries keyed only by SHA. When the same refs are benchmarked with a different BENCH_FEATURES value, the existing cached binaries are reused and the new features are ignored.

Evidence

Lines 94-102 compare only $WORK/cli_A.sha/$WORK/cli_B.sha against SHA_A/SHA_B. Line 52 introduces BENCH_FEATURES and line 112 uses it in cargo build, but no feature hash is stored or compared.

Suggested fix

Store a feature fingerprint alongside the SHA file (e.g., echo "$BENCH_FEATURES" > "$WORK/$2.features") and compare it before reusing cached binaries.

AI-023: `listComments` uses default 30-item page, so the GPU comment can be missed and duplicated on a chatty PR
  • Status: confirmed
  • Severity: low
  • Location: .github/workflows/benchmark-gpu.yml:352
  • Found by: moonmath:zro/minimax-m3
  • Verified by: deepseek-verifier:openrouter/deepseek/deepseek-v4-pro
  • Rejected by: -

Claim

The "Comment ABBA result on PR" step uses github.rest.issues.listComments(...) without paginate: true, so it only sees the first 30 comments. On a PR with >30 comments where an earlier /bench-gpu run posted the marker comment past page 1, a subsequent run will not find the existing comment and will post a duplicate — leaving stale result comments behind instead of updating the in-place one.

Evidence

benchmark-gpu.yml lines 352-368: const { data: comments } = await github.rest.issues.listComments({...}) with no pagination options; const existing = comments.find(c => c.user.type === 'Bot' && c.body.includes(marker)). Default GitHub REST page size is 30.

Suggested fix

Pass paginate: true to listComments so the call fetches every page before searching for the existing marker comment.

Reviewer Lanes

Lane Model Prompt Status Findings
glm openrouter/z-ai/glm-5.2 general success 3
kimi openrouter/moonshotai/kimi-k2.7-code general success 4
minimax minimax/MiniMax-M3 general success 6
moonmath zro/minimax-m3 general success 7
nemotron openrouter/nvidia/nemotron-3-ultra-550b-a55b general success 7

Verification Lanes

Lane Model Status Confirmed Rejected Uncertain
deepseek-verifier openrouter/deepseek/deepseek-v4-pro success 7 10 0

Native Codex and Claude reviews run separately and post their own comments. They are not included in this structured provenance report.

Discarded candidates (10) — rejected by the verifier
  • BENCH_FEATURES not passed to remote Vast instance - GPU benchmark runs with CPU features only (.github/workflows/benchmark-gpu.yml:170, found by nemotron:openrouter/nvidia/nemotron-3-ultra-550b-a55b) — BENCH_FEATURES is set at the job level (line 42). The step-level env at lines 272-279 does NOT override BENCH_FEATURES. In GitHub Actions, job-level env vars not overridden by a step-level env block remain available in the step's run block. The $BENCH_FEATURES on line 306 is expanded client-side by the runner's shell before being sent via SSH — it's inside double quotes in the REMOTE variable assignment, so it expands to the literal string 'jemalloc-stats,prover/cuda' in the remote command. The variable is correctly passed.
  • prover/cuda feature does not exist on the cli crate — GPU build will fail (.github/workflows/benchmark-gpu.yml:307, found by minimax:minimax/MiniMax-M3) — The prover dependency in bin/cli/Cargo.toml is named prover (line 9), not optional, and refers to package lambda-vm-prover. In --features prover/cuda on the cargo CLI, the dep_name/feature_name syntax enables the cuda feature of the dependency named prover. This works for non-optional dependencies on the command line. The prover crate has a cuda feature (prover/Cargo.toml:10). The --features prover/cuda passed via bench_abba.sh line 112 will resolve correctly — Cargo maps prover to the dependency and cuda to that dependency's feature.
  • Chicken-and-egg: GPU workflow requires bench_abba.sh change to be on main before it can be tested (.github/workflows/benchmark-gpu.yml:190, found by nemotron:openrouter/nvidia/nemotron-3-ultra-550b-a55b, kimi:openrouter/moonshotai/kimi-k2.7-code, moonmath:zro/minimax-m3, glm:openrouter/z-ai/glm-5.2) — The workflow explicitly documents this limitation in its own comments (lines 297-299: 'NOTE: requires this PR's bench_abba.sh change... to be on main — i.e. it only takes effect after merge'). The BENCH_FEATURES env variable is still set on the remote command (line 306) — it just won't be consumed by the older bench_abba.sh on main. The benchmark will still run with the old hardcoded features (jemalloc-stats only), so it's not a blocker. This is a known, documented design tradeoff, not an undiscovered bug.
  • Driver version filter may be overly restrictive (requires >= 580, CUDA 12.8 needs >= 550) (.github/workflows/benchmark-gpu.yml:75, found by nemotron:openrouter/nvidia/nemotron-3-ultra-550b-a55b) — The comment on lines 131-134 explicitly documents the rationale for requiring driver >= 580: 'Older drivers (e.g. 575) lack newer symbols like cuCtxGetDevice_v2 and the GPU path falls back to CPU.' This is an intentional engineering choice to avoid GPU availability issues at runtime, not a mistake about CUDA version requirements. The cuda_max_good>=12.8 filter works at a different abstraction level (Vast API capability) vs. the driver filter (runtime symbol availability).
  • Instance id persists only to the ephemeral runner tmp — a runner crash leaks the box (.github/workflows/benchmark-gpu.yml:180, found by minimax:minimax/MiniMax-M3, nemotron:openrouter/nvidia/nemotron-3-ultra-550b-a55b, glm:openrouter/z-ai/glm-5.2, moonmath:zro/minimax-m3) — The workflow immediately persists the instance ID to $RUNNER_TEMP (line 180) and the destroy step uses 'if: always()' (line 372). This is the best practice available within GitHub-hosted runner constraints — there is no persistent storage accessible across runner instances. Runner crashes/terminations are an inherent platform limitation, not a code defect. The workflow already mitigates with concurrency control (lines 31-33) and immediate persistence.
  • workflow_dispatch triggered on main performs a no-op main-vs-main compare while still renting a GPU (.github/workflows/benchmark-gpu.yml:76, found by moonmath:zro/minimax-m3) — On workflow_dispatch triggered on main, BRANCH='main' (line 76), REF_A='origin/main' (line 289), and REF_B='origin/main' (line 307), resulting in a main-vs-main comparison. However, this requires a human to explicitly choose 'main' in the workflow_dispatch UI — the default ref for workflow_dispatch isn't 'main' (it's whatever branch is selected). The workflow is designed for PR branches; dispatching on main is a user error, not a code bug. The code cannot distinguish 'user intentionally tested main' from 'user made a mistake'.
  • SSH key attachment retry lacks exponential backoff (.github/workflows/benchmark-gpu.yml:113, found by nemotron:openrouter/nvidia/nemotron-3-ultra-550b-a55b) — The SSH attach step (lines 193-199) retries 12 times with fixed 10s intervals. Fixed-interval retries are a standard and widely-used pattern in CI workflows — they are simple, predictable, and work well for transient failures. Exponential backoff is a reasonable improvement suggestion but calling this a 'bug' or an issue is speculative. The fixed delay was chosen to handle the specific timing of Vast's instance initialization (keys take a few seconds to propagate).
  • Command injection risk: PR_NUM used directly in git fetch refspec (.github/workflows/benchmark-gpu.yml:183, found by nemotron:openrouter/nvidia/nemotron-3-ultra-550b-a55b) — PR_NUM is set from github.event.issue.number (line 63), which is guaranteed by the GitHub API to be an integer. In the git fetch refspec on line 285, this integer is interpolated into a refspec path — there is no shell injection vector from a numeric value. GitHub's event payload is trusted data in the GitHub Actions security model. Defense-in-depth validation is reasonable but the claim of 'command injection risk' is unfounded for this data source.
  • Comment-updater uses a generic 'GPU Benchmark (ABBA)' marker — collides across runs and across bots (.github/workflows/benchmark-gpu.yml:356, found by minimax:minimax/MiniMax-M3) — The marker 'GPU Benchmark (ABBA)' on line 356 is specific. The existing CPU ABBA workflow (bench-abba.yml line 109) uses a different comment header: '## ABBA tiebreaker —' which does not contain the GPU marker string. There is no current collision. The finding is speculative ('it is plausible a future cross-link could collide') and assumes a future change where another bot comment happens to mention the GPU benchmark marker — this is too hypothetical to qualify as a real issue.
  • BENCH_FEATURES default in bench_abba.sh is jemalloc-stats, but cli also requires the heap feature to print 'Peak heap' (scripts/bench_abba.sh:52, found by minimax:minimax/MiniMax-M3) — The finding itself states 'this is OK' and 'The ABBA stats step does not actually use the heap value.' It explicitly acknowledges there is no actual bug — it only warns about a potential 'maintenance hazard' if future code changes read heap from prove output. The bench_abba.sh script (line 134) only parses 'Proving time:', not 'Peak heap:'. The jemalloc-stats feature is included for the ABBA run but its output is not consumed. This is not an actual issue in the current code.

Raw lane outputs, candidates, final issues, and model metrics are uploaded as workflow artifacts.

@JuArce

JuArce commented Jun 29, 2026

Copy link
Copy Markdown
Contributor Author

Addressed AI review findings

All 7 confirmed findings are resolved.

# Sev Finding Fix
AI-003 high Header said "default 10" but default is 14 Header now says default 14
AI-005 high Missing pipefail could mask a failed GPU bench as success set -o pipefail before the tee in the bench step
AI-007 med Unpinned pip install vastai Pinned to immutable commit 28494d92… (git install) + moved VAST_API_KEY out of the install step so install-time code can't read it
AI-008 med Hardcoded nightly-2026-02-01 provisioning check goes stale Dropped the toolchain-date check; readiness now relies on the === done === onstart marker plus cargo/sysroot/repo artifacts
AI-011 med Teardown didn't retry transient vastai destroy failures Destroy now retries 3× (10s apart); on final failure it warns with the instance label for manual cleanup
AI-012 med BENCH_FEATURES change didn't invalidate the cached binaries bench_abba.sh cache marker now stores "<sha> <features>", so a different feature set forces a rebuild
AI-023 low listComments (30-item page) could duplicate the result comment on chatty PRs Both the "started" and result comment steps use github.paginate(..., per_page: 100)

The 10 candidates the verifier rejected were spot-checked and need no action (e.g. prover/cuda does resolve on the prover dep; BENCH_FEATURES is passed correctly via the job-level env expanded into the SSH command; the runner-crash/chicken-and-egg items are documented platform tradeoffs).

Lint: actionlint + shellcheck clean.

Review follow-ups on the GPU benchmark workflow:

- Teardown: fall back to destroying by the unique RUN_LABEL when no instance
  id was recorded. The id file is written only after `create` succeeds and its
  JSON parses, so a box created in that window (concurrency cancel, or a parse
  failure) could otherwise leak and bill indefinitely.
- Cap pairs at 32 (was 40) and round odd requests up to even (the AB/BA design
  wants even N); raise the job timeout to 210 min so a worst-case 32-pair run
  (64 proves + slow provisioning + dual CUDA build) fits without timing out
  after the expensive build.
- Fix the CUDARC_PIN comment: the boxes are ~CUDA 12.8 (matching cuda-12080 and
  the cuda_max_good>=12.8 offer floor), not 13.0; tie it to the MIN_DRIVER guard
  as the opposite end of the same compatibility window.
- Log only the needed fields of create.json instead of the full --raw response,
  so an unexpected sensitive field can't land in the run log.
- Validate the workflow_dispatch branch name before it is interpolated into the
  remote `bash -lc` command.
- Move the run-summary write into an always() step so workflow_dispatch failures
  are visible in the Actions summary rather than only the raw step log.
@MauroToscano MauroToscano enabled auto-merge June 29, 2026 19:24
@MauroToscano MauroToscano added this pull request to the merge queue Jun 29, 2026
Merged via the queue into main with commit ae858b8 Jun 29, 2026
12 checks passed
@MauroToscano MauroToscano deleted the gpu_benchmarks branch June 29, 2026 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai-review Trigger the AI review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants