Skip to content

perf(gpu): keep composition-LDE on device so R4 DEEP avoids the re-H2D#738

Open
diegokingston wants to merge 3 commits into
mainfrom
perf/gpu-fused-composition-lde
Open

perf(gpu): keep composition-LDE on device so R4 DEEP avoids the re-H2D#738
diegokingston wants to merge 3 commits into
mainfrom
perf/gpu-fused-composition-lde

Conversation

@diegokingston

Copy link
Copy Markdown
Collaborator

Summary

After #699/#700 every VM AIR's composition poly decomposes into exactly 2 parts, so round_2 routes through decompose_and_extend_d2. The GPU path there did a device→host LDE only; R4 DEEP then re-uploaded the entire composition LDE (num_parts * 3 * lde_size * 8 bytes) per table.

This retains the de-interleaved composition-LDE device buffer as a GpuLdeExt3 handle and lets R4 DEEP read it on device, eliminating that re-H2D.

  • math-cuda: refactor coset_lde_batch_ext3_into into a shared _inner(.., keep_device_buf) and add coset_lde_batch_ext3_into_keep — runs the ext3 coset LDE and retains the device buffer (no Merkle tree).
  • stark: try_extend_two_halves_gpu_keep<F,E> -> Option<GpuLdeExt3> (self-guards via check_ext3_layout, builds the g^(-k)/N half-extension weights, reserves capacity to lde_size, calls the no-tree keep). round_2 stores the handle in gpu_composition_parts.
  • The composition Merkle tree is still built by the proven try_build_comp_poly_tree_gpu. The keep pipeline's on-device tree is not reused for the composition — it is built for the aux-trace leaf order and a CUDA bisect showed it produces a commitment the verifier rejects (even though the kernel/args match build_comp_poly_tree_from_evals_ext3). The GPU composition evals and the 2-part DEEP-handle path were both confirmed correct in that bisect.

Note: test now reaches verify()

The cuda integration test's stale >2-part dispatch asserts (gpu_parts_lde_calls/gpu_comp_poly_tree_calls) have not fired since #699/#700 made all AIRs number_of_parts == 2; gpu_path_fires_end_to_end panicked on them before reaching verify(), so GPU prove+verify was effectively unvalidated for this ELF. They are replaced with gpu_extend_halves_calls() > 0, and the test now exercises the full prove→verify.

Test plan

  • Non-cuda: CPU composition path byte-identical; 130 stark lib tests pass.
  • CUDA server: cargo build -p stark --features cuda compiles; gpu_path_fires_end_to_end proves a 1M-row ELF with GPU on, the keep path fires (gpu_extend_halves_calls() > 0), and the proof verifies.
  • (optional, follow-up) bench vs main to quantify the saved re-H2D.

Follow-up

The composition Merkle tree's re-H2D (in try_build_comp_poly_tree_gpu) is unchanged from main; building it from the kept device buffer is blocked by the keep-tree leaf-order issue noted above.

diegokingston and others added 3 commits June 29, 2026 16:44
After #699/#700 every VM AIR's composition poly decomposes into exactly 2 parts,
so round_2 routes through `decompose_and_extend_d2`. The GPU path there did a
device-to-host LDE only; R4 DEEP then re-uploaded the whole composition LDE
(`num_parts * 3 * lde_size * 8` bytes) per table.

Retain the de-interleaved composition-LDE device buffer as a `GpuLdeExt3` handle
and let R4 DEEP read it on device, eliminating that re-H2D.

- crypto/math-cuda: refactor `coset_lde_batch_ext3_into` into a shared
  `_inner(.., keep_device_buf)` and add `coset_lde_batch_ext3_into_keep`, which
  runs the ext3 coset LDE and retains the device buffer (no Merkle tree).
- crypto/stark `try_extend_two_halves_gpu_keep<F,E>` returns `Option<GpuLdeExt3>`:
  self-guards via `check_ext3_layout`, builds the `g^(-k)/N` half-extension
  weights (identical to `CompositionLdeTwiddles::weights`), reserves capacity to
  `lde_size`, and calls the no-tree keep.
- round_2 stores the handle in `gpu_composition_parts` so DEEP reads the parts on
  device. The composition Merkle tree is still built by the proven
  `try_build_comp_poly_tree_gpu`.

The keep pipeline's on-device Merkle tree is intentionally NOT reused for the
composition: it is built for the aux-trace leaf order, and bisecting an initial
verification failure on a real CUDA run showed it produces a commitment the
verifier rejects (even though the kernel/args match `build_comp_poly_tree_from_evals_ext3`).
The GPU composition LDE evals and the 2-part DEEP-handle path were both confirmed
correct in the same bisect.

cuda integration test: replace the dead >2-part dispatch asserts
(`gpu_parts_lde_calls`/`gpu_comp_poly_tree_calls`, never fire since #699/#700 made
all AIRs `number_of_parts == 2`) with `gpu_extend_halves_calls() > 0`. This is
also the first time `gpu_path_fires_end_to_end` reaches `verify()` — the stale R2
assert previously panicked before it, leaving GPU prove+verify unvalidated for
this ELF.

Validated on a CUDA server: builds with `--features cuda`, the keep path fires,
and the produced proof verifies. Non-cuda: CPU composition path byte-identical;
130 stark lib tests pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant