perf(gpu): keep composition-LDE on device so R4 DEEP avoids the re-H2D by diegokingston · Pull Request #738 · yetanotherco/lambda_vm

diegokingston · 2026-06-29T19:47:29Z

Summary

After #699/#700 every VM AIR's composition poly decomposes into exactly 2 parts, so round_2 routes through decompose_and_extend_d2. The GPU path there did a device→host LDE only; R4 DEEP then re-uploaded the entire composition LDE (num_parts * 3 * lde_size * 8 bytes) per table.

This retains the de-interleaved composition-LDE device buffer as a GpuLdeExt3 handle and lets R4 DEEP read it on device, eliminating that re-H2D.

math-cuda: refactor coset_lde_batch_ext3_into into a shared _inner(.., keep_device_buf) and add coset_lde_batch_ext3_into_keep — runs the ext3 coset LDE and retains the device buffer (no Merkle tree).
stark: try_extend_two_halves_gpu_keep<F,E> -> Option<GpuLdeExt3> (self-guards via check_ext3_layout, builds the g^(-k)/N half-extension weights, reserves capacity to lde_size, calls the no-tree keep). round_2 stores the handle in gpu_composition_parts.
The composition Merkle tree is still built by the proven try_build_comp_poly_tree_gpu. The keep pipeline's on-device tree is not reused for the composition — it is built for the aux-trace leaf order and a CUDA bisect showed it produces a commitment the verifier rejects (even though the kernel/args match build_comp_poly_tree_from_evals_ext3). The GPU composition evals and the 2-part DEEP-handle path were both confirmed correct in that bisect.

Note: test now reaches `verify()`

The cuda integration test's stale >2-part dispatch asserts (gpu_parts_lde_calls/gpu_comp_poly_tree_calls) have not fired since #699/#700 made all AIRs number_of_parts == 2; gpu_path_fires_end_to_end panicked on them before reaching verify(), so GPU prove+verify was effectively unvalidated for this ELF. They are replaced with gpu_extend_halves_calls() > 0, and the test now exercises the full prove→verify.

Test plan

Non-cuda: CPU composition path byte-identical; 130 stark lib tests pass.
CUDA server: cargo build -p stark --features cuda compiles; gpu_path_fires_end_to_end proves a 1M-row ELF with GPU on, the keep path fires (gpu_extend_halves_calls() > 0), and the proof verifies.
(optional, follow-up) bench vs main to quantify the saved re-H2D.

Follow-up

The composition Merkle tree's re-H2D (in try_build_comp_poly_tree_gpu) is unchanged from main; building it from the kept device buffer is blocked by the keep-tree leaf-order issue noted above.

After #699/#700 every VM AIR's composition poly decomposes into exactly 2 parts, so round_2 routes through `decompose_and_extend_d2`. The GPU path there did a device-to-host LDE only; R4 DEEP then re-uploaded the whole composition LDE (`num_parts * 3 * lde_size * 8` bytes) per table. Retain the de-interleaved composition-LDE device buffer as a `GpuLdeExt3` handle and let R4 DEEP read it on device, eliminating that re-H2D. - crypto/math-cuda: refactor `coset_lde_batch_ext3_into` into a shared `_inner(.., keep_device_buf)` and add `coset_lde_batch_ext3_into_keep`, which runs the ext3 coset LDE and retains the device buffer (no Merkle tree). - crypto/stark `try_extend_two_halves_gpu_keep<F,E>` returns `Option<GpuLdeExt3>`: self-guards via `check_ext3_layout`, builds the `g^(-k)/N` half-extension weights (identical to `CompositionLdeTwiddles::weights`), reserves capacity to `lde_size`, and calls the no-tree keep. - round_2 stores the handle in `gpu_composition_parts` so DEEP reads the parts on device. The composition Merkle tree is still built by the proven `try_build_comp_poly_tree_gpu`. The keep pipeline's on-device Merkle tree is intentionally NOT reused for the composition: it is built for the aux-trace leaf order, and bisecting an initial verification failure on a real CUDA run showed it produces a commitment the verifier rejects (even though the kernel/args match `build_comp_poly_tree_from_evals_ext3`). The GPU composition LDE evals and the 2-part DEEP-handle path were both confirmed correct in the same bisect. cuda integration test: replace the dead >2-part dispatch asserts (`gpu_parts_lde_calls`/`gpu_comp_poly_tree_calls`, never fire since #699/#700 made all AIRs `number_of_parts == 2`) with `gpu_extend_halves_calls() > 0`. This is also the first time `gpu_path_fires_end_to_end` reaches `verify()` — the stale R2 assert previously panicked before it, leaving GPU prove+verify unvalidated for this ELF. Validated on a CUDA server: builds with `--features cuda`, the keep path fires, and the produced proof verifies. Non-cuda: CPU composition path byte-identical; 130 stark lib tests pass.

diegokingston and others added 3 commits June 29, 2026 16:44

Merge branch 'main' into perf/gpu-fused-composition-lde

313214c

Merge branch 'main' into perf/gpu-fused-composition-lde

af9a3c8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(gpu): keep composition-LDE on device so R4 DEEP avoids the re-H2D#738

perf(gpu): keep composition-LDE on device so R4 DEEP avoids the re-H2D#738
diegokingston wants to merge 3 commits into
mainfrom
perf/gpu-fused-composition-lde

diegokingston commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

diegokingston commented Jun 29, 2026

Summary

Note: test now reaches verify()

Test plan

Follow-up

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Note: test now reaches `verify()`