perf(gpu): keep composition-LDE on device so R4 DEEP avoids the re-H2D#738
Open
diegokingston wants to merge 3 commits into
Open
perf(gpu): keep composition-LDE on device so R4 DEEP avoids the re-H2D#738diegokingston wants to merge 3 commits into
diegokingston wants to merge 3 commits into
Conversation
After #699/#700 every VM AIR's composition poly decomposes into exactly 2 parts, so round_2 routes through `decompose_and_extend_d2`. The GPU path there did a device-to-host LDE only; R4 DEEP then re-uploaded the whole composition LDE (`num_parts * 3 * lde_size * 8` bytes) per table. Retain the de-interleaved composition-LDE device buffer as a `GpuLdeExt3` handle and let R4 DEEP read it on device, eliminating that re-H2D. - crypto/math-cuda: refactor `coset_lde_batch_ext3_into` into a shared `_inner(.., keep_device_buf)` and add `coset_lde_batch_ext3_into_keep`, which runs the ext3 coset LDE and retains the device buffer (no Merkle tree). - crypto/stark `try_extend_two_halves_gpu_keep<F,E>` returns `Option<GpuLdeExt3>`: self-guards via `check_ext3_layout`, builds the `g^(-k)/N` half-extension weights (identical to `CompositionLdeTwiddles::weights`), reserves capacity to `lde_size`, and calls the no-tree keep. - round_2 stores the handle in `gpu_composition_parts` so DEEP reads the parts on device. The composition Merkle tree is still built by the proven `try_build_comp_poly_tree_gpu`. The keep pipeline's on-device Merkle tree is intentionally NOT reused for the composition: it is built for the aux-trace leaf order, and bisecting an initial verification failure on a real CUDA run showed it produces a commitment the verifier rejects (even though the kernel/args match `build_comp_poly_tree_from_evals_ext3`). The GPU composition LDE evals and the 2-part DEEP-handle path were both confirmed correct in the same bisect. cuda integration test: replace the dead >2-part dispatch asserts (`gpu_parts_lde_calls`/`gpu_comp_poly_tree_calls`, never fire since #699/#700 made all AIRs `number_of_parts == 2`) with `gpu_extend_halves_calls() > 0`. This is also the first time `gpu_path_fires_end_to_end` reaches `verify()` — the stale R2 assert previously panicked before it, leaving GPU prove+verify unvalidated for this ELF. Validated on a CUDA server: builds with `--features cuda`, the keep path fires, and the produced proof verifies. Non-cuda: CPU composition path byte-identical; 130 stark lib tests pass.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
After #699/#700 every VM AIR's composition poly decomposes into exactly 2 parts, so
round_2routes throughdecompose_and_extend_d2. The GPU path there did a device→host LDE only; R4 DEEP then re-uploaded the entire composition LDE (num_parts * 3 * lde_size * 8bytes) per table.This retains the de-interleaved composition-LDE device buffer as a
GpuLdeExt3handle and lets R4 DEEP read it on device, eliminating that re-H2D.coset_lde_batch_ext3_intointo a shared_inner(.., keep_device_buf)and addcoset_lde_batch_ext3_into_keep— runs the ext3 coset LDE and retains the device buffer (no Merkle tree).try_extend_two_halves_gpu_keep<F,E> -> Option<GpuLdeExt3>(self-guards viacheck_ext3_layout, builds theg^(-k)/Nhalf-extension weights, reserves capacity tolde_size, calls the no-tree keep).round_2stores the handle ingpu_composition_parts.try_build_comp_poly_tree_gpu. The keep pipeline's on-device tree is not reused for the composition — it is built for the aux-trace leaf order and a CUDA bisect showed it produces a commitment the verifier rejects (even though the kernel/args matchbuild_comp_poly_tree_from_evals_ext3). The GPU composition evals and the 2-part DEEP-handle path were both confirmed correct in that bisect.Note: test now reaches
verify()The cuda integration test's stale
>2-part dispatch asserts (gpu_parts_lde_calls/gpu_comp_poly_tree_calls) have not fired since #699/#700 made all AIRsnumber_of_parts == 2;gpu_path_fires_end_to_endpanicked on them before reachingverify(), so GPU prove+verify was effectively unvalidated for this ELF. They are replaced withgpu_extend_halves_calls() > 0, and the test now exercises the full prove→verify.Test plan
starklib tests pass.cargo build -p stark --features cudacompiles;gpu_path_fires_end_to_endproves a 1M-row ELF with GPU on, the keep path fires (gpu_extend_halves_calls() > 0), and the proof verifies.Follow-up
The composition Merkle tree's re-H2D (in
try_build_comp_poly_tree_gpu) is unchanged from main; building it from the kept device buffer is blocked by the keep-tree leaf-order issue noted above.