Skip to content

Improve CUDA synchronization and make memset asynchronous#40

Merged
souls-syntax merged 5 commits into
mainfrom
perf/Some_sync_issue
Apr 30, 2026
Merged

Improve CUDA synchronization and make memset asynchronous#40
souls-syntax merged 5 commits into
mainfrom
perf/Some_sync_issue

Conversation

@souls-syntax

Copy link
Copy Markdown
Member

Due to cudaDeviceSynchronize we were suffering from performance loss as it was blocking process hence removed it.

@greptile-apps

greptile-apps Bot commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR removes cudaDeviceSynchronize() from all GPU math and backprop paths to avoid blocking the host thread, replacing it with a #ifdef SC_DEBUG guard that restores synchronization in debug builds while using cudaGetLastError() in release. It also converts cudaMemset to cudaMemsetAsync in the bridge and fixes a uint32_t overflow in the benchmark's FLOP counter.

Confidence Score: 4/5

Safe to merge with awareness that kernel execution errors are now silent in release builds — a trade-off the team has been made aware of through prior review comments.

Previously flagged P1 findings (silent in-kernel errors via cudaGetLastError, tensor_sgd_gpu return value semantics, and cudaMemsetAsync on failure) remain unresolved in the PR. The benchmark overflow fix and the #ifdef SC_DEBUG approach are both correct, and the cudaMemsetAsync path is safe given all ops share the NULL stream. Score is 4 rather than 5 because the unresolved P1s represent real, unaddressed reliability concerns raised in prior threads.

src/backend_gpu/backprop/backprop_gpu.cu and all src/backend_gpu/math/*.cu files — these are where in-kernel execution errors will go silently undetected in release builds.

Important Files Changed

Filename Overview
benchmarks/bench_softcuda.cpp Fixes total_flops integer overflow by casting to size_t before the multiply — the arithmetic is now correct for 4096³ inputs.
src/backend_cpu/backprop/backprop_cuda_bridge.cu Converts cudaMemset to cudaMemsetAsync with a null-pointer guard and error logging; error is logged but execution continues silently on failure, and the stream defaults to NULL (consistent with the rest of the codebase).
src/backend_gpu/backprop/backprop_gpu.cu Wraps cudaDeviceSynchronize in #ifdef SC_DEBUG, replacing it with cudaGetLastError() in release builds; indentation of new err declaration is incorrect (column 0).
src/backend_gpu/math/add.cu Replaces cudaDeviceSynchronize() with #ifdef SC_DEBUG guard; release builds only check kernel launch errors, not in-kernel execution errors.
src/backend_gpu/math/matmul.cu Same cudaDeviceSynchronizecudaGetLastError substitution applied after cuBLAS cublasSgemm; cuBLAS execution errors will go undetected in release builds.
src/backend_gpu/math/broadcast_add.cu Same async substitution pattern; adds an extra trailing blank line that was not present before.
src/backend_gpu/math/mean.cu Async substitution applied after cudaFree calls for intermediate partials; same release-build error-detection limitation.
src/backend_gpu/math/relu.cu Same async substitution pattern as other math ops.
src/backend_gpu/math/scalar_mul.cu Same async substitution pattern as other math ops.
src/backend_gpu/math/square.cu Same async substitution pattern as other math ops.
src/backend_gpu/math/sub.cu Same async substitution pattern as other math ops.

Sequence Diagram

sequenceDiagram
    participant CPU as Host (CPU)
    participant NS as NULL Stream (GPU)
    participant Kernel as GPU Kernel

    Note over CPU,Kernel: gradInitializer — zero all gradient buffers
    CPU->>NS: cudaMemsetAsync(grad_buf, 0, bytes) [returns immediately]
    CPU->>NS: cudaMemcpy H2D root grad = 1.0f [blocks until queued ops complete]
    NS-->>CPU: sync point (cudaMemcpy completes)

    Note over CPU,Kernel: Training forward pass
    CPU->>NS: kernel_launch<<<grid,block>>>() [returns immediately]
    Note over NS: SC_DEBUG=off → cudaGetLastError() only
    NS-->>CPU: launch error check (exec errors NOT detected)
    Kernel->>NS: executes asynchronously

    Note over CPU,Kernel: Backprop dispatch
    CPU->>NS: backprop_kernel<<<...>>>() [returns immediately]
    Note over NS: SC_DEBUG=off → cudaGetLastError()
    NS-->>CPU: launch error check only
    Kernel->>NS: executes asynchronously

    Note over CPU,Kernel: SGD weight update
    CPU->>NS: gpu_sgd_k<<<grid,block>>>() [returns immediately]
    Note over NS: SC_DEBUG=off → cudaGetLastError()
    NS-->>CPU: returns true (exec errors silent)
Loading

Reviews (4): Last reviewed commit: "Update benchmarks/bench_softcuda.cpp" | Re-trigger Greptile

Comment thread src/backend_gpu/math/add.cu Outdated
Comment thread src/backend_cpu/backprop/backprop_cuda_bridge.cu Outdated
Comment thread src/backend_gpu/backprop/backprop_gpu.cu Outdated
souls-syntax and others added 2 commits May 1, 2026 02:47
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Comment thread benchmarks/bench_softcuda.cpp Outdated
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
@souls-syntax souls-syntax merged commit 41ad978 into main Apr 30, 2026
1 check passed
@souls-syntax souls-syntax deleted the perf/Some_sync_issue branch April 30, 2026 21:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant