Improve CUDA synchronization and make memset asynchronous by souls-syntax · Pull Request #40 · builders-lab/soft-cuda

souls-syntax · 2026-04-30T20:55:56Z

Due to cudaDeviceSynchronize we were suffering from performance loss as it was blocking process hence removed it.

…sync

…ll be written next i guess

greptile-apps · 2026-04-30T20:59:58Z

Greptile Summary

This PR removes cudaDeviceSynchronize() from all GPU math and backprop paths to avoid blocking the host thread, replacing it with a #ifdef SC_DEBUG guard that restores synchronization in debug builds while using cudaGetLastError() in release. It also converts cudaMemset to cudaMemsetAsync in the bridge and fixes a uint32_t overflow in the benchmark's FLOP counter.

Confidence Score: 4/5

Safe to merge with awareness that kernel execution errors are now silent in release builds — a trade-off the team has been made aware of through prior review comments.

Previously flagged P1 findings (silent in-kernel errors via cudaGetLastError, tensor_sgd_gpu return value semantics, and cudaMemsetAsync on failure) remain unresolved in the PR. The benchmark overflow fix and the #ifdef SC_DEBUG approach are both correct, and the cudaMemsetAsync path is safe given all ops share the NULL stream. Score is 4 rather than 5 because the unresolved P1s represent real, unaddressed reliability concerns raised in prior threads.

src/backend_gpu/backprop/backprop_gpu.cu and all src/backend_gpu/math/*.cu files — these are where in-kernel execution errors will go silently undetected in release builds.

Important Files Changed

Filename	Overview
benchmarks/bench_softcuda.cpp	Fixes `total_flops` integer overflow by casting to `size_t` before the multiply — the arithmetic is now correct for 4096³ inputs.
src/backend_cpu/backprop/backprop_cuda_bridge.cu	Converts `cudaMemset` to `cudaMemsetAsync` with a null-pointer guard and error logging; error is logged but execution continues silently on failure, and the stream defaults to NULL (consistent with the rest of the codebase).
src/backend_gpu/backprop/backprop_gpu.cu	Wraps `cudaDeviceSynchronize` in `#ifdef SC_DEBUG`, replacing it with `cudaGetLastError()` in release builds; indentation of new `err` declaration is incorrect (column 0).
src/backend_gpu/math/add.cu	Replaces `cudaDeviceSynchronize()` with `#ifdef SC_DEBUG` guard; release builds only check kernel launch errors, not in-kernel execution errors.
src/backend_gpu/math/matmul.cu	Same `cudaDeviceSynchronize` → `cudaGetLastError` substitution applied after cuBLAS `cublasSgemm`; cuBLAS execution errors will go undetected in release builds.
src/backend_gpu/math/broadcast_add.cu	Same async substitution pattern; adds an extra trailing blank line that was not present before.
src/backend_gpu/math/mean.cu	Async substitution applied after `cudaFree` calls for intermediate partials; same release-build error-detection limitation.
src/backend_gpu/math/relu.cu	Same async substitution pattern as other math ops.
src/backend_gpu/math/scalar_mul.cu	Same async substitution pattern as other math ops.
src/backend_gpu/math/square.cu	Same async substitution pattern as other math ops.
src/backend_gpu/math/sub.cu	Same async substitution pattern as other math ops.

Sequence Diagram

sequenceDiagram
    participant CPU as Host (CPU)
    participant NS as NULL Stream (GPU)
    participant Kernel as GPU Kernel

    Note over CPU,Kernel: gradInitializer — zero all gradient buffers
    CPU->>NS: cudaMemsetAsync(grad_buf, 0, bytes) [returns immediately]
    CPU->>NS: cudaMemcpy H2D root grad = 1.0f [blocks until queued ops complete]
    NS-->>CPU: sync point (cudaMemcpy completes)

    Note over CPU,Kernel: Training forward pass
    CPU->>NS: kernel_launch<<<grid,block>>>() [returns immediately]
    Note over NS: SC_DEBUG=off → cudaGetLastError() only
    NS-->>CPU: launch error check (exec errors NOT detected)
    Kernel->>NS: executes asynchronously

    Note over CPU,Kernel: Backprop dispatch
    CPU->>NS: backprop_kernel<<<...>>>() [returns immediately]
    Note over NS: SC_DEBUG=off → cudaGetLastError()
    NS-->>CPU: launch error check only
    Kernel->>NS: executes asynchronously

    Note over CPU,Kernel: SGD weight update
    CPU->>NS: gpu_sgd_k<<<grid,block>>>() [returns immediately]
    Note over NS: SC_DEBUG=off → cudaGetLastError()
    NS-->>CPU: returns true (exec errors silent)

_{Reviews (4): Last reviewed commit: "Update benchmarks/bench_softcuda.cpp" | Re-trigger Greptile}

…uild

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

souls-syntax added 2 commits May 1, 2026 02:05

changed cudaDeviceSynchronize with cudaGetLastError and made memset a…

7e2e73a

…sync

Some sync issue were still left removed them i guess testing suite wi…

818f035

…ll be written next i guess

greptile-apps Bot reviewed Apr 30, 2026

View reviewed changes

Comment thread src/backend_gpu/math/add.cu Outdated

Comment thread src/backend_cpu/backprop/backprop_cuda_bridge.cu Outdated

Comment thread src/backend_gpu/backprop/backprop_gpu.cu Outdated

souls-syntax and others added 2 commits May 1, 2026 02:47

Added definition gaurd against cudasync and cudalasterror for debug b…

a40d938

…uild

Update src/backend_cpu/backprop/backprop_cuda_bridge.cu

0dd00c8

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

greptile-apps Bot reviewed Apr 30, 2026

View reviewed changes

Comment thread benchmarks/bench_softcuda.cpp Outdated

Update benchmarks/bench_softcuda.cpp

8620e3b

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

souls-syntax merged commit 41ad978 into main Apr 30, 2026
1 check passed

souls-syntax deleted the perf/Some_sync_issue branch April 30, 2026 21:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve CUDA synchronization and make memset asynchronous#40

Improve CUDA synchronization and make memset asynchronous#40
souls-syntax merged 5 commits into
mainfrom
perf/Some_sync_issue

souls-syntax commented Apr 30, 2026

Uh oh!

greptile-apps Bot commented Apr 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

souls-syntax commented Apr 30, 2026

Uh oh!

greptile-apps Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented Apr 30, 2026 •

edited

Loading