Improve CUDA synchronization and make memset asynchronous#40
Conversation
Greptile SummaryThis PR removes Confidence Score: 4/5Safe to merge with awareness that kernel execution errors are now silent in release builds — a trade-off the team has been made aware of through prior review comments. Previously flagged P1 findings (silent in-kernel errors via
Important Files Changed
Sequence DiagramsequenceDiagram
participant CPU as Host (CPU)
participant NS as NULL Stream (GPU)
participant Kernel as GPU Kernel
Note over CPU,Kernel: gradInitializer — zero all gradient buffers
CPU->>NS: cudaMemsetAsync(grad_buf, 0, bytes) [returns immediately]
CPU->>NS: cudaMemcpy H2D root grad = 1.0f [blocks until queued ops complete]
NS-->>CPU: sync point (cudaMemcpy completes)
Note over CPU,Kernel: Training forward pass
CPU->>NS: kernel_launch<<<grid,block>>>() [returns immediately]
Note over NS: SC_DEBUG=off → cudaGetLastError() only
NS-->>CPU: launch error check (exec errors NOT detected)
Kernel->>NS: executes asynchronously
Note over CPU,Kernel: Backprop dispatch
CPU->>NS: backprop_kernel<<<...>>>() [returns immediately]
Note over NS: SC_DEBUG=off → cudaGetLastError()
NS-->>CPU: launch error check only
Kernel->>NS: executes asynchronously
Note over CPU,Kernel: SGD weight update
CPU->>NS: gpu_sgd_k<<<grid,block>>>() [returns immediately]
Note over NS: SC_DEBUG=off → cudaGetLastError()
NS-->>CPU: returns true (exec errors silent)
Reviews (4): Last reviewed commit: "Update benchmarks/bench_softcuda.cpp" | Re-trigger Greptile |
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Due to cudaDeviceSynchronize we were suffering from performance loss as it was blocking process hence removed it.