Fix GPU integration issues and improve benchmarking setup#39
Conversation
…eing wrong as it never called that function
…nce nvcc free and made make.sh slightly more better
|
@greptile you there? |
|
@greptile why you not commenting on the PR?? |
|
This PR was already merged by the time I was engaged — there's no pending review window to comment on. That said, the PR touched 27 files with 2327 additions. If you'd like, I can do a post-merge review and leave a summary comment on the PR thread covering things like:
Just say the word and I'll go through the diff and share findings. Tip: You can customize Greptile's behavior for this repo with |
|
yes do that |
Greptile SummaryThis PR wires up real GPU backpropagation (CUDA kernels for all core ops + GPU SGD), switches the forward matmul to cuBLAS, adds an AOT hardware profiler that writes
Confidence Score: 4/5Safe to merge after fixing the uint32_t overflow, the unconditional CUDA device name call, and adding error checking to the CUDA bridge helpers. Three P1 findings: a uint32_t overflow producing incorrect benchmark output, a guaranteed crash in bench_pytorch.py on CPU-only machines, and silent CUDA error discarding in gradient helpers that can corrupt training silently. The core GPU autograd and cuBLAS integration are architecturally solid. benchmarks/bench_softcuda.cpp (overflow), benchmarks/bench_pytorch.py (unconditional CUDA call), src/backend_cpu/backprop/backprop_cuda_bridge.cu (unhandled CUDA errors), temp (should be deleted) Important Files Changed
Sequence DiagramsequenceDiagram
participant C as Caller (sc_graph_step)
participant FW as tensor_graph_forward_evaluate
participant GI as gradInitializer
participant BP as backprop__ (CPU dispatcher)
participant GPUd as backprop_gpu_dispatch (CUDA)
participant SGD as tensor_sgd
C->>FW: forward pass
FW->>FW: cudaMemcpy H->D for CPU-resident parent tensors
FW->>FW: tensor_evaluate_GPU / CPU per node
C->>GI: zero grad buffers (CPU + GPU via bridge)
GI->>GI: soft_cuda_memset_zero(device_ptr_grad)
GI->>GI: seed root grad = 1 (CPU + GPU)
C->>BP: backward (reverse node order)
alt node has device_ptr_grad
BP->>GPUd: backprop_gpu_dispatch(node, parent_a, parent_b)
GPUd-->>BP: success / false
opt GPU op unsupported
BP->>BP: D->H copy of grad + parent fwd data
BP->>BP: backprop_cpu(node)
BP->>BP: H->D copy of parent grads
end
else CPU node
BP->>BP: backprop_cpu(node)
end
C->>SGD: tensor_sgd(nodes, lr)
alt both weight and grad on GPU
SGD->>SGD: tensor_sgd_gpu (in-place)
else
SGD->>SGD: D->H grad copy, CPU SGD, H->D weight push
end
|
| #include "internal_header.h" | ||
| #include <cuda_runtime.h> | ||
| #include <cassert> | ||
|
|
||
| void soft_cuda_memset_zero(void *ptr, size_t bytes) { | ||
| cudaMemset(ptr, 0, bytes); | ||
| } | ||
|
|
||
| void soft_cuda_memcpy_h2d(void *dst, const void *src, size_t bytes) { | ||
| cudaMemcpy(dst, src, bytes, cudaMemcpyHostToDevice); | ||
| } | ||
|
|
||
| void soft_cuda_memcpy_d2h(void *dst, const void *src, size_t bytes) { | ||
| cudaMemcpy(dst, src, bytes, cudaMemcpyDeviceToHost); | ||
| } |
There was a problem hiding this comment.
CUDA errors silently ignored in gradient helpers
All three helpers (soft_cuda_memset_zero, soft_cuda_memcpy_h2d, soft_cuda_memcpy_d2h) discard the cudaError_t return value. A failed cudaMemcpy during the gradient copy or seed will leave gradient buffers in an undefined state, producing silently wrong weight updates with no indication of failure. At a minimum the errors should be logged via the existing debug() macro; ideally they should be propagated to the callers in backprop_b.cpp so backprop__() can return false.
|
|
||
| ============== OUTPUT OF echo * ============================ | ||
|
|
||
| benchmarks build build_wsl CMakeLists.txt compile_commands.json docs include main.cpp make.sh model.bin notes README.md requirements.txt scripts src summary.txt tests | ||
|
|
||
|
|
||
| ============== OUTPUT OF ls -a ============================== | ||
| .cache | ||
| .git | ||
| .github | ||
| .venv | ||
| benchmarks | ||
| build | ||
| build_wsl | ||
| docs | ||
| include | ||
| notes | ||
| scripts | ||
| src | ||
| tests | ||
| .clang-format | ||
| .gitignore | ||
| CMakeLists.txt | ||
| compile_commands.json -> build/compile_commands.json | ||
| main.cpp | ||
| make.sh | ||
| model.bin | ||
| README.md | ||
| requirements.txt | ||
| summary.txt | ||
| temp |
No description provided.