Fix GPU integration issues and improve benchmarking setup by souls-syntax · Pull Request #39 · builders-lab/soft-cuda

souls-syntax · 2026-04-30T19:53:07Z

No description provided.

…eing wrong as it never called that function

…nce nvcc free and made make.sh slightly more better

souls-syntax · 2026-04-30T19:57:46Z

@greptile you there?

souls-syntax · 2026-04-30T19:58:30Z

@greptile why you not commenting on the PR??

greptile-apps · 2026-04-30T19:59:48Z

This PR was already merged by the time I was engaged — there's no pending review window to comment on.

That said, the PR touched 27 files with 2327 additions. If you'd like, I can do a post-merge review and leave a summary comment on the PR thread covering things like:

Error handling in the GPU backward pass functions
Memory safety in the CUDA kernel changes
Benchmarking setup correctness

Just say the word and I'll go through the diff and share findings.

Tip: You can customize Greptile's behavior for this repo with .greptile/rules.md and .greptile/config.json.

souls-syntax · 2026-04-30T20:00:19Z

yes do that

greptile-apps · 2026-04-30T20:03:18Z

Greptile Summary

This PR wires up real GPU backpropagation (CUDA kernels for all core ops + GPU SGD), switches the forward matmul to cuBLAS, adds an AOT hardware profiler that writes CONFIG.soft, and introduces a suite of C++ and Python benchmarks. The core autograd and backend-dispatch changes look structurally sound, but there are three correctness bugs in the new benchmark/bridge code that should be fixed before merge:

total_flops is uint32_t in bench_softcuda.cpp but 2 × 4096³ ≈ 137 B overflows, making all reported GFLOPs values wrong.
bench_pytorch.py calls torch.cuda.get_device_name(0) unconditionally at the top level, crashing on any CPU-only machine.
The three CUDA helper wrappers in backprop_cuda_bridge.cu silently discard cudaError_t, risking undetected gradient corruption during training.

Confidence Score: 4/5

Safe to merge after fixing the uint32_t overflow, the unconditional CUDA device name call, and adding error checking to the CUDA bridge helpers.

Three P1 findings: a uint32_t overflow producing incorrect benchmark output, a guaranteed crash in bench_pytorch.py on CPU-only machines, and silent CUDA error discarding in gradient helpers that can corrupt training silently. The core GPU autograd and cuBLAS integration are architecturally solid.

benchmarks/bench_softcuda.cpp (overflow), benchmarks/bench_pytorch.py (unconditional CUDA call), src/backend_cpu/backprop/backprop_cuda_bridge.cu (unhandled CUDA errors), temp (should be deleted)

Important Files Changed

Filename	Overview
benchmarks/bench_softcuda.cpp	New comprehensive benchmark suite; contains a uint32_t overflow for total_flops (4096³ × 2 wraps around), producing incorrect GFLOPs output for the matmul benchmark.
benchmarks/bench_pytorch.py	New PyTorch comparison benchmark; crashes without CUDA due to unconditional torch.cuda.get_device_name(0) call, and contains mismatched labels (512×512 header vs actual 4096×4096 run).
src/backend_cpu/backprop/backprop_cuda_bridge.cu	New CUDA helper wrappers for host↔device copies; all three functions silently discard cudaError_t return values, risking undetected gradient corruption.
src/backend_gpu/backprop/backprop_gpu.cu	New GPU backward pass dispatcher implementing CUDA kernels for ADD, SUB, ReLU, SQUARE, MEAN, BROADCAST_ADD, and matrix multiply; includes GPU SGD kernel. Error handling via cudaDeviceSynchronize is present.
src/backend_cpu/backprop/backprop_b.cpp	Significantly reworked: GPU backward is now dispatched via backprop_gpu_dispatch with a CPU fallback path; MUL_MATRIX case enabled; GPU grad buffers are seeded and zeroed during gradInitializer.
src/core/graph/assignBackend.cu	Backend assignment generalised to all op types via JSON key lookup; fallback added when GPU allocation fails; forward-evaluate guarded against NULL device_ptr before cudaMemcpy.
src/init/config/profiler_core.cu	New AOT hardware profiler measuring CPU/GPU crossover thresholds; mul_scalar and broadcast_add thresholds are incorrectly copied from the add crossover rather than being measured independently.
src/backend_gpu/math/matmul.cu	Switched forward matmul from a naive CUDA kernel to cuBLAS SGEMM via a lazily-initialised global cublasHandle_t; handle is never destroyed (resource leak, acceptable for demo).
benchmarks/CMakeLists.txt	New benchmark build configuration; hardcodes /opt/cuda/targets/x86_64-linux/lib and links cublas twice, breaking portability on non-x86 or non-Arch environments.
temp	Accidentally committed shell debug output (echo *, ls -a); should be deleted and gitignored.
src/core/graph/train.cpp	SGD now dispatches to a GPU kernel when both weights and gradients are on device, with CPU fallback and push-back to GPU after CPU update.
src/backend_gpu/kernels/sgemm_double_buffer.cuh	New double-buffered SGEMM kernel using cuda::barrier and cooperative_groups; currently commented out in matmul.cu in favour of cuBLAS.

Sequence Diagram

sequenceDiagram
    participant C as Caller (sc_graph_step)
    participant FW as tensor_graph_forward_evaluate
    participant GI as gradInitializer
    participant BP as backprop__ (CPU dispatcher)
    participant GPUd as backprop_gpu_dispatch (CUDA)
    participant SGD as tensor_sgd

    C->>FW: forward pass
    FW->>FW: cudaMemcpy H->D for CPU-resident parent tensors
    FW->>FW: tensor_evaluate_GPU / CPU per node

    C->>GI: zero grad buffers (CPU + GPU via bridge)
    GI->>GI: soft_cuda_memset_zero(device_ptr_grad)
    GI->>GI: seed root grad = 1 (CPU + GPU)

    C->>BP: backward (reverse node order)
    alt node has device_ptr_grad
        BP->>GPUd: backprop_gpu_dispatch(node, parent_a, parent_b)
        GPUd-->>BP: success / false
        opt GPU op unsupported
            BP->>BP: D->H copy of grad + parent fwd data
            BP->>BP: backprop_cpu(node)
            BP->>BP: H->D copy of parent grads
        end
    else CPU node
        BP->>BP: backprop_cpu(node)
    end

    C->>SGD: tensor_sgd(nodes, lr)
    alt both weight and grad on GPU
        SGD->>SGD: tensor_sgd_gpu (in-place)
    else
        SGD->>SGD: D->H grad copy, CPU SGD, H->D weight push
    end

Comments Outside Diff (5)

benchmarks/bench_softcuda.cpp, line 570 (link)

uint32_t overflow for total_flops

total_flops is declared as uint32_t but the value 2 * 4096 * 4096 * 4096 = 137,438,953,472 vastly exceeds UINT32_MAX (≈ 4.29 × 10⁹), so the stored value wraps around and the GFLOPs figure printed by result() will be wildly wrong.

And update the call to result() accordingly:
```
result(label, total / REPS, (double)total_flops * 1e-9);
```
benchmarks/bench_pytorch.py, line 301-302 (link)

Unconditional get_device_name crashes without CUDA

device is conditionally set to CPU when CUDA is unavailable, but torch.cuda.get_device_name(0) on the very next line throws a RuntimeError on any machine without a CUDA device. The script should guard this call.
benchmarks/CMakeLists.txt, line 115-119 (link)

Hardcoded architecture-specific CUDA library path

/opt/cuda/targets/x86_64-linux/lib is hardcoded, making the build fail on non-x86 or non-Arch-Linux environments (e.g., ARM, Ubuntu packages, conda CUDA). CMake's built-in CUDA::cublas imported target (available when find_package(CUDAToolkit) is used) resolves the path portably and should be preferred. Also, cublas is linked twice — once alone and once together with cudart.
```
# Replace the two redundant target_link_libraries and target_link_directories calls with:
target_link_libraries(bench_softcuda PRIVATE CUDA::cublas CUDA::cudart)
```
benchmarks/bench_pytorch.py, line 317-354 (link)

Mismatched benchmark labels

Several label/header pairs are inconsistent: Benchmark 1 header says [PyTorch CPU] but a and b tensors live on device (which may be the GPU). Benchmark 2 header reads Matmul 512×512 but the actual dimensions are 4096 × 4096, and the result() call still says Matmul 512×512 [PyTorch OpenBLAS] despite running on a CUDA device. These mismatches will produce confusing benchmark output.
src/init/config/profiler_core.cu, line 2790-2791 (link)

mul_scalar / broadcast_add thresholds borrow the add crossover without measurement

These two ops piggyback the add crossover instead of being profiled independently. broadcast_add in particular involves a column-wise reduction and has different GPU/CPU characteristics than a simple element-wise add. CONFIG.soft will therefore contain wrong thresholds for these two ops, defeating the purpose of AOT profiling.

_{Reviews (1): Last reviewed commit: "make backprop_p use the forward cuda fun..." | Re-trigger Greptile}

greptile-apps · 2026-04-30T20:03:28Z

+#include "internal_header.h"
+#include <cuda_runtime.h>
+#include <cassert>
+
+void soft_cuda_memset_zero(void *ptr, size_t bytes) {
+    cudaMemset(ptr, 0, bytes);
+}
+
+void soft_cuda_memcpy_h2d(void *dst, const void *src, size_t bytes) {
+    cudaMemcpy(dst, src, bytes, cudaMemcpyHostToDevice);
+}
+
+void soft_cuda_memcpy_d2h(void *dst, const void *src, size_t bytes) {
+    cudaMemcpy(dst, src, bytes, cudaMemcpyDeviceToHost);
+}


CUDA errors silently ignored in gradient helpers

All three helpers (soft_cuda_memset_zero, soft_cuda_memcpy_h2d, soft_cuda_memcpy_d2h) discard the cudaError_t return value. A failed cudaMemcpy during the gradient copy or seed will leave gradient buffers in an undefined state, producing silently wrong weight updates with no indication of failure. At a minimum the errors should be logged via the existing debug() macro; ideally they should be propagated to the callers in backprop_b.cpp so backprop__() can return false.

greptile-apps · 2026-04-30T20:03:29Z

+
+============== OUTPUT OF echo * ============================
+
+benchmarks build build_wsl CMakeLists.txt compile_commands.json docs include main.cpp make.sh model.bin notes README.md requirements.txt scripts src summary.txt tests
+
+
+============== OUTPUT OF ls -a ==============================
+.cache
+.git
+.github
+.venv
+benchmarks
+build
+build_wsl
+docs
+include
+notes
+scripts
+src
+tests
+.clang-format
+.gitignore
+CMakeLists.txt
+compile_commands.json -> build/compile_commands.json
+main.cpp
+make.sh
+model.bin
+README.md
+requirements.txt
+summary.txt
+temp


Debug artifact committed to repo

This file is a shell diagnostic dump (echo * / ls -a output) and has no place in version control. It should be deleted and added to .gitignore to prevent future accidents.

souls-syntax added 10 commits April 29, 2026 14:05

Some git functions

c1e1a8e

Some header declaration error

3f3522b

Backward pass for GPU seems to be working

18a227b

Written out profiler and it's working too

0d65570

Added some benchmarkings

58a13c3

Added benchmarking

92f18a4

More benchmark edits

493062d

BROKEN: something went wrong with cublas intergration seeing to it

534238a

GPU path is working how did it even pass me weird, it was main file b…

e47480e

…eing wrong as it never called that function

make backprop_p use the forward cuda func delaration we did and is he…

b7f6f25

…nce nvcc free and made make.sh slightly more better

souls-syntax merged commit 00c7591 into main Apr 30, 2026
1 check passed

greptile-apps Bot reviewed Apr 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix GPU integration issues and improve benchmarking setup#39

Fix GPU integration issues and improve benchmarking setup#39
souls-syntax merged 10 commits into
mainfrom
feat/GPU_backward_func

souls-syntax commented Apr 30, 2026

Uh oh!

souls-syntax commented Apr 30, 2026

Uh oh!

souls-syntax commented Apr 30, 2026

Uh oh!

Uh oh!

greptile-apps Bot commented Apr 30, 2026

Uh oh!

souls-syntax commented Apr 30, 2026

Uh oh!

greptile-apps Bot commented Apr 30, 2026 •

edited

Loading

Comments Outside Diff (5)

Uh oh!

greptile-apps Bot Apr 30, 2026

Uh oh!

greptile-apps Bot Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

souls-syntax commented Apr 30, 2026

Uh oh!

souls-syntax commented Apr 30, 2026

Uh oh!

souls-syntax commented Apr 30, 2026

Uh oh!

Uh oh!

greptile-apps Bot commented Apr 30, 2026

Uh oh!

souls-syntax commented Apr 30, 2026

Uh oh!

greptile-apps Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (5)

Uh oh!

greptile-apps Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented Apr 30, 2026 •

edited

Loading