Skip to content

Fix GPU integration issues and improve benchmarking setup#39

Merged
souls-syntax merged 10 commits into
mainfrom
feat/GPU_backward_func
Apr 30, 2026
Merged

Fix GPU integration issues and improve benchmarking setup#39
souls-syntax merged 10 commits into
mainfrom
feat/GPU_backward_func

Conversation

@souls-syntax

Copy link
Copy Markdown
Member

No description provided.

@souls-syntax

Copy link
Copy Markdown
Member Author

@greptile you there?

@souls-syntax

Copy link
Copy Markdown
Member Author

@greptile why you not commenting on the PR??

@souls-syntax souls-syntax merged commit 00c7591 into main Apr 30, 2026
1 check passed
@greptile-apps

greptile-apps Bot commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

This PR was already merged by the time I was engaged — there's no pending review window to comment on.

That said, the PR touched 27 files with 2327 additions. If you'd like, I can do a post-merge review and leave a summary comment on the PR thread covering things like:

  • Error handling in the GPU backward pass functions
  • Memory safety in the CUDA kernel changes
  • Benchmarking setup correctness

Just say the word and I'll go through the diff and share findings.

Tip: You can customize Greptile's behavior for this repo with .greptile/rules.md and .greptile/config.json.

@souls-syntax

Copy link
Copy Markdown
Member Author

yes do that

@greptile-apps

greptile-apps Bot commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR wires up real GPU backpropagation (CUDA kernels for all core ops + GPU SGD), switches the forward matmul to cuBLAS, adds an AOT hardware profiler that writes CONFIG.soft, and introduces a suite of C++ and Python benchmarks. The core autograd and backend-dispatch changes look structurally sound, but there are three correctness bugs in the new benchmark/bridge code that should be fixed before merge:

  • total_flops is uint32_t in bench_softcuda.cpp but 2 × 4096³ ≈ 137 B overflows, making all reported GFLOPs values wrong.
  • bench_pytorch.py calls torch.cuda.get_device_name(0) unconditionally at the top level, crashing on any CPU-only machine.
  • The three CUDA helper wrappers in backprop_cuda_bridge.cu silently discard cudaError_t, risking undetected gradient corruption during training.

Confidence Score: 4/5

Safe to merge after fixing the uint32_t overflow, the unconditional CUDA device name call, and adding error checking to the CUDA bridge helpers.

Three P1 findings: a uint32_t overflow producing incorrect benchmark output, a guaranteed crash in bench_pytorch.py on CPU-only machines, and silent CUDA error discarding in gradient helpers that can corrupt training silently. The core GPU autograd and cuBLAS integration are architecturally solid.

benchmarks/bench_softcuda.cpp (overflow), benchmarks/bench_pytorch.py (unconditional CUDA call), src/backend_cpu/backprop/backprop_cuda_bridge.cu (unhandled CUDA errors), temp (should be deleted)

Important Files Changed

Filename Overview
benchmarks/bench_softcuda.cpp New comprehensive benchmark suite; contains a uint32_t overflow for total_flops (4096³ × 2 wraps around), producing incorrect GFLOPs output for the matmul benchmark.
benchmarks/bench_pytorch.py New PyTorch comparison benchmark; crashes without CUDA due to unconditional torch.cuda.get_device_name(0) call, and contains mismatched labels (512×512 header vs actual 4096×4096 run).
src/backend_cpu/backprop/backprop_cuda_bridge.cu New CUDA helper wrappers for host↔device copies; all three functions silently discard cudaError_t return values, risking undetected gradient corruption.
src/backend_gpu/backprop/backprop_gpu.cu New GPU backward pass dispatcher implementing CUDA kernels for ADD, SUB, ReLU, SQUARE, MEAN, BROADCAST_ADD, and matrix multiply; includes GPU SGD kernel. Error handling via cudaDeviceSynchronize is present.
src/backend_cpu/backprop/backprop_b.cpp Significantly reworked: GPU backward is now dispatched via backprop_gpu_dispatch with a CPU fallback path; MUL_MATRIX case enabled; GPU grad buffers are seeded and zeroed during gradInitializer.
src/core/graph/assignBackend.cu Backend assignment generalised to all op types via JSON key lookup; fallback added when GPU allocation fails; forward-evaluate guarded against NULL device_ptr before cudaMemcpy.
src/init/config/profiler_core.cu New AOT hardware profiler measuring CPU/GPU crossover thresholds; mul_scalar and broadcast_add thresholds are incorrectly copied from the add crossover rather than being measured independently.
src/backend_gpu/math/matmul.cu Switched forward matmul from a naive CUDA kernel to cuBLAS SGEMM via a lazily-initialised global cublasHandle_t; handle is never destroyed (resource leak, acceptable for demo).
benchmarks/CMakeLists.txt New benchmark build configuration; hardcodes /opt/cuda/targets/x86_64-linux/lib and links cublas twice, breaking portability on non-x86 or non-Arch environments.
temp Accidentally committed shell debug output (echo *, ls -a); should be deleted and gitignored.
src/core/graph/train.cpp SGD now dispatches to a GPU kernel when both weights and gradients are on device, with CPU fallback and push-back to GPU after CPU update.
src/backend_gpu/kernels/sgemm_double_buffer.cuh New double-buffered SGEMM kernel using cuda::barrier and cooperative_groups; currently commented out in matmul.cu in favour of cuBLAS.

Sequence Diagram

sequenceDiagram
    participant C as Caller (sc_graph_step)
    participant FW as tensor_graph_forward_evaluate
    participant GI as gradInitializer
    participant BP as backprop__ (CPU dispatcher)
    participant GPUd as backprop_gpu_dispatch (CUDA)
    participant SGD as tensor_sgd

    C->>FW: forward pass
    FW->>FW: cudaMemcpy H->D for CPU-resident parent tensors
    FW->>FW: tensor_evaluate_GPU / CPU per node

    C->>GI: zero grad buffers (CPU + GPU via bridge)
    GI->>GI: soft_cuda_memset_zero(device_ptr_grad)
    GI->>GI: seed root grad = 1 (CPU + GPU)

    C->>BP: backward (reverse node order)
    alt node has device_ptr_grad
        BP->>GPUd: backprop_gpu_dispatch(node, parent_a, parent_b)
        GPUd-->>BP: success / false
        opt GPU op unsupported
            BP->>BP: D->H copy of grad + parent fwd data
            BP->>BP: backprop_cpu(node)
            BP->>BP: H->D copy of parent grads
        end
    else CPU node
        BP->>BP: backprop_cpu(node)
    end

    C->>SGD: tensor_sgd(nodes, lr)
    alt both weight and grad on GPU
        SGD->>SGD: tensor_sgd_gpu (in-place)
    else
        SGD->>SGD: D->H grad copy, CPU SGD, H->D weight push
    end
Loading

Comments Outside Diff (5)

  1. benchmarks/bench_softcuda.cpp, line 570 (link)

    P1 uint32_t overflow for total_flops

    total_flops is declared as uint32_t but the value 2 * 4096 * 4096 * 4096 = 137,438,953,472 vastly exceeds UINT32_MAX (≈ 4.29 × 10⁹), so the stored value wraps around and the GFLOPs figure printed by result() will be wildly wrong.

    And update the call to result() accordingly:

    result(label, total / REPS, (double)total_flops * 1e-9);
  2. benchmarks/bench_pytorch.py, line 301-302 (link)

    P1 Unconditional get_device_name crashes without CUDA

    device is conditionally set to CPU when CUDA is unavailable, but torch.cuda.get_device_name(0) on the very next line throws a RuntimeError on any machine without a CUDA device. The script should guard this call.

  3. benchmarks/CMakeLists.txt, line 115-119 (link)

    P2 Hardcoded architecture-specific CUDA library path

    /opt/cuda/targets/x86_64-linux/lib is hardcoded, making the build fail on non-x86 or non-Arch-Linux environments (e.g., ARM, Ubuntu packages, conda CUDA). CMake's built-in CUDA::cublas imported target (available when find_package(CUDAToolkit) is used) resolves the path portably and should be preferred. Also, cublas is linked twice — once alone and once together with cudart.

    # Replace the two redundant target_link_libraries and target_link_directories calls with:
    target_link_libraries(bench_softcuda PRIVATE CUDA::cublas CUDA::cudart)
  4. benchmarks/bench_pytorch.py, line 317-354 (link)

    P2 Mismatched benchmark labels

    Several label/header pairs are inconsistent: Benchmark 1 header says [PyTorch CPU] but a and b tensors live on device (which may be the GPU). Benchmark 2 header reads Matmul 512×512 but the actual dimensions are 4096 × 4096, and the result() call still says Matmul 512×512 [PyTorch OpenBLAS] despite running on a CUDA device. These mismatches will produce confusing benchmark output.

  5. src/init/config/profiler_core.cu, line 2790-2791 (link)

    P2 mul_scalar / broadcast_add thresholds borrow the add crossover without measurement

    These two ops piggyback the add crossover instead of being profiled independently. broadcast_add in particular involves a column-wise reduction and has different GPU/CPU characteristics than a simple element-wise add. CONFIG.soft will therefore contain wrong thresholds for these two ops, defeating the purpose of AOT profiling.

Reviews (1): Last reviewed commit: "make backprop_p use the forward cuda fun..." | Re-trigger Greptile

Comment on lines +1 to +15
#include "internal_header.h"
#include <cuda_runtime.h>
#include <cassert>

void soft_cuda_memset_zero(void *ptr, size_t bytes) {
cudaMemset(ptr, 0, bytes);
}

void soft_cuda_memcpy_h2d(void *dst, const void *src, size_t bytes) {
cudaMemcpy(dst, src, bytes, cudaMemcpyHostToDevice);
}

void soft_cuda_memcpy_d2h(void *dst, const void *src, size_t bytes) {
cudaMemcpy(dst, src, bytes, cudaMemcpyDeviceToHost);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 CUDA errors silently ignored in gradient helpers

All three helpers (soft_cuda_memset_zero, soft_cuda_memcpy_h2d, soft_cuda_memcpy_d2h) discard the cudaError_t return value. A failed cudaMemcpy during the gradient copy or seed will leave gradient buffers in an undefined state, producing silently wrong weight updates with no indication of failure. At a minimum the errors should be logged via the existing debug() macro; ideally they should be propagated to the callers in backprop_b.cpp so backprop__() can return false.

Comment thread temp
Comment on lines +1 to +31

============== OUTPUT OF echo * ============================

benchmarks build build_wsl CMakeLists.txt compile_commands.json docs include main.cpp make.sh model.bin notes README.md requirements.txt scripts src summary.txt tests


============== OUTPUT OF ls -a ==============================
.cache
.git
.github
.venv
benchmarks
build
build_wsl
docs
include
notes
scripts
src
tests
.clang-format
.gitignore
CMakeLists.txt
compile_commands.json -> build/compile_commands.json
main.cpp
make.sh
model.bin
README.md
requirements.txt
summary.txt
temp

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Debug artifact committed to repo

This file is a shell diagnostic dump (echo * / ls -a output) and has no place in version control. It should be deleted and added to .gitignore to prevent future accidents.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant