Skip to content

s85io/nabla

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

∇ Nabla

Autodiff and tensor computing for modern C++.

Nabla is a header-only C++23 library for building and training neural networks. It follows the conventions of the C++ standard library: concepts over inheritance, value semantics, allocator-aware containers, execution policies, and zero-cost abstractions.

No frameworks. No runtime. No Python. Just C++.


Philosophy

One type. Everything is a nabla::tensor. Inputs, outputs, gradients, parameters, activations — one type flows through the entire graph. If two things are tensors, they compose.

One contract. Every differentiable operation satisfies one concept: differentiable_op. Three methods — forward(), backward(), operands(). No base classes, no registration macros, no factory patterns.

One direction. Data flows forward through operations. Gradients flow backward through the same operations, in reverse. The type at every junction is the same tensor. This is why the whole system compiles and why adding a new operation takes five minutes.


Quick start

cmake -B build
cmake --build build -j$(nproc)
ctest --test-dir build --output-on-failure
#include <nabla/nabla.hpp>

using namespace nabla;

int main() {
    // Tensors — like std::vector, but with gradients
    tensor x(dynamic_shape{{32, 768}});
    tensor W(dynamic_shape{{768, 768}});
    W.set_requires_grad(true);

    // Forward — operations build the graph automatically
    auto y = linear_forward(x, W, nullptr, 32, 768, 768);
    auto z = layer_norm_forward(y, gamma, beta, 32, 768);
    auto loss = cross_entropy_forward(z, targets, 32, vocab_size);

    // Backward — one call, gradients flow through everything
    loss.backward();

    // W.grad() now contains ∂loss/∂W
    adamw.step(W);
}

Architecture

include/nabla/
├── tensor.hpp        Core tensor type, autodiff tape, execution policies, SIMD
├── ops.hpp           Convenience header: includes all ops/*.hpp
├── ops/              Individual operations (one per file)
│   ├── matmul.hpp    matmul: A @ B
│   ├── linear.hpp    linear: x @ W^T + b
│   ├── softmax.hpp   softmax with numerical stability
│   ├── layer_norm.hpp layer normalization + optional bias
│   ├── rms_norm.hpp  RMS normalization
│   ├── gelu.hpp      GELU activation
│   ├── dropout.hpp   Dropout (training/eval modes)
│   ├── embedding.hpp Token/position embeddings
│   ├── add.hpp       Element-wise addition
│   ├── rope.hpp      Rotary positional embeddings
│   ├── swiglu.hpp    SwiGLU FFN
│   ├── cross_entropy.hpp  Cross-entropy loss
│   ├── attention.hpp  Multi-head attention
│   └── swish.hpp     Swish activation wrapper
├── gpu.hpp           Device memory, backend abstraction
├── training.hpp      AdamW optimizer, learning rate scheduling
└── nabla.hpp         Convenience header (includes tensor, ops, gpu, training)

Operations organization

Each operation (13 total) is now in its own header for clarity and modularity:

Operation File Forward Backward
matmul ops/matmul.hpp A @ B ∂L/∂A = ∂L/∂Y @ B^T, ∂L/∂B = A^T @ ∂L/∂Y
linear ops/linear.hpp x @ W^T + b ∂L/∂x = ∂L/∂y @ W, + weight/bias grads
softmax ops/softmax.hpp exp(x_i) / Σexp(x_j) Jacobian-vector product
layer_norm ops/layer_norm.hpp (x - μ) / σ · γ + β chain rule with normalization
rms_norm ops/rms_norm.hpp (x / rms) · γ chain rule with rms
gelu ops/gelu.hpp 0.5·x·(1 + tanh(√(2/π)·(x + 0.044715·x³))) GELU derivative
dropout ops/dropout.hpp mask(x) / (1-p) pass-through masked gradient
embedding ops/embedding.hpp table[idx] scatter: table.grad[idx] += ∂L/∂y
add ops/add.hpp a + b ∂L/∂a = ∂L/∂y, ∂L/∂b = ∂L/∂y
rope ops/rope.hpp 2D rotation by θ(pos) inverse rotation
swiglu ops/swiglu.hpp swish(x·W_g) ⊙ (x·W_u) @ W_d full chain through gates
cross_entropy ops/cross_entropy.hpp −log P[target] softmax(logits) − one_hot
attention ops/attention.hpp softmax(QK^T/√d) @ V six-step backward pass

Core types

basic_tensor<Scalar, Shape, Allocator>

The central type. Modeled after std::basic_string — parameterised on scalar type, shape representation, and allocator.

// Dynamic shape, float — the default
using tensor   = basic_tensor<float, dynamic_shape>;

// Double precision
using tensor_d = basic_tensor<double, dynamic_shape>;

// Compile-time shape — zero overhead for known dimensions
template <size_t... Dims>
using fixed_tensor = basic_tensor<float, static_shape<Dims...>>;

// GPU-resident tensor — same type, different allocator
template <typename T = float>
using device_tensor = basic_tensor<T, dynamic_shape,
                                    device_allocator<T>>;

A tensor carries:

  • Data — contiguous storage, accessible via data(), as_span(), iterators
  • Shape — static or dynamic, accessible via shape(), total_size()
  • Gradient — same layout as data, accessible via grad(), grad_data()
  • Creator — pointer to the operation that produced this tensor (the tape)

differentiable_op<Op, T>

The single concept that every operation must satisfy:

template <typename Op, typename T>
concept differentiable_op = requires(Op op, const T& grad_output) {
    { op.forward() }               -> convertible_to<T>;
    { op.backward(grad_output) }   -> same_as<void>;
    { op.operands() }              -> ranges::range;
};

No inheritance. No virtual functions in user code. If your struct has these three methods with the right signatures, it is an operation.

Execution policies

Modeled after std::execution::seq / std::execution::par:

auto y = nabla::matmul_forward(execution::seq,  a, b, M, K, N);  // scalar loop
auto y = nabla::matmul_forward(execution::simd, a, b, M, K, N);  // SIMD
auto y = nabla::matmul_forward(execution::gpu,  a, b, M, K, N);  // GPU

Same function, same signature, different policy — different codepath. Resolved at compile time.


Training loop example

A complete training loop with Nabla:

#include <nabla/nabla.hpp>

using namespace nabla;

int main() {
    // Model parameters
    tensor W_embed(dynamic_shape{{vocab_size, hidden}});
    tensor W_q(dynamic_shape{{hidden, hidden}});
    tensor W_k(dynamic_shape{{hidden, hidden}});
    tensor W_v(dynamic_shape{{hidden, hidden}});
    tensor W_out(dynamic_shape{{hidden, hidden}});
    tensor gamma(dynamic_shape{{hidden}});

    // All require gradients
    W_embed.set_requires_grad(true);
    W_q.set_requires_grad(true);
    // ... etc

    // Optimizer
    adamw_optimizer<float> optimizer(1e-3f);
    auto param_id_q = optimizer.add_param(W_q);

    // Training loop
    for (auto& [tokens, targets] : dataloader) {
        // Forward
        auto x = embedding_forward(W_embed, tokens, hidden);
        auto normed = layer_norm_forward(x, gamma, nullptr, batch, hidden);
        auto attn = nabla::multi_head_attention(...);  // from ops/attention.hpp
        auto loss = cross_entropy_forward(logits, targets, batch, vocab_size);

        // Backward
        loss.backward();

        // Update
        optimizer.begin_step();
        optimizer.step(param_id_q, W_q);
        // ...

        W_q.zero_grad();
    }
}

Adding a new operation

Every operation is a plain struct satisfying differentiable_op. No inheritance, no registration.

Create include/nabla/ops/myop.hpp:

#pragma once
#include "../tensor.hpp"

namespace nabla::ops {

template <scalar T, typename Shape, typename Alloc>
struct myop {
    using tensor_type = basic_tensor<T, Shape, Alloc>;

    tensor_type* input;
    tensor_type  result;

    myop(tensor_type& x) : input(&x), result(x.shape()) {}

    tensor_type forward() {
        auto xi = input->data();
        auto ri = result.data();
        for (size_t i = 0; i < input->total_size(); ++i) {
            ri[i] = f(xi[i]);  // your forward formula
        }
        return result;
    }

    void backward(const tensor_type& grad_output) {
        if (!input->requires_grad()) return;
        auto go = grad_output.data();
        auto gx = input->grad_data();
        for (size_t i = 0; i < input->total_size(); ++i) {
            gx[i] += go[i] * df(xi[i]);  // your backward formula
        }
    }

    auto operands() const { return std::array<tensor_type*, 1>{input}; }
};

} // namespace nabla::ops

namespace nabla {
// Free function wrapper (optional but recommended)
template <scalar T, typename S, typename A>
basic_tensor<T, S, A> myop_forward(basic_tensor<T, S, A>& x) {
    ops::myop<T, S, A> op(x);
    auto result = op.forward();
    if (x.requires_grad()) {
        result.set_requires_grad(true);
        result.set_creator(
            std::make_shared<tape_node<ops::myop<T, S, A>, basic_tensor<T, S, A>>>(std::move(op))
        );
    }
    return result;
}
}

Then add #include "ops/myop.hpp" to include/nabla/ops.hpp.

Rules: inputs are tensor_type* (non-owning), outputs and caches are tensor_type (owning), backward accumulates with +=, always check requires_grad(), use T{} literals not float.


Design decisions

Why concepts, not virtual? Virtual dispatch costs ~2ns per call. In a transformer with 32 layers × 12 operations × thousands of tokens, that adds up. More importantly, virtual functions prevent inlining — the compiler cannot see through the call. With concepts and templates, the compiler sees every operation and can optimise across operation boundaries.

Why allocators for GPU? The same reason the standard library uses allocators: to decouple what you store from where you store it. A device_tensor is the same basic_tensor with a different allocator — all algorithms work unchanged. This is the same insight behind std::pmr::vector.

Why type-erasure for the tape? The autodiff tape needs to store heterogeneous operations in a single container. We use type-erasure (tape_node<Op, T> inheriting tape_node_base) — the same pattern as std::function and std::any. The user never sees or inherits the base class; it is an implementation detail.

Why execution policies? std::sort(std::execution::par, ...) switches from sequential to parallel with one token. Nabla does the same: matmul_forward(execution::gpu, a, b, M, K, N) switches from CPU to GPU. The dispatch is compile-time (overload resolution on the policy type), so there is no runtime cost.

Why header-only? Tensor operations are templates parameterised on scalar type, shape, and allocator. Templates must be in headers. The GPU kernels are the exception — they may require separate compilation depending on the backend toolchain (nvcc, hipcc, etc.).

Why decompose ops.hpp? A 1894-line monolithic file violates Single Responsibility Principle and makes the library hard to navigate. Each operation now lives in a focused header (ops/<name>.hpp), making it easy to find, understand, and extend individual operations. ops.hpp becomes a convenience header for including all operations at once. This pattern mirrors the C++ standard library (e.g., <algorithm> includes many focused algorithms).


Examples

nanoGPT Shakespeare training

A full GPT implementation training on the complete works of Shakespeare:

cmake -B build && cmake --build build
./build/examples/nanogpt_shakespeare data/input.txt

The example trains a 2-layer GPT model with 2 heads, 64 embedding dimension, and demonstrates:

  • Token & position embeddings
  • Multi-layer transformer blocks with causal self-attention
  • AdamW optimizer with cosine learning rate scheduling
  • Loss tracking and sample generation

XOR network

A complete autodiff training loop solving XOR:

./build/examples/xor_network

GPU compute (CPU fallback)

./build/examples/gpu_compute

Requirements

  • C++23 compiler (GCC 13+, Clang 17+)
  • CMake 3.25+
  • Optional: <experimental/simd> for SIMD kernels (GCC 11+)
  • Optional: CUDA/ROCm/SYCL/Metal for GPU (currently CPU fallback only)

Name

Nabla — the ∇ operator. The gradient of a scalar field. The one symbol that captures what this library does: compute ∇L with respect to every parameter, automatically, efficiently, in modern C++.


License

Apache 2.0

About

Autodiff and tensor computing for modern C++

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors