∇ Nabla

Autodiff and tensor computing for modern C++.

Nabla is a header-only C++23 library for building and training neural networks. It follows the conventions of the C++ standard library: concepts over inheritance, value semantics, allocator-aware containers, execution policies, and zero-cost abstractions.

No frameworks. No runtime. No Python. Just C++.

Philosophy

One type. Everything is a nabla::tensor. Inputs, outputs, gradients, parameters, activations — one type flows through the entire graph. If two things are tensors, they compose.

One contract. Every differentiable operation satisfies one concept: differentiable_op. Three methods — forward(), backward(), operands(). No base classes, no registration macros, no factory patterns.

One direction. Data flows forward through operations. Gradients flow backward through the same operations, in reverse. The type at every junction is the same tensor. This is why the whole system compiles and why adding a new operation takes five minutes.

Quick start

cmake -B build
cmake --build build -j$(nproc)
ctest --test-dir build --output-on-failure

#include <nabla/nabla.hpp>

using namespace nabla;

int main() {
    // Tensors — like std::vector, but with gradients
    tensor x(dynamic_shape{{32, 768}});
    tensor W(dynamic_shape{{768, 768}});
    W.set_requires_grad(true);

    // Forward — operations build the graph automatically
    auto y = linear_forward(x, W, nullptr, 32, 768, 768);
    auto z = layer_norm_forward(y, gamma, beta, 32, 768);
    auto loss = cross_entropy_forward(z, targets, 32, vocab_size);

    // Backward — one call, gradients flow through everything
    loss.backward();

    // W.grad() now contains ∂loss/∂W
    adamw.step(W);
}

Architecture

include/nabla/
├── tensor.hpp        Core tensor type, autodiff tape, execution policies, SIMD
├── ops.hpp           Convenience header: includes all ops/*.hpp
├── ops/              Individual operations (one per file)
│   ├── matmul.hpp    matmul: A @ B
│   ├── linear.hpp    linear: x @ W^T + b
│   ├── softmax.hpp   softmax with numerical stability
│   ├── layer_norm.hpp layer normalization + optional bias
│   ├── rms_norm.hpp  RMS normalization
│   ├── gelu.hpp      GELU activation
│   ├── dropout.hpp   Dropout (training/eval modes)
│   ├── embedding.hpp Token/position embeddings
│   ├── add.hpp       Element-wise addition
│   ├── rope.hpp      Rotary positional embeddings
│   ├── swiglu.hpp    SwiGLU FFN
│   ├── cross_entropy.hpp  Cross-entropy loss
│   ├── attention.hpp  Multi-head attention
│   └── swish.hpp     Swish activation wrapper
├── gpu.hpp           Device memory, backend abstraction
├── training.hpp      AdamW optimizer, learning rate scheduling
└── nabla.hpp         Convenience header (includes tensor, ops, gpu, training)

Operations organization

Each operation (13 total) is now in its own header for clarity and modularity:

Operation	File	Forward	Backward
`matmul`	`ops/matmul.hpp`	A @ B	∂L/∂A = ∂L/∂Y @ B^T, ∂L/∂B = A^T @ ∂L/∂Y
`linear`	`ops/linear.hpp`	x @ W^T + b	∂L/∂x = ∂L/∂y @ W, + weight/bias grads
`softmax`	`ops/softmax.hpp`	exp(x_i) / Σexp(x_j)	Jacobian-vector product
`layer_norm`	`ops/layer_norm.hpp`	(x - μ) / σ · γ + β	chain rule with normalization
`rms_norm`	`ops/rms_norm.hpp`	(x / rms) · γ	chain rule with rms
`gelu`	`ops/gelu.hpp`	0.5·x·(1 + tanh(√(2/π)·(x + 0.044715·x³)))	GELU derivative
`dropout`	`ops/dropout.hpp`	mask(x) / (1-p)	pass-through masked gradient
`embedding`	`ops/embedding.hpp`	table[idx]	scatter: table.grad[idx] += ∂L/∂y
`add`	`ops/add.hpp`	a + b	∂L/∂a = ∂L/∂y, ∂L/∂b = ∂L/∂y
`rope`	`ops/rope.hpp`	2D rotation by θ(pos)	inverse rotation
`swiglu`	`ops/swiglu.hpp`	swish(x·W_g) ⊙ (x·W_u) @ W_d	full chain through gates
`cross_entropy`	`ops/cross_entropy.hpp`	−log P[target]	softmax(logits) − one_hot
`attention`	`ops/attention.hpp`	softmax(QK^T/√d) @ V	six-step backward pass

Core types

`basic_tensor<Scalar, Shape, Allocator>`

The central type. Modeled after std::basic_string — parameterised on scalar type, shape representation, and allocator.

// Dynamic shape, float — the default
using tensor   = basic_tensor<float, dynamic_shape>;

// Double precision
using tensor_d = basic_tensor<double, dynamic_shape>;

// Compile-time shape — zero overhead for known dimensions
template <size_t... Dims>
using fixed_tensor = basic_tensor<float, static_shape<Dims...>>;

// GPU-resident tensor — same type, different allocator
template <typename T = float>
using device_tensor = basic_tensor<T, dynamic_shape,
                                    device_allocator<T>>;

A tensor carries:

Data — contiguous storage, accessible via data(), as_span(), iterators
Shape — static or dynamic, accessible via shape(), total_size()
Gradient — same layout as data, accessible via grad(), grad_data()
Creator — pointer to the operation that produced this tensor (the tape)

`differentiable_op<Op, T>`

The single concept that every operation must satisfy:

template <typename Op, typename T>
concept differentiable_op = requires(Op op, const T& grad_output) {
    { op.forward() }               -> convertible_to<T>;
    { op.backward(grad_output) }   -> same_as<void>;
    { op.operands() }              -> ranges::range;
};

No inheritance. No virtual functions in user code. If your struct has these three methods with the right signatures, it is an operation.

Execution policies

Modeled after std::execution::seq / std::execution::par:

auto y = nabla::matmul_forward(execution::seq,  a, b, M, K, N);  // scalar loop
auto y = nabla::matmul_forward(execution::simd, a, b, M, K, N);  // SIMD
auto y = nabla::matmul_forward(execution::gpu,  a, b, M, K, N);  // GPU

Same function, same signature, different policy — different codepath. Resolved at compile time.

Training loop example

A complete training loop with Nabla:

#include <nabla/nabla.hpp>

using namespace nabla;

int main() {
    // Model parameters
    tensor W_embed(dynamic_shape{{vocab_size, hidden}});
    tensor W_q(dynamic_shape{{hidden, hidden}});
    tensor W_k(dynamic_shape{{hidden, hidden}});
    tensor W_v(dynamic_shape{{hidden, hidden}});
    tensor W_out(dynamic_shape{{hidden, hidden}});
    tensor gamma(dynamic_shape{{hidden}});

    // All require gradients
    W_embed.set_requires_grad(true);
    W_q.set_requires_grad(true);
    // ... etc

    // Optimizer
    adamw_optimizer<float> optimizer(1e-3f);
    auto param_id_q = optimizer.add_param(W_q);

    // Training loop
    for (auto& [tokens, targets] : dataloader) {
        // Forward
        auto x = embedding_forward(W_embed, tokens, hidden);
        auto normed = layer_norm_forward(x, gamma, nullptr, batch, hidden);
        auto attn = nabla::multi_head_attention(...);  // from ops/attention.hpp
        auto loss = cross_entropy_forward(logits, targets, batch, vocab_size);

        // Backward
        loss.backward();

        // Update
        optimizer.begin_step();
        optimizer.step(param_id_q, W_q);
        // ...

        W_q.zero_grad();
    }
}

Adding a new operation

Every operation is a plain struct satisfying differentiable_op. No inheritance, no registration.

Create include/nabla/ops/myop.hpp:

#pragma once
#include "../tensor.hpp"

namespace nabla::ops {

template <scalar T, typename Shape, typename Alloc>
struct myop {
    using tensor_type = basic_tensor<T, Shape, Alloc>;

    tensor_type* input;
    tensor_type  result;

    myop(tensor_type& x) : input(&x), result(x.shape()) {}

    tensor_type forward() {
        auto xi = input->data();
        auto ri = result.data();
        for (size_t i = 0; i < input->total_size(); ++i) {
            ri[i] = f(xi[i]);  // your forward formula
        }
        return result;
    }

    void backward(const tensor_type& grad_output) {
        if (!input->requires_grad()) return;
        auto go = grad_output.data();
        auto gx = input->grad_data();
        for (size_t i = 0; i < input->total_size(); ++i) {
            gx[i] += go[i] * df(xi[i]);  // your backward formula
        }
    }

    auto operands() const { return std::array<tensor_type*, 1>{input}; }
};

} // namespace nabla::ops

namespace nabla {
// Free function wrapper (optional but recommended)
template <scalar T, typename S, typename A>
basic_tensor<T, S, A> myop_forward(basic_tensor<T, S, A>& x) {
    ops::myop<T, S, A> op(x);
    auto result = op.forward();
    if (x.requires_grad()) {
        result.set_requires_grad(true);
        result.set_creator(
            std::make_shared<tape_node<ops::myop<T, S, A>, basic_tensor<T, S, A>>>(std::move(op))
        );
    }
    return result;
}
}

Then add #include "ops/myop.hpp" to include/nabla/ops.hpp.

Rules: inputs are tensor_type* (non-owning), outputs and caches are tensor_type (owning), backward accumulates with +=, always check requires_grad(), use T{} literals not float.

Design decisions

Why concepts, not virtual? Virtual dispatch costs ~2ns per call. In a transformer with 32 layers × 12 operations × thousands of tokens, that adds up. More importantly, virtual functions prevent inlining — the compiler cannot see through the call. With concepts and templates, the compiler sees every operation and can optimise across operation boundaries.

Why allocators for GPU? The same reason the standard library uses allocators: to decouple what you store from where you store it. A device_tensor is the same basic_tensor with a different allocator — all algorithms work unchanged. This is the same insight behind std::pmr::vector.

Why type-erasure for the tape? The autodiff tape needs to store heterogeneous operations in a single container. We use type-erasure (tape_node<Op, T> inheriting tape_node_base) — the same pattern as std::function and std::any. The user never sees or inherits the base class; it is an implementation detail.

Why execution policies? std::sort(std::execution::par, ...) switches from sequential to parallel with one token. Nabla does the same: matmul_forward(execution::gpu, a, b, M, K, N) switches from CPU to GPU. The dispatch is compile-time (overload resolution on the policy type), so there is no runtime cost.

Why header-only? Tensor operations are templates parameterised on scalar type, shape, and allocator. Templates must be in headers. The GPU kernels are the exception — they may require separate compilation depending on the backend toolchain (nvcc, hipcc, etc.).

Why decompose ops.hpp? A 1894-line monolithic file violates Single Responsibility Principle and makes the library hard to navigate. Each operation now lives in a focused header (ops/<name>.hpp), making it easy to find, understand, and extend individual operations. ops.hpp becomes a convenience header for including all operations at once. This pattern mirrors the C++ standard library (e.g., <algorithm> includes many focused algorithms).

Examples

nanoGPT Shakespeare training

A full GPT implementation training on the complete works of Shakespeare:

cmake -B build && cmake --build build
./build/examples/nanogpt_shakespeare data/input.txt

The example trains a 2-layer GPT model with 2 heads, 64 embedding dimension, and demonstrates:

Token & position embeddings
Multi-layer transformer blocks with causal self-attention
AdamW optimizer with cosine learning rate scheduling
Loss tracking and sample generation

XOR network

A complete autodiff training loop solving XOR:

./build/examples/xor_network

GPU compute (CPU fallback)

./build/examples/gpu_compute

Requirements

C++23 compiler (GCC 13+, Clang 17+)
CMake 3.25+
Optional: <experimental/simd> for SIMD kernels (GCC 11+)
Optional: CUDA/ROCm/SYCL/Metal for GPU (currently CPU fallback only)

Name

Nabla — the ∇ operator. The gradient of a scalar field. The one symbol that captures what this library does: compute ∇L with respect to every parameter, automatically, efficiently, in modern C++.

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
examples		examples
include/nabla		include/nabla
tests		tests
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

∇ Nabla

Philosophy

Quick start

Architecture

Operations organization

Core types

`basic_tensor<Scalar, Shape, Allocator>`

`differentiable_op<Op, T>`

Execution policies

Training loop example

Adding a new operation

Design decisions

Examples

nanoGPT Shakespeare training

XOR network

GPU compute (CPU fallback)

Requirements

Name

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

∇ Nabla

Philosophy

Quick start

Architecture

Operations organization

Core types

basic_tensor<Scalar, Shape, Allocator>

differentiable_op<Op, T>

Execution policies

Training loop example

Adding a new operation

Design decisions

Examples

nanoGPT Shakespeare training

XOR network

GPU compute (CPU fallback)

Requirements

Name

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`basic_tensor<Scalar, Shape, Allocator>`

`differentiable_op<Op, T>`

Packages