Autodiff and tensor computing for modern C++.
Nabla is a header-only C++23 library for building and training neural networks. It follows the conventions of the C++ standard library: concepts over inheritance, value semantics, allocator-aware containers, execution policies, and zero-cost abstractions.
No frameworks. No runtime. No Python. Just C++.
One type. Everything is a nabla::tensor. Inputs, outputs, gradients, parameters, activations — one type flows through the entire graph. If two things are tensors, they compose.
One contract. Every differentiable operation satisfies one concept: differentiable_op. Three methods — forward(), backward(), operands(). No base classes, no registration macros, no factory patterns.
One direction. Data flows forward through operations. Gradients flow backward through the same operations, in reverse. The type at every junction is the same tensor. This is why the whole system compiles and why adding a new operation takes five minutes.
cmake -B build
cmake --build build -j$(nproc)
ctest --test-dir build --output-on-failure#include <nabla/nabla.hpp>
using namespace nabla;
int main() {
// Tensors — like std::vector, but with gradients
tensor x(dynamic_shape{{32, 768}});
tensor W(dynamic_shape{{768, 768}});
W.set_requires_grad(true);
// Forward — operations build the graph automatically
auto y = linear_forward(x, W, nullptr, 32, 768, 768);
auto z = layer_norm_forward(y, gamma, beta, 32, 768);
auto loss = cross_entropy_forward(z, targets, 32, vocab_size);
// Backward — one call, gradients flow through everything
loss.backward();
// W.grad() now contains ∂loss/∂W
adamw.step(W);
}include/nabla/
├── tensor.hpp Core tensor type, autodiff tape, execution policies, SIMD
├── ops.hpp Convenience header: includes all ops/*.hpp
├── ops/ Individual operations (one per file)
│ ├── matmul.hpp matmul: A @ B
│ ├── linear.hpp linear: x @ W^T + b
│ ├── softmax.hpp softmax with numerical stability
│ ├── layer_norm.hpp layer normalization + optional bias
│ ├── rms_norm.hpp RMS normalization
│ ├── gelu.hpp GELU activation
│ ├── dropout.hpp Dropout (training/eval modes)
│ ├── embedding.hpp Token/position embeddings
│ ├── add.hpp Element-wise addition
│ ├── rope.hpp Rotary positional embeddings
│ ├── swiglu.hpp SwiGLU FFN
│ ├── cross_entropy.hpp Cross-entropy loss
│ ├── attention.hpp Multi-head attention
│ └── swish.hpp Swish activation wrapper
├── gpu.hpp Device memory, backend abstraction
├── training.hpp AdamW optimizer, learning rate scheduling
└── nabla.hpp Convenience header (includes tensor, ops, gpu, training)
Each operation (13 total) is now in its own header for clarity and modularity:
| Operation | File | Forward | Backward |
|---|---|---|---|
matmul |
ops/matmul.hpp |
A @ B | ∂L/∂A = ∂L/∂Y @ B^T, ∂L/∂B = A^T @ ∂L/∂Y |
linear |
ops/linear.hpp |
x @ W^T + b | ∂L/∂x = ∂L/∂y @ W, + weight/bias grads |
softmax |
ops/softmax.hpp |
exp(x_i) / Σexp(x_j) | Jacobian-vector product |
layer_norm |
ops/layer_norm.hpp |
(x - μ) / σ · γ + β | chain rule with normalization |
rms_norm |
ops/rms_norm.hpp |
(x / rms) · γ | chain rule with rms |
gelu |
ops/gelu.hpp |
0.5·x·(1 + tanh(√(2/π)·(x + 0.044715·x³))) | GELU derivative |
dropout |
ops/dropout.hpp |
mask(x) / (1-p) | pass-through masked gradient |
embedding |
ops/embedding.hpp |
table[idx] | scatter: table.grad[idx] += ∂L/∂y |
add |
ops/add.hpp |
a + b | ∂L/∂a = ∂L/∂y, ∂L/∂b = ∂L/∂y |
rope |
ops/rope.hpp |
2D rotation by θ(pos) | inverse rotation |
swiglu |
ops/swiglu.hpp |
swish(x·W_g) ⊙ (x·W_u) @ W_d | full chain through gates |
cross_entropy |
ops/cross_entropy.hpp |
−log P[target] | softmax(logits) − one_hot |
attention |
ops/attention.hpp |
softmax(QK^T/√d) @ V | six-step backward pass |
The central type. Modeled after std::basic_string — parameterised on scalar type, shape representation, and allocator.
// Dynamic shape, float — the default
using tensor = basic_tensor<float, dynamic_shape>;
// Double precision
using tensor_d = basic_tensor<double, dynamic_shape>;
// Compile-time shape — zero overhead for known dimensions
template <size_t... Dims>
using fixed_tensor = basic_tensor<float, static_shape<Dims...>>;
// GPU-resident tensor — same type, different allocator
template <typename T = float>
using device_tensor = basic_tensor<T, dynamic_shape,
device_allocator<T>>;A tensor carries:
- Data — contiguous storage, accessible via
data(),as_span(), iterators - Shape — static or dynamic, accessible via
shape(),total_size() - Gradient — same layout as data, accessible via
grad(),grad_data() - Creator — pointer to the operation that produced this tensor (the tape)
The single concept that every operation must satisfy:
template <typename Op, typename T>
concept differentiable_op = requires(Op op, const T& grad_output) {
{ op.forward() } -> convertible_to<T>;
{ op.backward(grad_output) } -> same_as<void>;
{ op.operands() } -> ranges::range;
};No inheritance. No virtual functions in user code. If your struct has these three methods with the right signatures, it is an operation.
Modeled after std::execution::seq / std::execution::par:
auto y = nabla::matmul_forward(execution::seq, a, b, M, K, N); // scalar loop
auto y = nabla::matmul_forward(execution::simd, a, b, M, K, N); // SIMD
auto y = nabla::matmul_forward(execution::gpu, a, b, M, K, N); // GPUSame function, same signature, different policy — different codepath. Resolved at compile time.
A complete training loop with Nabla:
#include <nabla/nabla.hpp>
using namespace nabla;
int main() {
// Model parameters
tensor W_embed(dynamic_shape{{vocab_size, hidden}});
tensor W_q(dynamic_shape{{hidden, hidden}});
tensor W_k(dynamic_shape{{hidden, hidden}});
tensor W_v(dynamic_shape{{hidden, hidden}});
tensor W_out(dynamic_shape{{hidden, hidden}});
tensor gamma(dynamic_shape{{hidden}});
// All require gradients
W_embed.set_requires_grad(true);
W_q.set_requires_grad(true);
// ... etc
// Optimizer
adamw_optimizer<float> optimizer(1e-3f);
auto param_id_q = optimizer.add_param(W_q);
// Training loop
for (auto& [tokens, targets] : dataloader) {
// Forward
auto x = embedding_forward(W_embed, tokens, hidden);
auto normed = layer_norm_forward(x, gamma, nullptr, batch, hidden);
auto attn = nabla::multi_head_attention(...); // from ops/attention.hpp
auto loss = cross_entropy_forward(logits, targets, batch, vocab_size);
// Backward
loss.backward();
// Update
optimizer.begin_step();
optimizer.step(param_id_q, W_q);
// ...
W_q.zero_grad();
}
}Every operation is a plain struct satisfying differentiable_op. No inheritance, no registration.
Create include/nabla/ops/myop.hpp:
#pragma once
#include "../tensor.hpp"
namespace nabla::ops {
template <scalar T, typename Shape, typename Alloc>
struct myop {
using tensor_type = basic_tensor<T, Shape, Alloc>;
tensor_type* input;
tensor_type result;
myop(tensor_type& x) : input(&x), result(x.shape()) {}
tensor_type forward() {
auto xi = input->data();
auto ri = result.data();
for (size_t i = 0; i < input->total_size(); ++i) {
ri[i] = f(xi[i]); // your forward formula
}
return result;
}
void backward(const tensor_type& grad_output) {
if (!input->requires_grad()) return;
auto go = grad_output.data();
auto gx = input->grad_data();
for (size_t i = 0; i < input->total_size(); ++i) {
gx[i] += go[i] * df(xi[i]); // your backward formula
}
}
auto operands() const { return std::array<tensor_type*, 1>{input}; }
};
} // namespace nabla::ops
namespace nabla {
// Free function wrapper (optional but recommended)
template <scalar T, typename S, typename A>
basic_tensor<T, S, A> myop_forward(basic_tensor<T, S, A>& x) {
ops::myop<T, S, A> op(x);
auto result = op.forward();
if (x.requires_grad()) {
result.set_requires_grad(true);
result.set_creator(
std::make_shared<tape_node<ops::myop<T, S, A>, basic_tensor<T, S, A>>>(std::move(op))
);
}
return result;
}
}Then add #include "ops/myop.hpp" to include/nabla/ops.hpp.
Rules: inputs are tensor_type* (non-owning), outputs and caches are tensor_type (owning), backward accumulates with +=, always check requires_grad(), use T{} literals not float.
Why concepts, not virtual? Virtual dispatch costs ~2ns per call. In a transformer with 32 layers × 12 operations × thousands of tokens, that adds up. More importantly, virtual functions prevent inlining — the compiler cannot see through the call. With concepts and templates, the compiler sees every operation and can optimise across operation boundaries.
Why allocators for GPU? The same reason the standard library uses allocators: to decouple what you store from where you store it. A device_tensor is the same basic_tensor with a different allocator — all algorithms work unchanged. This is the same insight behind std::pmr::vector.
Why type-erasure for the tape? The autodiff tape needs to store heterogeneous operations in a single container. We use type-erasure (tape_node<Op, T> inheriting tape_node_base) — the same pattern as std::function and std::any. The user never sees or inherits the base class; it is an implementation detail.
Why execution policies? std::sort(std::execution::par, ...) switches from sequential to parallel with one token. Nabla does the same: matmul_forward(execution::gpu, a, b, M, K, N) switches from CPU to GPU. The dispatch is compile-time (overload resolution on the policy type), so there is no runtime cost.
Why header-only? Tensor operations are templates parameterised on scalar type, shape, and allocator. Templates must be in headers. The GPU kernels are the exception — they may require separate compilation depending on the backend toolchain (nvcc, hipcc, etc.).
Why decompose ops.hpp? A 1894-line monolithic file violates Single Responsibility Principle and makes the library hard to navigate. Each operation now lives in a focused header (ops/<name>.hpp), making it easy to find, understand, and extend individual operations. ops.hpp becomes a convenience header for including all operations at once. This pattern mirrors the C++ standard library (e.g., <algorithm> includes many focused algorithms).
A full GPT implementation training on the complete works of Shakespeare:
cmake -B build && cmake --build build
./build/examples/nanogpt_shakespeare data/input.txtThe example trains a 2-layer GPT model with 2 heads, 64 embedding dimension, and demonstrates:
- Token & position embeddings
- Multi-layer transformer blocks with causal self-attention
- AdamW optimizer with cosine learning rate scheduling
- Loss tracking and sample generation
A complete autodiff training loop solving XOR:
./build/examples/xor_network./build/examples/gpu_compute- C++23 compiler (GCC 13+, Clang 17+)
- CMake 3.25+
- Optional:
<experimental/simd>for SIMD kernels (GCC 11+) - Optional: CUDA/ROCm/SYCL/Metal for GPU (currently CPU fallback only)
Nabla — the ∇ operator. The gradient of a scalar field. The one symbol that captures what this library does: compute ∇L with respect to every parameter, automatically, efficiently, in modern C++.
Apache 2.0