A Rust-native deep learning framework built on libtorch.
Same GPU kernels as PyTorch. No Python. No GIL. No GC. Just Rust.
PyTorch Users • Thesis • Getting Started • Graph Builder • Graph Tree • Training • Multi-GPU • HuggingFace • Parity • Benchmarks • Roadmap • Migration Guide • Data Loading
What's new in 0.5.0 -- the
fdlCLI maturity pass. New proc-macro crateflodl-cli-macrosadds#[derive(FdlArgs)]-- any Rust binary gets typed argv parsing, JSON schema, shell completions, and env-var fallback for free.fdl.ymlconsolidates to a singlecommands:map with three clean kinds (run:/path:/ preset). New--envoverlays andfdl config showsurface per-environment config with per-field origin annotations, so you can see the resolved YAML before running a two-hour job. Migration from 0.4.0: see UPGRADE.md.
| PyTorch | floDl |
|---|---|
model = nn.Sequential(
nn.Linear(2, 16),
nn.GELU(),
nn.LayerNorm(16),
nn.Linear(16, 2),
)
pred = model(x)
loss = F.mse_loss(pred, target)
loss.backward()
optimizer.step() |
let model = FlowBuilder::from(Linear::new(2, 16)?)
.through(GELU)
.through(LayerNorm::new(16)?)
.through(Linear::new(16, 2)?)
.build()?;
let pred = model.forward(&x)?;
let loss = mse_loss(&pred, &target)?;
loss.backward()?;
optimizer.step()?; |
Same concepts, same names, same GPU kernels underneath. The ? operator
replaces silent failures with compile-time error handling. Drop replaces the
garbage collector. The full migration guide covers
every op, module, and pattern.
New to Rust? Read Rust for PyTorch Users — 10 patterns in 15 minutes.
With the CLI (recommended, no Rust needed):
curl -sL https://flodl.dev/fdl -o fdl && chmod +x fdl
./fdl setup # detect hardware, download libtorch, configure build environment
./fdl init my-proj # scaffold a new project with training templateThe fdl script auto-downloads a pre-compiled CLI binary (~750KB, pure Rust,
no libtorch dependency). It detects your GPUs, downloads the right libtorch
variant, and configures Docker or native builds. See the full CLI
reference for all commands.
One-liner with Docker (no Rust, no setup):
curl -sL https://flodl.dev/init.sh | sh -s my-project
cd my-project
./fdl build # first build (~5 min, downloads libtorch)
./fdl run # train the modelNative -- Rust 1.85+ and libtorch:
./fdl libtorch download # auto-detects CPU or CUDA
cargo add flodl && cargo buildFor CUDA: cargo add flodl --features cuda + CUDA toolkit.
Using tch-rs or PyTorch C++?
fdlalso works as a standalone libtorch manager outside of flodl: download any CPU/CUDA variant, switch between installs, compile from source for mixed GPU architectures (e.g. sm_61 + sm_120 in one build), and emit a machine-readable diagnostics report. No flodl buy-in required. See docs/cli.md § Standalone and theflodl-clicrate.
Both paths generate an annotated training template. Edit src/main.rs to
build your model:
use flodl::*;
let model = FlowBuilder::from(Linear::new(2, 16)?)
.through(GELU)
.through(LayerNorm::new(16)?)
.also(Linear::new(16, 16)?) // residual connection
.through(Linear::new(16, 2)?)
.build()?;
let params = model.parameters();
let mut optimizer = Adam::new(¶ms, 0.01);
model.train();
for (input_t, target_t) in &batches {
let input = Variable::new(input_t.clone(), true);
let target = Variable::new(target_t.clone(), false);
let pred = model.forward(&input)?;
let loss = mse_loss(&pred, &target)?;
optimizer.zero_grad();
loss.backward()?;
clip_grad_norm(¶ms, 1.0)?;
optimizer.step()?;
}For framework-managed training (same code on CPU, single GPU, multi-GPU),
the universal Trainer takes a step closure and owns the loop:
// One step: forward + loss, returns the loss Variable.
fn train_step(model: &dyn Module, batch: &[Tensor]) -> Result<Variable> {
let input = Variable::new(batch[0].clone(), false);
let target = Variable::new(batch[1].clone(), false);
mse_loss(&model.forward(&input)?, &target)
}
Trainer::builder(
|dev| build_model_on(dev),
|params| Adam::new(params, 0.01),
train_step,
)
.dataset(dataset)
.batch_size(32)
.num_epochs(num_epochs)
.run()?
.join()?;Trainer::setup (own the loop, framework owns the setup) is the
in-between tier. See Tutorial 4: Training
for all three tiers.
floDl's fluent graph builder lets you describe complex architectures as
readable data flow — no boilerplate, no nn.Module subclassing.
let model = FlowBuilder::from(Linear::new(2, 16)?)
.through(GELU) // activation
.through(LayerNorm::new(16)?) // normalization
.also(Linear::new(16, 16)?) // residual connection
.through(Linear::new(16, 2)?) // output projection
.build()?;build() returns a Graph that implements Module — you can nest it
inside other graphs. Things get interesting when architectures get complex:
let g = FlowBuilder::from(encoder).tag("encoded")
.split(modules![head_a, head_b, head_c]).merge(MergeOp::Mean)
.loop_body(refinement_block).for_n(3).tag("refined")
.gate(router, modules![expert_a, expert_b]).using(&["encoded"])
.switch(selector, modules![light_path, heavy_path]).using(&["refined"])
.through(StateAdd).using(&["memory"]).tag("memory")
.loop_body(decoder).while_cond(halt_condition, 10)
.through(output_head)
.build()?;Every construct — split/merge, also, loop_body, gate, switch, map,
tag/using — composes cleanly. Forward references (using before tag) carry
state across calls, enabling recurrent architectures without special-casing.
| Method | What it does |
|---|---|
from(m).through(m) |
Linear chain |
also(m) |
Residual: input + m(input) |
fork(m) |
Side branch: capture output as tag, stream continues |
split(modules![...]).merge(op) |
Parallel branches, merged by Add or Mean |
tag(name) / using(refs) |
Named references — backward or forward (across calls) |
loop_body(body).for_n(n) |
Fixed iteration with BPTT |
loop_body(body).while_cond / until_cond |
Conditional loops |
gate(router, modules![...]) |
Soft routing — weighted combination |
switch(selector, modules![...]) |
Hard routing — only selected branch |
map(body).each() / .over(tag) / .slices(n) |
Element-wise, tagged, or sliced iteration |
input(names) |
Auxiliary graph inputs for multi-input architectures |
See the Graph Builder Tutorial and the full showcase.
This is where floDl goes beyond PyTorch. Graphs nest inside graphs with label-path addressing — dot-separated paths that let you reach into any subgraph from the root. Train components independently, compose them into larger architectures, and control training phases declaratively.
// Build components independently
let scan = FlowBuilder::from(scan_net).tag("hidden")
.label("scan").build()?;
let read = FlowBuilder::from(read_net).tag("confidence")
.label("read").build()?;
let encoder = FlowBuilder::from(scan)
.through(read)
.label("encoder").build()?;
// Compose into full model
let model = FlowBuilder::from(encoder)
.through(classifier)
.build()?;Every tag and subgraph is addressable through dotted paths from the root:
model.validate_path("encoder")?; // -> Subgraph
model.validate_path("encoder.scan.hidden")?; // -> Tag (three levels deep)
model.validate_path("encoder.read.confidence")?; // -> TagFreeze and thaw entire subtrees by path — no manual parameter iteration:
// Phase 1: train only the classifier, encoder is frozen
model.freeze("encoder")?;
let fresh_params = model.parameters(); // only unfrozen params
let mut opt = Adam::new(&fresh_params, 1e-3);
// ... train ...
// Phase 2: thaw scan, keep read frozen (it's proven)
model.thaw("encoder.scan")?;
let mut opt = Adam::with_groups()
.group(&model.parameters_at("encoder.scan")?, 1e-4) // low LR
.group(&model.parameters_at("classifier")?, 1e-3)
.build();Train a component standalone, save it, load it into a larger model:
// Pre-trained encoder saved earlier
encoder.save_checkpoint("encoder_v1.fdl.gz")?;
// Load into the composed model — namespace + hash validated
model.load_subgraph_checkpoint("encoder", "encoder_v1.fdl.gz")?;
model.freeze("encoder.read")?; // lock what's provenMetrics flow up through the tree automatically:
model.record_at("encoder.scan.loss", scan_loss)?;
model.record_at("encoder.read.accuracy", read_acc)?;
model.record_scalar("total_loss", total)?;
model.flush(&[]); // single call flushes the entire tree
// Trends across boundaries — drive training decisions
if model.trend_at("encoder.scan.loss")?.stalled(10, 1e-4) {
model.thaw("encoder.read")?; // scan stalled, unfreeze read
}
// Monitor sees all metrics with dotted names automatically
monitor.log(epoch, elapsed, &model);
// -> total_loss, encoder.scan.loss, encoder.read.accuracyThis is progressive model composition: each component is trained and validated independently before becoming a building block in a larger architecture. Checkpoints, metrics, and training phases compose just like the graphs themselves.
See the full Graph Tree Tutorial.
Drop-in monitor with adaptive ETA, resource tracking, and a live web dashboard — no external dependencies, no separate process.
use flodl::monitor::Monitor;
let mut monitor = Monitor::new(num_epochs);
monitor.serve(3000)?; // optional: live dashboard at http://localhost:3000
for epoch in 0..num_epochs {
let t = std::time::Instant::now();
// ... training ...
monitor.log(epoch, t.elapsed(), &model); // sees entire graph tree
}
monitor.finish(); epoch 1/100 loss=1.5264 [49ms ETA 4.8s]
epoch 10/100 loss=0.3817 [25ms ETA 2.2s] VRAM: 2.1/6.0 GB (82%)
epoch 50/100 loss=0.0023 [24ms ETA 1.2s] VRAM: 2.1/6.0 GB (82%)
epoch 100/100 loss=0.0012 [23ms] VRAM: 2.1/6.0 GB (82%)
training complete in 2.8s | loss: 0.0012
Interactive benchmark dashboard — real data from a 100-epoch training run
The live dashboard updates via Server-Sent Events (no WebSocket, no npm), tracks CPU/GPU/RAM/VRAM, and supports late join — open it mid-training and all past epochs backfill instantly.
monitor.save_html("training_report.html"); // self-contained archive
monitor.export_csv("training.csv")?; // for external analysisTags double as observation points. Collect metrics during training and use trend queries to make programmatic training decisions:
for epoch in 0..num_epochs {
for (input, target) in &batches {
let pred = graph.forward(&input)?;
graph.collect(&["hidden"])?; // from graph tag
graph.record_scalar("loss", loss.item()?); // external metric
}
graph.flush(&["hidden", "loss"]);
// Programmatic training control
if graph.trend("loss").stalled(5, 1e-4) {
optimizer.set_lr(optimizer.lr() * 0.5); // decay LR
}
if graph.trend("loss").converged(5, 1e-5) {
break; // early stopping
}
}| Method | What it does |
|---|---|
g.collect(tags) / g.flush(tags) |
Batch -> epoch metric aggregation |
g.record_scalar(tag, value) |
Inject external metrics (loss, accuracy) |
g.trend(tag).slope(n) |
OLS slope over last n epochs |
g.trend(tag).stalled(n, tol) |
Is |slope| below tolerance? |
g.trend(tag).improving(n) |
Is loss decreasing? |
g.trend(tag).converged(n, tol) |
Is variance below tolerance? |
g.trends(tags).all_improving(n) |
Group queries across branches |
let svg = g.svg(Some("model.svg"))?; // architecture diagram
g.svg_with_profile(Some("profile.svg"))?; // timing heatmap
g.plot_html("training.html", &["loss", "head"])?; // interactive curvesSee the Training Monitor Tutorial and the Observation example.
Trainer::setup() gives you transparent heterogeneous multi-GPU training with
zero changes to your training loop. floDl detects your GPUs, picks the best
strategy, and balances work automatically: the slowest GPU anchors the pace
while faster ones run ahead intelligently.
Graph DDP -- one line to go from single-GPU to multi-GPU:
// Detect GPUs, replicate model, set optimizer, enable training
Trainer::setup(&model, &builder, |p| Adam::new(p, 0.001))?;
// Training loop is IDENTICAL for 1 or N GPUs
for batch in model.epoch(0) {
let loss = model.forward_batch(&batch?)?;
model.step()?; // AllReduce + sync + optimizer + zero_grad
}DDP Builder -- thread-per-GPU, works with any Module:
let state = Trainer::builder(model_factory, optim_factory, train_fn)
.dataset(dataset)
.batch_size(32)
.num_epochs(10)
.policy(ApplyPolicy::Cadence) // ElChe for mixed GPUs
.backend(AverageBackend::Nccl) // or Cpu for A/B testing
.run()?
.join()?;| Graph DDP | DDP Builder | |
|---|---|---|
| Works with | Graph builder |
Any Module |
| GPU model | Scatter per batch | Thread per GPU (Local SGD) |
| Mixed GPUs | El Che auto-enabled | ApplyPolicy x AverageBackend |
| Setup | One line (Trainer::setup) |
Builder pattern |
| Dashboard | Integrated | Stderr logging |
A/B testing: swap AverageBackend::Nccl for AverageBackend::Cpu
with one line. If loss curves match, you have validated the cheaper
backend for your workload.
See the Multi-GPU Tutorial, DDP Builder Tutorial, Data Loading Tutorial, and DDP Reference.
The repo ships with ddp-bench/,
a workspace member that reproduces published training setups (Logistic /
MLP / LeNet-5 / ResNet-20 / Char-RNN / GPT-nano / Conv-AE on MNIST,
CIFAR-10, Shakespeare) to build scientifically valid solo baselines, then
measures DDP/ElChe convergence quality against them across all 8
backend × policy combinations:
fdl ddp-bench --list # list models and modes
fdl ddp-bench quick # 1-epoch smoke test
fdl ddp-bench validate # full sweep vs structured baselines
fdl ddp-bench --model gpt-nano --mode nccl-cadence --epochs 50 --lr-scale 2
fdl ddp-bench --report runs/report.md # convergence report from saved runsEvery run produces a high-frequency Timeline (CPU/GPU utilization, sync
events, anchor changes, idle gaps) saved as JSON / CSV / interactive HTML
under runs/<model>/<mode>/.
The framework ships ready-to-use parsers for common benchmarks (all
implement BatchDataSet, plug straight into DataLoader::builder):
use flodl::data::datasets::{Cifar10, Mnist, Shakespeare};
let mnist = Mnist::parse(&images_gz, &labels_gz)?;
let cifar = Cifar10::parse(&[&batch1, &batch2, /* ... */])?;
let text = Shakespeare::parse(&corpus, /*seq_len=*/ 128)?;ddp-bench downloads and caches the underlying files on first run.
The flodl-hf sibling crate loads
real HuggingFace checkpoints directly into floDl, with no Python, no
ONNX detour, and no conversion script. Six BERT-family architectures
(BERT, RoBERTa, DistilBERT, ALBERT, XLM-RoBERTa, DeBERTa-v2) are wired
end-to-end with PyTorch-verified numerical parity across four task
heads each.
use flodl_hf::models::auto::AutoModelForSequenceClassification;
let clf = AutoModelForSequenceClassification::from_pretrained(
"cardiffnlp/twitter-roberta-base-sentiment-latest",
)?;
let results = clf.predict(&["I love this framework"])?;
for (label, score) in &results[0] {
println!("{} ({:.3})", label, score);
}AutoModel reads config.json, dispatches to the right family, and
returns a typed handle. Swap the repo id for bert-base-uncased,
albert-base-v2, xlm-roberta-base, microsoft/deberta-v3-base, or
any fine-tune on top of those, and the caller stays identical.
| Entry point | Task | Output |
|---|---|---|
AutoModel |
Backbone (hidden states) | [batch, seq_len, hidden] |
AutoModelForSequenceClassification |
Whole-text labels | Vec<Vec<(label, score)>> |
AutoModelForTokenClassification |
Per-token labels (NER) | Vec<Vec<TokenPrediction>> |
AutoModelForQuestionAnswering |
Extractive answer span | Answer { text, start, end, score } |
AutoModelForMaskedLM |
Fill-mask candidates | Vec<(token, prob)> (top-k) |
Three feature profiles cover deployment shapes: full Hub + tokenizer
(default), vision-only (Hub without the regex/unicode surface), and
offline safetensors-only (no network, no async runtime, no TLS).
Inside an existing flodl project, fdl add flodl-hf --playground
scaffolds a side crate with a runnable AutoModel example so you can
verify a real checkpoint loads before wiring it into your main code;
fdl add flodl-hf --install appends flodl-hf = "=0.5.3" to your
root Cargo.toml.
fdl add flodl-hf --playground # try it: ./flodl-hf/ sandbox crate
fdl add flodl-hf --install # wire it: pin to your root Cargo.toml
fdl flodl-hf classify # runs AutoModel on a real fine-tuneFine-tuning uses the same loop code on CPU, single GPU, or N GPUs:
Trainer::setup_head distributes the head transparently when more
devices are available, and compute_loss(enc, labels) mirrors HF
Python's model(..., labels=...).loss one-call shape. Round-trip the
trained head back out with fdl flodl-hf export then verify it loads
into HF Python's AutoModelFor* with fdl flodl-hf verify-export.
See the HuggingFace Integration Tutorial for the full feature matrix, per-family entry points, tokenizer usage, local-disk loading, the fine-tune walkthrough, the export round-trip recipe, and the 30-cell parity matrix.
floDl covers the modules, losses, and optimizers you actually use:
| Category | Count | Highlights |
|---|---|---|
| NN Modules | 30+ | Linear, Conv1d/2d/3d + transpose, GRU/LSTM, MultiheadAttention, Bilinear, all norms (Layer/RMS/Group/Batch/Instance), all pooling, Embedding/EmbeddingBag, PixelShuffle, Upsample, Unfold/Fold |
| Activations | 17 | ReLU, LeakyReLU, ELU, GELU, SiLU, Mish, SELU, Softplus, Hardswish, PReLU, Softmax, ... |
| Losses | 15 | MSE, CrossEntropy, BCE, NLL, CTC, Focal, Triplet, KLDiv, SmoothL1, Cosine, Hinge, Margin, Poisson, ... |
| Optimizers | 7 | SGD, Adam, AdamW, RMSprop, Adagrad, RAdam, NAdam — all with parameter groups |
| Schedulers | 8 | Step, Cosine, Exponential, MultiStep, OneCycle, Cyclic, Warmup (composable), Plateau |
| Init | 9 | Xavier, Kaiming, orthogonal, truncated normal, uniform, normal |
| Tensor Ops | 100+ | Full arithmetic, trig, reductions, shape, indexing, comparisons, fused ops |
| Autograd | 90+ | Differentiable backward for every op above |
Fused Adam/AdamW on CUDA (single kernel for all parameters). Fused gradient
clipping via foreach ops. Mixed precision with AutocastGuard + GradScaler.
CUDA Graphs for replay-based training.
The full migration guide has side-by-side code for every op, module, and pattern.
Same CUDA kernels as PyTorch — the difference comes from what happens between kernel launches. Ten models, ten interleaved rounds, locked GPU clocks (RTX 5060 Ti, v0.3.0 vs PyTorch 2.10.0):
| Model | PyTorch | flodl | Delta |
|---|---|---|---|
| transformer | 3183.0 ms | 2199.8 ms | -31% |
| mlp | 291.1 ms | 207.0 ms | -29% |
| residual_tower | 406.9 ms | 309.7 ms | -24% |
| feedback_fixed | 275.3 ms | 231.3 ms | -16% |
| gated_routing | 248.0 ms | 217.3 ms | -12% |
| iterative_refine | 230.7 ms | 206.0 ms | -11% |
| gru_seq | 1105.1 ms | 1057.5 ms | -4% |
| conv_autoenc | 398.2 ms | 395.3 ms | -1% |
| lstm_seq | 692.3 ms | 692.3 ms | 0% |
| convnet | 1298.0 ms | 1298.2 ms | 0% |
Wins 8 of 10, ties 2, zero regressions. The ties (convnet, lstm_seq) are compute-bound -- both frameworks saturate the GPU, confirming identical CUDA kernels. The gap appears where framework overhead matters: dispatch-bound architectures (transformer -31%, mlp -29%), graph routing (residual_tower -24%), and recurrent loops (feedback_fixed -16%).
Benchmark Report | Interactive dashboard
ResNet-20 on CIFAR-10, 200 epochs -- heterogeneous GPUs (RTX 5060 Ti + GTX 1060, 2.5x speed ratio). Published reference: 91.25% (He et al. 2015, Table 6):
| Mode | Eval | vs Published | Time | vs Solo-0 |
|---|---|---|---|---|
| solo-0 (fast GPU only) | 91.66% | +0.41% | 3127s | -- |
| nccl-async | 92.44% | +1.19% | 2697s | 1.2x |
| nccl-cadence | 92.42% | +1.17% | 2650s | 1.2x |
| cpu-async | 92.43% | +1.18% | 2614s | 1.2x |
| cpu-cadence | 92.04% | +0.79% | 2670s | 1.2x |
Every ElChe mode surpasses published accuracy while finishing faster than the fast GPU alone. 200 epochs is where ElChe's proportional scheduling has room to calibrate and shine -- shorter models (logistic through gpt-nano) confirm DDP convergence across architectures.
DDP Benchmark Report -- full results for 8 models across 9 DDP modes
Deterministic memory. Python adds ~3-5 us of framework overhead per GPU
op. Go's GC can't manage VRAM — an earlier Go implementation
required 5 phases of lifecycle management (refcounting, GC callbacks, VRAM
budgets, pending-free queues). Rust replaces all of that with
impl Drop for Tensor. Memory is freed the instant a tensor leaves scope.
Zero-cost safety. Every op returns Result<T> — no silent failures.
Ownership ensures tensors are freed exactly once. The borrow checker
prevents data races at compile time.
Same GPU kernels. floDl binds libtorch — the C++ library under PyTorch. CUDA, cuBLAS, cuDNN are identical. floDl replaces the dispatch path, autograd tracking, and graph execution.
Training Tools
| Tool | What it does |
|---|---|
clip_grad_norm / clip_grad_value |
Fused gradient clipping (2 kernels total via foreach ops) |
save_checkpoint / load_checkpoint |
Named .fdl checkpoints, structural hash, partial loading, LoadReport |
migrate_checkpoint |
Remap parameter names across versions |
Parameter::freeze / unfreeze |
Per-parameter gradient control |
GradScaler |
Dynamic loss scaling for fp16 training |
cast_parameters |
Cast model parameters to any dtype |
CpuWorker / ModelSnapshot |
Background checkpoint saving |
CudaGraph |
Capture/replay training steps for fixed-shape models |
Module Traits
Beyond forward/parameters, Module provides optional methods the graph
recognizes automatically:
| Method | What happens |
|---|---|
as_named_input() |
using() refs arrive as a named map |
reset() |
Loops auto-call before iterating — clears per-forward state |
detach_state() |
Break gradient chains on retained state |
sub_modules() |
Recursive device placement, training mode, parameter collection |
Build Profiles
# Optimize floDl in dev builds — your code stays fast to compile.
[profile.dev.package.flodl]
opt-level = 3
[profile.dev.package.flodl-sys]
opt-level = 3
# Release: cross-crate optimization for maximum throughput.
[profile.release]
lto = "thin"
codegen-units = 1| Profile | flodl | Your code | Typical rebuild |
|---|---|---|---|
cargo build |
-O3 (cached) |
-O0 (fast) |
< 2s |
cargo build --release |
-O3 + LTO |
-O3 + LTO |
full link |
Multi-GPU (DDP)
| Component | What it does |
|---|---|
Trainer::setup |
One-liner: detect GPUs, distribute, set optimizer, train |
Trainer::builder |
Thread-per-GPU with Local SGD, any Module |
ApplyPolicy |
Sync / Cadence / Async (when to average) |
AverageBackend |
Nccl / Cpu (how to average, A/B testable) |
ElChe |
Heterogeneous GPU cadence strategy |
NcclComms / NcclRankComm |
NCCL AllReduce, Broadcast, abort handles |
CudaEvent / CudaStream |
Async GPU-CPU pipeline, timing |
DataLoader |
Resident/streaming/distributed, VRAM-aware prefetch, auto OOM fallback |
Every differentiable path is verified against finite-difference gradients:
- 117 autograd op-level checks (every op + compositions)
- Module-level checks (every NN module, input + parameter gradients)
- Exact optimizer step verifications (SGD, Adam, AdamW, RMSprop, Adagrad, RAdam, NAdam)
- 1027 library tests, zero clippy warnings — all tests run on both CPU and CUDA
Developed and tested from NVIDIA Pascal (GTX 1060 6GB) to Blackwell
(RTX 5060 Ti 16GB). PyTorch dropped Pascal support after 2.5.1 — floDl
links libtorch's stable C API, which supports every architecture the driver
supports. If nvidia-smi works, floDl trains on it.
| Background | Start here |
|---|---|
| New to Rust | Rust for PyTorch Users — 10 patterns in 15 minutes |
| Know Rust, new to DL | Tensors then Training |
| Know PyTorch | Porting Guide (or /port with AI) then Graph Builder |
| Scaling to multi-GPU | Multi-GPU Training then DDP Builder |
| Bringing a HuggingFace model | HuggingFace Integration: load BERT, RoBERTa, DistilBERT, ALBERT, XLM-R, or DeBERTa-v2; classify, NER, QA, or fill-mask; fine-tune and round-trip back to the HF ecosystem |
| Just show me code | quickstart or showcase |
- Rust for PyTorch Users — 10 Rust patterns in 15 minutes
- Tensors — creation, ops, memory, CUDA
- Autograd — variables, gradients, backward
- Modules — all layers, convolutions, RNNs, attention, normalization
- Training — losses, optimizers, mixed precision, full loop
- Graph Builder — fluent API from simple to complex
- Advanced Graphs — forward refs, loops, gates, switches
- Visualization — DOT/SVG, profiling heatmaps
- Utilities — checkpoints, clipping, freezing, initialization, scheduling, verbosity-gated logging
- Training Monitor — ETA, resource tracking, live dashboard
- Graph Tree — hierarchical composition, freeze/thaw, subgraph checkpoints
- Multi-GPU Training — Trainer::setup, El Che, auto-balancing, DataLoader integration
- DDP Builder — thread-per-GPU, Local SGD, A/B testable backends
- Data Loading — DataLoader, resident/streaming modes, VRAM-aware prefetch, DDP integration
- HuggingFace Integration — load BERT, RoBERTa, DistilBERT, ALBERT, XLM-RoBERTa, DeBERTa-v2 checkpoints, AutoModel dispatch across four task heads (seqcls, NER, QA, MLM), fine-tune with
Trainer::setup_head, round-trip export to the HF ecosystem, PyTorch parity
quickstart— build, train, and monitor a model with residual connectionssine_wave— sine regression with monitor, checkpoint round-tripmixed_precision— float16 training withGradScalertransfer_learning— checkpoint, partial load, freeze, fine-tuneschedulers— warmup + cosine + plateau compositionobservation— collect, flush, trend queries, early stoppingshowcase— every graph builder method in one graph
- Porting Guide — module mapping, FlowBuilder patterns, training loop translation
- AI-assisted porting — point any AI coding assistant at the skill guide for automated translation. With Claude Code:
/port my_model.py fdl api-ref— generate a structured API reference for your flodl version. Used by AI tools and useful on its own.
+-----------------------------------------------------------+
| User Code / Model Definitions |
+-----------------------------------------------------------+
| monitor/ ETA, resource tracking, live web dashboard |
+-----------------------------------------------------------+
| graph/ Fluent builder, graph tree, execution, DOT/SVG |
+-----------------------------------------------------------+
| data/ DataLoader, resident/streaming, prefetch |
+-----------------------------------------------------------+
| nn/ Modules, losses, optimizers, DDP, NCCL |
+-----------------------------------------------------------+
| autograd/ Reverse-mode AD, gradient tracking |
+-----------------------------------------------------------+
| tensor/ Owned tensors with Drop, CPU + CUDA |
+-----------------------------------------------------------+
| flodl-sys FFI bindings to libtorch C++ shim |
+-----------------------------------------------------------+
| libtorch / CUDA / NCCL |
+-----------------------------------------------------------+
floDl started as a question: what would a deep learning framework look like if you designed it around Rust's ownership model instead of fighting a garbage collector?
An earlier attempt in Go proved the architecture — the graph builder, the module system, the observation engine — but hit a wall: Go's GC cannot manage GPU memory deterministically. That required building five layers of memory management infrastructure on top of the language, not with it.
Rust solved this at the language level. impl Drop for Tensor replaced
hundreds of lines of lifecycle management. The graph builder, module
composition, and design philosophy carried forward; the memory fights didn't.
floDl is open-sourced software licensed under the MIT license.
