A lightweight, ultra-scale RL post-training framework built around low-precision bases and BF16 adapters — so frontier-scale RL fits on a single node.
Today's leading LLMs cross the trillion-parameter mark, and the conventional RL post-training recipe demands high-precision, multi-node, full-parameter updates. Orbit takes a different route: hold the base at its deployment precision (INT4 / FP4 / FP8) and put gradients on a tiny BF16 OFT or LoRA adapter. The result — RL post-training of 1T-class models on a single 8×B200 node, with no precision gap between training and rollout.
We have used Orbit to run stable, end-to-end RL on Kimi-K2.6 (~1T), DeepSeek V4-Flash, DeepSeek V4-Pro (~1.6T), and the Qwen3 MoE family — all on a single-node setup.
| Capability | What it means | |
|---|---|---|
| 🪶 | Adapter-first RL | BF16 OFT/LoRA adapters on a frozen low-precision base. Same kernels and quantization scheme at train and serve time. |
| 🛰️ | Single-node trillion-scale | 1T-class models fit on a single 8×B200 node. No cross-node orchestration, no precision drift. |
| ⚡ | Low-precision native | First-class support for INT4, NVFP4, FP8, and BF16, with parity preflight gates between Megatron and SGLang. |
| 🧩 | PEFT-native | LoRA and OFT adapters; PEFT KL launchers compute reference log-probs in-model (no separate reference workers), with async adapter double-buffering. |
Supported runtime: Python 3.12, CUDA 13.2, PyTorch 2.11. This is currently the only path the public launchers and helper scripts target.
Orbit expects sibling backend checkouts next to it:
<workspace>/orbit
<workspace>/Megatron-Bridge
<workspace>/Megatron-Bridge/3rdparty/Megatron-LM
<workspace>/sglang
Keep these checkouts at the refs recorded in pyproject.toml under tool.orbit.release.backend-pins. tool.uv.sources currently points at local paths; when the backend repos are public, swap those entries for Git URLs at the same rev values.
The whole CUDA stack builds from a single uv sync. env.sh carries the bits that can't live in pyproject.toml (site CUDA paths, build toggles, runtime loader paths), all auto-detected:
cd orbit
uv python pin 3.12
source env.sh # auto-detects CUDA_HOME / GPU arch / python
uv sync --extra allinone # builds torch, TE, sglang, megatron, deep-ep, deep-gemm, sgl-kernel, flash-attn, ... from sourceThe first build compiles everything from source, budget around 1–2 hours on a CUDA 13.2 + B200 machine. Override knobs before source env.sh (CUDA_HOME, TORCH_CUDA_ARCH_LIST, MAX_JOBS, UV_CACHE_DIR).
Alternatively, CUDA-13-install.md installs the layer from prebuilt wheels.
Release maintainers: verify a public clean-room install with
scripts/release/clean_room_gate.shafter settingPUBLIC_ORBIT_URL. This gate targets the future public Git-ref release; it is not expected to pass against the interim local-path backend sources.
Run any recipe under examples/ — each launcher is an independent bash entrypoint with all hyperparameters inlined.
# A high-precision BF16 OFT run on Qwen3-4B
bash examples/high_precision/run-qwen3-4b-instruct-2507-bf16-math-oft.sh
# A low-precision FP8 OFT run on Qwen3-4B
bash examples/low_precision/run-qwen3-4b-fp8-math-oft.shSite-specific paths are passed in through environment variables. The most common ones:
| Variable | Required | Purpose |
|---|---|---|
ORBIT_VENV |
usually | Python environment with Orbit + backends. |
CUDA_HOME |
usually | CUDA 13.2 toolkit root. |
TRAIN_JSONL |
yes | Training prompt JSONL. |
HF_CKPT |
yes | HuggingFace checkpoint directory (quantized for low-precision recipes). |
MEGATRON_LOAD |
yes | Megatron distributed checkpoint root. |
TEST_JSONL |
if eval is on | Eval JSONL. Skip with DISABLE_EVAL=1. |
SAVE_DIR |
no | Output checkpoint directory. |
ENABLE_WANDB |
no | auto enables W&B if $HOME/.wandb_key exists; 0 disables. |
See examples/README.md for the full environment knob reference and the async PEFT double-buffer notes.
To exercise the command path without spending real cycles:
NUM_ROLLOUT=1 TOTAL_EPOCHS=1 TRAIN_ROWS=1 \
ROLLOUT_BATCH_SIZE=1 N_SAMPLES_PER_PROMPT=1 GLOBAL_BATCH_SIZE=1 \
DISABLE_EVAL=1 ENABLE_WANDB=0 \
bash examples/high_precision/run-qwen3-4b-instruct-2507-bf16-math-oft.shTo inspect the final Python argv without starting Ray or loading the model, prepend ORBIT_DRY_RUN_ARGV=1.
Orbit is under active development. On deck:
- More launcher recipes — broader model coverage (additional Qwen, Llama, GLM, and DeepSeek variants), more datasets, and more precision combinations.
- Docker / containerized environments — reproducible images and a documented env-setup path so getting a launcher running takes minutes, not a CUDA-13.2 module hunt.
- On-policy distillation — recipes and reference runs for
ADVANTAGE_ESTIMATOR=on_policy_distillation, including teacher/student preflight. - Public Git-ref backends — flip
tool.uv.sourcesfrom local paths to public Git URLs for Megatron-Bridge, Megatron-LM, and SGLang once the upstream repos land. - Troubleshooting & ops docs — common resolver, import, and launcher smoke failures, plus a multi-node story for sites that have capacity beyond a single 8×B200 box.
Have a request? Open an issue or PR.
@article{spherelab2026orbit,
author = {Qiu, Zeju and Chen, Le and Liu, Lixin and Xiao, Tim Z.
and Feng, Yao and Huang, Yangyi and Liu, Zhen and Shi, Han
and Wen, Yandong and Yu, Zhouliang and Sch{\"o}lkopf, Bernhard
and Liu, Weiyang},
title = {Orbit: Stable and Efficient Reinforcement Learning for Trillion-Parameter LLMs},
journal = {SphereLab Blog},
year = {2026},
note = {https://spherelab.ai/orbit}
}Orbit stands on the shoulders of these excellent projects:
Orbit is released under the Apache License 2.0.
