I wanted to see if I could make a small LLM post training stack that matched the way I think about RL algorithms... to me, that would be something like:
- Select a model to train from.
- Select some prompts + grader.
- Select a "loss function" to train SGD.
Most LLM post-training codebases tangle these up with trainer glue, tokenization, and distributed plumbing... I find that confusing. So, here is a go at making something more "paper style" that still runs an LLM.
Warning: this is a proof of concept... not reliable infra.
pip install -e ".[test,viz]"from nanopost.trainer import Trainer
from nanopost.config import Preset
from nanopost.losses import Loss
Trainer(Preset.QWEN_1_5B_1xA100.make()).run(
loss_fn=Loss.DG.make(),
tasks='gsm8k',
)Tests run on CPU without a GPU. Training needs one.
The research object is four types, all in nanopost/base.py:
LossFn: takes aLearnerBatch, returns(loss, metrics).RewardFn: takes decoded completions, returns scores.LearnerBatch: policy, rollout, and reference logprobs, advantages, rewards, mask. No raw tokens.TaskInstance: name, prompt, eval_fn.
If you can describe your algorithm in those terms, you can implement it here. A new loss is one file:
import dataclasses
import torch
from nanopost import base
@dataclasses.dataclass
class SimplePG(base.LossFn):
"""Vanilla policy gradient with advantage weighting."""
kl_coef: float = 0.04
def __call__(self, batch: base.LearnerBatch) -> tuple[torch.Tensor, base.Metrics]:
mask = batch.loss_mask[:, 1:]
pg = -batch.policy_logprobs * batch.advantages
kl = torch.exp(batch.reference_logprobs - batch.policy_logprobs) \
- (batch.reference_logprobs - batch.policy_logprobs) - 1
total = pg + self.kl_coef * kl
loss = ((total * mask).sum(dim=1) / mask.sum(dim=1)).mean()
return loss, {'kl': (kl * mask).sum().item() / mask.sum().item()}Built-in losses live in nanopost/losses/: GRPOLoss, DGLoss, KondoLoss, PMPOLoss, MaxRLLoss.
One self-contained file each. Same structure for rewards (nanopost/rewards.py) and tasks (nanopost/tasks/).
Everything else lives in Trainer and TrainConfig: batching, gradient accumulation, DDP,
reference-model offloading, actor/learner staleness, the learning-rate schedule, checkpointing, wandb.
Writing a loss doesn't touch any of it. When you need to, it's all in one file.
Pick a preset, override what you care about, run it:
cfg = Preset.QWEN_1_5B_1xA100.make(
num_steps=2048,
kl_coef=0.02,
sync_interval=4, # actor re-copies from learner every N steps
wandb_log=False,
)
Trainer(cfg).run(loss_fn=SimplePG(), tasks='gsm8k')Main knobs: num_steps, prompts_per_step, responses_per_prompt, learning_rate, kl_coef.
Full list in nanopost/config.py.
Each preset is a rough (model, batch sizes, max_new_tokens) starting point.
Pick the closest one and expect to tune from there.
| Preset | Hardware | Model | Notes |
|---|---|---|---|
QWEN_0_5B_SMOKE |
CPU or any GPU | Qwen 2.5 0.5B | 2-step pipeline check |
QWEN_0_5B_FAST |
any GPU | Qwen 2.5 0.5B | quick iteration loop |
QWEN_1_5B_1xA100 |
1x A100 40GB | Qwen 2.5 1.5B | a reasonable 40GB guess |
QWEN_1_5B_1x80GB |
1x A100/H100 80GB | Qwen 2.5 1.5B | a larger-batch starting point |
GEMMA_1B_1xA100 |
1x A100 40GB | Gemma 3 1B | a reasonable 40GB guess |
GEMMA_1B_1x80GB |
1x A100/H100 80GB | Gemma 3 1B | a larger-batch starting point |
QWEN_3B_8xA100 |
8x A100 40GB | Qwen 2.5 3B | experimental DDP path |
Each run writes out/{run_name}.jsonl (one row per step) and out/{run_name}-config.json.
Load every run as one tidy DataFrame:
from nanopost import analysis
df = analysis.load("out")Columns include step, reward_mean, reward_std, response_len_mean, loss, grad_norm, lr,
generation_time_s, learning_time_s, and loss-specific metrics like loss_kl_ref and loss_clipfrac.
Modal is a serverless GPU cloud. One function decorator spins up an A100 container.
modal secret create hf-token HF_TOKEN=hf_your_token
modal run examples/launch_modal.py --smoke # quick pipeline check
modal run --detach examples/launch_modal.py # full sweep
modal volume get nanopost-results / ./out/ --force # download results- PPO. Schulman et al. 2017. The clipped surrogate GRPO reuses.
- GRPO / DeepSeekMath. Shao et al. 2024. Group-relative advantages.
- DeepSeek-R1. GRPO with verified rewards.
- PMPO. Abdolmaleki et al. 2024.
- Delightful Policy Gradient. DG loss, per-token delight gating.
- Does This Gradient Spark Joy?. Kondo loss, binary top-k gating.
- Delightful Distributed Policy Gradient. Actor/learner staleness setting.
- MaxRL. Guo et al. 2025. Maximum likelihood RL.
- nanoGPT. Karpathy's minimal GPT training.