nanopost

I wanted to see if I could make a small LLM post training stack that matched the way I think about RL algorithms... to me, that would be something like:

Select a model to train from.
Select some prompts + grader.
Select a "loss function" to train SGD.

Most LLM post-training codebases tangle these up with trainer glue, tokenization, and distributed plumbing... I find that confusing. So, here is a go at making something more "paper style" that still runs an LLM.

Warning: this is a proof of concept... not reliable infra.

pip install -e ".[test,viz]"

from nanopost.trainer import Trainer
from nanopost.config import Preset
from nanopost.losses import Loss

Trainer(Preset.QWEN_1_5B_1xA100.make()).run(
    loss_fn=Loss.DG.make(),
    tasks='gsm8k',
)

Tests run on CPU without a GPU. Training needs one.

base.py is the paper

The research object is four types, all in nanopost/base.py:

LossFn: takes a LearnerBatch, returns (loss, metrics).
RewardFn: takes decoded completions, returns scores.
LearnerBatch: policy, rollout, and reference logprobs, advantages, rewards, mask. No raw tokens.
TaskInstance: name, prompt, eval_fn.

If you can describe your algorithm in those terms, you can implement it here. A new loss is one file:

import dataclasses
import torch
from nanopost import base

@dataclasses.dataclass
class SimplePG(base.LossFn):
    """Vanilla policy gradient with advantage weighting."""
    kl_coef: float = 0.04

    def __call__(self, batch: base.LearnerBatch) -> tuple[torch.Tensor, base.Metrics]:
        mask = batch.loss_mask[:, 1:]
        pg = -batch.policy_logprobs * batch.advantages
        kl = torch.exp(batch.reference_logprobs - batch.policy_logprobs) \
             - (batch.reference_logprobs - batch.policy_logprobs) - 1
        total = pg + self.kl_coef * kl
        loss = ((total * mask).sum(dim=1) / mask.sum(dim=1)).mean()
        return loss, {'kl': (kl * mask).sum().item() / mask.sum().item()}

Built-in losses live in nanopost/losses/: GRPOLoss, DGLoss, KondoLoss, PMPOLoss, MaxRLLoss. One self-contained file each. Same structure for rewards (nanopost/rewards.py) and tasks (nanopost/tasks/).

trainer.py is the infra

Everything else lives in Trainer and TrainConfig: batching, gradient accumulation, DDP, reference-model offloading, actor/learner staleness, the learning-rate schedule, checkpointing, wandb. Writing a loss doesn't touch any of it. When you need to, it's all in one file.

Pick a preset, override what you care about, run it:

cfg = Preset.QWEN_1_5B_1xA100.make(
    num_steps=2048,
    kl_coef=0.02,
    sync_interval=4,      # actor re-copies from learner every N steps
    wandb_log=False,
)
Trainer(cfg).run(loss_fn=SimplePG(), tasks='gsm8k')

Main knobs: num_steps, prompts_per_step, responses_per_prompt, learning_rate, kl_coef. Full list in nanopost/config.py.

Presets

Each preset is a rough (model, batch sizes, max_new_tokens) starting point. Pick the closest one and expect to tune from there.

Preset	Hardware	Model	Notes
`QWEN_0_5B_SMOKE`	CPU or any GPU	Qwen 2.5 0.5B	2-step pipeline check
`QWEN_0_5B_FAST`	any GPU	Qwen 2.5 0.5B	quick iteration loop
`QWEN_1_5B_1xA100`	1x A100 40GB	Qwen 2.5 1.5B	a reasonable 40GB guess
`QWEN_1_5B_1x80GB`	1x A100/H100 80GB	Qwen 2.5 1.5B	a larger-batch starting point
`GEMMA_1B_1xA100`	1x A100 40GB	Gemma 3 1B	a reasonable 40GB guess
`GEMMA_1B_1x80GB`	1x A100/H100 80GB	Gemma 3 1B	a larger-batch starting point
`QWEN_3B_8xA100`	8x A100 40GB	Qwen 2.5 3B	experimental DDP path

Analysis

Each run writes out/{run_name}.jsonl (one row per step) and out/{run_name}-config.json. Load every run as one tidy DataFrame:

from nanopost import analysis
df = analysis.load("out")

Columns include step, reward_mean, reward_std, response_len_mean, loss, grad_norm, lr, generation_time_s, learning_time_s, and loss-specific metrics like loss_kl_ref and loss_clipfrac.

Cloud GPUs (Modal)

Modal is a serverless GPU cloud. One function decorator spins up an A100 container.

modal secret create hf-token HF_TOKEN=hf_your_token
modal run examples/launch_modal.py --smoke           # quick pipeline check
modal run --detach examples/launch_modal.py          # full sweep
modal volume get nanopost-results / ./out/ --force   # download results

References

PPO. Schulman et al. 2017. The clipped surrogate GRPO reuses.
GRPO / DeepSeekMath. Shao et al. 2024. Group-relative advantages.
DeepSeek-R1. GRPO with verified rewards.
PMPO. Abdolmaleki et al. 2024.
Delightful Policy Gradient. DG loss, per-token delight gating.
Does This Gradient Spark Joy?. Kondo loss, binary top-k gating.
Delightful Distributed Policy Gradient. Actor/learner staleness setting.
MaxRL. Guo et al. 2025. Maximum likelihood RL.
nanoGPT. Karpathy's minimal GPT training.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples		examples
nanopost		nanopost
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nanopost

base.py is the paper

trainer.py is the infra

Presets

Analysis

Cloud GPUs (Modal)

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

nanopost

base.py is the paper

trainer.py is the infra

Presets

Analysis

Cloud GPUs (Modal)

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages