Skip to content

iosband/nanopost

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

nanopost

I wanted to see if I could make a small LLM post training stack that matched the way I think about RL algorithms... to me, that would be something like:

  • Select a model to train from.
  • Select some prompts + grader.
  • Select a "loss function" to train SGD.

Most LLM post-training codebases tangle these up with trainer glue, tokenization, and distributed plumbing... I find that confusing. So, here is a go at making something more "paper style" that still runs an LLM.

Warning: this is a proof of concept... not reliable infra.

pip install -e ".[test,viz]"
from nanopost.trainer import Trainer
from nanopost.config import Preset
from nanopost.losses import Loss

Trainer(Preset.QWEN_1_5B_1xA100.make()).run(
    loss_fn=Loss.DG.make(),
    tasks='gsm8k',
)

Tests run on CPU without a GPU. Training needs one.

base.py is the paper

The research object is four types, all in nanopost/base.py:

  • LossFn: takes a LearnerBatch, returns (loss, metrics).
  • RewardFn: takes decoded completions, returns scores.
  • LearnerBatch: policy, rollout, and reference logprobs, advantages, rewards, mask. No raw tokens.
  • TaskInstance: name, prompt, eval_fn.

If you can describe your algorithm in those terms, you can implement it here. A new loss is one file:

import dataclasses
import torch
from nanopost import base

@dataclasses.dataclass
class SimplePG(base.LossFn):
    """Vanilla policy gradient with advantage weighting."""
    kl_coef: float = 0.04

    def __call__(self, batch: base.LearnerBatch) -> tuple[torch.Tensor, base.Metrics]:
        mask = batch.loss_mask[:, 1:]
        pg = -batch.policy_logprobs * batch.advantages
        kl = torch.exp(batch.reference_logprobs - batch.policy_logprobs) \
             - (batch.reference_logprobs - batch.policy_logprobs) - 1
        total = pg + self.kl_coef * kl
        loss = ((total * mask).sum(dim=1) / mask.sum(dim=1)).mean()
        return loss, {'kl': (kl * mask).sum().item() / mask.sum().item()}

Built-in losses live in nanopost/losses/: GRPOLoss, DGLoss, KondoLoss, PMPOLoss, MaxRLLoss. One self-contained file each. Same structure for rewards (nanopost/rewards.py) and tasks (nanopost/tasks/).

trainer.py is the infra

Everything else lives in Trainer and TrainConfig: batching, gradient accumulation, DDP, reference-model offloading, actor/learner staleness, the learning-rate schedule, checkpointing, wandb. Writing a loss doesn't touch any of it. When you need to, it's all in one file.

Pick a preset, override what you care about, run it:

cfg = Preset.QWEN_1_5B_1xA100.make(
    num_steps=2048,
    kl_coef=0.02,
    sync_interval=4,      # actor re-copies from learner every N steps
    wandb_log=False,
)
Trainer(cfg).run(loss_fn=SimplePG(), tasks='gsm8k')

Main knobs: num_steps, prompts_per_step, responses_per_prompt, learning_rate, kl_coef. Full list in nanopost/config.py.

Presets

Each preset is a rough (model, batch sizes, max_new_tokens) starting point. Pick the closest one and expect to tune from there.

Preset Hardware Model Notes
QWEN_0_5B_SMOKE CPU or any GPU Qwen 2.5 0.5B 2-step pipeline check
QWEN_0_5B_FAST any GPU Qwen 2.5 0.5B quick iteration loop
QWEN_1_5B_1xA100 1x A100 40GB Qwen 2.5 1.5B a reasonable 40GB guess
QWEN_1_5B_1x80GB 1x A100/H100 80GB Qwen 2.5 1.5B a larger-batch starting point
GEMMA_1B_1xA100 1x A100 40GB Gemma 3 1B a reasonable 40GB guess
GEMMA_1B_1x80GB 1x A100/H100 80GB Gemma 3 1B a larger-batch starting point
QWEN_3B_8xA100 8x A100 40GB Qwen 2.5 3B experimental DDP path

Analysis

Each run writes out/{run_name}.jsonl (one row per step) and out/{run_name}-config.json. Load every run as one tidy DataFrame:

from nanopost import analysis
df = analysis.load("out")

Columns include step, reward_mean, reward_std, response_len_mean, loss, grad_norm, lr, generation_time_s, learning_time_s, and loss-specific metrics like loss_kl_ref and loss_clipfrac.

Cloud GPUs (Modal)

Modal is a serverless GPU cloud. One function decorator spins up an A100 container.

modal secret create hf-token HF_TOKEN=hf_your_token
modal run examples/launch_modal.py --smoke           # quick pipeline check
modal run --detach examples/launch_modal.py          # full sweep
modal volume get nanopost-results / ./out/ --force   # download results

References

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors