A decoder-only transformer (a small GPT), built and trained from scratch in PyTorch. The goal is understanding every part, not competing with frontier models. You write the core; this scaffold gives you the structure, the plumbing, and a roadmap.
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
Each step is one concept. The plumbing (data, config, batching, device selection) is already written so you can focus on the model and the loop.
- 1. Data —
python data/prepare.pyDownloads ~1MB of Shakespeare, builds a char-level vocab, writestrain.bin/val.bin/meta.pkl. Char-level = no tokenizer yet. - 2. Attention — implement
CausalSelfAttention.forwardinmodel.py - 3. MLP — implement
MLP.forward - 4. Block — wire attention + MLP with residuals in
Block.forward - 5. Full model —
GPT.forward(embeddings -> blocks -> loss) - 6. Sampling —
GPT.generate - 7. Train — fill the loop in
train.py, thenpython train.py - 8. Generate — finish
sample.py, thenpython sample.py
When that works end-to-end, milestone 2: implement real BPE in
tokenizer.py and move from characters to subword tokens.
A ~10M-parameter model on Shakespeare trains in minutes on a GPU, an hour or so on CPU. It won't be smart — it'll learn to produce text that looks like Shakespeare (character names, line breaks, archaic phrasing). That "it went from noise to structure" moment is the whole point.
config.py all hyperparameters in one place
model.py the transformer <- you implement this
train.py training loop <- you implement the loop
sample.py generate from a checkpoint
tokenizer.py BPE (milestone 2)
data/prepare.py download + encode corpus (done)