Two from-scratch PyTorch large language models with a Claude-Code-style plugin runtime, audio I/O, vision, and a DevOps tool layer. Every block has beginner-friendly comments, every config knob is documented, and the full training pipeline runs end-to-end on a laptop at the "tiny" tier before you commit to renting GPUs.
| Model | Size | Modalities | Specialization | Context |
|---|---|---|---|---|
| Aurora | ~72 B dense | text + image + audio in, text + audio out | Omni-modal assistant (GPT-4o-class) | 256K |
| Forge | ~460 B MoE / ~35 B active | text + image + audio + video + PDF in, text out (audio optional) | Coding, software engineering, DevOps (Opus-class) | 1M |
| Tiny (~50 M) | Same architectures, miniaturised | All modalities | Smoke-testing & education | 4K |
- Original architectures for both models. Every block (RMSNorm, RoPE
- YaRN, GQA, MLA, SwiGLU, fine-grained MoE with auxiliary-loss-free
load balancing, SigLIP-style vision tower, Whisper-style audio encoder,
EnCodec-style audio codec) is implemented with
torch.nnprimitives. Notransformersmodel imports.
- YaRN, GQA, MLA, SwiGLU, fine-grained MoE with auxiliary-loss-free
load balancing, SigLIP-style vision tower, Whisper-style audio encoder,
EnCodec-style audio codec) is implemented with
- Tokenizers: byte-level BPE + 4096 audio code tokens trained via
scripts/train_tokenizer.py. - Training pipelines: text pretraining, long-context extension, audio codec pretrain, omni-modal pretrain, code-screenshot pretrain, SFT, omni-SFT, DPO, tool-use SFT, DevOps SFT, RLEF (reinforcement learning from execution feedback), plus LoRA/QLoRA fine-tuning. FSDP2 and DeepSpeed-ZeRO-3 supported.
- Inference: KV-cache sampler (temperature / top-p / top-k / min-p / repetition penalty), multimodal chat with audio in/out + video frame sampling, minimal OpenAI-compatible HTTP server.
- Claude-Code-style runtime for Forge: agent loop, built-in tools
(filesystem, shell, http, notebook, PDF/document parser,
DevOps log + metric tools), MCP-protocol client, plugin manifests,
markdown skills (including a ready-to-use
log_triageskill), lifecycle hooks, slash commands (including/postmortem), sub-agents. - Benchmark runners: HumanEval, MBPP, LiveCodeBench, MMLU, GSM8K, SWE-bench Lite.
- No trained checkpoints. The repo is functional code; training is
the user's responsibility, with full instructions in
docs/training-guide.md.
docs/architecture-aurora-72b.mddocs/architecture-forge-460b.mddocs/training-guide.md— end-to-end how-to-traindocs/plugin-system.md— Claude-Code-style runtime referencedocs/datasets.md— every dataset, license, fetch instructionsdocs/benchmarks.md— how to run each evaldocs/beginner-glossary.md— RoPE, MoE, MLA, LoRA, … in plain English
chatbot/
├── README.md
├── LICENSE # Apache-2.0
├── pyproject.toml
├── requirements.txt
├── requirements-train.txt
├── configs/
│ ├── models/ # tiny, aurora-72b, forge-460b
│ ├── training/ # pretrain, long_context, sft, dpo, lora, tool-use-sft
│ └── runtime/ # default, aurora-chat, forge-coder
├── docs/ # all the architecture / training / plugin docs
├── plugins_examples/ # two ready-to-copy plugins
├── scripts/ # CLI entrypoints
├── src/chatbot/ # the Python package itself
│ ├── models/{common, vision, aurora_50b, forge_250b}
│ ├── tokenizer/
│ ├── data/
│ ├── training/
│ ├── inference/
│ ├── runtime/
│ ├── eval/
│ └── utils/
├── tests/
└── legacy/v0-tiny-chatbot/ # the original ~1M-param educational chatbot, preserved
# Windows / PowerShell
git clone https://github.com/<you>/chatbot.git
cd chatbot
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -e .
# For training:
pip install -r requirements-train.txt# Linux / macOS
git clone https://github.com/<you>/chatbot.git
cd chatbot
python -m venv .venv
source .venv/bin/activate
pip install -e .
pip install -r requirements-train.txt # for trainingVerify the package imports and the tests pass:
python -c "import chatbot; print(chatbot.__version__)"
pytest tests/ -qCheck that the model configs really hit ~50B / ~250B before you commit to serious compute:
python scripts/count_params.py --model tiny
python scripts/count_params.py --model aurora-72b
python scripts/count_params.py --model forge-460bA four-command flow that exercises tokenizer training, pretraining, SFT, and chat — all on the tiny config, on CPU, in minutes:
# 1) Tokenizer.
python scripts/train_tokenizer.py `
--files tests/conftest.py `
--vocab-size 1024 `
--output checkpoints/tiny-tok.json
# 2) Pretrain a few steps. (Loss should decrease on a fixed seed.)
python scripts/pretrain.py `
--model tiny `
--tokenizer checkpoints/tiny-tok.json `
max_steps=10 micro_batch_size=2
# 3) SFT a few more steps.
python scripts/sft.py `
--model tiny `
--tokenizer checkpoints/tiny-tok.json `
--resume-from outputs/pretrain/latest `
max_steps=10 micro_batch_size=2
# 4) Chat with the tiny checkpoint.
python scripts/chat.py `
--model tiny `
--checkpoint outputs/sft/latest `
--tokenizer checkpoints/tiny-tok.jsonIf those four commands succeed end-to-end, the bigger configs will run end-to-end on a cluster — same code, just bigger numbers in YAML.
Every stage is one script invocation. See
docs/training-guide.md for full
hyperparameter tables, hardware sizing, and the long-context extension
recipe. Headline commands:
# Aurora-72B
torchrun --nproc-per-node=8 scripts/pretrain.py \
--model aurora-72b --training pretrain \
--tokenizer checkpoints/aurora-tokenizer.json
torchrun --nproc-per-node=8 scripts/pretrain.py \
--model aurora-72b --training long_context \
--tokenizer checkpoints/aurora-tokenizer.json \
resume_from=outputs/pretrain/latest
torchrun --nproc-per-node=8 scripts/sft.py --model aurora-72b --tokenizer ...
torchrun --nproc-per-node=8 scripts/dpo.py --model aurora-72b --tokenizer ...Forge adds one extra stage (tool-use SFT) at the end. Compute budgets are very real:
| Model | Pretrain (full) | SFT (full) | LoRA | QLoRA |
|---|---|---|---|---|
| Tiny (~50M) | 1 GPU | 1 GPU | 1 GPU | 1 GPU |
| Aurora-72B | 256–1024× H100 | 32–64× H100 | 8× H100 | 2–4× A100 / RTX 6000 |
| Forge-460B | 1024–4096× H100 | 64–128× H100 | 16× H100 | 4–8× H100 (NF4) |
LoRA / QLoRA paths (in scripts/lora_finetune.py) make continuation
training realistic on consumer hardware.
python scripts/chat.py `
--model aurora-72b `
--checkpoint outputs/dpo/latest `
--tokenizer checkpoints/aurora-tokenizer.json `
--runtime aurora-chatThe runtime config (configs/runtime/aurora-chat.yaml) controls system
prompt, temperature, top-p, top-k, min-p, repetition penalty, stop
sequences, max-new-tokens, and context window. Override any leaf from
the CLI:
python scripts/chat.py ... runtime.temperature=0.3python scripts/agent.py `
--model forge-460b `
--checkpoint outputs/tool_use_sft/latest `
--tokenizer checkpoints/forge-tokenizer.json `
--runtime forge-coderThis launches the Claude-Code-style loop: the model can read the
filesystem, run shell commands (subject to allowlist), make HTTP
requests, execute Python notebook cells, and call any plugin or MCP
server discovered under ~/.chatbot/plugins/.
Plugin authoring guide: docs/plugin-system.md.
Working examples: plugins_examples/.
python scripts/eval.py `
--bench humaneval `
--model forge-460b `
--checkpoint outputs/dpo/latest `
--tokenizer checkpoints/forge-tokenizer.json `
--limit 164See docs/benchmarks.md for the full benchmark
list and instructions for SWE-bench Lite (which needs a Docker-backed
sandbox per problem — we ship the patch-extraction half; you supply the
runner).
Architectures are original, but the design ideas are openly inspired by:
- Llama 3.1 / 3.2-Vision, Gemma 3, Mistral — dense + SwiGLU + GQA + RoPE.
- SigLIP / SigLIP2 — vision encoder pattern.
- DeepSeek-V3 — fine-grained MoE, shared experts, AL-free balancing, MLA.
- Qwen3-Coder — long-context, repo-level packed training.
- Claude Code — plugin / skill / hook / slash-command / MCP runtime layer.
The code in this repository is our own. We use PyTorch (framework),
tokenizers (BPE library), datasets (data loading), and
bitsandbytes (quantization), but the model architectures themselves are
implemented from scratch.
Apache-2.0, see LICENSE. The model code is licensed as the
project's own work; trained weights (if you produce any) are governed by
the licenses of the datasets you trained on — see
docs/datasets.md for license metadata on every
dataset entry in the registry.
Q: Do you ship checkpoints?
No. The repo is functional code. Training the real 50B / 250B models is
a compute commitment we don't make on your behalf — but you can do it
yourself with the scripts here, with the exact recipes documented in
docs/training-guide.md.
Q: Can I run the 50B / 250B models on a single GPU?
For full inference: no. The KV cache alone at full context is tens of
gigabytes. For fine-tuning: yes, via QLoRA on the right hardware (see
the table above). For learning / experimentation: use the tiny config.
Q: Why not transformers?
We deliberately implement architectures from scratch so the code is
readable end-to-end. transformers is great for production, but it
hides the moving parts that this repo is designed to teach.
Q: Do these models actually beat Claude / GPT? The repo is the model code, not the result of training it. We're not claiming the resulting checkpoints will match production models — those took thousands of GPU-years. We are claiming the architectures and recipes are sound and faithfully reproduce the ideas behind those models.
Open an issue if anything is unclear; PRs welcome. The codebase is deliberately small (about 5K lines of model / runtime code) so it's easy to read and modify.