Aibys2 is a runnable from-scratch LLM starter for an Indonesia-rooted AI ecosystem.
The goal is not to build an "Indonesian-only AI". The stronger direction is:
- use English-first pretraining for broad reasoning, code, science, and general language capability
- add Indonesian strength through curated local data, bilingual instruction tuning, document workflows, OCR, tool calling, and Aibys domain apps
- keep the scripts open and reproducible so other builders can run the same pipeline for free
This repo currently focuses on a working training pipeline. It does not ship pretrained weights.
Aibys2 is free to use, modify, train with, and build on, including for commercial AI models, paid AI applications, and hosted AI services.
Attribution is required:
Aibys2 by Arlchoose-code
You may not sell, repackage, or redistribute this source code repository itself as a standalone paid product without permission. See LICENSE and NOTICE.
The tiny pipeline has already been tested end-to-end:
prepare sample data
-> train tokenizer
-> tokenize train/valid text
-> initialize Aibys2 from random weights
-> train
-> evaluate on valid split
-> save checkpoints
Verified command:
py -3.10 scripts/run_training_pipeline.py --config configs/train_tiny.yamlExpected outputs include:
data/tokenized/train.bin
data/tokenized/valid.bin
checkpoints/tiny/step_10.pt
checkpoints/tiny/step_20.pt
Tests:
py -3.10 -m pytest -qExpected result:
3 passed
Aibys2 is a decoder-only Transformer skeleton with modern model components:
- RMSNorm pre-normalization
- SwiGLU feed-forward layers
- Multi-head Latent Attention style KV compression
- alternating local/global attention windows
- routed + shared MoE layers
- Multi-Token Prediction head
- vision-token interface for later image stages
- chat template and JSON tool-calling utilities
The code is intentionally configurable. The tiny model proves the scripts work; mini/base configs follow the same path with larger dimensions, more data, and more compute.
On this machine, default python points to Python 3.15 beta. PyTorch is not available for that version. Use Python 3.10 for training:
py -3.10 -m pip install -r requirements.txtThen use py -3.10 for all training commands:
py -3.10 scripts/smoke.pyIf you are on Linux/macOS with a normal Python 3.10-3.12 environment, python may be enough:
python -m pip install -r requirements.txt
python scripts/smoke.pyconfigs/
tiny.yaml tiny model architecture for smoke training
mini.yaml development-size model architecture
base.yaml larger target architecture
train_tiny.yaml tiny training run config
train_mini.yaml mini training run config
train_sft_tiny.yaml tiny SFT continuation config
data/
raw/ raw source text, separated by source/domain
processed/ train.txt and valid.txt
tokenized/ train.bin and valid.bin
vision/ future image/caption/OCR data
tool/ future tool-calling data
model/ Aibys2 model architecture
tokenizer/ tokenizer training and chat template
training/ data/loss/optimizer/trainer utilities
tools/ tool schema/parser/executor
inference/ generation, chat loop, FastAPI server skeleton
export/ GGUF/Ollama placeholders
scripts/ runnable pipeline scripts
tests/ pytest checks
docs/ deeper docs
Put original corpora under data/raw/:
data/raw/
english/
web.txt
books.txt
code.txt
science.txt
indonesia/
wiki_id.txt
news_id.txt
legal_docs.txt
medical_docs.txt
invoices_ocr.txt
mixed/
bilingual_chat.txt
aibys_app_data.txt
Then create the curated training files:
data/processed/train.txt
data/processed/valid.txt
After tokenization:
data/tokenized/train.bin
data/tokenized/valid.bin
Current text pretraining scripts expect plain UTF-8 text. One training example can be one line or paragraph. Keep the validation split separate from training data.
Detailed dataset format docs:
docs/DATASETS.md
That file covers text pretraining, chat/SFT, tool calling, image/vision manifests, and local app usage.
Recommended early data mix:
60-75% English general/code/science
15-30% Indonesian general/domain data
5-10% bilingual/chat/tool/document examples
Reasoning:
- English data gives the model stronger general and technical grounding.
- Indonesian data teaches local language, context, terminology, and workflows.
- Aibys-specific data makes the model useful for document AI apps, not just chat.
For larger runs, prioritize quality over size:
- deduplicate aggressively
- remove broken encoding and boilerplate
- filter spam, repeated templates, and low-information pages
- keep source metadata outside the training text if possible
- keep a clean validation set that is never included in train
Run the full tiny pipeline:
py -3.10 scripts/run_training_pipeline.py --config configs/train_tiny.yamlThis does:
1. scripts/prepare_sample_data.py
2. tokenizer/train_tokenizer.py
3. scripts/tokenize_data.py for train.txt
4. scripts/tokenize_data.py for valid.txt
5. scripts/train.py
This is the fastest sanity check. It trains a tiny random model for 20 steps and saves checkpoints.
Use this when replacing the sample data with your real data.
Generate sample data:
py -3.10 scripts/prepare_sample_data.pyFor real data, put your own files into data/raw/, then build:
data/processed/train.txt
data/processed/valid.txt
Use the included simple builder:
py -3.10 scripts/build_text_dataset.py --raw-dir data/raw --output-dir data/processed --valid-ratio 0.02This is a simple cleaner/splitter for local .txt files. It is not a web crawler.
Tiny/dev tokenizer:
py -3.10 tokenizer/train_tokenizer.py --input data/processed/train.txt --output tokenizer/aibys2.json --vocab-size 8000Mini/base tokenizer:
py -3.10 tokenizer/train_tokenizer.py --input data/processed/train.txt --output tokenizer/aibys2.json --vocab-size 64000If tokenizers is unavailable and you only want a dependency-free smoke run:
py -3.10 tokenizer/train_tokenizer.py --input data/processed/train.txt --output tokenizer/aibys2.json --byte-fallbackFor serious training, use the real BPE tokenizer, not byte fallback.
py -3.10 scripts/tokenize_data.py --tokenizer tokenizer/aibys2.json --input data/processed/train.txt --output data/tokenized/train.bin
py -3.10 scripts/tokenize_data.py --tokenizer tokenizer/aibys2.json --input data/processed/valid.txt --output data/tokenized/valid.binThe .bin files are uint16 token ID streams. This is simple and fast enough for the starter pipeline.
Important: vocab_size in the model config must be large enough for the tokenizer. If tokenizer vocab is 8000, model config should use vocab_size: 8000. If tokenizer vocab is 64000, model config should use vocab_size: 64000.
Tiny:
py -3.10 scripts/train.py --config configs/train_tiny.yamlMini:
py -3.10 scripts/train.py --config configs/train_mini.yamlTraining starts from random initialization. Checkpoints are saved to the configured output_dir.
Training configs are separate from model architecture configs.
Example: configs/train_tiny.yaml
model_config: configs/tiny.yaml
train_bin: data/tokenized/train.bin
valid_bin: data/tokenized/valid.bin
output_dir: checkpoints/tiny
batch_size: 4
seq_len: 32
max_steps: 20
learning_rate: 0.0003
weight_decay: 0.1
warmup_steps: 2
eval_every: 10
save_every: 10
log_every: 1
device: autoMeaning:
| Key | Meaning |
|---|---|
model_config |
Which architecture YAML to load |
train_bin |
Tokenized training data |
valid_bin |
Tokenized validation data |
output_dir |
Where checkpoints are saved |
batch_size |
Number of samples per optimizer step |
seq_len |
Tokens per sample |
max_steps |
Number of optimizer steps |
learning_rate |
AdamW learning rate |
weight_decay |
AdamW weight decay |
warmup_steps |
Linear warmup before cosine decay |
eval_every |
Validate every N steps |
save_every |
Save checkpoint every N steps |
log_every |
Print training loss every N steps |
device |
auto, cpu, or cuda |
Model size is controlled by:
configs/tiny.yaml
configs/mini.yaml
configs/base.yaml
Key parameters:
| Key | Meaning |
|---|---|
vocab_size |
Must match tokenizer size |
hidden_dim |
Main model width |
num_layers |
Number of Transformer blocks |
num_heads |
Attention heads |
num_kv_heads |
Reserved for GQA-style compatibility |
mla_latent_dim |
Latent KV compression size |
ffn_dim |
FFN hidden dimension |
max_seq_len |
Maximum context length |
local_window |
Sliding window size for local attention layers |
global_every |
Every Nth layer uses global attention |
num_routed_experts |
MoE routed experts |
num_active_experts |
Routed experts active per token |
num_shared_experts |
Shared experts always active |
moe_every |
Every Nth layer uses MoE |
mtp_depth |
Number of future-token heads |
vision_image_size |
Future vision input size |
vision_max_tokens |
Future max image tokens |
The training direction stays the same:
raw data
-> processed train/valid text
-> train tokenizer
-> tokenize to .bin
-> train from random init
-> eval
-> checkpoint
Scaling means changing config, data size, and compute.
Use:
configs/mini.yaml
configs/train_mini.yaml
Make sure:
- tokenizer vocab is
64000 configs/mini.yamlhasvocab_size: 64000data/tokenized/train.binis much larger thanseq_len- GPU memory is enough for
batch_sizeandseq_len
Command:
py -3.10 tokenizer/train_tokenizer.py --input data/processed/train.txt --output tokenizer/aibys2.json --vocab-size 64000
py -3.10 scripts/tokenize_data.py --tokenizer tokenizer/aibys2.json --input data/processed/train.txt --output data/tokenized/train.bin
py -3.10 scripts/tokenize_data.py --tokenizer tokenizer/aibys2.json --input data/processed/valid.txt --output data/tokenized/valid.bin
py -3.10 scripts/train.py --config configs/train_mini.yamlCreate a new train config, for example:
configs/train_base.yaml
Use:
model_config: configs/base.yaml
train_bin: data/tokenized/train.bin
valid_bin: data/tokenized/valid.bin
output_dir: checkpoints/base
batch_size: 1
seq_len: 2048
max_steps: 100000
learning_rate: 0.0003
weight_decay: 0.1
warmup_steps: 2000
eval_every: 1000
save_every: 1000
log_every: 10
device: cudaThen run:
py -3.10 scripts/train.py --config configs/train_base.yamlBase-scale training will need more infrastructure before a serious run:
- distributed training
- sharded checkpoints
- larger binary dataset loading
- better validation/evaluation
The current starter proves the path and architecture wiring. It is not yet optimized for multi-GPU training.
Goal: base language model.
Data:
- English web/books/science/code
- Indonesian Wikipedia/news/docs
- Aibys domain documents
Objective:
- next-token prediction
Current script:
py -3.10 scripts/train.py --config configs/train_tiny.yamlFull tiny pretraining pipeline:
py -3.10 scripts/run_training_pipeline.py --config configs/train_tiny.yamlGoal: assistant behavior.
Data:
- English and Indonesian chat examples
- summarization
- document QA
- coding/debugging
- medical/legal/research style outputs
Status:
- chat template exists
scripts/prepare_sample_sft.pycan create sample chat/tool datascripts/build_sft_dataset.pycan convert chat JSONL into Aibys2 template text- a dedicated SFT trainer is not implemented yet
Tiny SFT continuation pipeline:
py -3.10 scripts/run_sft_pipeline.py --base-checkpoint checkpoints/tiny/step_20.ptThis converts chat/tool JSONL to Aibys2 template text, tokenizes it, then continues training from a base checkpoint.
Goal: structured JSON function calls.
Data:
- conversations with tool definitions
- expected tool calls
- tool results
- final answers after tool results
Status:
- parser/executor utilities exist
- sample tool-call SFT data can be converted with
scripts/build_sft_dataset.py - dedicated tool-call evaluation/training metrics are not implemented yet
Tool calling uses the same chat/SFT JSONL format with a tools field. See docs/DATASETS.md for the exact JSON example.
Goal: image understanding, OCR, visual QA.
Data:
- image-caption pairs
- document images
- OCR targets
- receipts/invoices/forms
Status:
- vision token module exists
scripts/build_vision_manifest.pyvalidates image-caption manifests- full vision training stage is planned
Vision data uses data/vision/images/ plus data/vision/captions.jsonl. See docs/DATASETS.md.
Checkpoints are saved as:
checkpoints/tiny/step_10.pt
checkpoints/tiny/step_20.pt
For mini:
checkpoints/mini/step_1000.pt
checkpoints/mini/step_2000.pt
...
Current checkpoint format:
{
"model": model.state_dict(),
"optimizer": optimizer.state_dict(),
"scaler": scaler.state_dict() if scaler is not None else None,
"config": model_cfg.__dict__,
"train_config": train_cfg,
"step": step,
}Resume training:
py -3.10 scripts/train.py --config configs/train_tiny.yaml --resume checkpoints/tiny/step_20.ptGenerate from a checkpoint:
py -3.10 scripts/generate.py --checkpoint checkpoints/tiny/step_20.pt --prompt "Aibys2 is" --max-new-tokens 32The tiny sample checkpoint will produce low-quality/random text because it only trains for 20 steps on sample data. The command exists to verify checkpoint loading and inference wiring.
Aibys2 is the model/checkpoint. Any interface can call it directly.
from inference.runtime import Aibys2Runtime
runtime = Aibys2Runtime("checkpoints/tiny/step_20.pt", "tokenizer/aibys2.json")
text = runtime.generate("Aibys2 is", max_new_tokens=32)
print(text)The HTTP server is only an optional adapter. It is not an OpenAI model and does not call OpenAI.
Server skeleton:
py -3.10 scripts/serve.pyRun without checkpoint:
py -3.10 scripts/serve.pyRun with your local checkpoint:
$env:AIBYS2_CHECKPOINT="checkpoints/tiny/step_20.pt"
$env:AIBYS2_TOKENIZER="tokenizer/aibys2.json"
py -3.10 scripts/serve.pyEndpoints:
GET /v1/models
POST /v1/chat/completions
POST /v1/images/analyze
The chat endpoint loads and runs your local Aibys2 checkpoint when AIBYS2_CHECKPOINT is set.
Files:
export/to_gguf.py
export/Modelfile
GGUF export is a placeholder. Real export needs a trained checkpoint and a conversion mapping compatible with llama.cpp/Ollama.
You are probably using Python 3.15 beta. Use Python 3.10:
py -3.10 scripts/train.py --config configs/train_tiny.yamlYour .bin file is too small for the configured context length.
Fix one of these:
- add more data
- lower
seq_len - use a smaller config for smoke testing
Tokenizer vocab and model vocab_size do not match.
Fix:
- if tokenizer is
8000, setvocab_size: 8000 - if tokenizer is
64000, setvocab_size: 64000
This is normal for a random tiny model on tiny data. The sample run proves code execution, not model quality.
Working:
- tokenizer training
- train/valid tokenization
- raw
.txtto train/valid builder - tiny model training from zero
- gradient accumulation
- mixed precision config (
bf16/fp16where supported) - optional MTP loss in the main train script
- validation loss
- checkpoint saving
- resume checkpoint
- checkpoint loading for generation
- top-k, top-p, repetition penalty generation
- SFT/chat JSONL to template text converter
- vision manifest builder
- local runtime class for apps
- optional HTTP adapter that can load a local checkpoint
- tests
Not production-ready yet:
- distributed training
- sharded checkpoints
- memory-mapped or sharded large dataset loader
- serious data dedup/filtering beyond the simple builder
- SFT/instruction trainer
- vision training
- tool-calling training
- GGUF export
Aibys2 should be a foundation for Indonesian builders and workflows, not a narrow language demo.
The model should learn broad capability from English and global technical data, then become useful locally through Indonesian data, document AI tasks, tool calling, OCR, domain workflows, and open reproducible scripts.
See docs/TRAINING_PIPELINE.md for additional notes.