Aibys2

Aibys2 is a runnable from-scratch LLM starter for an Indonesia-rooted AI ecosystem.

The goal is not to build an "Indonesian-only AI". The stronger direction is:

use English-first pretraining for broad reasoning, code, science, and general language capability
add Indonesian strength through curated local data, bilingual instruction tuning, document workflows, OCR, tool calling, and Aibys domain apps
keep the scripts open and reproducible so other builders can run the same pipeline for free

This repo currently focuses on a working training pipeline. It does not ship pretrained weights.

License

Aibys2 is free to use, modify, train with, and build on, including for commercial AI models, paid AI applications, and hosted AI services.

Attribution is required:

Aibys2 by Arlchoose-code

You may not sell, repackage, or redistribute this source code repository itself as a standalone paid product without permission. See LICENSE and NOTICE.

What Works Now

The tiny pipeline has already been tested end-to-end:

prepare sample data
-> train tokenizer
-> tokenize train/valid text
-> initialize Aibys2 from random weights
-> train
-> evaluate on valid split
-> save checkpoints

Verified command:

py -3.10 scripts/run_training_pipeline.py --config configs/train_tiny.yaml

Expected outputs include:

data/tokenized/train.bin
data/tokenized/valid.bin
checkpoints/tiny/step_10.pt
checkpoints/tiny/step_20.pt

Tests:

py -3.10 -m pytest -q

Expected result:

3 passed

Architecture Direction

Aibys2 is a decoder-only Transformer skeleton with modern model components:

RMSNorm pre-normalization
SwiGLU feed-forward layers
Multi-head Latent Attention style KV compression
alternating local/global attention windows
routed + shared MoE layers
Multi-Token Prediction head
vision-token interface for later image stages
chat template and JSON tool-calling utilities

The code is intentionally configurable. The tiny model proves the scripts work; mini/base configs follow the same path with larger dimensions, more data, and more compute.

Environment

On this machine, default python points to Python 3.15 beta. PyTorch is not available for that version. Use Python 3.10 for training:

py -3.10 -m pip install -r requirements.txt

Then use py -3.10 for all training commands:

py -3.10 scripts/smoke.py

If you are on Linux/macOS with a normal Python 3.10-3.12 environment, python may be enough:

python -m pip install -r requirements.txt
python scripts/smoke.py

Project Structure

configs/
  tiny.yaml              tiny model architecture for smoke training
  mini.yaml              development-size model architecture
  base.yaml              larger target architecture
  train_tiny.yaml        tiny training run config
  train_mini.yaml        mini training run config
  train_sft_tiny.yaml    tiny SFT continuation config

data/
  raw/                   raw source text, separated by source/domain
  processed/             train.txt and valid.txt
  tokenized/             train.bin and valid.bin
  vision/                future image/caption/OCR data
  tool/                  future tool-calling data

model/                   Aibys2 model architecture
tokenizer/               tokenizer training and chat template
training/                data/loss/optimizer/trainer utilities
tools/                   tool schema/parser/executor
inference/               generation, chat loop, FastAPI server skeleton
export/                  GGUF/Ollama placeholders
scripts/                 runnable pipeline scripts
tests/                   pytest checks
docs/                    deeper docs

Data Layout

Put original corpora under data/raw/:

data/raw/
  english/
    web.txt
    books.txt
    code.txt
    science.txt
  indonesia/
    wiki_id.txt
    news_id.txt
    legal_docs.txt
    medical_docs.txt
    invoices_ocr.txt
  mixed/
    bilingual_chat.txt
    aibys_app_data.txt

Then create the curated training files:

data/processed/train.txt
data/processed/valid.txt

After tokenization:

data/tokenized/train.bin
data/tokenized/valid.bin

Current text pretraining scripts expect plain UTF-8 text. One training example can be one line or paragraph. Keep the validation split separate from training data.

Detailed dataset format docs:

docs/DATASETS.md

That file covers text pretraining, chat/SFT, tool calling, image/vision manifests, and local app usage.

Data Mix Direction

Recommended early data mix:

60-75% English general/code/science
15-30% Indonesian general/domain data
5-10% bilingual/chat/tool/document examples

Reasoning:

English data gives the model stronger general and technical grounding.
Indonesian data teaches local language, context, terminology, and workflows.
Aibys-specific data makes the model useful for document AI apps, not just chat.

For larger runs, prioritize quality over size:

deduplicate aggressively
remove broken encoding and boilerplate
filter spam, repeated templates, and low-information pages
keep source metadata outside the training text if possible
keep a clean validation set that is never included in train

One-Command Tiny Training

Run the full tiny pipeline:

py -3.10 scripts/run_training_pipeline.py --config configs/train_tiny.yaml

This does:

1. scripts/prepare_sample_data.py
2. tokenizer/train_tokenizer.py
3. scripts/tokenize_data.py for train.txt
4. scripts/tokenize_data.py for valid.txt
5. scripts/train.py

This is the fastest sanity check. It trains a tiny random model for 20 steps and saves checkpoints.

Manual Training Pipeline

Use this when replacing the sample data with your real data.

1. Prepare Data

Generate sample data:

py -3.10 scripts/prepare_sample_data.py

For real data, put your own files into data/raw/, then build:

data/processed/train.txt
data/processed/valid.txt

Use the included simple builder:

py -3.10 scripts/build_text_dataset.py --raw-dir data/raw --output-dir data/processed --valid-ratio 0.02

This is a simple cleaner/splitter for local .txt files. It is not a web crawler.

2. Train Tokenizer

Tiny/dev tokenizer:

py -3.10 tokenizer/train_tokenizer.py --input data/processed/train.txt --output tokenizer/aibys2.json --vocab-size 8000

Mini/base tokenizer:

py -3.10 tokenizer/train_tokenizer.py --input data/processed/train.txt --output tokenizer/aibys2.json --vocab-size 64000

If tokenizers is unavailable and you only want a dependency-free smoke run:

py -3.10 tokenizer/train_tokenizer.py --input data/processed/train.txt --output tokenizer/aibys2.json --byte-fallback

For serious training, use the real BPE tokenizer, not byte fallback.

3. Tokenize Train And Valid

py -3.10 scripts/tokenize_data.py --tokenizer tokenizer/aibys2.json --input data/processed/train.txt --output data/tokenized/train.bin
py -3.10 scripts/tokenize_data.py --tokenizer tokenizer/aibys2.json --input data/processed/valid.txt --output data/tokenized/valid.bin

The .bin files are uint16 token ID streams. This is simple and fast enough for the starter pipeline.

Important: vocab_size in the model config must be large enough for the tokenizer. If tokenizer vocab is 8000, model config should use vocab_size: 8000. If tokenizer vocab is 64000, model config should use vocab_size: 64000.

4. Train Model From Zero

Tiny:

py -3.10 scripts/train.py --config configs/train_tiny.yaml

Mini:

py -3.10 scripts/train.py --config configs/train_mini.yaml

Training starts from random initialization. Checkpoints are saved to the configured output_dir.

Training Configs

Training configs are separate from model architecture configs.

Example: configs/train_tiny.yaml

model_config: configs/tiny.yaml
train_bin: data/tokenized/train.bin
valid_bin: data/tokenized/valid.bin
output_dir: checkpoints/tiny
batch_size: 4
seq_len: 32
max_steps: 20
learning_rate: 0.0003
weight_decay: 0.1
warmup_steps: 2
eval_every: 10
save_every: 10
log_every: 1
device: auto

Meaning:

Key	Meaning
`model_config`	Which architecture YAML to load
`train_bin`	Tokenized training data
`valid_bin`	Tokenized validation data
`output_dir`	Where checkpoints are saved
`batch_size`	Number of samples per optimizer step
`seq_len`	Tokens per sample
`max_steps`	Number of optimizer steps
`learning_rate`	AdamW learning rate
`weight_decay`	AdamW weight decay
`warmup_steps`	Linear warmup before cosine decay
`eval_every`	Validate every N steps
`save_every`	Save checkpoint every N steps
`log_every`	Print training loss every N steps
`device`	`auto`, `cpu`, or `cuda`

Model Configs

Model size is controlled by:

configs/tiny.yaml
configs/mini.yaml
configs/base.yaml

Key parameters:

Key	Meaning
`vocab_size`	Must match tokenizer size
`hidden_dim`	Main model width
`num_layers`	Number of Transformer blocks
`num_heads`	Attention heads
`num_kv_heads`	Reserved for GQA-style compatibility
`mla_latent_dim`	Latent KV compression size
`ffn_dim`	FFN hidden dimension
`max_seq_len`	Maximum context length
`local_window`	Sliding window size for local attention layers
`global_every`	Every Nth layer uses global attention
`num_routed_experts`	MoE routed experts
`num_active_experts`	Routed experts active per token
`num_shared_experts`	Shared experts always active
`moe_every`	Every Nth layer uses MoE
`mtp_depth`	Number of future-token heads
`vision_image_size`	Future vision input size
`vision_max_tokens`	Future max image tokens

How To Scale To Bigger Models

The training direction stays the same:

raw data
-> processed train/valid text
-> train tokenizer
-> tokenize to .bin
-> train from random init
-> eval
-> checkpoint

Scaling means changing config, data size, and compute.

Tiny To Mini

Use:

configs/mini.yaml
configs/train_mini.yaml

Make sure:

tokenizer vocab is 64000
configs/mini.yaml has vocab_size: 64000
data/tokenized/train.bin is much larger than seq_len
GPU memory is enough for batch_size and seq_len

Command:

py -3.10 tokenizer/train_tokenizer.py --input data/processed/train.txt --output tokenizer/aibys2.json --vocab-size 64000
py -3.10 scripts/tokenize_data.py --tokenizer tokenizer/aibys2.json --input data/processed/train.txt --output data/tokenized/train.bin
py -3.10 scripts/tokenize_data.py --tokenizer tokenizer/aibys2.json --input data/processed/valid.txt --output data/tokenized/valid.bin
py -3.10 scripts/train.py --config configs/train_mini.yaml

Mini To Base

Create a new train config, for example:

configs/train_base.yaml

Use:

model_config: configs/base.yaml
train_bin: data/tokenized/train.bin
valid_bin: data/tokenized/valid.bin
output_dir: checkpoints/base
batch_size: 1
seq_len: 2048
max_steps: 100000
learning_rate: 0.0003
weight_decay: 0.1
warmup_steps: 2000
eval_every: 1000
save_every: 1000
log_every: 10
device: cuda

Then run:

py -3.10 scripts/train.py --config configs/train_base.yaml

Base-scale training will need more infrastructure before a serious run:

distributed training
sharded checkpoints
larger binary dataset loading
better validation/evaluation

The current starter proves the path and architecture wiring. It is not yet optimized for multi-GPU training.

Recommended Training Stages

Stage 1: Text Pretraining

Goal: base language model.

Data:

English web/books/science/code
Indonesian Wikipedia/news/docs
Aibys domain documents

Objective:

next-token prediction

Current script:

py -3.10 scripts/train.py --config configs/train_tiny.yaml

Full tiny pretraining pipeline:

py -3.10 scripts/run_training_pipeline.py --config configs/train_tiny.yaml

Stage 2: Instruction Tuning

Goal: assistant behavior.

Data:

English and Indonesian chat examples
summarization
document QA
coding/debugging
medical/legal/research style outputs

Status:

chat template exists
scripts/prepare_sample_sft.py can create sample chat/tool data
scripts/build_sft_dataset.py can convert chat JSONL into Aibys2 template text
a dedicated SFT trainer is not implemented yet

Tiny SFT continuation pipeline:

py -3.10 scripts/run_sft_pipeline.py --base-checkpoint checkpoints/tiny/step_20.pt

This converts chat/tool JSONL to Aibys2 template text, tokenizes it, then continues training from a base checkpoint.

Stage 3: Tool Calling

Goal: structured JSON function calls.

Data:

conversations with tool definitions
expected tool calls
tool results
final answers after tool results

Status:

parser/executor utilities exist
sample tool-call SFT data can be converted with scripts/build_sft_dataset.py
dedicated tool-call evaluation/training metrics are not implemented yet

Tool calling uses the same chat/SFT JSONL format with a tools field. See docs/DATASETS.md for the exact JSON example.

Stage 4: Vision

Goal: image understanding, OCR, visual QA.

Data:

image-caption pairs
document images
OCR targets
receipts/invoices/forms

Status:

vision token module exists
scripts/build_vision_manifest.py validates image-caption manifests
full vision training stage is planned

Vision data uses data/vision/images/ plus data/vision/captions.jsonl. See docs/DATASETS.md.

Checkpoints

Checkpoints are saved as:

checkpoints/tiny/step_10.pt
checkpoints/tiny/step_20.pt

For mini:

checkpoints/mini/step_1000.pt
checkpoints/mini/step_2000.pt
...

Current checkpoint format:

{
    "model": model.state_dict(),
    "optimizer": optimizer.state_dict(),
    "scaler": scaler.state_dict() if scaler is not None else None,
    "config": model_cfg.__dict__,
    "train_config": train_cfg,
    "step": step,
}

Resume training:

py -3.10 scripts/train.py --config configs/train_tiny.yaml --resume checkpoints/tiny/step_20.pt

Generate from a checkpoint:

py -3.10 scripts/generate.py --checkpoint checkpoints/tiny/step_20.pt --prompt "Aibys2 is" --max-new-tokens 32

The tiny sample checkpoint will produce low-quality/random text because it only trains for 20 steps on sample data. The command exists to verify checkpoint loading and inference wiring.

Loading A Model In Your Own App

Aibys2 is the model/checkpoint. Any interface can call it directly.

from inference.runtime import Aibys2Runtime

runtime = Aibys2Runtime("checkpoints/tiny/step_20.pt", "tokenizer/aibys2.json")
text = runtime.generate("Aibys2 is", max_new_tokens=32)
print(text)

The HTTP server is only an optional adapter. It is not an OpenAI model and does not call OpenAI.

Local HTTP Adapter

Server skeleton:

py -3.10 scripts/serve.py

Run without checkpoint:

py -3.10 scripts/serve.py

Run with your local checkpoint:

$env:AIBYS2_CHECKPOINT="checkpoints/tiny/step_20.pt"
$env:AIBYS2_TOKENIZER="tokenizer/aibys2.json"
py -3.10 scripts/serve.py

Endpoints:

GET  /v1/models
POST /v1/chat/completions
POST /v1/images/analyze

The chat endpoint loads and runs your local Aibys2 checkpoint when AIBYS2_CHECKPOINT is set.

GGUF And Ollama

Files:

export/to_gguf.py
export/Modelfile

GGUF export is a placeholder. Real export needs a trained checkpoint and a conversion mapping compatible with llama.cpp/Ollama.

Common Errors

`torch is not installed`

You are probably using Python 3.15 beta. Use Python 3.10:

py -3.10 scripts/train.py --config configs/train_tiny.yaml

`tokenized data is shorter than seq_len`

Your .bin file is too small for the configured context length.

Fix one of these:

add more data
lower seq_len
use a smaller config for smoke testing

Token ID out of range

Tokenizer vocab and model vocab_size do not match.

Fix:

if tokenizer is 8000, set vocab_size: 8000
if tokenizer is 64000, set vocab_size: 64000

Training runs but loss is huge

This is normal for a random tiny model on tiny data. The sample run proves code execution, not model quality.

Current Limitations

Working:

tokenizer training
train/valid tokenization
raw .txt to train/valid builder
tiny model training from zero
gradient accumulation
mixed precision config (bf16/fp16 where supported)
optional MTP loss in the main train script
validation loss
checkpoint saving
resume checkpoint
checkpoint loading for generation
top-k, top-p, repetition penalty generation
SFT/chat JSONL to template text converter
vision manifest builder
local runtime class for apps
optional HTTP adapter that can load a local checkpoint
tests

Not production-ready yet:

distributed training
sharded checkpoints
memory-mapped or sharded large dataset loader
serious data dedup/filtering beyond the simple builder
SFT/instruction trainer
vision training
tool-calling training
GGUF export

Philosophy

Aibys2 should be a foundation for Indonesian builders and workflows, not a narrow language demo.

The model should learn broad capability from English and global technical data, then become useful locally through Indonesian data, document AI tasks, tool calling, OCR, domain workflows, and open reproducible scripts.

See docs/TRAINING_PIPELINE.md for additional notes.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
docs		docs
export		export
inference		inference
model		model
scripts		scripts
tests		tests
tokenizer		tokenizer
tools		tools
training		training
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Aibys2

License

What Works Now

Architecture Direction

Environment

Project Structure

Data Layout

Data Mix Direction

One-Command Tiny Training

Manual Training Pipeline

1. Prepare Data

2. Train Tokenizer

3. Tokenize Train And Valid

4. Train Model From Zero

Training Configs

Model Configs

How To Scale To Bigger Models

Tiny To Mini

Mini To Base

Recommended Training Stages

Stage 1: Text Pretraining

Stage 2: Instruction Tuning

Stage 3: Tool Calling

Stage 4: Vision

Checkpoints

Loading A Model In Your Own App

Local HTTP Adapter

GGUF And Ollama

Common Errors

torch is not installed

tokenized data is shorter than seq_len

Token ID out of range

Training runs but loss is huge

Current Limitations

Philosophy

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`torch is not installed`

`tokenized data is shorter than seq_len`

Packages