Skip to content

Arlchoose-code/Aibys2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Aibys2

Aibys2 is a runnable from-scratch LLM starter for an Indonesia-rooted AI ecosystem.

The goal is not to build an "Indonesian-only AI". The stronger direction is:

  • use English-first pretraining for broad reasoning, code, science, and general language capability
  • add Indonesian strength through curated local data, bilingual instruction tuning, document workflows, OCR, tool calling, and Aibys domain apps
  • keep the scripts open and reproducible so other builders can run the same pipeline for free

This repo currently focuses on a working training pipeline. It does not ship pretrained weights.

License

Aibys2 is free to use, modify, train with, and build on, including for commercial AI models, paid AI applications, and hosted AI services.

Attribution is required:

Aibys2 by Arlchoose-code

You may not sell, repackage, or redistribute this source code repository itself as a standalone paid product without permission. See LICENSE and NOTICE.

What Works Now

The tiny pipeline has already been tested end-to-end:

prepare sample data
-> train tokenizer
-> tokenize train/valid text
-> initialize Aibys2 from random weights
-> train
-> evaluate on valid split
-> save checkpoints

Verified command:

py -3.10 scripts/run_training_pipeline.py --config configs/train_tiny.yaml

Expected outputs include:

data/tokenized/train.bin
data/tokenized/valid.bin
checkpoints/tiny/step_10.pt
checkpoints/tiny/step_20.pt

Tests:

py -3.10 -m pytest -q

Expected result:

3 passed

Architecture Direction

Aibys2 is a decoder-only Transformer skeleton with modern model components:

  • RMSNorm pre-normalization
  • SwiGLU feed-forward layers
  • Multi-head Latent Attention style KV compression
  • alternating local/global attention windows
  • routed + shared MoE layers
  • Multi-Token Prediction head
  • vision-token interface for later image stages
  • chat template and JSON tool-calling utilities

The code is intentionally configurable. The tiny model proves the scripts work; mini/base configs follow the same path with larger dimensions, more data, and more compute.

Environment

On this machine, default python points to Python 3.15 beta. PyTorch is not available for that version. Use Python 3.10 for training:

py -3.10 -m pip install -r requirements.txt

Then use py -3.10 for all training commands:

py -3.10 scripts/smoke.py

If you are on Linux/macOS with a normal Python 3.10-3.12 environment, python may be enough:

python -m pip install -r requirements.txt
python scripts/smoke.py

Project Structure

configs/
  tiny.yaml              tiny model architecture for smoke training
  mini.yaml              development-size model architecture
  base.yaml              larger target architecture
  train_tiny.yaml        tiny training run config
  train_mini.yaml        mini training run config
  train_sft_tiny.yaml    tiny SFT continuation config

data/
  raw/                   raw source text, separated by source/domain
  processed/             train.txt and valid.txt
  tokenized/             train.bin and valid.bin
  vision/                future image/caption/OCR data
  tool/                  future tool-calling data

model/                   Aibys2 model architecture
tokenizer/               tokenizer training and chat template
training/                data/loss/optimizer/trainer utilities
tools/                   tool schema/parser/executor
inference/               generation, chat loop, FastAPI server skeleton
export/                  GGUF/Ollama placeholders
scripts/                 runnable pipeline scripts
tests/                   pytest checks
docs/                    deeper docs

Data Layout

Put original corpora under data/raw/:

data/raw/
  english/
    web.txt
    books.txt
    code.txt
    science.txt
  indonesia/
    wiki_id.txt
    news_id.txt
    legal_docs.txt
    medical_docs.txt
    invoices_ocr.txt
  mixed/
    bilingual_chat.txt
    aibys_app_data.txt

Then create the curated training files:

data/processed/train.txt
data/processed/valid.txt

After tokenization:

data/tokenized/train.bin
data/tokenized/valid.bin

Current text pretraining scripts expect plain UTF-8 text. One training example can be one line or paragraph. Keep the validation split separate from training data.

Detailed dataset format docs:

docs/DATASETS.md

That file covers text pretraining, chat/SFT, tool calling, image/vision manifests, and local app usage.

Data Mix Direction

Recommended early data mix:

60-75% English general/code/science
15-30% Indonesian general/domain data
5-10% bilingual/chat/tool/document examples

Reasoning:

  • English data gives the model stronger general and technical grounding.
  • Indonesian data teaches local language, context, terminology, and workflows.
  • Aibys-specific data makes the model useful for document AI apps, not just chat.

For larger runs, prioritize quality over size:

  • deduplicate aggressively
  • remove broken encoding and boilerplate
  • filter spam, repeated templates, and low-information pages
  • keep source metadata outside the training text if possible
  • keep a clean validation set that is never included in train

One-Command Tiny Training

Run the full tiny pipeline:

py -3.10 scripts/run_training_pipeline.py --config configs/train_tiny.yaml

This does:

1. scripts/prepare_sample_data.py
2. tokenizer/train_tokenizer.py
3. scripts/tokenize_data.py for train.txt
4. scripts/tokenize_data.py for valid.txt
5. scripts/train.py

This is the fastest sanity check. It trains a tiny random model for 20 steps and saves checkpoints.

Manual Training Pipeline

Use this when replacing the sample data with your real data.

1. Prepare Data

Generate sample data:

py -3.10 scripts/prepare_sample_data.py

For real data, put your own files into data/raw/, then build:

data/processed/train.txt
data/processed/valid.txt

Use the included simple builder:

py -3.10 scripts/build_text_dataset.py --raw-dir data/raw --output-dir data/processed --valid-ratio 0.02

This is a simple cleaner/splitter for local .txt files. It is not a web crawler.

2. Train Tokenizer

Tiny/dev tokenizer:

py -3.10 tokenizer/train_tokenizer.py --input data/processed/train.txt --output tokenizer/aibys2.json --vocab-size 8000

Mini/base tokenizer:

py -3.10 tokenizer/train_tokenizer.py --input data/processed/train.txt --output tokenizer/aibys2.json --vocab-size 64000

If tokenizers is unavailable and you only want a dependency-free smoke run:

py -3.10 tokenizer/train_tokenizer.py --input data/processed/train.txt --output tokenizer/aibys2.json --byte-fallback

For serious training, use the real BPE tokenizer, not byte fallback.

3. Tokenize Train And Valid

py -3.10 scripts/tokenize_data.py --tokenizer tokenizer/aibys2.json --input data/processed/train.txt --output data/tokenized/train.bin
py -3.10 scripts/tokenize_data.py --tokenizer tokenizer/aibys2.json --input data/processed/valid.txt --output data/tokenized/valid.bin

The .bin files are uint16 token ID streams. This is simple and fast enough for the starter pipeline.

Important: vocab_size in the model config must be large enough for the tokenizer. If tokenizer vocab is 8000, model config should use vocab_size: 8000. If tokenizer vocab is 64000, model config should use vocab_size: 64000.

4. Train Model From Zero

Tiny:

py -3.10 scripts/train.py --config configs/train_tiny.yaml

Mini:

py -3.10 scripts/train.py --config configs/train_mini.yaml

Training starts from random initialization. Checkpoints are saved to the configured output_dir.

Training Configs

Training configs are separate from model architecture configs.

Example: configs/train_tiny.yaml

model_config: configs/tiny.yaml
train_bin: data/tokenized/train.bin
valid_bin: data/tokenized/valid.bin
output_dir: checkpoints/tiny
batch_size: 4
seq_len: 32
max_steps: 20
learning_rate: 0.0003
weight_decay: 0.1
warmup_steps: 2
eval_every: 10
save_every: 10
log_every: 1
device: auto

Meaning:

Key Meaning
model_config Which architecture YAML to load
train_bin Tokenized training data
valid_bin Tokenized validation data
output_dir Where checkpoints are saved
batch_size Number of samples per optimizer step
seq_len Tokens per sample
max_steps Number of optimizer steps
learning_rate AdamW learning rate
weight_decay AdamW weight decay
warmup_steps Linear warmup before cosine decay
eval_every Validate every N steps
save_every Save checkpoint every N steps
log_every Print training loss every N steps
device auto, cpu, or cuda

Model Configs

Model size is controlled by:

configs/tiny.yaml
configs/mini.yaml
configs/base.yaml

Key parameters:

Key Meaning
vocab_size Must match tokenizer size
hidden_dim Main model width
num_layers Number of Transformer blocks
num_heads Attention heads
num_kv_heads Reserved for GQA-style compatibility
mla_latent_dim Latent KV compression size
ffn_dim FFN hidden dimension
max_seq_len Maximum context length
local_window Sliding window size for local attention layers
global_every Every Nth layer uses global attention
num_routed_experts MoE routed experts
num_active_experts Routed experts active per token
num_shared_experts Shared experts always active
moe_every Every Nth layer uses MoE
mtp_depth Number of future-token heads
vision_image_size Future vision input size
vision_max_tokens Future max image tokens

How To Scale To Bigger Models

The training direction stays the same:

raw data
-> processed train/valid text
-> train tokenizer
-> tokenize to .bin
-> train from random init
-> eval
-> checkpoint

Scaling means changing config, data size, and compute.

Tiny To Mini

Use:

configs/mini.yaml
configs/train_mini.yaml

Make sure:

  • tokenizer vocab is 64000
  • configs/mini.yaml has vocab_size: 64000
  • data/tokenized/train.bin is much larger than seq_len
  • GPU memory is enough for batch_size and seq_len

Command:

py -3.10 tokenizer/train_tokenizer.py --input data/processed/train.txt --output tokenizer/aibys2.json --vocab-size 64000
py -3.10 scripts/tokenize_data.py --tokenizer tokenizer/aibys2.json --input data/processed/train.txt --output data/tokenized/train.bin
py -3.10 scripts/tokenize_data.py --tokenizer tokenizer/aibys2.json --input data/processed/valid.txt --output data/tokenized/valid.bin
py -3.10 scripts/train.py --config configs/train_mini.yaml

Mini To Base

Create a new train config, for example:

configs/train_base.yaml

Use:

model_config: configs/base.yaml
train_bin: data/tokenized/train.bin
valid_bin: data/tokenized/valid.bin
output_dir: checkpoints/base
batch_size: 1
seq_len: 2048
max_steps: 100000
learning_rate: 0.0003
weight_decay: 0.1
warmup_steps: 2000
eval_every: 1000
save_every: 1000
log_every: 10
device: cuda

Then run:

py -3.10 scripts/train.py --config configs/train_base.yaml

Base-scale training will need more infrastructure before a serious run:

  • distributed training
  • sharded checkpoints
  • larger binary dataset loading
  • better validation/evaluation

The current starter proves the path and architecture wiring. It is not yet optimized for multi-GPU training.

Recommended Training Stages

Stage 1: Text Pretraining

Goal: base language model.

Data:

  • English web/books/science/code
  • Indonesian Wikipedia/news/docs
  • Aibys domain documents

Objective:

  • next-token prediction

Current script:

py -3.10 scripts/train.py --config configs/train_tiny.yaml

Full tiny pretraining pipeline:

py -3.10 scripts/run_training_pipeline.py --config configs/train_tiny.yaml

Stage 2: Instruction Tuning

Goal: assistant behavior.

Data:

  • English and Indonesian chat examples
  • summarization
  • document QA
  • coding/debugging
  • medical/legal/research style outputs

Status:

  • chat template exists
  • scripts/prepare_sample_sft.py can create sample chat/tool data
  • scripts/build_sft_dataset.py can convert chat JSONL into Aibys2 template text
  • a dedicated SFT trainer is not implemented yet

Tiny SFT continuation pipeline:

py -3.10 scripts/run_sft_pipeline.py --base-checkpoint checkpoints/tiny/step_20.pt

This converts chat/tool JSONL to Aibys2 template text, tokenizes it, then continues training from a base checkpoint.

Stage 3: Tool Calling

Goal: structured JSON function calls.

Data:

  • conversations with tool definitions
  • expected tool calls
  • tool results
  • final answers after tool results

Status:

  • parser/executor utilities exist
  • sample tool-call SFT data can be converted with scripts/build_sft_dataset.py
  • dedicated tool-call evaluation/training metrics are not implemented yet

Tool calling uses the same chat/SFT JSONL format with a tools field. See docs/DATASETS.md for the exact JSON example.

Stage 4: Vision

Goal: image understanding, OCR, visual QA.

Data:

  • image-caption pairs
  • document images
  • OCR targets
  • receipts/invoices/forms

Status:

  • vision token module exists
  • scripts/build_vision_manifest.py validates image-caption manifests
  • full vision training stage is planned

Vision data uses data/vision/images/ plus data/vision/captions.jsonl. See docs/DATASETS.md.

Checkpoints

Checkpoints are saved as:

checkpoints/tiny/step_10.pt
checkpoints/tiny/step_20.pt

For mini:

checkpoints/mini/step_1000.pt
checkpoints/mini/step_2000.pt
...

Current checkpoint format:

{
    "model": model.state_dict(),
    "optimizer": optimizer.state_dict(),
    "scaler": scaler.state_dict() if scaler is not None else None,
    "config": model_cfg.__dict__,
    "train_config": train_cfg,
    "step": step,
}

Resume training:

py -3.10 scripts/train.py --config configs/train_tiny.yaml --resume checkpoints/tiny/step_20.pt

Generate from a checkpoint:

py -3.10 scripts/generate.py --checkpoint checkpoints/tiny/step_20.pt --prompt "Aibys2 is" --max-new-tokens 32

The tiny sample checkpoint will produce low-quality/random text because it only trains for 20 steps on sample data. The command exists to verify checkpoint loading and inference wiring.

Loading A Model In Your Own App

Aibys2 is the model/checkpoint. Any interface can call it directly.

from inference.runtime import Aibys2Runtime

runtime = Aibys2Runtime("checkpoints/tiny/step_20.pt", "tokenizer/aibys2.json")
text = runtime.generate("Aibys2 is", max_new_tokens=32)
print(text)

The HTTP server is only an optional adapter. It is not an OpenAI model and does not call OpenAI.

Local HTTP Adapter

Server skeleton:

py -3.10 scripts/serve.py

Run without checkpoint:

py -3.10 scripts/serve.py

Run with your local checkpoint:

$env:AIBYS2_CHECKPOINT="checkpoints/tiny/step_20.pt"
$env:AIBYS2_TOKENIZER="tokenizer/aibys2.json"
py -3.10 scripts/serve.py

Endpoints:

GET  /v1/models
POST /v1/chat/completions
POST /v1/images/analyze

The chat endpoint loads and runs your local Aibys2 checkpoint when AIBYS2_CHECKPOINT is set.

GGUF And Ollama

Files:

export/to_gguf.py
export/Modelfile

GGUF export is a placeholder. Real export needs a trained checkpoint and a conversion mapping compatible with llama.cpp/Ollama.

Common Errors

torch is not installed

You are probably using Python 3.15 beta. Use Python 3.10:

py -3.10 scripts/train.py --config configs/train_tiny.yaml

tokenized data is shorter than seq_len

Your .bin file is too small for the configured context length.

Fix one of these:

  • add more data
  • lower seq_len
  • use a smaller config for smoke testing

Token ID out of range

Tokenizer vocab and model vocab_size do not match.

Fix:

  • if tokenizer is 8000, set vocab_size: 8000
  • if tokenizer is 64000, set vocab_size: 64000

Training runs but loss is huge

This is normal for a random tiny model on tiny data. The sample run proves code execution, not model quality.

Current Limitations

Working:

  • tokenizer training
  • train/valid tokenization
  • raw .txt to train/valid builder
  • tiny model training from zero
  • gradient accumulation
  • mixed precision config (bf16/fp16 where supported)
  • optional MTP loss in the main train script
  • validation loss
  • checkpoint saving
  • resume checkpoint
  • checkpoint loading for generation
  • top-k, top-p, repetition penalty generation
  • SFT/chat JSONL to template text converter
  • vision manifest builder
  • local runtime class for apps
  • optional HTTP adapter that can load a local checkpoint
  • tests

Not production-ready yet:

  • distributed training
  • sharded checkpoints
  • memory-mapped or sharded large dataset loader
  • serious data dedup/filtering beyond the simple builder
  • SFT/instruction trainer
  • vision training
  • tool-calling training
  • GGUF export

Philosophy

Aibys2 should be a foundation for Indonesian builders and workflows, not a narrow language demo.

The model should learn broad capability from English and global technical data, then become useful locally through Indonesian data, document AI tasks, tool calling, OCR, domain workflows, and open reproducible scripts.

See docs/TRAINING_PIPELINE.md for additional notes.

About

A runnable from-scratch LLM starter with tokenizer, training, checkpointing, SFT scaffolding, tool calling, and vision dataset support.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages