A from-scratch C inference engine for the hako model family. Own loader, own kernels, own tokenizer. hakm.
site · releases · models · hako-code (agent) · hako-edit (editor) · org
![]() --info · arch · params · GQA · ctx
|
![]() --raw · greedy completion
|
![]() one ChatML turn · honest tok/s split |
| From scratch. Own GGUF→MLF2 loader, own Q4_K/Q6_K + int8 kernels, own byte-level BPE, own Qwen2 forward pass. No ggml, no llama.cpp, no torch. |
Three deps. libc + libm + pthread. That is the whole stack. make builds ./hakm on Linux / macOS / FreeBSD. |
Resident + fast.--serve keeps the model warm and reuses the KV prefix across turns. Batched prefill, FMA q4k, a Q6_K int8 path, threaded attention, f16 KV — ~8 tok/s decode on a CPU-only i3. |
—— I ——
This repo is the engine. It runs the hako models
end to end and is spawned by hako-code
(the agent) and hako-edit (the editor)
as a hakm subprocess — one-shot --chat-stdin per turn, or a long-lived
--serve process that keeps the model resident and reuses the KV cache.
- No ggml, no llama.cpp, no torch, no ollama. The loader, the k-quant + int8 kernels, the BPE tokenizer, and the Qwen2 forward pass are all hand-written C11. libc + libm + pthread is the entire runtime.
- Own container format — MLF2.
tools/gguf2mlf.pyconverts a Qwen2.5-Coder GGUF to a native MLF2 file once (pure stdlib): k-quant blocks copied verbatim, tokenizer embedded, arch params in a fixed 128-byte header. The runtime mmaps it — RSS is just activations + KV cache. - Runs the real weights. Qwen2.5-Coder 3B (
hako-sho) and 7B (hako-koi) generate end to end — greedy + sampling, ChatML, multi-turn, tool-call dialect. Quant output is bit-exact vs an independent Python reimpl. - CPU-only, and honest about it. This is correctness-first; the goal is owning the stack on old hardware, not beating tuned GPU runtimes. Speed wins come from int8 fast paths and KV reuse, measured on a 4-core i3-8100.
—— II ——
git clone https://github.com/mithraeums/hako && cd hako
make # builds ./hakm — libc + libm + pthread, no deps
make UNIVERSAL=1 # (cross) override ARCH for a fat / portable buildDeps: libc + libm + pthread. No third-party libraries. Built
-march=nativeby default; passARCH=""for a portable generic build, orARCH="-arch arm64 -arch x86_64"for a macOS universal binary.
Weights are never stored here — pull them from HuggingFace (huggingface.co/mithraeum) or convert your own:
# let the agent fetch + convert for you:
hako → :pull hako-sho
# or convert a Qwen2.5-Coder GGUF yourself, once:
python3 tools/gguf2mlf.py model.gguf ~/.hako/models/hako-sho/hako-sho.mlf2| model | tier | base | license |
|---|---|---|---|
hako-sho |
mini · 3B | Qwen2.5-Coder-3B-Instruct | |
hako-koi |
mid · 7B | Qwen2.5-Coder-7B-Instruct | Apache-2.0 |
Future tiers (14B/32B fine-tune, 50B+ max) are queued. Stock wraps carry no
version; the first real fine-tune of a tier earns v0.0.1.
M=~/.hako/models/hako-sho/hako-sho.mlf2
./hakm "$M" --info # arch banner, then exit
./hakm "$M" --raw -t 0 "def fibonacci(n):" # raw greedy completion
./hakm "$M" --sys "You are hako." "ring buffers in C" # one ChatML turn
./hakm "$M" --sys "You are hako." # interactive multi-turn REPL
./hakm --version # hakm v0.0.2Flags: -n new-tokens · -t temp (0 = greedy) · -p top_p · -k top_k ·
-s seed · --sys TEXT · --ctx N · --raw · --info · --chat-stdin · --serve.
—— III ——
-
Offline conversion —
tools/gguf2mlf.pyreads a GGUF and emits an MLF2 file: the native container. Quant blocks are copied verbatim (ggml's k-quant layout preserved), tokenizer (vocab + merges) embedded, arch params in a fixed 128-byte header. Pure stdlib — no torch / llama.cpp dependency. -
Runtime (C11, libc + libm + pthread):
loader.c— mmap the MLF2, resolve tensors by name. RSS = activations + KV.quant.c— Q4_K / Q6_K dequant + int8 fast-path dot (quantize the activation to int8 once, dot straight against the k-quant weights; weights stay exact). Bit-exact vstests/q_ref.py. AVX2-FMA + scalar paths.nn.c— rmsnorm, softmax, SiLU, NeoX rope, int8 matmul, and a generichakm_parallelworker pool (matmul rows + attention heads share it).model.c— Qwen2 forward: GQA, QKV bias, KV cache (f16 when__F16C__), batched prefill, threaded attention, lm_head (tied on 3B,output.weighton 7B+).bpe.c— Qwen2 byte-level BPE (GPT2 byte map + rank-greedy merges).cli/main.c— thehakmCLI: one-shot,--raw, chat REPL,--chat-stdin, and the resident--serveloop.
-
Serve loop —
--serveis a resident framed-request loop. Each request's rendered conversation is prefix-matched against the KV-cached tokens; rollback is justpos = p; only the unseen tail prefills. This is how the agent gets warm turns in seconds instead of re-prefilling the whole system prompt every turn. Optional per-request temperature (greedy on tool turns) and token cap.
—— IV ——
Runs the real Qwen2.5-Coder 3B and 7B weights end to end (greedy + sampling,
ChatML, multi-turn REPL, --serve):
- 3B (
hako-sho): qwen2 · 36 layers · d_model 2048 · 16/2 heads (GQA) · ffn 11008 · vocab 151936 · tied embeddings. - 7B (
hako-koi): qwen2 · 28 layers · d_model 3584 · 28/4 heads (GQA) · ffn 18944 · vocab 152064 · untied (output.weight).
Quant correctness gate: C output bit-exact vs tests/q_ref.py (Q4_K + Q6_K,
maxdiff 0.0). Strict-FP build is batch-vs-single bit-identical. ASan + UBSan clean
on chat + serve paths. Build tests:
make test_quant && ./test_quant # Q4_K/Q6_K dequant vs reference—— V ——
CPU-only, measured on a 4-core i3-8100 (the staging box itself):
- Decode ~8 tok/s, flat with depth (was 2.35 → ~3.4×), via the Q6_K int8 dot path, threaded attention, and an f16 KV cache that stops decode sagging as context grows.
- Batched prefill ~4.7× —
matmul_batchstreams each weight row from RAM once per batch (hot in L1 for the other B−1 dots); the FMA q4k kernel keeps 8-lane int32 partials vectorized with one hsum per row. - Warm agent turns ~3–5s (were 5+ minutes) —
--servekeeps the model resident and reuses the KV prefix, so only the new tail prefills. - 7B (
hako-koi) is viable on 8GB — ~4 tok/s decode, RSS ~4.4GB.
Honest framing: the int8 paths keep weights exact (greedy output is
bit-identical to the float path), and the tok/s print splits prefill from
decode rather than blending them. Decode is near the box's bandwidth/compute
floor; the next jumps are speculative decode (needs a draft model) or a
prefill GEMM micro-kernel — not more threads.
—— VI ——
--serve— resident framed-request loop with KV prefix reuse; warm agent turns in seconds instead of re-prefilling every turn.- Batched prefill —
matmul_batch+hakm_forward_batch(causal within batch, lm_head only for the last token).HAKM_BATCHknob. - Q6_K int8 dot path — decode 2.88 → 8.40 tok/s; was riding the float dequant path and dominating decode.
- Threaded attention + f16 KV cache — decode flat with depth; koi KV at ctx 4096 halved.
- FMA q4k kernel — one hsum per row; prefill + decode both up.
- Per-request temperature + stop sequences in
--serve(greedy tool turns, break on</tool_call>/</write_file>). - Honest tok/s — every one-shot path now prints prefill and decode rates separately (the blended figure hid a slow decode behind a fast prefill).
--version/-v→hakm v0.0.2.
- First native runner: MLF2 loader, Q4_K/Q6_K dequant, Qwen2 GQA forward, NeoX rope, byte-level BPE, int8-activation × int4-weight fast dot, pthread matmul. Runs 3B + 7B end to end, no llama.cpp / no ollama.
—— VII ——
- Windows: the loader uses POSIX
mmap; MinGW lackssys/mman.h. An#ifdef _WIN32mmap shim is the unlock. Linux / macOS / FreeBSD build today. - Tokenizer pretokenizer is simplified (whitespace-delimited; merges, symbols, special tokens correct). Needs the full Qwen2 regex for digit/punctuation-run boundaries to exactly match reference tokenization.
- Decode ceiling: near the i3's bandwidth/compute floor. Real breaks = speculative decode (sho as a draft model for koi — needs the training lane) or a smaller quant tier.
- Prefill: a q4k/q6k GEMM micro-kernel (hoist nibble-unpack across the batch) is the next pure-engine lever; NEON q6k path for arm64.
- MLF2 loader + own k-quant dequant (bit-exact)
- Qwen2 GQA forward, NeoX rope, byte-level BPE
- int8-activation × int4-weight fast dot (AVX2-FMA + scalar)
-
--serveresident loop + KV prefix reuse - batched prefill, Q6_K int8 path, threaded attention, f16 KV
- runs 3B (
hako-sho) + 7B (hako-koi) end to end - Windows
mmapshim - full Qwen2 pretokenizer regex
- NEON q6k path (arm64) + prefill GEMM micro-kernel
- speculative decode (draft model from the training lane)
—— VIII ——
src/loader.c mmap MLF2, resolve tensors
src/quant.c Q4_K / Q6_K dequant + int8 fast dot (bit-exact)
src/nn.c rmsnorm, softmax, SiLU, NeoX rope, matmul, worker pool
src/model.c Qwen2 forward — GQA, KV cache, batched prefill, threaded attn
src/bpe.c Qwen2 byte-level BPE (encode + decode)
src/sample.c temp / top-k / top-p / greedy
cli/main.c the hakm CLI — one-shot / raw / REPL / --chat-stdin / --serve
tools/gguf2mlf.py offline GGUF → MLF2 converter (pure stdlib)
include/mlf2.h the native container format
If you share the belief that simplicity empowers creativity, feel free to contribute.
- Forking this repo
- Submitting a Pull Request
- Bug reports and feature requests
Please ensure your code follows the existing style — C11, tabs, snake_case,
ASan + UBSan clean, quant changes proven bit-exact against tests/q_ref.py.
- Engine code — GPL-3.0 (
LICENSE). From-scratch C; no ggml/llama.cpp/torch. - Model weights keep their required upstream licenses (on HuggingFace, not
here):
hako-koi(7B) Apache-2.0;hako-sho(3B) Qwen RESEARCH — non-commercial only (the 3B base is research-licensed). See each model repo'sLICENSE/NOTICE.
— SEE LICENSE — · GPL-3.0
— deus sol invictus mithras —



