Encoder vectorization + CLI flags for training on larger corpora by modulovalue · Pull Request #2 · Encrux/simple_dlm

modulovalue · 2026-04-30T15:30:59Z

Two commits that make it practical to point this codebase at corpora bigger and more diverse than Shakespeare. Tested locally on a 500 MB Wikipedia dump (522M chars, 7003-char vocab); steady at 27 it/s on M-series MPS using the same model.

Changes

1. Vectorize Encoder vocab build and bulk encode with numpy

The original Encoder ran two pure-Python passes: a per-character dict-membership check to build the vocab, and a per-character lookup to encode. Fine on Shakespeare, but on a 500 MB wiki dump those Python loops dominate startup (estimated minutes each). New code:

set(text) for vocab (one C pass)
encode_array(text): encodes text to UTF-32-LE, views as uint32 codepoints, gathers through a precomputed lookup table -- returns np.ndarray[int32]
caches the inverse dict so decode() doesn't rebuild it on every call
chunked bulk encode (4M chars/block) so peak memory stays bounded

encode, decode, vocab are unchanged. encode_array is additive.

Measured on a 500 MB wiki dump (522M characters):

vocab build: ~2.6 s (was estimated at minutes)
encode_array: ~0.8 s (was estimated at minutes)

2. Make dataset path configurable and add training CLI flags

Transformer(data_path=...) constructor argument; threaded through train.py, sample.py, export_onnx.py via a --data flag.
--batch-size and --seq-len flags so hyperparameters can be tuned without editing source.
--resume <checkpoint.pt> flag that loads a saved state_dict before training -- useful for picking up long runs after a crash. Only model weights are restored; optimizer state and step counter are not.
Use the new encoder.encode_array() and store the corpus as int32 on device. Vocab fits easily in 32 bits, so int64 was wasting 50% of corpus memory. On a 500 MB corpus this saves ~2 GB of device memory.
Read the corpus with f.read() instead of \"\\n\".join(f.readlines()) -- the old form silently doubled every newline (no vocab change, since the encoder was building from the same join'd text).

Defaults are unchanged, so uv run train --device mps behaves exactly the same as before. Sanity-checked on Shakespeare: same it/s, matching loss curve.

Test plan

uv run train --device mps (defaults to Shakespeare) -- identical it/s and loss curve
uv run train --device mps --data <other.txt> -- runs cleanly on multilingual corpora
Full 500 MB wiki dump trains at ~27 it/s on M-series MPS
--resume checkpoints/checkpoint.pt loads a previous state_dict and continues training

The original Encoder had two pure-Python passes: it built the vocab dict by iterating every character and dispatching a dict membership test, and its encode() ran a per-character dict lookup. Both are O(N) in Python. On Shakespeare (~1M chars) that is fine. On larger corpora (e.g. a 500MB wiki dump, ~520M chars) those Python loops dominate startup. This commit: - Builds the vocabulary as `sorted(set(text))` (one pass in C). - Adds `encode_array(text)` which converts text to UTF-32-LE bytes via the C codec, views the buffer as `uint32` codepoints, and gathers through a precomputed lookup table indexed by ord(c). Output is a `np.ndarray[int32]` ready to be moved to a torch tensor. - Caches the inverse dict so decode() does not rebuild it on every call. - Bulk encoding is chunked (default 4M chars/block) so peak transient memory stays bounded for very large corpora. Existing public API is preserved: `encode`, `decode`, and `vocab` behave the same. `encode_array` is additive. Measured on a 500 MB wiki dump (522M characters): vocab build: ~2.6 s (was estimated at minutes) encode_array: ~0.8 s (was estimated at minutes)

Adds a small set of CLI knobs needed to point training at a different corpus and to recover from interruptions, plus a few correctness/perf tweaks that come along for the ride: - `Transformer(data_path=...)` constructor argument; previously the path was hardcoded to "data/input.txt". Threaded through train.py, sample.py, and export_onnx.py via a `--data` flag (default unchanged). - `--batch-size` and `--seq-len` flags so hyperparameters can be tuned without editing source. - `--resume <checkpoint.pt>` flag that loads a saved state_dict before training. Useful for picking up a long run after a crash, machine reboot, or any other interruption. Only the model weights are restored; the optimizer state and step counter are not. - Use the new `encoder.encode_array()` and store the corpus as `int32` on device. The vocabulary easily fits in 32 bits (this PR's wiki sample has ~720 chars, the full HF wiki dump has ~7000), so int64 was wasting 50% of corpus memory. On a 500MB wiki corpus this saves ~2 GB of device memory. - Read the corpus with `f.read()` instead of `"\n".join(f.readlines())`. The old form silently doubled every newline. No vocab change, the encoder was building from the same join'd text. Sanity-checked: training on tiny Shakespeare with default flags gives the same it/s and matching loss curve as before.

Encrux · 2026-05-04T16:48:23Z

Looks good.

One thing: If you want to resume from a previous checkpoint, you also want to make sure you're resuming the optimizer state imo.

modulovalue added 2 commits April 30, 2026 17:28

modulovalue force-pushed the wiki-support branch from 681053b to 84b8639 Compare April 30, 2026 15:36

modulovalue changed the title ~~Train on larger / multilingual corpora (e.g. Wikipedia)~~ Encoder vectorization + CLI flags for training on larger corpora Apr 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoder vectorization + CLI flags for training on larger corpora#2

Encoder vectorization + CLI flags for training on larger corpora#2
modulovalue wants to merge 2 commits into
Encrux:masterfrom
modulovalue:wiki-support

modulovalue commented Apr 30, 2026 •

edited

Loading

Uh oh!

Encrux commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

modulovalue commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Test plan

Uh oh!

Encrux commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

modulovalue commented Apr 30, 2026 •

edited

Loading