Skip to content

Encoder vectorization + CLI flags for training on larger corpora#2

Open
modulovalue wants to merge 2 commits into
Encrux:masterfrom
modulovalue:wiki-support
Open

Encoder vectorization + CLI flags for training on larger corpora#2
modulovalue wants to merge 2 commits into
Encrux:masterfrom
modulovalue:wiki-support

Conversation

@modulovalue
Copy link
Copy Markdown
Contributor

@modulovalue modulovalue commented Apr 30, 2026

Two commits that make it practical to point this codebase at corpora bigger and more diverse than Shakespeare. Tested locally on a 500 MB Wikipedia dump (522M chars, 7003-char vocab); steady at 27 it/s on M-series MPS using the same model.

Changes

1. Vectorize Encoder vocab build and bulk encode with numpy

The original Encoder ran two pure-Python passes: a per-character dict-membership check to build the vocab, and a per-character lookup to encode. Fine on Shakespeare, but on a 500 MB wiki dump those Python loops dominate startup (estimated minutes each). New code:

  • set(text) for vocab (one C pass)
  • encode_array(text): encodes text to UTF-32-LE, views as uint32 codepoints, gathers through a precomputed lookup table -- returns np.ndarray[int32]
  • caches the inverse dict so decode() doesn't rebuild it on every call
  • chunked bulk encode (4M chars/block) so peak memory stays bounded

encode, decode, vocab are unchanged. encode_array is additive.

Measured on a 500 MB wiki dump (522M characters):

  • vocab build: ~2.6 s (was estimated at minutes)
  • encode_array: ~0.8 s (was estimated at minutes)

2. Make dataset path configurable and add training CLI flags

  • Transformer(data_path=...) constructor argument; threaded through train.py, sample.py, export_onnx.py via a --data flag.
  • --batch-size and --seq-len flags so hyperparameters can be tuned without editing source.
  • --resume <checkpoint.pt> flag that loads a saved state_dict before training -- useful for picking up long runs after a crash. Only model weights are restored; optimizer state and step counter are not.
  • Use the new encoder.encode_array() and store the corpus as int32 on device. Vocab fits easily in 32 bits, so int64 was wasting 50% of corpus memory. On a 500 MB corpus this saves ~2 GB of device memory.
  • Read the corpus with f.read() instead of \"\\n\".join(f.readlines()) -- the old form silently doubled every newline (no vocab change, since the encoder was building from the same join'd text).

Defaults are unchanged, so uv run train --device mps behaves exactly the same as before. Sanity-checked on Shakespeare: same it/s, matching loss curve.

Test plan

  • uv run train --device mps (defaults to Shakespeare) -- identical it/s and loss curve
  • uv run train --device mps --data <other.txt> -- runs cleanly on multilingual corpora
  • Full 500 MB wiki dump trains at ~27 it/s on M-series MPS
  • --resume checkpoints/checkpoint.pt loads a previous state_dict and continues training

The original Encoder had two pure-Python passes: it built the vocab dict
by iterating every character and dispatching a dict membership test, and
its encode() ran a per-character dict lookup. Both are O(N) in Python.
On Shakespeare (~1M chars) that is fine. On larger corpora (e.g. a
500MB wiki dump, ~520M chars) those Python loops dominate startup.

This commit:
- Builds the vocabulary as `sorted(set(text))` (one pass in C).
- Adds `encode_array(text)` which converts text to UTF-32-LE bytes via
  the C codec, views the buffer as `uint32` codepoints, and gathers
  through a precomputed lookup table indexed by ord(c). Output is a
  `np.ndarray[int32]` ready to be moved to a torch tensor.
- Caches the inverse dict so decode() does not rebuild it on every call.
- Bulk encoding is chunked (default 4M chars/block) so peak transient
  memory stays bounded for very large corpora.

Existing public API is preserved: `encode`, `decode`, and `vocab` behave
the same. `encode_array` is additive.

Measured on a 500 MB wiki dump (522M characters):
  vocab build: ~2.6 s   (was estimated at minutes)
  encode_array: ~0.8 s  (was estimated at minutes)
Adds a small set of CLI knobs needed to point training at a different
corpus and to recover from interruptions, plus a few correctness/perf
tweaks that come along for the ride:

- `Transformer(data_path=...)` constructor argument; previously the
  path was hardcoded to "data/input.txt". Threaded through train.py,
  sample.py, and export_onnx.py via a `--data` flag (default unchanged).
- `--batch-size` and `--seq-len` flags so hyperparameters can be tuned
  without editing source.
- `--resume <checkpoint.pt>` flag that loads a saved state_dict before
  training. Useful for picking up a long run after a crash, machine
  reboot, or any other interruption. Only the model weights are
  restored; the optimizer state and step counter are not.
- Use the new `encoder.encode_array()` and store the corpus as `int32`
  on device. The vocabulary easily fits in 32 bits (this PR's wiki
  sample has ~720 chars, the full HF wiki dump has ~7000), so int64
  was wasting 50% of corpus memory. On a 500MB wiki corpus this saves
  ~2 GB of device memory.
- Read the corpus with `f.read()` instead of `"\n".join(f.readlines())`.
  The old form silently doubled every newline. No vocab change, the
  encoder was building from the same join'd text.

Sanity-checked: training on tiny Shakespeare with default flags gives
the same it/s and matching loss curve as before.
@modulovalue modulovalue changed the title Train on larger / multilingual corpora (e.g. Wikipedia) Encoder vectorization + CLI flags for training on larger corpora Apr 30, 2026
@Encrux
Copy link
Copy Markdown
Owner

Encrux commented May 4, 2026

Looks good.

One thing: If you want to resume from a previous checkpoint, you also want to make sure you're resuming the optimizer state imo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants