A standalone C++ tokenizer project for reusable shell-model tokenization.
This repository provides:
- Parquet-to-text corpus ingestion in C++
- shard and supplemental corpus handling
- reusable corpus scan/cache indexing
- shared SentencePiece BPE tokenizer training in C++
- tokenizer runtime encode/decode/inspect support
- manifests, reports, and debugging artifacts
This project is meant to be used across multiple repositories so every consumer shares the same tokenizer contract.
For the full reference, read:
Current project scope:
- real Arrow/Parquet ingestion
- resumable shard export
- supplemental corpus inclusion
- cached corpus indexing
- shared tokenizer training
- report and sample generation
Not yet included:
- corpus sanitation or dedup pipeline
- published license selection
C:\Program Files\Microsoft Visual Studio\2022\Community\VC\vcpkg\vcpkg.exe install --x-manifest-root=C:\Tokenizer --triplet x64-windows
C:\Program Files\Microsoft Visual Studio\2022\Community\Common7\IDE\CommonExtensions\Microsoft\CMake\CMake\bin\cmake.exe -S C:\Tokenizer -B C:\Tokenizer\build -G "Visual Studio 17 2022" -DCMAKE_TOOLCHAIN_FILE="C:\Program Files\Microsoft Visual Studio\2022\Community\VC\vcpkg\scripts\buildsystems\vcpkg.cmake" -DCMAKE_PREFIX_PATH="C:\Tokenizer\vcpkg_installed\x64-windows"
C:\Program Files\Microsoft Visual Studio\2022\Community\Common7\IDE\CommonExtensions\Microsoft\CMake\CMake\bin\cmake.exe --build C:\Tokenizer\build --config Debug
$env:PATH = "C:\Tokenizer\build\Debug;C:\Tokenizer\vcpkg_installed\x64-windows\debug\bin;C:\Tokenizer\vcpkg_installed\x64-windows\bin;" + $env:PATH
C:\Tokenizer\build\Debug\tokenizer_parquet_ingest_tool.exe --parquet-root C:\Datasets --output-root C:\Tokenizer
C:\Tokenizer\build\Debug\tokenizer_train_tool.exe --source-repo-root C:\SourceRepo --dataset-root C:\Datasets --output-root C:\Tokenizer --corpus-root C:\CorpusShards --scan-only
C:\Tokenizer\build\Debug\tokenizer_train_tool.exe --source-repo-root C:\SourceRepo --dataset-root C:\Datasets --output-root C:\Tokenizer --corpus-root C:\CorpusShards
C:\Tokenizer\build\Debug\tokenizer_inspect_tool.exe --model-path C:\Tokenizer\manifests\tokenizer\shared_tokenizer.model --text "The available inputs do not provide enough evidence, so the shell should preserve uncertainty."
C:\Program Files\Microsoft Visual Studio\2022\Community\Common7\IDE\CommonExtensions\Microsoft\CMake\CMake\bin\cmake.exe --build C:\Tokenizer\build --config Debug
$env:PATH = "C:\Tokenizer\build\Debug;C:\Tokenizer\vcpkg_installed\x64-windows\debug\bin;C:\Tokenizer\vcpkg_installed\x64-windows\bin;" + $env:PATH
C:\Program Files\Microsoft Visual Studio\2022\Community\Common7\IDE\CommonExtensions\Microsoft\CMake\CMake\bin\ctest.exe --test-dir C:\Tokenizer\build -C Debug --output-on-failure
C:\Tokenizer\build\Debug\tokenizer_parquet_ingest_tool.exe --parquet-root C:\Datasets --output-root C:\Tokenizer --batch-size 65536 --shard-size-mb 256
C:\Tokenizer\build\Debug\tokenizer_parquet_ingest_tool.exe --parquet-root C:\Datasets --output-root C:\Tokenizer
C:\Tokenizer\build\Debug\tokenizer_train_tool.exe --source-repo-root C:\SourceRepo --dataset-root C:\Datasets --output-root C:\Tokenizer --corpus-root C:\CorpusShards --scan-only
C:\Tokenizer\build\Debug\tokenizer_train_tool.exe --source-repo-root C:\SourceRepo --dataset-root C:\Datasets --output-root C:\Tokenizer --corpus-root C:\CorpusShards
C:\Tokenizer\build\Debug\tokenizer_train_tool.exe --source-repo-root C:\SourceRepo --dataset-root C:\Datasets --output-root C:\Tokenizer --corpus-root C:\CorpusShards --force-rescan
C:\Tokenizer\build\Debug\tokenizer_train_tool.exe --source-repo-root C:\SourceRepo --dataset-root C:\Datasets --output-root C:\Tokenizer --corpus-root C:\CorpusShards --verbose-trainer
C:\Tokenizer\build\Debug\tokenizer_inspect_tool.exe --model-path C:\Tokenizer\manifests\tokenizer\shared_tokenizer.model --text "Source public://pmc/PMC4457059 reports that the claim is tied to the described study."
Get-Content C:\Tokenizer\reports\tokenizer\tokenizer_report.md
Get-Content C:\Tokenizer\reports\tokenizer\tokenizer_samples.md
Get-Content C:\Tokenizer\reports\tokenizer\parquet_ingest_log.md -Tail 50 -Wait
Get-Content C:\Tokenizer\reports\tokenizer\tokenizer_training_log.md -Tail 50 -Wait
include/public headerssrc/implementation and CLI toolstests/local fixture and integration coveragedocs/detailed documentationexports/generated corpus shardsmanifests/progress files, manifests, config, model outputsreports/human-readable logs and Markdown reports
Training defaults:
- dataset root:
C:\Datasets - corpus root:
C:\CorpusShards - output root:
C:\Tokenizer - default model path:
C:\Tokenizer\manifests\tokenizer\shared_tokenizer.model
Parquet ingest defaults:
- parquet root:
C:\Datasets - output root:
C:\Tokenizer
Detailed docs:
Additional repository guidance:
This repository is close to being publishable as a standalone project, but one major non-technical choice is still intentionally unresolved:
- no open-source license has been selected yet
That choice should be made explicitly by the project owner rather than guessed in code or docs.