A native C implementation of KittenTTS — a lightweight, CPU-optimized text-to-speech engine built on ONNX Runtime. Produces 24 kHz float32 audio from text using pre-downloaded model files.
This library is a C port of the official Python implementation at KittenML/KittenTTS.
| Dependency | Version | Purpose |
|---|---|---|
| CMake | ≥ 3.18 | Build system |
| ONNX Runtime | ≥ 1.16 | Neural inference |
| libsndfile | any | WAV file output |
| libpcre2-8 | any | Text normalization regex |
| espeak-ng | any | Phonemization (via subprocess) |
All except ONNX Runtime are found automatically via pkg-config. ONNX Runtime must be pointed to explicitly (see below).
# Download ONNX Runtime for your platform from:
# https://github.com/microsoft/onnxruntime/releases
# Or install via package manager (e.g. Homebrew on macOS):
# brew install onnxruntime
cd c/
mkdir build && cd build
cmake .. -DONNXRUNTIME_ROOT=/path/to/onnxruntime
cmake --build . --parallelThis produces:
libkittentts.dylib/libkittentts.so— shared librarylibkittentts.a— static librarykittentts-cli— command-line tool
To install system-wide:
cmake --install . --prefix /usr/localModels are distributed via Hugging Face Hub. Download manually or use the Python package to cache them:
# Using Python (one-time download):
pip install kittentts
python -c "from kittentts import KittenTTS; KittenTTS('KittenML/kitten-tts-nano-0.8')"After download, locate the files:
~/.cache/huggingface/hub/models--KittenML--kitten-tts-nano-0.8/snapshots/<hash>/
kitten_tts_nano_v0_8.onnx ← model file
voices.npz ← voice embeddings
Available model variants:
| Model | Size | Parameters |
|---|---|---|
kitten-tts-nano-0.8 |
~15 MB | 15M |
kitten-tts-nano-int8-0.8 |
~25 MB | 15M quantized |
kitten-tts-micro-0.8 |
~40 MB | 40M |
kitten-tts-mini-0.8 |
~80 MB | 80M |
kittentts-cli [options] "Text to speak"
Options:
--model PATH Path to the .onnx model file (required)
--voices PATH Path to the voices .npz file (required)
--output PATH Output WAV file (default: output.wav)
--voice NAME Voice name (default: expr-voice-5-m)
--speed FLOAT Speech speed multiplier (default: 1.0)
--backend NAME Execution backend: cpu|cuda|amd_gpu (default: auto)
--no-clean Disable text normalization
--list-voices List available voices and exit
--help Show this help
Basic usage:
kittentts-cli \
--model nano.onnx \
--voices voices.npz \
--output hello.wav \
"Hello, world."List available voices:
kittentts-cli --model nano.onnx --voices voices.npz --list-voicesCustom voice and speed:
kittentts-cli \
--model nano.onnx \
--voices voices.npz \
--voice expr-voice-2-f \
--speed 1.2 \
--output fast.wav \
"The quick brown fox jumps over the lazy dog."Raw phoneme input (skip text normalization):
kittentts-cli --model nano.onnx --voices voices.npz --no-clean \
--output raw.wav "She sells sea shells by the sea shore."GPU inference (requires onnxruntime-gpu):
kittentts-cli --model nano.onnx --voices voices.npz \
--backend cuda --output gpu.wav "Testing GPU synthesis."Include <kittentts.h> and link with -lkittentts.
// Create engine from pre-downloaded files.
// backend: NULL = auto, "cpu", "cuda", "amd_gpu"
KittenTTS *tts = kittentts_create("nano.onnx", "voices.npz", NULL);
if (!tts) {
fprintf(stderr, "Error: %s\n", kittentts_last_error());
return 1;
}
// Always destroy when done.
kittentts_destroy(tts);size_t n_samples;
float *audio = kittentts_generate(tts,
"Hello, world.", // text (UTF-8)
"expr-voice-5-m", // voice name
1.0f, // speed
1, // clean_text: normalize numbers, currency, etc.
&n_samples);
if (!audio) {
fprintf(stderr, "Error: %s\n", kittentts_last_error());
} else {
// audio is float32 at 24 kHz, n_samples long
// ... use audio ...
kittentts_free_audio(audio);
}int rc = kittentts_generate_to_file(tts,
"Hello, world.",
"output.wav",
"expr-voice-5-m",
1.0f, // speed
24000, // sample rate
1); // clean_textUseful for long texts or low-latency playback pipelines — the callback fires once per sentence chunk:
void on_chunk(const float *chunk, size_t n_samples, void *userdata) {
// Stream chunk to audio device, append to buffer, etc.
// chunk is valid only for the duration of this call.
fwrite(chunk, sizeof(float), n_samples, (FILE *)userdata);
}
FILE *out = fopen("stream.raw", "wb");
int rc = kittentts_generate_stream(tts,
"Long text spanning many sentences...",
"expr-voice-5-m",
1.0f, // speed
1, // clean_text
on_chunk,
out);
fclose(out);int count;
const char **voices = kittentts_available_voices(tts, &count);
for (int i = 0; i < count; i++)
printf("%s\n", voices[i]);
// Pointers are valid for the lifetime of tts; do not free.// kittentts_last_error() is thread-local and valid until the next API call
// on the same thread.
const char *err = kittentts_last_error();When clean_text is enabled (the default), the preprocessor converts spoken-friendly forms before phonemization:
| Input | Output |
|---|---|
$1,200.50 |
"one thousand two hundred dollars and fifty cents" |
March 21st |
"March twenty-first" |
3:45 PM |
"three forty-five PM" |
100km/h |
"one hundred kilometers per hour" |
1.5e-3 |
"one point five times ten to the power of negative three" |
IV |
"four" (Roman numerals) |
I'm |
"I am" (contractions) |
192.168.1.1 |
"one nine two dot one six eight dot one dot one" |
Pass clean_text=0 / --no-clean if your input is already normalized or phonetic.
- Sample rate: 24,000 Hz
- Channels: 1 (mono)
- Sample format: IEEE float32
- WAV files use the
SF_FORMAT_WAV | SF_FORMAT_FLOATlibsndfile encoding
To convert to 16-bit PCM WAV for broader compatibility:
ffmpeg -i output.wav -acodec pcm_s16le output_16bit.wavKittenTTS C is licensed under the Apache License 2.0 (see LICENSE).
The project includes or links against several third-party components with their own licenses:
- miniz (vendored): MIT License — see
THIRD_PARTY_LICENSES.md - ONNX Runtime (linked): MIT License
- libsndfile (linked): LGPL 2.1
- libpcre2-8 (linked): BSD 3-Clause License
- espeak-ng (subprocess): GPL 3.0 (optional runtime dependency, not linked)
See THIRD_PARTY_LICENSES.md for detailed compliance information and distribution guidance.