Skip to content

rediceli/KittenTTS_C

Repository files navigation

KittenTTS C Library

A native C implementation of KittenTTS — a lightweight, CPU-optimized text-to-speech engine built on ONNX Runtime. Produces 24 kHz float32 audio from text using pre-downloaded model files.

This library is a C port of the official Python implementation at KittenML/KittenTTS.

Requirements

Dependency Version Purpose
CMake ≥ 3.18 Build system
ONNX Runtime ≥ 1.16 Neural inference
libsndfile any WAV file output
libpcre2-8 any Text normalization regex
espeak-ng any Phonemization (via subprocess)

All except ONNX Runtime are found automatically via pkg-config. ONNX Runtime must be pointed to explicitly (see below).

Building

# Download ONNX Runtime for your platform from:
#   https://github.com/microsoft/onnxruntime/releases
# Or install via package manager (e.g. Homebrew on macOS):
#   brew install onnxruntime

cd c/
mkdir build && cd build
cmake .. -DONNXRUNTIME_ROOT=/path/to/onnxruntime
cmake --build . --parallel

This produces:

  • libkittentts.dylib / libkittentts.so — shared library
  • libkittentts.a — static library
  • kittentts-cli — command-line tool

To install system-wide:

cmake --install . --prefix /usr/local

Getting Model Files

Models are distributed via Hugging Face Hub. Download manually or use the Python package to cache them:

# Using Python (one-time download):
pip install kittentts
python -c "from kittentts import KittenTTS; KittenTTS('KittenML/kitten-tts-nano-0.8')"

After download, locate the files:

~/.cache/huggingface/hub/models--KittenML--kitten-tts-nano-0.8/snapshots/<hash>/
  kitten_tts_nano_v0_8.onnx   ← model file
  voices.npz                  ← voice embeddings

Available model variants:

Model Size Parameters
kitten-tts-nano-0.8 ~15 MB 15M
kitten-tts-nano-int8-0.8 ~25 MB 15M quantized
kitten-tts-micro-0.8 ~40 MB 40M
kitten-tts-mini-0.8 ~80 MB 80M

CLI Usage

kittentts-cli [options] "Text to speak"

Options:
  --model    PATH   Path to the .onnx model file (required)
  --voices   PATH   Path to the voices .npz file (required)
  --output   PATH   Output WAV file (default: output.wav)
  --voice    NAME   Voice name (default: expr-voice-5-m)
  --speed    FLOAT  Speech speed multiplier (default: 1.0)
  --backend  NAME   Execution backend: cpu|cuda|amd_gpu (default: auto)
  --no-clean        Disable text normalization
  --list-voices     List available voices and exit
  --help            Show this help

Examples

Basic usage:

kittentts-cli \
  --model nano.onnx \
  --voices voices.npz \
  --output hello.wav \
  "Hello, world."

List available voices:

kittentts-cli --model nano.onnx --voices voices.npz --list-voices

Custom voice and speed:

kittentts-cli \
  --model nano.onnx \
  --voices voices.npz \
  --voice expr-voice-2-f \
  --speed 1.2 \
  --output fast.wav \
  "The quick brown fox jumps over the lazy dog."

Raw phoneme input (skip text normalization):

kittentts-cli --model nano.onnx --voices voices.npz --no-clean \
  --output raw.wav "She sells sea shells by the sea shore."

GPU inference (requires onnxruntime-gpu):

kittentts-cli --model nano.onnx --voices voices.npz \
  --backend cuda --output gpu.wav "Testing GPU synthesis."

Library API

Include <kittentts.h> and link with -lkittentts.

Lifecycle

// Create engine from pre-downloaded files.
// backend: NULL = auto, "cpu", "cuda", "amd_gpu"
KittenTTS *tts = kittentts_create("nano.onnx", "voices.npz", NULL);
if (!tts) {
    fprintf(stderr, "Error: %s\n", kittentts_last_error());
    return 1;
}

// Always destroy when done.
kittentts_destroy(tts);

Batch synthesis

size_t n_samples;
float *audio = kittentts_generate(tts,
    "Hello, world.",   // text (UTF-8)
    "expr-voice-5-m",  // voice name
    1.0f,              // speed
    1,                 // clean_text: normalize numbers, currency, etc.
    &n_samples);

if (!audio) {
    fprintf(stderr, "Error: %s\n", kittentts_last_error());
} else {
    // audio is float32 at 24 kHz, n_samples long
    // ... use audio ...
    kittentts_free_audio(audio);
}

Write directly to WAV

int rc = kittentts_generate_to_file(tts,
    "Hello, world.",
    "output.wav",
    "expr-voice-5-m",
    1.0f,    // speed
    24000,   // sample rate
    1);      // clean_text

Streaming synthesis

Useful for long texts or low-latency playback pipelines — the callback fires once per sentence chunk:

void on_chunk(const float *chunk, size_t n_samples, void *userdata) {
    // Stream chunk to audio device, append to buffer, etc.
    // chunk is valid only for the duration of this call.
    fwrite(chunk, sizeof(float), n_samples, (FILE *)userdata);
}

FILE *out = fopen("stream.raw", "wb");
int rc = kittentts_generate_stream(tts,
    "Long text spanning many sentences...",
    "expr-voice-5-m",
    1.0f,      // speed
    1,         // clean_text
    on_chunk,
    out);
fclose(out);

List available voices

int count;
const char **voices = kittentts_available_voices(tts, &count);
for (int i = 0; i < count; i++)
    printf("%s\n", voices[i]);
// Pointers are valid for the lifetime of tts; do not free.

Error handling

// kittentts_last_error() is thread-local and valid until the next API call
// on the same thread.
const char *err = kittentts_last_error();

Text Normalization

When clean_text is enabled (the default), the preprocessor converts spoken-friendly forms before phonemization:

Input Output
$1,200.50 "one thousand two hundred dollars and fifty cents"
March 21st "March twenty-first"
3:45 PM "three forty-five PM"
100km/h "one hundred kilometers per hour"
1.5e-3 "one point five times ten to the power of negative three"
IV "four" (Roman numerals)
I'm "I am" (contractions)
192.168.1.1 "one nine two dot one six eight dot one dot one"

Pass clean_text=0 / --no-clean if your input is already normalized or phonetic.

Audio Output Format

  • Sample rate: 24,000 Hz
  • Channels: 1 (mono)
  • Sample format: IEEE float32
  • WAV files use the SF_FORMAT_WAV | SF_FORMAT_FLOAT libsndfile encoding

To convert to 16-bit PCM WAV for broader compatibility:

ffmpeg -i output.wav -acodec pcm_s16le output_16bit.wav

Licensing

KittenTTS C is licensed under the Apache License 2.0 (see LICENSE).

The project includes or links against several third-party components with their own licenses:

  • miniz (vendored): MIT License — see THIRD_PARTY_LICENSES.md
  • ONNX Runtime (linked): MIT License
  • libsndfile (linked): LGPL 2.1
  • libpcre2-8 (linked): BSD 3-Clause License
  • espeak-ng (subprocess): GPL 3.0 (optional runtime dependency, not linked)

See THIRD_PARTY_LICENSES.md for detailed compliance information and distribution guidance.

About

A native C implementation of KittenTTS

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors