On-device Qwen3 TTS in Rust: the speech model runs in ggml (GGUF weights), and the vocoder runs in ONNX Runtime. No server required—everything stays on your machine.
| If you want to… | Start here |
|---|---|
| Turn text into a WAV file | Quick start → Synthesize |
| Match a reference voice (speaker / style) | Voice clone prompts |
| Try it interactively in the terminal | Interactive TUI |
| Embed the engine in your own Rust app | Use the qts library crate (see Crates) |
| Tune GPU / CPU backends | Runtime configuration |
You need: Rust, CMake on your PATH, and Git (for the ggml submodule).
-
Clone and fetch ggml
git clone https://github.com/yet-another-ai/qts.git cd qts git submodule update --init --recursive -
Build the CLI (first build compiles vendored ggml; it can take a few minutes)
cargo build --release -p qts_cli
-
Download model files — this repo does not ship weights. Grab a main GGUF plus the shared vocoder ONNX from Hugging Face (or export your own; see docs/models.md) and put them in one folder, for example:
models/ qwen3-tts-0.6b-f16.gguf # or another supported q4_k / q5_k / q6_k / q8_0 variant qwen3-tts-vocoder.onnxThose names match the default lookup used by
--model-dir(seeModelPaths). -
Synthesize
cargo run --release -p qts_cli -- synthesize \ --model-dir models \ --text "Hello from local TTS." \ --out hello.wav
On Apple Silicon, default features include Metal and CoreML where applicable. On Linux / Windows, the default build also enables the NVIDIA-oriented vocoder EPs cuda, nvrtx, and tensorrt; DirectML remains available via an extra feature flag on Windows (see Build options).
| Path | What it is |
|---|---|
crates/ |
Rust: GGML bindings, TTS engine (qts), CLI/TUI (qts_cli) |
scripts/ |
Python (uv): export GGUF/ONNX and voice-clone protobuf prompts |
docs/ |
Models, testing, releases, Hugging Face card template |
testdata/ |
Small fixtures only; keep large checkpoints outside the repo |
| Crate | Role |
|---|---|
qts_ggml_sys |
CMake + bindgen FFI to vendored ggml (submodule) |
qts_ggml |
Thin wrappers + sys re-export |
qts |
Library: GGUF load, tokenizer, transformer inference, speaker encoding, vocoder bridge, protobuf voice-clone types |
qts_cli |
synthesize, profile, and interactive tui |
CLI (same engine as the library):
cargo build -p qts_cli
cargo build -p qts_cli --features metal # Apple GPU (GGML)
cargo build -p qts_cli --features vulkan # Vulkan (GGML); needs SDK + `glslc` where applicable
cargo build -p qts_cli --features tensorrt # NVIDIA TensorRT vocoder
cargo build -p qts_cli --features directml # Windows vocoder (ONNX DirectML)
cargo build -p qts_cli --features cuda # NVIDIA vocoder (ONNX CUDA)Library-only examples:
cargo build -p qts --features metal
cargo build -p qts --features vulkanGPU features are declared on qts_ggml_sys / qts; details and version pins live in VERSIONS.md. For the vocoder, qts and qts_cli forward the native ONNX Runtime EP feature set directly, including acl, armnn, azure, cann, coreml, cuda, directml, migraphx, nnapi, nvrtx, onednn, openvino, qnn, rknpu, tensorrt, tvm, vitis, webgpu, and xnnpack. The default feature set now includes cuda, nvrtx, and tensorrt in addition to the existing GGML defaults.
ONNX Runtime build note: ort does not ship prebuilt binaries for every EP combination. Its documented prebuilt bundles cover platform-native EPs like directml, xnnpack, and coreml, plus separate bundles for cuda + tensorrt, webgpu, and nvrtx. If you enable a mixed combination outside those bundles, ort may fall back to downloading a CPU-only runtime unless you compile ONNX Runtime from source. In practice, if you want a single build with cuda, nvrtx, and tensorrt all available together, plan on a source-built ORT.
Runtime behavior: with GPU features enabled, auto prefers Metal on Apple and Vulkan on other platforms, then falls back to CPU if init fails. Builds without those features use CPU only for GGML.
Full workspace:
cargo build --workspace
cargo test --workspaceExport and prompt tooling live under scripts/:
uv sync
uv run export-model-artifacts --help
uv run export-voice-clone-prompt --helpqts ships its protobuf schema in crates/qts/proto/. Regenerate the checked-in Python stub with uv run generate-voice-clone-prompt-pb2 after schema changes.
Where to download, how to export, and layout options: docs/models.md.
Default files in one directory (used by --model-dir):
qwen3-tts-vocoder.onnx- One of:
qwen3-tts-0.6b-f16.gguf,qwen3-tts-0.6b-q8_0.gguf, … (seeModelPathsfor the full preference order)
| Repo | Role |
|---|---|
GitHub yet-another-ai/qts |
Source of truth for code, export scripts, tests, docs |
Hugging Face dsh0416/Qwen3-TTS-12Hz-0.6B-Base-QTS |
Published GGUF + ONNX artifacts |
Typical flow: change and export from a pinned commit here → upload only binaries to Hugging Face → keep the HF model card in sync with this repo’s docs (template: docs/huggingface-model-card.md).
Release packaging helper:
cargo xtask hf-release --model Qwen/qts-12Hz-0.6B-BaseAdd --hf-repo-dir /path/to/cloned-hf-repo to sync into an existing clone. CI (.github/workflows/) builds release binaries and can publish tagged releases; see workflow comments for HF_TOKEN and related setup.
cargo run --release -p qts_cli -- synthesize \
--model-dir models \
--text "Your line here." \
--out out.wavUseful knobs include --threads, --frames (max audio frames), --temperature, --top-p, --top-k, --language-id, and --chunk-size (see --help on the binary). Backend overrides: --backend, --vocoder-ep, plus fallback chains. --vocoder-ep accepts auto or any enabled native ORT EP token such as coreml, directml, cuda, openvino, tensorrt, or xnnpack.
To stay aligned with upstream Qwen3 TTS, conditioning uses protobuf prompts (exported from Python), not raw reference audio at synthesis time.
Modes:
- xvector-only — speaker identity from the reference clip.
- ICL — identity plus reference text and codec prompt (closer to upstream
create_voice_clone_prompt).
xvector-only example
uv sync
uv run export-voice-clone-prompt \
--model Qwen/qts-12Hz-0.6B-Base \
--ref-audio testdata/hello.wav \
--x-vector-only-mode \
--out target/hello.xvector.voice-clone-prompt.pb
cargo run --release -p qts_cli -- synthesize \
--model-dir models \
--text "hello" \
--voice-clone-prompt target/hello.xvector.voice-clone-prompt.pb \
--out target/hello-from-xvector.wavICL example
uv run export-voice-clone-prompt \
--model Qwen/qts-12Hz-0.6B-Base \
--ref-audio testdata/hello.wav \
--ref-text "hello" \
--out target/hello.voice-clone-prompt.pb
cargo run --release -p qts_cli -- synthesize \
--model-dir models \
--text "hello" \
--voice-clone-prompt target/hello.voice-clone-prompt.pb \
--out target/hello-from-icl.wavThe engine reads fields such as ref_spk_embedding, ref_code, ref_text, and the icl_mode / x_vector_only_mode flags. Legacy wrapper:
uv run python scripts/export_voice_clone_prompt.py --helpLoads once, then you type lines and hear audio via cpal.
cargo run --release -p qts_cli -- tui \
--model-dir models \
--voice-clone-prompt target/hello.xvector.voice-clone-prompt.pb \
--language en \
--chunk-size 4| Key / input | Action |
|---|---|
Enter |
Synthesize current line |
F2 |
Cycle English / Chinese / Japanese |
Esc, Ctrl-C, or :q |
Quit |
The header shows the active transformer backend and vocoder execution provider. --language en|zh|ja is a friendly alias; --language-id still sets the raw codec id. --chunk-size trades startup latency vs scheduling overhead (codec frames per playback chunk).
Apple (CoreML vocoder example)
cargo run --release -p qts_cli -- tui \
--model-dir models \
--backend auto \
--backend-fallback metal,vulkan,cpu \
--vocoder-ep coreml \
--chunk-size 4Windows (DirectML vocoder example)
cargo run --release -p qts_cli --no-default-features --features vulkan,directml -- tui \
--model-dir models \
--backend auto \
--backend-fallback vulkan,cpu \
--vocoder-ep directml \
--chunk-size 4Default auto chains
| Platform | Transformer | Vocoder |
|---|---|---|
| Apple | metal,vulkan,cpu |
coreml,cpu |
| Windows | vulkan,cpu |
cuda,nvrtx,tensorrt,directml,cpu |
| Linux / Other | vulkan,cpu |
cuda,nvrtx,tensorrt,cpu |
| Concern | CLI flags | Environment variables |
|---|---|---|
| GGML backend | --backend, --backend-fallback |
QWEN3_TTS_BACKEND, QWEN3_TTS_BACKEND_FALLBACK |
| ONNX vocoder EP | --vocoder-ep, --vocoder-ep-fallback |
QWEN3_TTS_VOCODER_EP, QWEN3_TTS_VOCODER_EP_FALLBACK |
| Experimental talker KV cache | `--talker-kv-mode f16 | turboquant` |
| Multi-GPU adapter index | — | QWEN3_TTS_GPU_DEVICE (default 0; e.g. Vulkan0, MTL0) |
When using cargo run -p qts_cli directly, Cargo features (e.g. --features vulkan or --features cuda) must include the backend / execution provider you select with QWEN3_TTS_BACKEND or QWEN3_TTS_VOCODER_EP, or init will fail. The vocoder accepts the native ORT EP tokens cpu, acl, armnn, azure, cann, coreml, cuda, directml, migraphx, nnapi, nvrtx, onednn, openvino, qnn, rknpu, tensorrt, tvm, vitis, webgpu, and xnnpack when the matching feature is enabled.
Profiling: cargo xtask profile runs the CLI with matching features and sets QWEN3_TTS_BACKEND for you (important for Vulkan on macOS). Example:
cargo xtask profile cpu --model-dir models --text "hello" --frames 64 --runs 3
cargo xtask profile metal --model-dir models --text "hello" --frames 64Manual equivalent:
QWEN3_TTS_BACKEND=vulkan cargo run --release -p qts_cli --features vulkan -- profile \
--text "hello" --model-dir models --frames 64profile prints per-stage timings; --out run1.wav keeps audio from the first run.
Experimental note: --talker-kv-mode turboquant switches the talker KV cache to a quantized GGML-backed storage path. The cache itself now lives on the selected backend, while host-side quantization and upload are still part of the write-back path. profile reports talker KV allocation plus kv_download, kv_quantize, and kv_upload timing buckets.
- Fast tests:
cargo test --workspace(no large downloads). - Optional integration tests (real checkpoints): set
QWEN3_TTS_MODEL_DIR— see docs/testing.md.
Benchmarks (needs QWEN3_TTS_BENCH_MODEL_DIR, etc.):
cargo xtask bench cpu
cargo xtask bench metal
cargo xtask bench vulkanSet QWEN3_TTS_BENCH_TALKER_KV_MODE=turboquant to compare the experimental talker KV cache against the default f16 path.
Alias definition: .cargo/config.toml.
Apache License 2.0 — see LICENSE and NOTICE.
qts is a normal Rust rlib. A Godot extension can depend on it from a gdext crate without a separate C ABI, unless you choose to add one.
- predict-woo/qwen3-tts.cpp for architecture and tensor naming.
- QwenLM/Qwen3-TTS for the model and naming conventions.