A simple, powerful CLI for generating high-quality speech from text using Qwen3-TTS models locally.
- Voice Cloning: Clone any voice from a 3-second audio sample
- Preset Voices: 9 premium voices covering multiple languages and dialects
- Voice Design: Generate custom voices from natural language descriptions
- Multi-language: Supports 10 major languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian)
- Emotion Control: Fine-tune tone, emotion, and speaking style with text instructions
- High Quality: 12Hz tokenizer for natural-sounding speech
- MP3 Export: Direct MP3 output for easy sharing
git clone <repo-url>
cd qtts
python3.12 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
./qtts.py "Hello world!" -s Ryan -l English-
Python 3.12 (recommended - see Troubleshooting if using 3.13/3.14)
-
ffmpeg (for MP3 conversion):
# macOS brew install ffmpeg sox # Ubuntu/Debian sudo apt install ffmpeg sox # Windows (using chocolatey) choco install ffmpeg sox
-
CUDA (optional, for GPU acceleration - highly recommended)
-
Clone or download this repository:
cd qtts -
Create a virtual environment with Python 3.12:
# macOS/Linux with brew-installed python3.12 python3.12 -m venv venv source venv/bin/activate # Or use conda conda create -n qwen3-tts python=3.12 -y conda activate qwen3-tts
-
Install dependencies:
pip install -r requirements.txt
-
(Optional) For better performance with NVIDIA GPU only:
# Only works on Linux/Windows with CUDA-capable GPU # Will NOT work on macOS or CPU-only systems pip install flash-attn --no-build-isolation
Note: Flash-attention requires CUDA GPU. If installation fails, you can skip this step - the CLI will work fine without it (just a bit slower).
-
Make the script executable:
chmod +x qtts.py
To make qtts available from anywhere on your system, use the wrapper script method:
Note: A
pyproject.tomlis included for future pip/PyPI distribution, butpip installis not recommended due to potential llvmlite build issues on Python 3.12+. The wrapper script method below is the recommended approach.
-
Make the wrapper script executable:
chmod +x qtts-wrapper.sh
-
Create a symlink in your user bin directory (no sudo required):
ln -sf "$(pwd)/qtts-wrapper.sh" ~/.local/bin/qtts
Note: Make sure
~/.local/binis in your PATH. If not, add this to your~/.bashrcor~/.zshrc:export PATH="$HOME/.local/bin:$PATH"
-
Alternatively, for system-wide installation (requires sudo):
sudo ln -sf "$(pwd)/qtts-wrapper.sh" /usr/local/bin/qtts -
Verify installation:
which qtts qtts --list-speakers
After installation, you can run qtts from any directory without the ./ prefix or activating the virtual environment.
# If installed system-wide:
qtts "Hello, welcome to Qwen3-TTS!" -s Vivian -l English
# Or from the project directory:
./qtts.py "Hello, welcome to Qwen3-TTS!" -s Vivian -l EnglishThis generates output.mp3 with the Vivian voice speaking in English.
qtts "I'm so excited to meet you!" -s Ryan -i "Very happy and energetic"Clone a voice from a reference audio:
./qtts.py "This is my cloned voice" -m clone \
--ref-audio path/to/reference.wav \
--ref-text "The text spoken in the reference audio"Create a unique voice from a description:
./qtts.py "Hello there!" -m design \
-i "Young male voice, 25 years old, cheerful and confident tone"# If installed system-wide:
qtts [TEXT] [OPTIONS]
# Or from the project directory:
./qtts.py [TEXT] [OPTIONS]Note: The rest of this guide uses qtts for brevity. If not installed system-wide, use ./qtts.py instead.
| Option | Description | Default |
|---|---|---|
-o, --output |
Output file path | output.mp3 |
-m, --mode |
Generation mode: custom, clone, or design |
custom |
-l, --language |
Target language (see list below) | Auto |
| Option | Description | Default |
|---|---|---|
-s, --speaker |
Preset speaker name | Vivian |
-i, --instruct |
Emotion/tone instruction (optional) | None |
Available Speakers:
| Speaker | Description | Native Language |
|---|---|---|
Vivian |
Bright, slightly edgy young female | Chinese |
Serena |
Warm, gentle young female | Chinese |
Uncle_Fu |
Seasoned male, low and mellow | Chinese |
Dylan |
Youthful Beijing male, clear | Chinese (Beijing) |
Eric |
Lively Chengdu male, husky | Chinese (Sichuan) |
Ryan |
Dynamic male, strong rhythm | English |
Aiden |
Sunny American male | English |
Ono_Anna |
Playful Japanese female | Japanese |
Sohee |
Warm Korean female | Korean |
List all speakers:
qtts --list-speakers| Option | Description | Required |
|---|---|---|
--ref-audio |
Path or URL to reference audio (3+ seconds) | Yes |
--ref-text |
Transcript of the reference audio | Yes |
| Option | Description | Required |
|---|---|---|
-i, --instruct |
Natural language voice description | Yes |
| Option | Description |
|---|---|
--model |
Custom model path or HuggingFace ID |
--device |
Device: cuda:0, cpu, mps (auto-detected) |
--list-languages |
Show all supported languages |
Auto, Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
Use --list-languages to display the full list.
qtts "Welcome to the future of text-to-speech" -s Aiden -l Englishqtts "其实我真的有发现,我是一个特别善于观察别人情绪的人。" \
-s Vivian -l Chinese -i "用特别愤怒的语气说"qtts "こんにちは、世界!" -s Ono_Anna -l Japaneseqtts "I am solving the equation" -m clone \
--ref-audio "https://example.com/voice.wav" \
--ref-text "Original text from the audio" \
-l Englishqtts "Prepare for trouble, and make it double!" -m design \
-i "Male villain voice, 30s, dramatic and theatrical with a hint of menace" \
-l English -o villain.mp3Generate multiple files with different voices:
qtts "Hello world" -s Vivian -o output1.mp3
qtts "Hello world" -s Ryan -o output2.mp3
qtts "Hello world" -s Aiden -o output3.mp3- Use high-quality reference audio (clear, minimal background noise)
- Reference audio should be 3-10 seconds long
- Provide accurate transcription for best results
- Reference audio quality directly affects output quality
- Use each speaker's native language for best quality
- Keep instructions clear and concise
- Examples: "Very happy", "Angry and frustrated", "Calm and soothing"
- Be specific: age, gender, tone, emotional state
- Good: "25-year-old female, energetic teacher voice, warm and encouraging"
- Avoid: "Nice voice"
- First run downloads models (~1-3GB), be patient
- GPU (CUDA) recommended for faster generation
- Models are cached after first use
- Use shorter texts for faster processing
- Set specific language when known (faster than Auto)
- Auto mode works well for mixed-language text
- Each speaker performs best in their native language
Problem: If you're using Python 3.14 or 3.13, you may encounter build errors with llvmlite (required by librosa which is needed by qwen-tts).
Solution: Use Python 3.12 instead:
# If you have Python 3.12 installed via Homebrew
rm -rf venv
python3.12 -m venv venv
source venv/bin/activate
pip install -r requirements.txtWhy this happens: The llvmlite package (required by numba, which is required by librosa) doesn't have prebuilt wheels for Python 3.13/3.14 yet, and building from source requires LLVM 20 specifically (macOS typically has LLVM 15 or 21).
Alternative: Use conda environment which provides prebuilt numba/llvmlite:
conda create -n qwen3-tts python=3.12 -y
conda activate qwen3-tts
conda install -c conda-forge numba librosa -y
pip install -r requirements.txt- Use CPU mode:
--device cpu - Close other applications using GPU
- Use smaller model (0.6B instead of 1.7B)
- Install ffmpeg (see Prerequisites)
- Or save as WAV:
-o output.wav
- Check internet connection
- Models download from HuggingFace on first use
- Use
--modelwith local path if you pre-downloaded
- Check reference audio quality (for clone mode)
- Try different speaker/language combinations
- Ensure text is clean (no special formatting)
- Use GPU if available
- Install flash-attention:
pip install flash-attn --no-build-isolation - Use 0.6B model instead of 1.7B
- Custom Mode:
Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice(~600MB) - Clone Mode:
Qwen/Qwen3-TTS-12Hz-0.6B-Base(~600MB) - Design Mode:
Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign(~1.7GB)
Models are automatically downloaded from HuggingFace on first use and cached locally.
Minimum:
- CPU: Any modern processor
- RAM: 8GB
- Storage: 5GB free
Recommended:
- GPU: NVIDIA GPU with 4GB+ VRAM (CUDA support)
- RAM: 16GB
- Storage: 10GB free
- Tokenizer: Qwen3-TTS-Tokenizer-12Hz
- Architecture: Discrete multi-codebook LM
- Precision: bfloat16 (GPU) / float32 (CPU)
- Sample Rate: 24kHz output
- Latency: ~2-5 seconds per sentence (GPU)
This project uses Qwen3-TTS models which are licensed under Apache 2.0.
Built on top of Qwen3-TTS by Alibaba Cloud.
Model page: https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-Base
For issues with the Qwen3-TTS models themselves, refer to:
- GitHub: https://github.com/QwenLM/Qwen3-TTS
- HuggingFace: https://huggingface.co/Qwen
For CLI-specific issues, check the troubleshooting section above.
Happy voice synthesizing with qtts!