ComfyUI custom nodes for VoxCPM β Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning.
Run this node online: RunningHub (CN) | RunningHub (Global)
GitHub Repository: HM-RunningHub/ComfyUI_RH_VoxCPM
- Voice Design: Create unique voices from text descriptions (gender, age, tone, emotion, pace)
- Controllable Cloning: Clone a voice with optional style guidance via reference audio
- Ultimate Cloning: Reproduce every vocal nuance through audio continuation (VoxCPM2 only)
- LoRA Fine-tuning: Load custom LoRA weights for personalized voice generation
- LoRA / Full Training: Train VoxCPM LoRA (or full fine-tune) directly from a ComfyUI workflow, reusing the upstream training loop
- Auto ASR: Automatically recognize reference audio text via FunASR SenseVoiceSmall when
reference_audio_textis empty - Reference Denoising: Optional ZipEnhancer denoising for reference audio before cloning
cd ComfyUI/custom_nodes
git clone https://github.com/HM-RunningHub/ComfyUI_RH_VoxCPM.git
cd ComfyUI_RH_VoxCPM
pip install -r requirements.txtSearch for ComfyUI_RH_VoxCPM in ComfyUI Manager and install.
| Model | Params | Size | Recommended |
|---|---|---|---|
| VoxCPM2 | 2B | ~4.6 GB | β Best quality |
| VoxCPM1.5 | 800M | ~1.9 GB | Good balance |
| VoxCPM-0.5B | 640M | ~1.5 GB | Lightweight |
huggingface-cli download openbmb/VoxCPM2 --local-dir ComfyUI/models/voxcpm/VoxCPM2pip install modelscope
modelscope download --model openbmb/VoxCPM2 --local_dir ComfyUI/models/voxcpm/VoxCPM2ComfyUI/
βββ models/
βββ voxcpm/
βββ VoxCPM2/ # Main model (required)
β βββ config.json
β βββ model.safetensors
β βββ audiovae.pth
β βββ tokenizer.json
β βββ tokenizer_config.json
β βββ special_tokens_map.json
βββ loras/ # LoRA weights (optional)
β βββ my_custom_voice.pth
βββ speech_zipenhancer_ans_multiloss_16k_base/ # Denoiser (optional)
# From ModelScope
modelscope download --model iic/SenseVoiceSmall --local_dir ComfyUI/models/SenseVoice/SenseVoiceSmall# From ModelScope
modelscope download --model iic/speech_zipenhancer_ans_multiloss_16k_base --local_dir ComfyUI/models/voxcpm/speech_zipenhancer_ans_multiloss_16k_baseDownload example workflows from the examples/ directory and import into ComfyUI:
- Basic Workflow β Single-speaker speech generation with voice design / cloning
- Multi-Speaker Workflow β Fixed 5-speaker multi-speaker dialogue generation with per-speaker voice control
- LoRA Training Workflow β Build a tiny dataset from two audio clips and run a LoRA fine-tune
Notes:
RunningHub VoxCPM Multi-Speakeris the fixed 5-speaker versionRunningHub VoxCPM Multi-Speaker (Dynamic Audio)uses the same script format but grows reference-audio inputs automatically- If the dynamic inputs do not appear after updating the plugin, refresh the ComfyUI frontend page or reopen the workflow
- Voice Design: Fill
control_instruction(e.g. "A warm young woman"), leavereference_audioempty. The model creates a brand-new voice from your description alone. - Controllable Cloning: Upload
reference_audio, keepultimate_cloneOFF. Usecontrol_instructionto steer emotion, pace, and style while preserving the reference timbre. - Ultimate Cloning: Upload
reference_audio, turnultimate_cloneON. The model treats the reference as a spoken prefix and continues from it, faithfully reproducing every vocal detail.control_instructionis ignored in this mode. Ifreference_audio_textis empty, ASR will auto-recognize it.
Load VoxCPM/VoxCPM2 model from local directory with optional LoRA weights.
| Input | Type | Description |
|---|---|---|
| model_name | COMBO | Model directory under models/voxcpm/ |
| optimize | BOOLEAN | Enable torch.compile optimization (default: off) |
| lora_name | COMBO | LoRA weights under models/voxcpm/loras/ (optional, default: None) |
Generate speech with voice design, controllable cloning, or ultimate cloning.
| Input | Type | Description |
|---|---|---|
| model | VOXCPM_MODEL | Model from Load Model node |
| text | STRING | Target text to synthesize |
| cfg_value | FLOAT | Guidance scale (default: 2.0) |
| inference_steps | INT | LocDiT flow-matching steps (default: 10) |
| seed | INT | Random seed for reproducibility |
| control_instruction | STRING | Voice description for voice design mode (optional) |
| reference_audio | AUDIO | Reference audio for cloning (optional) |
| ultimate_clone | BOOLEAN | Enable ultimate cloning mode (default: off) |
| reference_audio_text | STRING | Transcript of reference audio; auto ASR if empty (optional) |
| normalize_text | BOOLEAN | Text normalization (default: off) |
| denoise_reference | BOOLEAN | Denoise reference audio via ZipEnhancer (default: off) |
| max_len | INT | Maximum token length during generation (default: 4096) |
| retry_badcase | BOOLEAN | Auto-retry when output quality is poor (default: on) |
Generate multi-speaker dialogue from a tagged script. Supports up to 5 speakers with individual voice control.
| Input | Type | Description |
|---|---|---|
| model | VOXCPM_MODEL | Model from Load Model node |
| script | STRING | Tagged script, e.g. [spk1]Hello[spk2]Hi there |
| cfg_value | FLOAT | Guidance scale (default: 2.0) |
| inference_steps | INT | LocDiT flow-matching steps (default: 10) |
| seed | INT | Random seed for reproducibility |
| audio_1 ~ audio_5 | AUDIO | Reference audio for each speaker (optional) |
| control_1 ~ control_5 | STRING | Voice description for each speaker (optional) |
| normalize_text | BOOLEAN | Text normalization (default: off) |
| denoise_reference | BOOLEAN | Denoise reference audio via ZipEnhancer (default: off) |
| max_len | INT | Maximum token length during generation (default: 4096) |
| retry_badcase | BOOLEAN | Auto-retry when output quality is poor (default: on) |
For multi-speaker reference-audio workflows. The script still uses [spk1]...[spk2]... tags, while speaker control instructions are merged into a single multiline input using the same tag format. The node shows 2 reference-audio inputs by default and automatically adds the next one when all current inputs are connected, with no fixed upper limit. At execution time, audio_1 maps to spk1, audio_2 maps to spk2, and so on, so tags like spk10 and spk20 are supported as well.
Usage tips:
- You need to connect all currently visible
audio_*inputs before the next one is added - This auto-growth behavior depends on the frontend extension script; if it does not update after installing a new version, refresh the page
| Input | Type | Description |
|---|---|---|
| model | VOXCPM_MODEL | Model from Load Model node |
| script | STRING | Tagged script, e.g. [spk1]Hello[spk2]Hi there |
| speaker_controls | STRING | Multiline tagged controls, e.g. [spk1]Sichuan accent\n[spk2]Adult female, northeastern accent |
| cfg_value | FLOAT | Guidance scale (default: 2.0) |
| inference_steps | INT | LocDiT flow-matching steps (default: 10) |
| seed | INT | Random seed for reproducibility |
| audio_1 ~ audio_N | AUDIO | Dynamic reference-audio inputs mapped to spk1 ~ spkN by slot order; starts with 2, auto-grows when filled, and has no fixed upper limit |
| normalize_text | BOOLEAN | Text normalization (default: off) |
| denoise_reference | BOOLEAN | Denoise reference audio via ZipEnhancer (default: off) |
| max_len | INT | Maximum token length during generation (default: 4096) |
| retry_badcase | BOOLEAN | Auto-retry when output quality is poor (default: on) |
β οΈ The training nodes rely on the upstream training modules (voxcpm.training.*). They pulltransformers / datasets / safetensors / argbindviarequirements.txt, and require a full VoxCPM source tree to be available β either install the full repo, or drop a checkout next to this plugin (e.g.ComfyUI/custom_nodes/VoxCPM/src/voxcpm/training/) or inside<plugin>/voxcpm/src/.
Typical workflow:
- Dataset Entry wraps a single (audio, text) pair into a training sample.
- Dataset Build aggregates samples into a
train.jsonlmanifest (an existing jsonl path also works). - Train LoRA / Train Full runs the training loop. Artifacts are written to
ComfyUI/output/voxcpm_train/<name>_<timestamp>/; withcopy_to_loras_direnabled LoRA weights are also copied toComfyUI/models/voxcpm/loras/so the Load Model node picks them up after a frontend refresh.
| Input | Type | Description |
|---|---|---|
| audio | AUDIO | Training clip |
| text | STRING | Optional transcript for the clip. If left blank, funasr SenseVoiceSmall is used to auto-transcribe audio |
| dataset_id | INT | Optional dataset id for multi-dataset training (default: 0) |
| ref_audio | AUDIO | Optional voice-style reference audio. When provided it is written to the manifest as ref_audio and used by the training pipeline for voice conditioning (requires voxcpm built after 2026-04) |
Returns entry (feed into Dataset Build) and text (the transcript actually used, handy for preview/reuse). Auto-ASR requires the SenseVoiceSmall model under models/SenseVoice/SenseVoiceSmall.
| Input | Type | Description |
|---|---|---|
| entry_1, entry_2 | VOXCPM_DATA_ENTRY | At least two samples |
| entry_3 ~ entry_8 | VOXCPM_DATA_ENTRY | Additional samples (optional) |
| extra_manifest | STRING | Path to an existing jsonl to append (optional) |
| sample_rate | INT | Sample rate to save WAVs at; match the base model AudioVAE (default: 16000) |
| dataset_name | STRING | Output directory prefix |
Outputs manifest_path (path to train.jsonl) and num_samples.
| Input | Type | Description |
|---|---|---|
| model_name | COMBO | Base model directory under models/voxcpm/ |
| train_manifest | STRING | Training manifest (jsonl) path (use Dataset Build output) |
| output_name | STRING | Output name prefix (the final folder is suffixed with a timestamp) |
| num_iters | INT | Total training steps (default: 500) |
| batch_size | INT | Per-step batch size (default: 1) |
| grad_accum_steps | INT | Gradient accumulation steps (default: 1) |
| learning_rate | FLOAT | Learning rate (default: 1e-4) |
| lora_rank | INT | LoRA rank (default: 32) |
| lora_alpha | INT | LoRA alpha (default: 32) |
| val_manifest | STRING | Optional validation manifest |
| warmup_steps | INT | Warmup steps (default: 100) |
| weight_decay | FLOAT | Weight decay (default: 0.01) |
| max_grad_norm | FLOAT | Gradient clipping; 0 = disabled (default: 1.0) |
| num_workers | INT | Data loader workers (default: 2) |
| log_interval | INT | Log interval in steps (default: 10) |
| save_interval | INT | Checkpoint interval; 0 = save only at the end (default: 0) |
| lora_dropout | FLOAT | LoRA dropout (default: 0.0) |
| enable_lm | BOOLEAN | Apply LoRA to the LM (default: on) |
| enable_dit | BOOLEAN | Apply LoRA to the DiT (default: on) |
| enable_proj | BOOLEAN | Apply LoRA to projection layers (default: off) |
| copy_to_loras_dir | BOOLEAN | Copy final LoRA to models/voxcpm/loras/ (default: on) |
Outputs lora_path (folder containing lora_weights.safetensors + lora_config.json) and info (summary string).
Mirrors the LoRA node without LoRA-specific inputs.
This project is licensed under the Apache License 2.0.
This project is based on VoxCPM, developed by OpenBMB / ModelBest.