Skip to content

HM-RunningHub/ComfyUI_RH_VoxCPM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

18 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ComfyUI_RH_VoxCPM

License

δΈ­ζ–‡θ―΄ζ˜Ž

ComfyUI custom nodes for VoxCPM β€” Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning.

Run this node online: RunningHub (CN) | RunningHub (Global)

GitHub Repository: HM-RunningHub/ComfyUI_RH_VoxCPM

✨ Features

  • Voice Design: Create unique voices from text descriptions (gender, age, tone, emotion, pace)
  • Controllable Cloning: Clone a voice with optional style guidance via reference audio
  • Ultimate Cloning: Reproduce every vocal nuance through audio continuation (VoxCPM2 only)
  • LoRA Fine-tuning: Load custom LoRA weights for personalized voice generation
  • LoRA / Full Training: Train VoxCPM LoRA (or full fine-tune) directly from a ComfyUI workflow, reusing the upstream training loop
  • Auto ASR: Automatically recognize reference audio text via FunASR SenseVoiceSmall when reference_audio_text is empty
  • Reference Denoising: Optional ZipEnhancer denoising for reference audio before cloning

πŸ› οΈ Installation

Method 1: Clone from GitHub

cd ComfyUI/custom_nodes
git clone https://github.com/HM-RunningHub/ComfyUI_RH_VoxCPM.git
cd ComfyUI_RH_VoxCPM
pip install -r requirements.txt

Method 2: ComfyUI Manager

Search for ComfyUI_RH_VoxCPM in ComfyUI Manager and install.

πŸ“¦ Model Download & Installation

VoxCPM Models (required, pick one)

Model Params Size Recommended
VoxCPM2 2B ~4.6 GB βœ… Best quality
VoxCPM1.5 800M ~1.9 GB Good balance
VoxCPM-0.5B 640M ~1.5 GB Lightweight

Method 1: Download from HuggingFace (Recommended)

huggingface-cli download openbmb/VoxCPM2 --local-dir ComfyUI/models/voxcpm/VoxCPM2

Method 2: Download from ModelScope (For China users)

pip install modelscope
modelscope download --model openbmb/VoxCPM2 --local_dir ComfyUI/models/voxcpm/VoxCPM2

Model Directory Structure

ComfyUI/
└── models/
    └── voxcpm/
        β”œβ”€β”€ VoxCPM2/                # Main model (required)
        β”‚   β”œβ”€β”€ config.json
        β”‚   β”œβ”€β”€ model.safetensors
        β”‚   β”œβ”€β”€ audiovae.pth
        β”‚   β”œβ”€β”€ tokenizer.json
        β”‚   β”œβ”€β”€ tokenizer_config.json
        β”‚   └── special_tokens_map.json
        β”œβ”€β”€ loras/                  # LoRA weights (optional)
        β”‚   └── my_custom_voice.pth
        └── speech_zipenhancer_ans_multiloss_16k_base/  # Denoiser (optional)

SenseVoiceSmall (required for auto ASR)

# From ModelScope
modelscope download --model iic/SenseVoiceSmall --local_dir ComfyUI/models/SenseVoice/SenseVoiceSmall

ZipEnhancer (optional, for reference audio denoising)

# From ModelScope
modelscope download --model iic/speech_zipenhancer_ans_multiloss_16k_base --local_dir ComfyUI/models/voxcpm/speech_zipenhancer_ans_multiloss_16k_base

πŸš€ Usage

Example Workflows

Download example workflows from the examples/ directory and import into ComfyUI:

  1. Basic Workflow β€” Single-speaker speech generation with voice design / cloning
  2. Multi-Speaker Workflow β€” Fixed 5-speaker multi-speaker dialogue generation with per-speaker voice control
  3. LoRA Training Workflow β€” Build a tiny dataset from two audio clips and run a LoRA fine-tune

Notes:

  • RunningHub VoxCPM Multi-Speaker is the fixed 5-speaker version
  • RunningHub VoxCPM Multi-Speaker (Dynamic Audio) uses the same script format but grows reference-audio inputs automatically
  • If the dynamic inputs do not appear after updating the plugin, refresh the ComfyUI frontend page or reopen the workflow

Three Modes

  • Voice Design: Fill control_instruction (e.g. "A warm young woman"), leave reference_audio empty. The model creates a brand-new voice from your description alone.
  • Controllable Cloning: Upload reference_audio, keep ultimate_clone OFF. Use control_instruction to steer emotion, pace, and style while preserving the reference timbre.
  • Ultimate Cloning: Upload reference_audio, turn ultimate_clone ON. The model treats the reference as a spoken prefix and continues from it, faithfully reproducing every vocal detail. control_instruction is ignored in this mode. If reference_audio_text is empty, ASR will auto-recognize it.

πŸ“ Node Reference

RunningHub VoxCPM Load Model

Load VoxCPM/VoxCPM2 model from local directory with optional LoRA weights.

Input Type Description
model_name COMBO Model directory under models/voxcpm/
optimize BOOLEAN Enable torch.compile optimization (default: off)
lora_name COMBO LoRA weights under models/voxcpm/loras/ (optional, default: None)

RunningHub VoxCPM Generate Speech

Generate speech with voice design, controllable cloning, or ultimate cloning.

Input Type Description
model VOXCPM_MODEL Model from Load Model node
text STRING Target text to synthesize
cfg_value FLOAT Guidance scale (default: 2.0)
inference_steps INT LocDiT flow-matching steps (default: 10)
seed INT Random seed for reproducibility
control_instruction STRING Voice description for voice design mode (optional)
reference_audio AUDIO Reference audio for cloning (optional)
ultimate_clone BOOLEAN Enable ultimate cloning mode (default: off)
reference_audio_text STRING Transcript of reference audio; auto ASR if empty (optional)
normalize_text BOOLEAN Text normalization (default: off)
denoise_reference BOOLEAN Denoise reference audio via ZipEnhancer (default: off)
max_len INT Maximum token length during generation (default: 4096)
retry_badcase BOOLEAN Auto-retry when output quality is poor (default: on)

RunningHub VoxCPM Multi-Speaker

Generate multi-speaker dialogue from a tagged script. Supports up to 5 speakers with individual voice control.

Input Type Description
model VOXCPM_MODEL Model from Load Model node
script STRING Tagged script, e.g. [spk1]Hello[spk2]Hi there
cfg_value FLOAT Guidance scale (default: 2.0)
inference_steps INT LocDiT flow-matching steps (default: 10)
seed INT Random seed for reproducibility
audio_1 ~ audio_5 AUDIO Reference audio for each speaker (optional)
control_1 ~ control_5 STRING Voice description for each speaker (optional)
normalize_text BOOLEAN Text normalization (default: off)
denoise_reference BOOLEAN Denoise reference audio via ZipEnhancer (default: off)
max_len INT Maximum token length during generation (default: 4096)
retry_badcase BOOLEAN Auto-retry when output quality is poor (default: on)

RunningHub VoxCPM Multi-Speaker (Dynamic Audio)

For multi-speaker reference-audio workflows. The script still uses [spk1]...[spk2]... tags, while speaker control instructions are merged into a single multiline input using the same tag format. The node shows 2 reference-audio inputs by default and automatically adds the next one when all current inputs are connected, with no fixed upper limit. At execution time, audio_1 maps to spk1, audio_2 maps to spk2, and so on, so tags like spk10 and spk20 are supported as well.

Usage tips:

  • You need to connect all currently visible audio_* inputs before the next one is added
  • This auto-growth behavior depends on the frontend extension script; if it does not update after installing a new version, refresh the page
Input Type Description
model VOXCPM_MODEL Model from Load Model node
script STRING Tagged script, e.g. [spk1]Hello[spk2]Hi there
speaker_controls STRING Multiline tagged controls, e.g. [spk1]Sichuan accent\n[spk2]Adult female, northeastern accent
cfg_value FLOAT Guidance scale (default: 2.0)
inference_steps INT LocDiT flow-matching steps (default: 10)
seed INT Random seed for reproducibility
audio_1 ~ audio_N AUDIO Dynamic reference-audio inputs mapped to spk1 ~ spkN by slot order; starts with 2, auto-grows when filled, and has no fixed upper limit
normalize_text BOOLEAN Text normalization (default: off)
denoise_reference BOOLEAN Denoise reference audio via ZipEnhancer (default: off)
max_len INT Maximum token length during generation (default: 4096)
retry_badcase BOOLEAN Auto-retry when output quality is poor (default: on)

πŸŽ“ Training Nodes (LoRA / Full Fine-tuning)

⚠️ The training nodes rely on the upstream training modules (voxcpm.training.*). They pull transformers / datasets / safetensors / argbind via requirements.txt, and require a full VoxCPM source tree to be available β€” either install the full repo, or drop a checkout next to this plugin (e.g. ComfyUI/custom_nodes/VoxCPM/src/voxcpm/training/) or inside <plugin>/voxcpm/src/.

Typical workflow:

  1. Dataset Entry wraps a single (audio, text) pair into a training sample.
  2. Dataset Build aggregates samples into a train.jsonl manifest (an existing jsonl path also works).
  3. Train LoRA / Train Full runs the training loop. Artifacts are written to ComfyUI/output/voxcpm_train/<name>_<timestamp>/; with copy_to_loras_dir enabled LoRA weights are also copied to ComfyUI/models/voxcpm/loras/ so the Load Model node picks them up after a frontend refresh.

RunningHub VoxCPM Dataset Entry

Input Type Description
audio AUDIO Training clip
text STRING Optional transcript for the clip. If left blank, funasr SenseVoiceSmall is used to auto-transcribe audio
dataset_id INT Optional dataset id for multi-dataset training (default: 0)
ref_audio AUDIO Optional voice-style reference audio. When provided it is written to the manifest as ref_audio and used by the training pipeline for voice conditioning (requires voxcpm built after 2026-04)

Returns entry (feed into Dataset Build) and text (the transcript actually used, handy for preview/reuse). Auto-ASR requires the SenseVoiceSmall model under models/SenseVoice/SenseVoiceSmall.

RunningHub VoxCPM Dataset Build

Input Type Description
entry_1, entry_2 VOXCPM_DATA_ENTRY At least two samples
entry_3 ~ entry_8 VOXCPM_DATA_ENTRY Additional samples (optional)
extra_manifest STRING Path to an existing jsonl to append (optional)
sample_rate INT Sample rate to save WAVs at; match the base model AudioVAE (default: 16000)
dataset_name STRING Output directory prefix

Outputs manifest_path (path to train.jsonl) and num_samples.

RunningHub VoxCPM Train LoRA

Input Type Description
model_name COMBO Base model directory under models/voxcpm/
train_manifest STRING Training manifest (jsonl) path (use Dataset Build output)
output_name STRING Output name prefix (the final folder is suffixed with a timestamp)
num_iters INT Total training steps (default: 500)
batch_size INT Per-step batch size (default: 1)
grad_accum_steps INT Gradient accumulation steps (default: 1)
learning_rate FLOAT Learning rate (default: 1e-4)
lora_rank INT LoRA rank (default: 32)
lora_alpha INT LoRA alpha (default: 32)
val_manifest STRING Optional validation manifest
warmup_steps INT Warmup steps (default: 100)
weight_decay FLOAT Weight decay (default: 0.01)
max_grad_norm FLOAT Gradient clipping; 0 = disabled (default: 1.0)
num_workers INT Data loader workers (default: 2)
log_interval INT Log interval in steps (default: 10)
save_interval INT Checkpoint interval; 0 = save only at the end (default: 0)
lora_dropout FLOAT LoRA dropout (default: 0.0)
enable_lm BOOLEAN Apply LoRA to the LM (default: on)
enable_dit BOOLEAN Apply LoRA to the DiT (default: on)
enable_proj BOOLEAN Apply LoRA to projection layers (default: off)
copy_to_loras_dir BOOLEAN Copy final LoRA to models/voxcpm/loras/ (default: on)

Outputs lora_path (folder containing lora_weights.safetensors + lora_config.json) and info (summary string).

RunningHub VoxCPM Train Full

Mirrors the LoRA node without LoRA-specific inputs. ⚠️ Full fine-tuning is memory-heavy; prefer the LoRA node for voice adaptation.

πŸ“„ License

This project is licensed under the Apache License 2.0.

πŸ”— Links

πŸ™ Acknowledgements

This project is based on VoxCPM, developed by OpenBMB / ModelBest.

About

This is a ComfyUI plug-in for VoxCPM

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages