Skip to content

localai-org/vibevoice.cpp

Repository files navigation

vibevoice.cpp

Brought to you by the LocalAI team - the creators of LocalAI, the open-source AI engine that runs any model - LLMs, vision, voice, image, video - on any hardware. No GPU required.

Models on HF License LocalAI

A C++ inference engine for Microsoft VibeVoice, built on ggml. Supports both TTS (text-to-speech with voice cloning) and ASR (long-form transcription with diarization).

Status: TTS + ASR pipelines run end-to-end on real model weights, with classifier-free guidance for TTS and a working closed-loop self-test (TTS → ASR produces real English transcripts of the synthesized audio).

Quickstart - prebuilt models

We publish quantized GGUFs at mudler/vibevoice.cpp-models. Pull them and you're running in two commands:

# `--recursive` is mandatory - third_party/ggml is a submodule; without
# it cmake configure fails with 'add_subdirectory called with empty
# directory'. If you've already cloned without it:
#     git submodule update --init --recursive
git clone --recursive https://github.com/mudler/vibevoice.cpp && cd vibevoice.cpp

cmake -B build -DVIBEVOICE_BUILD_TESTS=ON && cmake --build build -j

mkdir -p models && hf download mudler/vibevoice.cpp-models --local-dir models

# TTS
./build/bin/vibevoice-cli tts \
  --model     models/vibevoice-realtime-0.5B-q8_0.gguf \
  --tokenizer models/tokenizer.gguf \
  --voice     models/voice-en-Carter_man.gguf \
  --text "Hello from vibevoice cpp." --out hello.wav

# ASR
./build/bin/vibevoice-cli asr \
  --model     models/vibevoice-asr-q8_0.gguf \
  --tokenizer models/tokenizer.gguf \
  --audio     hello.wav

Quickstart - convert from upstream

If you want to roll your own (different quant, different voice, etc.):

# Tokenizer
hf download Qwen/Qwen2.5-0.5B --local-dir models/qwen2.5
python scripts/convert_tokenizer.py --src models/qwen2.5 --out models/tokenizer.gguf

# Realtime TTS model (~1.9 GB → ~3.8 GB fp32 gguf)
hf download microsoft/VibeVoice-Realtime-0.5B --local-dir models/vibevoice-realtime-0.5B
python scripts/convert_vibevoice_to_gguf.py \
  --src models/vibevoice-realtime-0.5B \
  --out models/vibevoice-realtime-0.5B.gguf

# (optional) quantize to Q8_0 - ~50% smaller, no quality loss in the closed-loop test
python scripts/quantize_gguf.py \
  --src models/vibevoice-realtime-0.5B.gguf \
  --out models/vibevoice-realtime-0.5B-q8_0.gguf \
  --type q8_0

# Voice prompt (one of the en-*/de-*/fr-* etc. files in the upstream demo)
curl -sL -o /tmp/voice.pt \
  https://github.com/microsoft/VibeVoice/raw/main/demo/voices/streaming_model/en-Carter_man.pt
python scripts/convert_voice_to_gguf.py --src /tmp/voice.pt --out models/voice.gguf

./build/bin/vibevoice-cli tts \
  --model models/vibevoice-realtime-0.5B-q8_0.gguf \
  --tokenizer models/tokenizer.gguf \
  --voice models/voice.gguf \
  --text "Hello from vibevoice cpp." \
  --out hello.wav \
  --cfg 3.0 --steps 20 --max-frames 40 --verbose

Closed-loop sanity (TTS → ASR)

# 1. synthesize
./build/bin/vibevoice-cli tts \
    --model models/vibevoice-realtime-0.5B-q8_0.gguf \
    --voice models/voice-en-Carter_man.gguf \
    --tokenizer models/tokenizer.gguf \
    --text "Hello world this is a test of the synthesis system." \
    --seed 12345 --out say.wav

# 2. transcribe back
./build/bin/vibevoice-cli asr \
    --model models/vibevoice-asr-q8_0.gguf \
    --tokenizer models/tokenizer.gguf \
    --audio say.wav
# -> [{"Start":0,"End":2.8,"Speaker":0,"Content":"Hello world, this is a test of the synthesis system."}]

This is the same roundtrip codified as tests/test_closed_loop.cpp - see docs/conversion.md for how to wire it into ctest.

Quickstart - voice cloning (1.5B)

The microsoft/VibeVoice-1.5B model conditions on a raw reference WAV at synthesis time — no separate voice gguf needed. Hand it ~5 s of any speaker and it'll synthesize new text in that voice.

hf download microsoft/VibeVoice-1.5B --local-dir models/vibevoice-1.5B
python scripts/convert_vibevoice_to_gguf.py \
  --src models/vibevoice-1.5B \
  --out models/vibevoice-1.5B.gguf
# shrink the gguf. Two recommended profiles for the 1.5B path:
#
#   1) Q8_0 across the board: 11 GB -> 6.8 GB, no measurable recall
#      hit on the closed-loop benchmark.
./build/bin/vibevoice-quantize \
  --src  models/vibevoice-1.5B.gguf \
  --out  models/vibevoice-1.5B-q8_0.gguf \
  --type q8_0
#
#   2) Mixed: 11 GB -> 6.5 GB, same recall as fp32. FFN at Q6_K, attn
#      at Q5_K, lm_head at Q8_0. Plain Q5_K across the board collapses
#      this model (recall drops to 22%) — FFN weights are the most
#      quant-sensitive piece, attention tolerates Q5_K well.
./build/bin/vibevoice-quantize \
  --src           models/vibevoice-1.5B.gguf \
  --out           models/vibevoice-1.5B-mixed.gguf \
  --type          q6_k    \
  --attn-type     q5_k    \
  --lm-head-type  q8_0

./build/bin/vibevoice-cli tts \
  --model     models/vibevoice-1.5B-q8_0.gguf \
  --tokenizer models/tokenizer.gguf \
  --ref-audio reference-voice.wav \
  --text      "Hello world, this is a test of voice cloning." \
  --out       cloned.wav

The same tts subcommand handles both model families: pass --voice <voice.gguf> for the realtime-0.5B path, or --ref-audio <wav> for runtime voice cloning on the 1.5B path. The CLI dispatches based on the loaded gguf's variant metadata; the two flags are mutually exclusive.

Multi-speaker dialog (1.5B only): repeat --ref-audio per speaker and tag the lines with Speaker N: markers. The model uses each WAV's encoded features as the voice for the corresponding speaker.

./build/bin/vibevoice-cli tts \
  --model     models/vibevoice-1.5B-q8_0.gguf \
  --tokenizer models/tokenizer.gguf \
  --ref-audio voice-carter.wav \
  --ref-audio voice-emma.wav \
  --text "$(printf ' Speaker 0: Hello, I am Carter.\n Speaker 1: And I am Emma.\n Speaker 0: Nice to meet you.')" \
  --out dialog.wav

Note: voice cloning only works with the 1.5B variant. The realtime-0.5B weights ship without encoders, so they can't process a reference WAV at runtime — they only consume pre-baked voice gguf files (see scripts/convert_voice_to_gguf.py).

Quickstart - ASR

# ASR model (~14 GB safetensors → ~33 GB fp32 gguf - needs lots of disk)
hf download microsoft/VibeVoice-ASR --local-dir models/vibevoice-asr
python scripts/convert_vibevoice_to_gguf.py \
  --src models/vibevoice-asr \
  --out models/vibevoice-asr.gguf

./build/bin/vibevoice-cli asr \
  --model models/vibevoice-asr.gguf \
  --tokenizer models/tokenizer.gguf \
  --audio my-clip.wav
# -> [{"Start":0.0,"End":6.0,"Speaker":0,"Content":"..."}]

Benchmarks

Long-form ASR on the multi-speaker Microsoft VibeVoice samples, through vibevoice-cli asr with the Q4_K-quantized 7B model. RTF is inference time / audio duration; lower is faster.

Sample Audio duration Backend / hardware Model Load time Inference RTF
2p_argument 68.5 s CPU - AMD Ryzen 9950X3D Q4_K 5.9 s 150.4 s 2.195
2p_argument 68.5 s CUDA - NVIDIA GB10 Q4_K 2.2 s 28.0 s 0.408

Sample transcripts and timing are produced by vibevoice-cli asr ... 2>&1 | grep "asr: timing" (timing breakdown was added in the same release as the backend dispatch). Reproduce locally:

hf download mudler/vibevoice.cpp-models --local-dir models
ffmpeg -i sample.mp3 -ac 1 -ar 24000 sample.wav
VIBEVOICE_BACKEND=cuda ./build/bin/vibevoice-cli asr \
    --model models/vibevoice-asr-q4_k.gguf \
    --tokenizer models/tokenizer.gguf \
    --audio sample.wav --max-new-tokens 8192

Tests

ctest --test-dir build --output-on-failure   # 21 ctest targets

For the real-weight tests:

VIBEVOICE_MODEL=models/vibevoice-realtime-0.5B.gguf \
VIBEVOICE_TOKENIZER=models/tokenizer.gguf \
  ctest --test-dir build --output-on-failure -j 2

Embedding from Go (purego)

vibevoice.cpp ships a flat C ABI in include/vibevoice_capi.h designed for dlopen / purego.RegisterLibFunc consumers. It mirrors the layout LocalAI's qwen3-tts-cpp Go backend uses:

int  vv_capi_load(const char* tts_model, const char* asr_model,
                  const char* tokenizer, const char* voice, int n_threads);
// `voice_path` is for realtime-0.5B, `ref_audio_path` is for 1.5B
// (runtime voice cloning); pass NULL for whichever the loaded model
// doesn't need.
int  vv_capi_tts(const char* text, const char* voice_path, const char* ref_audio_path,
                 const char* dst_wav, int steps, float cfg, int max_speech_frames,
                 uint32_t seed);
int  vv_capi_asr(const char* src_wav, char* out_json, size_t out_capacity,
                 int max_new_tokens);
void vv_capi_unload(void);

Build the shared library and call it from Go:

import "github.com/ebitengine/purego"

var (
    Load     func(tts, asr, tok, voice string, threads int) int
    TTS      func(text, voice, refAudio, dst string, steps int, cfg float32, maxFrames int, seed uint32) int
    ASR      func(src string, out []byte, outCap uint64, maxTok int) int
    Unload   func()
)

lib, _ := purego.Dlopen("./libvibevoice.so", purego.RTLD_NOW|purego.RTLD_GLOBAL)
purego.RegisterLibFunc(&Load,   lib, "vv_capi_load")
purego.RegisterLibFunc(&TTS,    lib, "vv_capi_tts")
purego.RegisterLibFunc(&ASR,    lib, "vv_capi_asr")
purego.RegisterLibFunc(&Unload, lib, "vv_capi_unload")

Build the shared library with cmake -DVIBEVOICE_SHARED=ON.

Why

VibeVoice's official runtime is Python + Transformers + a vLLM plugin. vibevoice.cpp provides:

  • A native CPU runtime with no Python at inference time
  • Free CUDA / Metal / Vulkan support via ggml backends
  • A single .gguf weight file + a single binary
  • A flat C ABI (include/vibevoice_capi.h) for embedding via dlopen / purego / cgo

Build

git clone --recursive https://example.com/vibevoice.cpp
cd vibevoice.cpp
cmake -B build -DVIBEVOICE_BUILD_TESTS=ON
cmake --build build -j
ctest --test-dir build --output-on-failure

Layout

See docs/ and the project plan for the full architecture.

Author

Ettore Di Giacinto (@mudler) - also the maintainer of LocalAI. PRs welcome.

License

MIT - see LICENSE. Copyright © 2026 Ettore Di Giacinto. The model weights remain under their upstream license (microsoft/VibeVoice).

About

C++ port of Microsoft VibeVoice built on ggml

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors