Local Voice Quality Enhancement — a compact neural model for joint acoustic echo cancellation (AEC), noise suppression, and dereverberation of 16 kHz speech, designed to run on commodity CPUs in real time.
- 1.3 M parameters (~5 MB F32)
- ~1.56 ms per 16 ms frame on Zen4 (4 threads) — ≈10× realtime
- Causal, streaming: 256-sample hop, 16 ms algorithmic latency
- F32 reference inference in C++ via GGML; PyTorch reference included for verification and research
Try it: https://huggingface.co/spaces/LocalAI-io/LocalVQE-demo.
LocalVQE is a derivative of DeepVQE
(Indenbom et al., Interspeech 2023) —
smaller, GGML-native, and tuned for streaming CPU inference. The
architecture, training recipe, and reproducibility checks are
documented in the technical report under
paper/; this README covers building and running the published
weights only.
Picture a video call from a laptop. Your microphone picks up three things alongside your voice:
- The remote participant's voice, played back through your speakers and caught again by your mic — this is the echo. Without cancellation they hear themselves a fraction of a second later.
- Your own voice bouncing off walls, desk, and monitor before reaching the mic — this is reverberation, the "tunnel" or "bathroom" sound that makes you feel far away from the listener.
- A fan, keyboard clatter, a dog barking, or traffic outside — plain background noise.
LocalVQE removes all three in a single causal pass, frame by frame, on the CPU, so only your voice reaches the far end.
Hand-tuned DSP pipelines (NLMS/AP/Kalman AEC, Wiener/spectral-subtraction NS, MCRA noise tracking, RLS dereverb) can run in tens of microseconds per frame and remain a strong baseline when the acoustic path is benign. LocalVQE is interesting when you want:
- Robustness to non-linear echo paths (small loudspeakers, handheld devices, plastic laptop chassis) where linear AEC leaves residual echo.
- Non-stationary noise suppression (babble, keyboards, fans changing speed) that energy-based noise estimators struggle with.
- One model, many conditions — no per-device tuning of step sizes, forgetting factors, or VAD thresholds.
- A single deterministic causal pass — no double-talk detector, no adaptation state that can diverge.
The trade-off is CPU: a classical stack might cost ~0.1 ms/frame, LocalVQE ~1–2 ms/frame. On anything larger than a microcontroller that's still a small fraction of a real-time budget.
Microsoft never released DeepVQE — no weights, no reference implementation, no streaming runtime. We re-implemented it from the paper as a GGML graph at richiejp/deepvqe-ggml (the full-width ~7.5 M-parameter version) before starting LocalVQE. LocalVQE is the same idea pruned and rebuilt to ~1.3 M parameters (~5 MB F32), small enough to run on commodity CPUs in real time.
Pre-trained weights are published on Hugging Face at LocalAI-io/LocalVQE:
| File | Description |
|---|---|
localvqe-v1.2-1.3M-f32.gguf |
F32 GGUF — what the C++ engine loads. |
localvqe-v1.2-1.3M.pt |
PyTorch checkpoint — for verification, ablation, and downstream research. |
localvqe-v1.1-1.3M-f32.gguf |
Previous release. |
localvqe-v1-1.3M-f32.gguf |
Original release. |
The current release is v1.2. It doubles the supported delay window from 500ms to 1 second at a 20% performance cost. It also avoids oversuppression of voices that are near to the noise floor.
Per-hop, 16 kHz / 256-sample hop → 16 ms budget. Each hop is a full
ggml_backend_graph_compute. Run any of these locally with the
bench-run cmake target — see Benchmark below. 30
iters × 625 hops/iter = 18 750 hops per row.
| Hardware | Backend | Threads | Hop p50 | Hop p99 | Hop max | RT factor |
|---|---|---|---|---|---|---|
| Ryzen 9 7900 (Zen4 desktop) | CPU | 1 | 4.15 ms | 4.53 ms | 6.23 ms | 3.83× |
| Ryzen 9 7900 (Zen4 desktop) | CPU | 4 | 1.56 ms | 1.73 ms | 4.57 ms | 10.16× |
| Ryzen 9 7900 (Zen4 desktop) | CPU | 8 | 1.89 ms | 2.15 ms | 6.91 ms ‡ | 8.58× |
| Ryzen 9 7900 (Zen4 desktop) | CPU | 16 | 2.12 ms | 2.17 ms | 6.43 ms ‡ | 7.54× |
| Ryzen 9 7900 + RADV iGPU (Raphael) | Vulkan | — | 4.88 ms | 5.06 ms | 6.24 ms | 3.28× |
| Ryzen 9 7900 + RTX 5070 Ti (dGPU) | Vulkan | — | 1.79 ms | 3.42 ms | 5.42 ms | 8.58× |
The wider echo-search window costs ~20–25 % per-hop on CPU vs v1.1. Re-runs of v1.2 on Apple M4 and Alder Lake aren't published yet — the v1.1 rows below remain representative shape-wise, expect the same multiplier.
| Hardware | Backend | Threads | Hop p50 | Hop p99 | Hop max | RT factor |
|---|---|---|---|---|---|---|
| Ryzen 9 7900 (Zen4 desktop) | CPU | 1 | 3.40 ms | 3.57 ms | 5.06 ms | 4.7× |
| Ryzen 9 7900 (Zen4 desktop) | CPU | 2 | 2.07 ms | 2.25 ms | 3.65 ms | 7.7× |
| Ryzen 9 7900 (Zen4 desktop) | CPU | 4 | 1.32 ms | 1.57 ms | 6.91 ms ‡ | 12.0× |
| Ryzen 9 7900 + RADV iGPU (Raphael) | Vulkan | — | 4.43 ms | 4.62 ms | 5.07 ms | 3.60× |
| Ryzen 9 7900 + RTX 5070 Ti (dGPU) | Vulkan | — | 1.79 ms | 3.41 ms | 4.14 ms | 8.63× |
| Apple M4 (4P + 6E, macOS 25.3) | CPU | 1 | 2.98 ms | 3.16 ms | 19.11 ms ‡ | 5.4× |
| Apple M4 (4P + 6E, macOS 25.3) | CPU | 2 | 1.82 ms | 1.93 ms | 3.17 ms | 8.8× |
| Apple M4 (4P + 6E, macOS 25.3) | CPU | 4 | 1.11 ms | 1.81 ms | 10.41 ms ‡ | 14.4× |
| Core i5-14500 (Alder Lake-S) | CPU | 1 | 3.25 ms | 3.53 ms | 6.73 ms | 4.93× |
| Core i5-14500 (Alder Lake-S) | CPU | 2 | 2.55 ms | 2.81 ms | 5.20 ms | 6.23× |
| Core i5-14500 (Alder Lake-S) | CPU | 3 | 2.26 ms | 3.09 ms | 3.85 ms | 7.06× |
| Core i5-14500 (Alder Lake-S) | CPU | 4 | 2.02 ms | 2.89 ms | 3.59 ms | 7.79× |
| Core i5-14500 + Arc A770 (dGPU) | Vulkan | — | 10.90 ms | 12.00 ms | 13.38 ms | 1.48× |
| Core i5-14500 + UHD 770 (iGPU) | Vulkan | — | 9.02 ms | 11.77 ms | 17.93 ms | 1.74× |
Adding cores hits diminishing returns quickly: the model is small enough that thread-launch and synchronisation overhead start to dominate beyond ≈4 threads on these CPUs. The Zen4 v1.2 sweep shows it plainly — the 1→4 thread step gives a 2.66× speedup, but 4→8 is a regression and 8→16 worse still. For deployment, four threads is the sweet spot on Zen4.
‡ Outliers are single hops early in the first iteration (cold caches); p99 is representative of steady-state.
Vulkan p50/p95/p99 are typically tight, but worst-case single-hop latency on a shared desktop is sensitive to external GPU clients (display compositor, browser). On a dedicated embedded device with no compositor contending for the queue, expect the quieter end of the range.
The bench binary prints the top-10 slowest hops with
(iteration, hop-in-iteration) coordinates so you can check whether
outliers cluster at post-localvqe_reset() boundaries (cold path)
or scatter through the stream (external contention). In practice we
see the latter.
Full 800-clip eval on the ICASSP 2022 AEC Challenge blind test set (real recordings, not synthetic mixes):
| Scenario | n | AECMOS echo ↑ | AECMOS deg ↑ | blind ERLE ↑ | DNSMOS OVRL ↑ |
|---|---|---|---|---|---|
| doubletalk | 115 | 4.72 | 2.37 | 8.4 dB | 2.83 |
| doubletalk-with-movement | 185 | 4.65 | 2.30 | 8.1 dB | 2.79 |
| farend-singletalk | 107 | 3.78 | 4.91 | 45.7 dB | 1.80 |
| farend-singletalk-with-movement | 193 | 4.12 | 4.96 | 40.6 dB | 1.75 |
| nearend-singletalk | 200 | 5.00 | 4.16 | 2.1 dB | 3.17 |
v1.2 vs v1.1 deltas: echo MOS +0.80 / +0.72 on FE-ST and FE-ST-with- movement (the main release goal), near-end deg MOS +0.11, DT echo MOS roughly unchanged. FE-ST-with-movement ERLE drops 4.4 dB — the v1.2 model is less aggressive when the echo path is moving, which trades raw cancellation for fewer near-end gating artefacts.
- AECMOS (Purin et al., ICASSP 2022) is Microsoft's non-intrusive AEC quality predictor. "Echo" rates how well echo was removed; "degradation" rates how clean the resulting speech is. 1–5 MOS scale, higher is better.
- Blind ERLE is
10·log10(E[mic²] / E[enh²]). Only meaningful on far-end single-talk where the input is echo-only; on scenes with active near-end speech it understates echo removal because both numerator and denominator are dominated by speech.
PyTorch checkpoint integrity (SHA256):
ff6885e7c8d7d29a8ce963303dcd668ae0f2a7bdafae28631292fe6f06f7cd77 localvqe-v1.2-1.3M.pt
ggml/ C++ streaming inference (GGML graph, CLI, C API, tests)
pytorch/ PyTorch reference implementation (model definition only)
ARCHITECTURE.md
CITATION.cff
LICENSE
flake.nix
Requires CMake ≥ 3.20 and a C++17 compiler. A Nix flake is provided:
git clone --recursive https://github.com/localai-org/LocalVQE.git
cd LocalVQE
# With Nix:
nix develop
cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release
cmake --build ggml/build -j$(nproc)
# Without Nix — install cmake, gcc/clang, pkg-config, libsndfile, then:
cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release
cmake --build ggml/build -j$(nproc)Binaries land in ggml/build/bin/. The CPU build produces multiple
libggml-cpu-*.so variants (SSE4.2 / AVX2 / AVX-512) selected at runtime.
Keep the binaries and .so files together.
Add -DLOCALVQE_VULKAN=ON to the configure step. This composes with the
CPU build — an additional libggml-vulkan.so is produced in
ggml/build/bin/ and the runtime loader picks it up when a Vulkan ICD is
present, otherwise it falls back to the CPU variants.
cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release -DLOCALVQE_VULKAN=ON
cmake --build ggml/build -j$(nproc)The Nix flake's dev shell already includes vulkan-loader,
vulkan-headers, and shaderc. Without Nix, install the equivalents
from your distro (Debian: libvulkan-dev vulkan-headers glslc/shaderc).
./ggml/build/bin/localvqe localvqe-v1.2-1.3M-f32.gguf \
--in-wav mic.wav ref.wav \
--out-wav enhanced.wavExpects 16 kHz mono PCM for both mic and far-end reference.
The bench-run cmake target is the turnkey path: it builds bench,
downloads the released F32 model and a doubletalk mic/ref WAV pair from
HuggingFace into ggml/build/bench_assets/, and runs the benchmark.
# Configure once (Vulkan optional but recommended for GPU runs)
cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release -DLOCALVQE_VULKAN=ON
# Discover backends + device indices
cmake --build ggml/build --target bench-list-devices
# Run on the default backend (CPU device 0, 10 iterations)
cmake --build ggml/build --target bench-runTo pick a specific backend or device, set the cache variables at configure time and rebuild the target:
# Vulkan device 0 (e.g. dGPU) with 30 iterations
cmake -S ggml -B ggml/build -DBENCH_BACKEND=Vulkan -DBENCH_DEVICE=0 -DBENCH_ITERS=30
cmake --build ggml/build --target bench-run
# Vulkan device 1 (e.g. iGPU)
cmake -S ggml -B ggml/build -DBENCH_DEVICE=1
cmake --build ggml/build --target bench-runSweeping every backend/device on the box is just a shell loop over the
indices bench-list-devices printed:
for dev in 0 1; do
cmake -S ggml -B ggml/build -DBENCH_BACKEND=Vulkan -DBENCH_DEVICE=$dev
cmake --build ggml/build --target bench-run
doneOr invoke the binary directly against your own WAV pair:
./ggml/build/bin/bench localvqe-v1.2-1.3M-f32.gguf \
--backend Vulkan --device 0 \
--in-wav mic.wav ref.wav --iters 10 --profilecmake -S ggml -B ggml/build -DLOCALVQE_BUILD_SHARED=ON
cmake --build ggml/build -j$(nproc)Produces liblocalvqe.so with the API in ggml/localvqe_api.h. See
ggml/example_purego_test.go for a Go / purego integration.
ggml/tests/test_regression.cpp is an end-to-end check: it runs
localvqe_process_f32 on a fixed seeded input through each published
.gguf and compares against a committed reference output, mirroring
the PyTorch suite under pytorch/tests/. Build, fetch both released
GGUFs from HuggingFace, and run via CTest:
cmake --build ggml/build --target test_regression regression-assets
ctest --test-dir ggml/build --output-on-failureregression-assets reuses the same SHA256-verified download path as
bench-assets. Missing GGUFs make the corresponding test entry SKIP
rather than fail, so CI without network access still runs cleanly.
To refresh a reference output after an intentional graph change:
python ggml/tests/regenerate_fixtures.py \
--gguf ggml/build/bench_assets/localvqe-v1.2-1.3M-f32.ggufCalibrated Q4_K / Q8_0 weights are not yet published. The quantize
tool in the C++ build can produce GGUF variants from the F32 reference
for experimentation:
./ggml/build/bin/quantize localvqe-v1.2-1.3M-f32.gguf localvqe-v1.2-1.3M-q8.gguf Q8_0Expect end-to-end quality loss until proper per-tensor selection and calibration have been worked through.
pytorch/ contains the model definition used to train and export the
weights. It's provided for verification, ablation, and downstream research
— not for end-user inference, which should go through the GGML build.
cd pytorch
pip install -r requirements.txt
python -c "
import yaml, torch
from localvqe.model import LocalVQE
cfg = yaml.safe_load(open('configs/default.yaml'))
model = LocalVQE(**cfg['model'], n_freqs=cfg['audio']['n_freqs'])
print(sum(p.numel() for p in model.parameters()))
"If you use LocalVQE in academic work, please cite the repository via the
CITATION.cff file at the root — GitHub renders a "Cite this repository"
button that produces APA and BibTeX entries automatically.
For a DOI, we recommend citing a specific release via Zenodo, which mints a DOI per GitHub release. Please also cite the upstream DeepVQE paper:
@inproceedings{indenbom2023deepvqe,
title = {DeepVQE: Real Time Deep Voice Quality Enhancement for Joint
Acoustic Echo Cancellation, Noise Suppression and Dereverberation},
author = {Indenbom, Evgenii and Beltr{\'a}n, Nicolae-C{\u{a}}t{\u{a}}lin
and Chernov, Mykola and Aichner, Robert},
booktitle = {Interspeech},
year = {2023},
doi = {10.21437/Interspeech.2023-2176}
}Published weights are trained on data from the ICASSP 2023 Deep Noise Suppression Challenge (Microsoft, CC BY 4.0) and fine-tuned on the ICASSP 2022/2023 Acoustic Echo Cancellation Challenge.
Training data was filtered by DNSMOS perceived-quality scores, which can misclassify distressed speech (screaming, crying) as noise. LocalVQE may attenuate or distort such signals and must not be relied upon for emergency call or safety-critical applications.
Apache License 2.0 — see LICENSE.