LocalVQE

Local Voice Quality Enhancement — a compact neural model for joint acoustic echo cancellation (AEC), noise suppression, and dereverberation of 16 kHz speech, designed to run on commodity CPUs in real time.

1.3 M parameters (~5 MB F32)
~1.56 ms per 16 ms frame on Zen4 (4 threads) — ≈10× realtime
Causal, streaming: 256-sample hop, 16 ms algorithmic latency
F32 reference inference in C++ via GGML; PyTorch reference included for verification and research

Try it: https://huggingface.co/spaces/LocalAI-io/LocalVQE-demo.

LocalVQE is a derivative of DeepVQE (Indenbom et al., Interspeech 2023) — smaller, GGML-native, and tuned for streaming CPU inference. The architecture, training recipe, and reproducibility checks are documented in the technical report under paper/; this README covers building and running the published weights only.

A concrete example

Picture a video call from a laptop. Your microphone picks up three things alongside your voice:

The remote participant's voice, played back through your speakers and caught again by your mic — this is the echo. Without cancellation they hear themselves a fraction of a second later.
Your own voice bouncing off walls, desk, and monitor before reaching the mic — this is reverberation, the "tunnel" or "bathroom" sound that makes you feel far away from the listener.
A fan, keyboard clatter, a dog barking, or traffic outside — plain background noise.

LocalVQE removes all three in a single causal pass, frame by frame, on the CPU, so only your voice reaches the far end.

Why this, and not a classical AEC/NS stack?

Hand-tuned DSP pipelines (NLMS/AP/Kalman AEC, Wiener/spectral-subtraction NS, MCRA noise tracking, RLS dereverb) can run in tens of microseconds per frame and remain a strong baseline when the acoustic path is benign. LocalVQE is interesting when you want:

Robustness to non-linear echo paths (small loudspeakers, handheld devices, plastic laptop chassis) where linear AEC leaves residual echo.
Non-stationary noise suppression (babble, keyboards, fans changing speed) that energy-based noise estimators struggle with.
One model, many conditions — no per-device tuning of step sizes, forgetting factors, or VAD thresholds.
A single deterministic causal pass — no double-talk detector, no adaptation state that can diverge.

The trade-off is CPU: a classical stack might cost ~0.1 ms/frame, LocalVQE ~1–2 ms/frame. On anything larger than a microcontroller that's still a small fraction of a real-time budget.

Why this, and not DeepVQE?

Microsoft never released DeepVQE — no weights, no reference implementation, no streaming runtime. We re-implemented it from the paper as a GGML graph at richiejp/deepvqe-ggml (the full-width ~7.5 M-parameter version) before starting LocalVQE. LocalVQE is the same idea pruned and rebuilt to ~1.3 M parameters (~5 MB F32), small enough to run on commodity CPUs in real time.

Model Weights

Pre-trained weights are published on Hugging Face at LocalAI-io/LocalVQE:

File	Description
`localvqe-v1.2-1.3M-f32.gguf`	F32 GGUF — what the C++ engine loads.
`localvqe-v1.2-1.3M.pt`	PyTorch checkpoint — for verification, ablation, and downstream research.
`localvqe-v1.1-1.3M-f32.gguf`	Previous release.
`localvqe-v1-1.3M-f32.gguf`	Original release.

The current release is v1.2. It doubles the supported delay window from 500ms to 1 second at a 20% performance cost. It also avoids oversuppression of voices that are near to the noise floor.

Streaming latency

Per-hop, 16 kHz / 256-sample hop → 16 ms budget. Each hop is a full ggml_backend_graph_compute. Run any of these locally with the bench-run cmake target — see Benchmark below. 30 iters × 625 hops/iter = 18 750 hops per row.

v1.2 (current — 1024 ms echo-search window)

Hardware	Backend	Threads	Hop p50	Hop p99	Hop max	RT factor
Ryzen 9 7900 (Zen4 desktop)	CPU	1	4.15 ms	4.53 ms	6.23 ms	3.83×
Ryzen 9 7900 (Zen4 desktop)	CPU	4	1.56 ms	1.73 ms	4.57 ms	10.16×
Ryzen 9 7900 (Zen4 desktop)	CPU	8	1.89 ms	2.15 ms	6.91 ms ‡	8.58×
Ryzen 9 7900 (Zen4 desktop)	CPU	16	2.12 ms	2.17 ms	6.43 ms ‡	7.54×
Ryzen 9 7900 + RADV iGPU (Raphael)	Vulkan	—	4.88 ms	5.06 ms	6.24 ms	3.28×
Ryzen 9 7900 + RTX 5070 Ti (dGPU)	Vulkan	—	1.79 ms	3.42 ms	5.42 ms	8.58×

The wider echo-search window costs ~20–25 % per-hop on CPU vs v1.1. Re-runs of v1.2 on Apple M4 and Alder Lake aren't published yet — the v1.1 rows below remain representative shape-wise, expect the same multiplier.

v1.1 (previous — 512 ms echo-search window)

Hardware	Backend	Threads	Hop p50	Hop p99	Hop max	RT factor
Ryzen 9 7900 (Zen4 desktop)	CPU	1	3.40 ms	3.57 ms	5.06 ms	4.7×
Ryzen 9 7900 (Zen4 desktop)	CPU	2	2.07 ms	2.25 ms	3.65 ms	7.7×
Ryzen 9 7900 (Zen4 desktop)	CPU	4	1.32 ms	1.57 ms	6.91 ms ‡	12.0×
Ryzen 9 7900 + RADV iGPU (Raphael)	Vulkan	—	4.43 ms	4.62 ms	5.07 ms	3.60×
Ryzen 9 7900 + RTX 5070 Ti (dGPU)	Vulkan	—	1.79 ms	3.41 ms	4.14 ms	8.63×
Apple M4 (4P + 6E, macOS 25.3)	CPU	1	2.98 ms	3.16 ms	19.11 ms ‡	5.4×
Apple M4 (4P + 6E, macOS 25.3)	CPU	2	1.82 ms	1.93 ms	3.17 ms	8.8×
Apple M4 (4P + 6E, macOS 25.3)	CPU	4	1.11 ms	1.81 ms	10.41 ms ‡	14.4×
Core i5-14500 (Alder Lake-S)	CPU	1	3.25 ms	3.53 ms	6.73 ms	4.93×
Core i5-14500 (Alder Lake-S)	CPU	2	2.55 ms	2.81 ms	5.20 ms	6.23×
Core i5-14500 (Alder Lake-S)	CPU	3	2.26 ms	3.09 ms	3.85 ms	7.06×
Core i5-14500 (Alder Lake-S)	CPU	4	2.02 ms	2.89 ms	3.59 ms	7.79×
Core i5-14500 + Arc A770 (dGPU)	Vulkan	—	10.90 ms	12.00 ms	13.38 ms	1.48×
Core i5-14500 + UHD 770 (iGPU)	Vulkan	—	9.02 ms	11.77 ms	17.93 ms	1.74×

Adding cores hits diminishing returns quickly: the model is small enough that thread-launch and synchronisation overhead start to dominate beyond ≈4 threads on these CPUs. The Zen4 v1.2 sweep shows it plainly — the 1→4 thread step gives a 2.66× speedup, but 4→8 is a regression and 8→16 worse still. For deployment, four threads is the sweet spot on Zen4.

‡ Outliers are single hops early in the first iteration (cold caches); p99 is representative of steady-state.

Vulkan p50/p95/p99 are typically tight, but worst-case single-hop latency on a shared desktop is sensitive to external GPU clients (display compositor, browser). On a dedicated embedded device with no compositor contending for the queue, expect the quieter end of the range.

The bench binary prints the top-10 slowest hops with (iteration, hop-in-iteration) coordinates so you can check whether outliers cluster at post-localvqe_reset() boundaries (cold path) or scatter through the stream (external contention). In practice we see the latter.

Validation Results

Full 800-clip eval on the ICASSP 2022 AEC Challenge blind test set (real recordings, not synthetic mixes):

Scenario	n	AECMOS echo ↑	AECMOS deg ↑	blind ERLE ↑	DNSMOS OVRL ↑
doubletalk	115	4.72	2.37	8.4 dB	2.83
doubletalk-with-movement	185	4.65	2.30	8.1 dB	2.79
farend-singletalk	107	3.78	4.91	45.7 dB	1.80
farend-singletalk-with-movement	193	4.12	4.96	40.6 dB	1.75
nearend-singletalk	200	5.00	4.16	2.1 dB	3.17

v1.2 vs v1.1 deltas: echo MOS +0.80 / +0.72 on FE-ST and FE-ST-with- movement (the main release goal), near-end deg MOS +0.11, DT echo MOS roughly unchanged. FE-ST-with-movement ERLE drops 4.4 dB — the v1.2 model is less aggressive when the echo path is moving, which trades raw cancellation for fewer near-end gating artefacts.

AECMOS (Purin et al., ICASSP 2022) is Microsoft's non-intrusive AEC quality predictor. "Echo" rates how well echo was removed; "degradation" rates how clean the resulting speech is. 1–5 MOS scale, higher is better.
Blind ERLE is 10·log10(E[mic²] / E[enh²]). Only meaningful on far-end single-talk where the input is echo-only; on scenes with active near-end speech it understates echo removal because both numerator and denominator are dominated by speech.

PyTorch checkpoint integrity (SHA256):

ff6885e7c8d7d29a8ce963303dcd668ae0f2a7bdafae28631292fe6f06f7cd77  localvqe-v1.2-1.3M.pt

Repository Layout

ggml/        C++ streaming inference (GGML graph, CLI, C API, tests)
pytorch/     PyTorch reference implementation (model definition only)
ARCHITECTURE.md
CITATION.cff
LICENSE
flake.nix

Building the C++ Inference Engine

Requires CMake ≥ 3.20 and a C++17 compiler. A Nix flake is provided:

git clone --recursive https://github.com/localai-org/LocalVQE.git
cd LocalVQE

# With Nix:
nix develop
cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release
cmake --build ggml/build -j$(nproc)

# Without Nix — install cmake, gcc/clang, pkg-config, libsndfile, then:
cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release
cmake --build ggml/build -j$(nproc)

Binaries land in ggml/build/bin/. The CPU build produces multiple libggml-cpu-*.so variants (SSE4.2 / AVX2 / AVX-512) selected at runtime. Keep the binaries and .so files together.

Vulkan backend (embedded / integrated-GPU targets)

Add -DLOCALVQE_VULKAN=ON to the configure step. This composes with the CPU build — an additional libggml-vulkan.so is produced in ggml/build/bin/ and the runtime loader picks it up when a Vulkan ICD is present, otherwise it falls back to the CPU variants.

cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release -DLOCALVQE_VULKAN=ON
cmake --build ggml/build -j$(nproc)

The Nix flake's dev shell already includes vulkan-loader, vulkan-headers, and shaderc. Without Nix, install the equivalents from your distro (Debian: libvulkan-dev vulkan-headers glslc/shaderc).

Running Inference

CLI

./ggml/build/bin/localvqe localvqe-v1.2-1.3M-f32.gguf \
    --in-wav mic.wav ref.wav \
    --out-wav enhanced.wav

Expects 16 kHz mono PCM for both mic and far-end reference.

Benchmark

The bench-run cmake target is the turnkey path: it builds bench, downloads the released F32 model and a doubletalk mic/ref WAV pair from HuggingFace into ggml/build/bench_assets/, and runs the benchmark.

# Configure once (Vulkan optional but recommended for GPU runs)
cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release -DLOCALVQE_VULKAN=ON

# Discover backends + device indices
cmake --build ggml/build --target bench-list-devices

# Run on the default backend (CPU device 0, 10 iterations)
cmake --build ggml/build --target bench-run

To pick a specific backend or device, set the cache variables at configure time and rebuild the target:

# Vulkan device 0 (e.g. dGPU) with 30 iterations
cmake -S ggml -B ggml/build -DBENCH_BACKEND=Vulkan -DBENCH_DEVICE=0 -DBENCH_ITERS=30
cmake --build ggml/build --target bench-run

# Vulkan device 1 (e.g. iGPU)
cmake -S ggml -B ggml/build -DBENCH_DEVICE=1
cmake --build ggml/build --target bench-run

Sweeping every backend/device on the box is just a shell loop over the indices bench-list-devices printed:

for dev in 0 1; do
    cmake -S ggml -B ggml/build -DBENCH_BACKEND=Vulkan -DBENCH_DEVICE=$dev
    cmake --build ggml/build --target bench-run
done

Or invoke the binary directly against your own WAV pair:

./ggml/build/bin/bench localvqe-v1.2-1.3M-f32.gguf \
    --backend Vulkan --device 0 \
    --in-wav mic.wav ref.wav --iters 10 --profile

Shared Library (C API)

cmake -S ggml -B ggml/build -DLOCALVQE_BUILD_SHARED=ON
cmake --build ggml/build -j$(nproc)

Produces liblocalvqe.so with the API in ggml/localvqe_api.h. See ggml/example_purego_test.go for a Go / purego integration.

Regression test

ggml/tests/test_regression.cpp is an end-to-end check: it runs localvqe_process_f32 on a fixed seeded input through each published .gguf and compares against a committed reference output, mirroring the PyTorch suite under pytorch/tests/. Build, fetch both released GGUFs from HuggingFace, and run via CTest:

cmake --build ggml/build --target test_regression regression-assets
ctest --test-dir ggml/build --output-on-failure

regression-assets reuses the same SHA256-verified download path as bench-assets. Missing GGUFs make the corresponding test entry SKIP rather than fail, so CI without network access still runs cleanly.

To refresh a reference output after an intentional graph change:

python ggml/tests/regenerate_fixtures.py \
    --gguf ggml/build/bench_assets/localvqe-v1.2-1.3M-f32.gguf

Quantizing (experimental)

Calibrated Q4_K / Q8_0 weights are not yet published. The quantize tool in the C++ build can produce GGUF variants from the F32 reference for experimentation:

./ggml/build/bin/quantize localvqe-v1.2-1.3M-f32.gguf localvqe-v1.2-1.3M-q8.gguf Q8_0

Expect end-to-end quality loss until proper per-tensor selection and calibration have been worked through.

PyTorch Reference

pytorch/ contains the model definition used to train and export the weights. It's provided for verification, ablation, and downstream research — not for end-user inference, which should go through the GGML build.

cd pytorch
pip install -r requirements.txt
python -c "
import yaml, torch
from localvqe.model import LocalVQE
cfg = yaml.safe_load(open('configs/default.yaml'))
model = LocalVQE(**cfg['model'], n_freqs=cfg['audio']['n_freqs'])
print(sum(p.numel() for p in model.parameters()))
"

Citing LocalVQE

If you use LocalVQE in academic work, please cite the repository via the CITATION.cff file at the root — GitHub renders a "Cite this repository" button that produces APA and BibTeX entries automatically.

For a DOI, we recommend citing a specific release via Zenodo, which mints a DOI per GitHub release. Please also cite the upstream DeepVQE paper:

@inproceedings{indenbom2023deepvqe,
  title     = {DeepVQE: Real Time Deep Voice Quality Enhancement for Joint
               Acoustic Echo Cancellation, Noise Suppression and Dereverberation},
  author    = {Indenbom, Evgenii and Beltr{\'a}n, Nicolae-C{\u{a}}t{\u{a}}lin
               and Chernov, Mykola and Aichner, Robert},
  booktitle = {Interspeech},
  year      = {2023},
  doi       = {10.21437/Interspeech.2023-2176}
}

Dataset Attribution

Published weights are trained on data from the ICASSP 2023 Deep Noise Suppression Challenge (Microsoft, CC BY 4.0) and fine-tuned on the ICASSP 2022/2023 Acoustic Echo Cancellation Challenge.

Safety Note

Training data was filtered by DNSMOS perceived-quality scores, which can misclassify distressed speech (screaming, crying) as noise. LocalVQE may attenuate or distort such signals and must not be relied upon for emergency call or safety-critical applications.

License

Apache License 2.0 — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
ggml		ggml
pytorch		pytorch
.gitignore		.gitignore
.gitmodules		.gitmodules
ARCHITECTURE.md		ARCHITECTURE.md
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
flake.lock		flake.lock
flake.nix		flake.nix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LocalVQE

A concrete example

Why this, and not a classical AEC/NS stack?

Why this, and not DeepVQE?

Model Weights

Streaming latency

v1.2 (current — 1024 ms echo-search window)

v1.1 (previous — 512 ms echo-search window)

Validation Results

Repository Layout

Building the C++ Inference Engine

Vulkan backend (embedded / integrated-GPU targets)

Running Inference

CLI

Benchmark

Shared Library (C API)

Regression test

Quantizing (experimental)

PyTorch Reference

Citing LocalVQE

Dataset Attribution

Safety Note

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LocalVQE

A concrete example

Why this, and not a classical AEC/NS stack?

Why this, and not DeepVQE?

Model Weights

Streaming latency

v1.2 (current — 1024 ms echo-search window)

v1.1 (previous — 512 ms echo-search window)

Validation Results

Repository Layout

Building the C++ Inference Engine

Vulkan backend (embedded / integrated-GPU targets)

Running Inference

CLI

Benchmark

Shared Library (C API)

Regression test

Quantizing (experimental)

PyTorch Reference

Citing LocalVQE

Dataset Attribution

Safety Note

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages