Skip to content

feat(vad): add configurable VAD policy#111

Open
JMLX42 wants to merge 14 commits into
itsmontoya:mainfrom
lx-industries:feat/configurable-vad-policy
Open

feat(vad): add configurable VAD policy#111
JMLX42 wants to merge 14 commits into
itsmontoya:mainfrom
lx-industries:feat/configurable-vad-policy

Conversation

@JMLX42

@JMLX42 JMLX42 commented Jan 20, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Implements Default for VadPolicy
  • Exports VadPolicy from the public API
  • Adds vad_policy field to Opts for custom VAD configuration
  • Adds VadProcessor::with_policy constructor
  • Uses vad_policy from Opts when creating VAD processor
  • Reduces VAD window size from 2s to 500ms for more responsive speech detection

Dependencies

Context

This feature is used by the Friday project.


This PR was created with the assistance of an AI assistant (Claude).

JMLX42 added 14 commits January 12, 2026 21:16
Add pass-through features for GPU backends:
- cuda: NVIDIA CUDA
- metal: Apple Metal
- hipblas: AMD ROCm
- vulkan: Cross-platform Vulkan
- coreml: Apple CoreML

This allows consumers to enable GPU acceleration by adding
the appropriate feature to their Cargo.toml, e.g.:

    scribble = { version = "0.5", features = ["cuda"] }
…cases

By default, the incremental transcriber waits for 2+ segments before
emitting, treating the last segment as potentially incomplete. This
adds latency for short utterances like voice assistant commands.

The new `emit_single_segments` option (default: false) allows emitting
single segments immediately when detected. This is useful for:
- Voice assistants
- Real-time transcription
- Any application where low latency is more important than waiting
  for natural sentence boundaries

When enabled, single segments are emitted as soon as Whisper produces
them, rather than waiting for a second segment or the 30-second
force-flush timeout.
When VAD detects no speech in an audio window, skip forwarding it to
Whisper entirely. This prevents hallucinations like "Merci" or "Thank
you for watching" that Whisper produces from silence with high
confidence.

Changes:
- process_ready_windows(): skip windows where VAD returns false
- flush(): only forward final buffer if VAD detects speech

Also fixes pre-existing test compilation (missing emit_single_segments
field) and formatting issues.
Move VAD filtering from the high-level Scribble API into the backend
stream. This ensures VAD works regardless of which API consumers use
(direct backend access or high-level Scribble::transcribe).

Changes:
- WhisperStream now optionally wraps audio with VadStream when
  enable_voice_activity_detection is true
- Remove VAD wrapping from Scribble::transcribe_with_encoder() to
  avoid double-filtering
- Export VadProcessor, VadStream, VadStreamReceiver publicly
- Make VadStream methods public for use in backend

This fixes the issue where friday-daemon's direct backend usage
bypassed VAD filtering entirely.
…etection

VAD windows at 2 seconds meant last_speech_instant() only updated every
2 seconds during continuous speech. With typical silence thresholds of
1 second, this caused premature end-of-utterance detection.

Reducing to 500ms (8000 samples at 16kHz) means speech instant updates
4x more frequently, enabling accurate silence gap measurement. Silero
VAD works reliably with windows down to ~250ms.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant