feat(vad): add configurable VAD policy#111
Open
JMLX42 wants to merge 14 commits into
Open
Conversation
Add pass-through features for GPU backends:
- cuda: NVIDIA CUDA
- metal: Apple Metal
- hipblas: AMD ROCm
- vulkan: Cross-platform Vulkan
- coreml: Apple CoreML
This allows consumers to enable GPU acceleration by adding
the appropriate feature to their Cargo.toml, e.g.:
scribble = { version = "0.5", features = ["cuda"] }
…cases By default, the incremental transcriber waits for 2+ segments before emitting, treating the last segment as potentially incomplete. This adds latency for short utterances like voice assistant commands. The new `emit_single_segments` option (default: false) allows emitting single segments immediately when detected. This is useful for: - Voice assistants - Real-time transcription - Any application where low latency is more important than waiting for natural sentence boundaries When enabled, single segments are emitted as soon as Whisper produces them, rather than waiting for a second segment or the 30-second force-flush timeout.
When VAD detects no speech in an audio window, skip forwarding it to Whisper entirely. This prevents hallucinations like "Merci" or "Thank you for watching" that Whisper produces from silence with high confidence. Changes: - process_ready_windows(): skip windows where VAD returns false - flush(): only forward final buffer if VAD detects speech Also fixes pre-existing test compilation (missing emit_single_segments field) and formatting issues.
Move VAD filtering from the high-level Scribble API into the backend stream. This ensures VAD works regardless of which API consumers use (direct backend access or high-level Scribble::transcribe). Changes: - WhisperStream now optionally wraps audio with VadStream when enable_voice_activity_detection is true - Remove VAD wrapping from Scribble::transcribe_with_encoder() to avoid double-filtering - Export VadProcessor, VadStream, VadStreamReceiver publicly - Make VadStream methods public for use in backend This fixes the issue where friday-daemon's direct backend usage bypassed VAD filtering entirely.
…etection VAD windows at 2 seconds meant last_speech_instant() only updated every 2 seconds during continuous speech. With typical silence thresholds of 1 second, this caused premature end-of-utterance detection. Reducing to 500ms (8000 samples at 16kHz) means speech instant updates 4x more frequently, enabling accurate silence gap measurement. Silero VAD works reliably with windows down to ~250ms.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
DefaultforVadPolicyVadPolicyfrom the public APIvad_policyfield toOptsfor custom VAD configurationVadProcessor::with_policyconstructorvad_policyfrom Opts when creating VAD processorDependencies
feat/expose-vad-speech-instant)Context
This feature is used by the Friday project.
This PR was created with the assistance of an AI assistant (Claude).