Build agents that know when the user is actually talking to them. Save on tokens.
Python sample app for attention labs real-time selective auditory attention (SAA).
Every voice pipeline has the same problem: the microphone hears everything, but your ASR should only process speech directed at the device. Wake words solve this with a rigid trigger phrase. SAA solves it without one.
That makes SAA useful for robots, smart displays, TVs, desktop agents, AR/VR interfaces, and other voice AI systems that need to ignore background conversation while still feeling natural.
reachy_demo.webm
attenlabs-saa streams mic and webcam data to the SAA inference server over WebSocket and emits typed events: attention predictions, voice activity, conversation state, and speech audio.
Mic + Webcam
|
v
attenlabs-saa SDK
|
| WebSocket
v
SAA inference server
|
+-- prediction events (ConvoStatus 0 / 1 / 2)
+-- conversation state events
+-- turn_ready audio + optional frames
|
v
Optional ASR + LLM / Agent
|
v
Speaker playback
This sample app uses OpenAI Realtime Voice as the LLM stage, but the bridge is part of the demo, not the SDK. Swap in whichever provider you like.
Follow the steps in order. Should take about 5 minutes.
Sign up at the attention labs dashboard and copy your token.
You only need this if you want the LLM to talk back. Use a key with Realtime API access. Skip this step to just see live SAA predictions in the terminal.
- Python 3.10+
- A microphone and webcam
- Grant your terminal microphone + camera permissions. On macOS: System Settings → Privacy & Security → Microphone / Camera → enable for Terminal (or iTerm, VS Code, etc). The app will fail silently without this.
macOS:
brew install portaudioLinux (Debian/Ubuntu):
sudo apt-get install -y libportaudio2 libasound2-devgit clone https://github.com/attenlabs/saa-sdk.git
cd saa-sdk/examples/python
pip install attenlabs-saa cv2-enumerate-cameras simpleaudioWith the LLM stage enabled:
python main.py --token YOUR_SAA_API_KEY --openai-key sk-...Without an LLM (just see live predictions):
python main.py --token YOUR_SAA_API_KEY --no-llmOr use env vars:
export SAA_API_KEY=...
export OPENAI_API_KEY=sk-...
python main.pyAfter a short warmup, the app prints a live status panel:
╔══════════════════════════════════════════════════════════════════════════════╗
║ ATTENTION LABS :: CONVERSATION INTELLIGENCE v1.0 ║
║ Press Ctrl+C to stop ║
╠══════════════════════════════════════════════════════════════════════════════╣
║ CURRENT MODE : TALKING TO COMPUTER (94%) ║
║ BUFFER : [0, 0, 0, 2, 2, 2, 2, 2, 2, 2] ║
║ LLM STATE : listening ║
║ PROCESSING : 16.2s ║
╚══════════════════════════════════════════════════════════════════════════════╝
- CURRENT MODE, the latest ConvoStatus prediction and confidence
- BUFFER, rolling window of the last 10 predictions
- LLM STATE,
idle→listening→processing→speaking
Once enough consecutive 2s land in the buffer, the LLM state flips to listening. When you stop talking (or turn to talk to someone else) the captured audio is sent to OpenAI Realtime, and the reply plays through your speakers.
Every prediction event carries a ConvoStatus value:
| State | Meaning |
|---|---|
0 |
Silence, no speech detected |
1 |
Human-to-human, people are talking to each other |
2 |
Human-to-device, someone is talking to the computer |
Your pipeline only needs to act on state 2. States 0 and 1 let you skip ASR entirely and avoid sending irrelevant audio to your LLM.
--threshold(default0.70): minimum confidence for a state-2prediction to count as device-directed.- Turn detection is handled by the SAA server/SDK (
on_turn_ready); the demo just forwards the captured turn._BUFFER_LENin main.py only sizes the on-screen prediction history, not the trigger.
Lower the threshold for more sensitive triggering; raise it for fewer false starts.
Try three thresholds with --threshold and keep the best:
- with video enabled:
0.75,0.85 - with
--no-video:0.65,0.75
--token SAA auth token (required; or SAA_API_KEY env var)
--url Override the default SAA server URL (for a private or enterprise SAA endpoint)
--openai-key OpenAI API key; falls back to OPENAI_API_KEY env var
--camera-index Webcam device index (skip the picker)
--mic-device Mic device name or numeric index (skip the picker)
--threshold Device-class trigger threshold 0..1 (default 0.70)
--no-video Disable webcam capture
--no-audio Disable mic capture
--no-llm Disable LLM stage even if a key is set
--log-level DEBUG, INFO, WARNING, ERROR (default WARNING)
Full API reference, constructor, methods, events, types, threading model, lives in the Python SDK reference.
This demo reads the OpenAI API key from a CLI arg or env var and uses it directly from the local process. Fine for personal use. For multi-user deployments, proxy the Realtime connection through a server you control so keys never leave your backend.
Apache-2.0