One decision per utterance: only addressee speech reaches your STT, LLM, and TTS. No wake word.
Drop-in for the voice-agent stack you already use:
A voice agent's microphone hears every voice in the room: yours, a coworker's, the kids, a podcast playing on the laptop, the agent's own TTS bleeding back through the speakers. Most pipelines respond to any of it, paying STT for every transcribed second and triggering the LLM on speech that was never directed at the device.
SAA (Selective Auditory Attention) is a hosted classifier that runs before STT and decides, per utterance, whether the speech was directed at the device. Side talk, background media, and the agent's own playback are filtered out, so your STT / LLM / TTS only see audio meant for the agent.
- No wake word. SAA decides per-utterance from the audio (and optionally low-rate video) stream.
- Hosted. A real-time WebSocket to attention labs' cloud; the model, weights, and inference run server-side, and these Apache-2.0 client SDKs are thin: they capture, encode, and stream audio (and optional low-rate video). Because it gates before STT, only addressed speech reaches the STT, LLM, and TTS you already run, so your downstream services and logs see less audio, not more. On-device deployment is a separate enterprise licence.
| Shape | Package | Use it when |
|---|---|---|
| Streaming SDK | @attenlabs/saa-js, attenlabs-saa |
your app captures the audio/video itself and you want typed attention events to gate your own pipeline. Good for web agents, mall kiosks, drive-through agents, and robots. |
| LiveKit | saa-livekit-client |
you run a LiveKit Agents voice agent. SAA joins your room and gates the session. |
| Pipecat (Daily) | saa-pipecat-client |
you run a Pipecat voice agent on Daily. SAA joins your Daily room and gates the pipeline through the "saa" app-message topic. |
| ElevenLabs | attenlabs-saa |
you run an ElevenLabs Conversational AI agent. SAA gates it via the streaming SDK's feed_audio (its room is sealed, so SAA can't join it directly). |
| Twilio | attenlabs-saa |
you run a Twilio Media Streams telephony agent. SAA gates inbound/outbound call audio (μ-law 8 kHz resampled to PCM16) via the streaming SDK's feed_audio. |
npm install @attenlabs/saa-js # JavaScript / browser
pip install attenlabs-saa # Python (streaming SDK)
pip install saa-livekit-client # Python (LiveKit)
pip install saa-pipecat-client # Python (Pipecat on Daily)Get an API key at attentionlabs.ai/dashboard.
You capture the media; SAA emits typed events. The key event is turnReady / turn_ready, one device-directed utterance, captured and ready to forward to your STT or LLM.
import { AttentionClient } from "@attenlabs/saa-js";
const client = new AttentionClient({ token: process.env.SAA_API_KEY });
// fires once per device-directed turn; turn.audioBase64 is PCM16 @ 16 kHz
client.on("turnReady", (turn) => yourSTT.send(turn.audioBase64));
await client.start({ videoElement: document.querySelector("video") });import os
from saa import AttentionClient
client = AttentionClient(token=os.environ["SAA_API_KEY"])
@client.on_turn_ready
def _(turn):
# turn.audio_base64, PCM16 @ 16 kHz mono; turn.audio_pcm16, np.int16 array
your_stt.send(turn.audio_base64)
client.start()For audio-only deployments, omit videoElement (browser) or pass enable_video=False (Python).
Both SDKs also emit prediction, vad, state, interrupt, config, and stats events, and expose mute() / unmute(), setThreshold() / set_threshold(), and markResponding() / mark_responding(). See packages/saa-js and packages/saa-py.
Runnable end-to-end demos are in examples/web/ (browser) and examples/python/ (terminal).
For LiveKit Agents, saa-livekit-client brings SAA into your room to run the classifier and publish events on the "saa" data topic. Your agent consumes them through AttentionEngine and gates the session.
from saa_livekit_client import AttentionEngine, attention_agent_token, start_attention_session
saa = await start_attention_session(
api_key=SAA_API_KEY, livekit_url=LIVEKIT_URL,
agent_token=attention_agent_token(api_key=LK_KEY, api_secret=LK_SECRET, room_name=ctx.room.name),
room_name=ctx.room.name, participant_identity=user.identity,
)
engine = AttentionEngine(ctx.room, agent_identity=saa.agent_identity)
@engine.on_prediction
def _(p):
session.input.set_audio_enabled(p.aligned_class == 2) # the gate
await engine.start()Two runnable samples, an OpenAI Realtime agent and a vanilla-JS web client, are in examples/livekit/.
For Pipecat voice agents running on Daily, saa-pipecat-client brings SAA into your Daily room and publishes events on Daily's app-message channel under the "saa" topic. Your bot consumes them through AttentionEngine (which subscribes via your DailyTransport) and gates the pipeline.
from saa_pipecat_client import AttentionEngine, attention_agent_token, start_attention_session
saa = await start_attention_session(
api_key=SAA_API_KEY, room_url=ROOM_URL,
agent_token=attention_agent_token(daily_api_key=DAILY_API_KEY, room_name=room_name),
participant_identity=human_identity,
)
engine = AttentionEngine(transport, agent_identity=saa.agent_identity)
engine.bind_task(task)
@engine.on_prediction
def _(p):
addressee_gate.suppressed = (p.aligned_class == 1 and p.confidence > 0.7)
await engine.start()A runnable web-client sample is in examples/pipecat/.
ElevenLabs Conversational AI runs its agent in a sealed WebRTC room, so SAA can't join it directly. Instead the streaming SDK runs in feed mode: you hand it the agent's microphone audio through feed_audio and gate the agent on SAA's prediction events.
from saa import AttentionClient
# feed mode: the SDK captures nothing itself; you supply the audio
saa = AttentionClient(token=SAA_API_KEY, enable_audio=False, enable_video=False)
@saa.on_prediction
def _(p):
mic_to_agent.enabled = (p.aligned_class == 2) # 2 = addressed to the device
saa.start()
# in ElevenLabs' AudioInterface input callback:
saa.feed_audio(mic_pcm16)A runnable sample is in examples/elevenlabs/.
For Twilio Media Streams telephony agents, the streaming SDK runs in feed mode over the call audio. The adapter transcodes Twilio's μ-law 8 kHz frames to PCM16 16 kHz, feeds them to SAA, and forwards only device-directed turns to your bridge, so side talk, hold music, and the agent's own TTS echo are gated out.
from saa import AttentionClient
saa = AttentionClient(token=SAA_API_KEY, enable_audio=False, enable_video=False)
@saa.on_turn_ready
def _(turn):
bridge.on_speech(turn.audio_base64) # only device-directed call audio continues
saa.start()
# in the Twilio media handler, after decoding μ-law -> PCM16:
saa.feed_audio(pcm16_frames)A runnable Media Streams bridge (codec, paced outbound, automatic mark_responding) is in examples/twilio/.
The streaming SDKs expose markResponding(true) / mark_responding(True) so the agent can assert when it is the one speaking, suppressing the gate during its own TTS and resuming once the tail clears. The LiveKit and Pipecat bridges expose the same lifecycle via engine.responding_start() / responding_stop(), identical surface.
SAA is the model-agnostic addressee decision between your VAD and STT. It answers a different question than VAD (is anyone speaking), speaker diarization (which voice it is), turn detection (have they finished), or a wake word (did they say the phrase), so it composes with those layers and can replace the wake word outright.
These Apache-2.0 client SDKs stream to the SAA cloud. For deployments where audio must stay on the device (telephony, embedded systems, wearables, robotics, kiosks), request on-device and embedded access at attentionlabs.ai.
packages/saa-js/README.md,packages/saa-py/README.md, streaming SDK reference.packages/saa-livekit-client/README.md: the LiveKit client.packages/saa-pipecat-client/README.md: the Pipecat-on-Daily client.examples/README.md, runnable examples.examples/twilio/README.md: the Twilio Media Streams bridge.
| Layer | The question it answers | What it does not tell you |
|---|---|---|
| Voice activity detection (VAD) | Is anyone speaking right now? | Whether that speech was meant for the agent |
| Speaker diarization / speaker ID | Who is speaking? | Whether that person is addressing the agent |
| Turn detection / endpointing | Has the speaker finished their turn? | Whether the turn was directed at the agent |
| Wake word | Did they say the trigger phrase? | Natural address with no keyword (and it degrades across multi-turn / hands-busy use) |
| Noise suppression / audio enhancement | Is the audio clean? | Whether the clean audio was directed at the agent |
| SAA (this SDK) | Was this utterance directed at the device? | It does not transcribe, identify the speaker, or denoise; it gates, then hands addressed audio to the layers you already run |
SAA composes with every layer above; it does not replace them (it can make a wake word optional where wake-word UX is the bottleneck).
Isn't this just VAD? No. VAD answers "is anyone speaking?" and fires on any voice: a coworker, the TV, the person next to your user. SAA answers "was this speech directed at the device?" and runs after VAD, so the pipeline only wakes on audio actually meant for the agent.
Is this speaker identification or diarization? No. Speaker ID answers "who is speaking" and diarization answers "who said what." SAA answers "who is being spoken to." An enrolled user can address the person beside them (not for the agent); an unenrolled stranger can speak straight to the device (for the agent). It is an orthogonal primitive: no enrollment, no speaker profiles.
Does it run on-device? No. SAA is a hosted service. The npm and PyPI packages are Apache-2.0 thin clients that capture, encode, and stream audio (and optional low-rate video) to the hosted inference endpoint; the model and weights run server-side. On-device and embedded delivery is a separate enterprise programme; it is not part of the public SDK.
Does it add latency? Yes: one hosted round-trip before STT. SAA does not make addressed speech faster. What it removes is the full STT + LLM + TTS pipeline firing on audio that was never addressed to the agent. Whether that is net-positive depends on how much non-directed audio your environment carries.
Why bother, what does it save me? In multi-talker environments, a large share of what your VAD flags was never meant for the agent. SAA drops that audio before STT is invoked, so you transcribe and reason over fewer non-directed seconds and get fewer ghost responses. The cheapest token is the one you never spend.
Will the agent respond to its own voice?
Not if you tell it when it is talking. Call markResponding(true) / mark_responding(True) when your TTS starts; SAA suppresses the gate during playback and resumes once the tail clears. No mute-mic hack.
How do I trade off false positives vs. false negatives?
setThreshold() / set_threshold() moves the operating point on the precision-recall curve. When the model is uncertain it fails closed: audio is not forwarded unless the system is confident the speech is addressed to the device. So the conservative default errs toward a possibly-missed command rather than a spurious trigger. Loosen the threshold to forward more readily; tighten it to suppress more aggressively.
What about my language? The acoustic model is designed to be language-agnostic, but cross-lingual validation is ongoing and cross-lingual recall is a known limitation under active work. English is where confidence is highest today; test carefully in other languages and in heavy-overlap (simultaneous-speech) environments before relying on it.
How accurate is it? On the held-out evaluation in the technical report (arXiv:2604.08412): 0.86 F1 audio-only, 0.95 F1 audio + video fusion. Two caveats up front: it fails closed under distribution shift, and cross-lingual recall is a known limitation. The paper figures demonstrate the approach; real-world performance varies by acoustic environment.
Is it open source? The client SDKs are Apache-2.0. The model, weights, and inference are a hosted service, not open source; think of SAA as a hosted API with permissively-licensed client libraries, not a self-hostable model.
Apache-2.0 across the repo, each package and the examples ship under it (see each subtree's LICENSE). The hosted cloud service is governed by the attention labs Terms of Service.
SECURITY.md · CONTRIBUTING.md · CODE_OF_CONDUCT.md · CHANGELOG.md · NOTICE · CITATION.cff
An attention labs project. © 2026 Socero Inc.


