Skip to content

attenlabs/saa-sdk

SAA: Selective Auditory Attention

Tells your voice agent which speech is actually for it.

One decision per utterance: only addressee speech reaches your STT, LLM, and TTS. No wake word.

npm PyPI License

Drop-in for the voice-agent stack you already use:

Pipecat    LiveKit    ElevenLabs Conversational AI    Twilio Media Streams

What is SAA?

A voice agent's microphone hears every voice in the room: yours, a coworker's, the kids, a podcast playing on the laptop, the agent's own TTS bleeding back through the speakers. Most pipelines respond to any of it, paying STT for every transcribed second and triggering the LLM on speech that was never directed at the device.

Single device · robot Single device · laptop Multi-device · two robots
Pollen Robotics Reachy robot listening; SAA fires only when the speaker addresses it. Live laptop session. SAA gates speech in a real browser tab; the pill flips green only when the user is addressing the laptop, ambient room speech stays gray. Two Pollen Robotics Reachy robots in the same room, hearing the same audio. Only the addressed robot acts; the other stays still.
Only addressed speech wakes the robot. Pill flips green only when the user addresses the screen. Same room, same audio. Only the addressed robot acts.

SAA (Selective Auditory Attention) is a hosted classifier that runs before STT and decides, per utterance, whether the speech was directed at the device. Side talk, background media, and the agent's own playback are filtered out, so your STT / LLM / TTS only see audio meant for the agent.

  • No wake word. SAA decides per-utterance from the audio (and optionally low-rate video) stream.
  • Hosted. A real-time WebSocket to attention labs' cloud; the model, weights, and inference run server-side, and these Apache-2.0 client SDKs are thin: they capture, encode, and stream audio (and optional low-rate video). Because it gates before STT, only addressed speech reaches the STT, LLM, and TTS you already run, so your downstream services and logs see less audio, not more. On-device deployment is a separate enterprise licence.

Ways to integrate

Shape Package Use it when
Streaming SDK @attenlabs/saa-js, attenlabs-saa your app captures the audio/video itself and you want typed attention events to gate your own pipeline. Good for web agents, mall kiosks, drive-through agents, and robots.
LiveKit saa-livekit-client you run a LiveKit Agents voice agent. SAA joins your room and gates the session.
Pipecat (Daily) saa-pipecat-client you run a Pipecat voice agent on Daily. SAA joins your Daily room and gates the pipeline through the "saa" app-message topic.
ElevenLabs attenlabs-saa you run an ElevenLabs Conversational AI agent. SAA gates it via the streaming SDK's feed_audio (its room is sealed, so SAA can't join it directly).
Twilio attenlabs-saa you run a Twilio Media Streams telephony agent. SAA gates inbound/outbound call audio (μ-law 8 kHz resampled to PCM16) via the streaming SDK's feed_audio.

Install

npm install @attenlabs/saa-js     # JavaScript / browser
pip install attenlabs-saa          # Python (streaming SDK)
pip install saa-livekit-client     # Python (LiveKit)
pip install saa-pipecat-client     # Python (Pipecat on Daily)

Get an API key at attentionlabs.ai/dashboard.

Streaming SDK

You capture the media; SAA emits typed events. The key event is turnReady / turn_ready, one device-directed utterance, captured and ready to forward to your STT or LLM.

import { AttentionClient } from "@attenlabs/saa-js";

const client = new AttentionClient({ token: process.env.SAA_API_KEY });

// fires once per device-directed turn; turn.audioBase64 is PCM16 @ 16 kHz
client.on("turnReady", (turn) => yourSTT.send(turn.audioBase64));

await client.start({ videoElement: document.querySelector("video") });
import os
from saa import AttentionClient

client = AttentionClient(token=os.environ["SAA_API_KEY"])

@client.on_turn_ready
def _(turn):
    # turn.audio_base64, PCM16 @ 16 kHz mono; turn.audio_pcm16, np.int16 array
    your_stt.send(turn.audio_base64)

client.start()

For audio-only deployments, omit videoElement (browser) or pass enable_video=False (Python).

Both SDKs also emit prediction, vad, state, interrupt, config, and stats events, and expose mute() / unmute(), setThreshold() / set_threshold(), and markResponding() / mark_responding(). See packages/saa-js and packages/saa-py.

Runnable end-to-end demos are in examples/web/ (browser) and examples/python/ (terminal).

LiveKit

For LiveKit Agents, saa-livekit-client brings SAA into your room to run the classifier and publish events on the "saa" data topic. Your agent consumes them through AttentionEngine and gates the session.

from saa_livekit_client import AttentionEngine, attention_agent_token, start_attention_session

saa = await start_attention_session(
    api_key=SAA_API_KEY, livekit_url=LIVEKIT_URL,
    agent_token=attention_agent_token(api_key=LK_KEY, api_secret=LK_SECRET, room_name=ctx.room.name),
    room_name=ctx.room.name, participant_identity=user.identity,
)
engine = AttentionEngine(ctx.room, agent_identity=saa.agent_identity)

@engine.on_prediction
def _(p):
    session.input.set_audio_enabled(p.aligned_class == 2)   # the gate

await engine.start()

Two runnable samples, an OpenAI Realtime agent and a vanilla-JS web client, are in examples/livekit/.

Pipecat on Daily

For Pipecat voice agents running on Daily, saa-pipecat-client brings SAA into your Daily room and publishes events on Daily's app-message channel under the "saa" topic. Your bot consumes them through AttentionEngine (which subscribes via your DailyTransport) and gates the pipeline.

from saa_pipecat_client import AttentionEngine, attention_agent_token, start_attention_session

saa = await start_attention_session(
    api_key=SAA_API_KEY, room_url=ROOM_URL,
    agent_token=attention_agent_token(daily_api_key=DAILY_API_KEY, room_name=room_name),
    participant_identity=human_identity,
)
engine = AttentionEngine(transport, agent_identity=saa.agent_identity)
engine.bind_task(task)

@engine.on_prediction
def _(p):
    addressee_gate.suppressed = (p.aligned_class == 1 and p.confidence > 0.7)

await engine.start()

A runnable web-client sample is in examples/pipecat/.

ElevenLabs

ElevenLabs Conversational AI runs its agent in a sealed WebRTC room, so SAA can't join it directly. Instead the streaming SDK runs in feed mode: you hand it the agent's microphone audio through feed_audio and gate the agent on SAA's prediction events.

from saa import AttentionClient

# feed mode: the SDK captures nothing itself; you supply the audio
saa = AttentionClient(token=SAA_API_KEY, enable_audio=False, enable_video=False)

@saa.on_prediction
def _(p):
    mic_to_agent.enabled = (p.aligned_class == 2)   # 2 = addressed to the device

saa.start()
# in ElevenLabs' AudioInterface input callback:
saa.feed_audio(mic_pcm16)

A runnable sample is in examples/elevenlabs/.

Twilio

For Twilio Media Streams telephony agents, the streaming SDK runs in feed mode over the call audio. The adapter transcodes Twilio's μ-law 8 kHz frames to PCM16 16 kHz, feeds them to SAA, and forwards only device-directed turns to your bridge, so side talk, hold music, and the agent's own TTS echo are gated out.

from saa import AttentionClient

saa = AttentionClient(token=SAA_API_KEY, enable_audio=False, enable_video=False)

@saa.on_turn_ready
def _(turn):
    bridge.on_speech(turn.audio_base64)   # only device-directed call audio continues

saa.start()
# in the Twilio media handler, after decoding μ-law -> PCM16:
saa.feed_audio(pcm16_frames)

A runnable Media Streams bridge (codec, paced outbound, automatic mark_responding) is in examples/twilio/.

Proactive agents (speak first)

The streaming SDKs expose markResponding(true) / mark_responding(True) so the agent can assert when it is the one speaking, suppressing the gate during its own TTS and resuming once the tail clears. The LiveKit and Pipecat bridges expose the same lifecycle via engine.responding_start() / responding_stop(), identical surface.

How it composes

SAA is the model-agnostic addressee decision between your VAD and STT. It answers a different question than VAD (is anyone speaking), speaker diarization (which voice it is), turn detection (have they finished), or a wake word (did they say the phrase), so it composes with those layers and can replace the wake word outright.

Where SAA sits in your voice stack: noise suppression and VAD upstream, SAA addressee gate, then STT → LLM → TTS downstream

On-device deployment

These Apache-2.0 client SDKs stream to the SAA cloud. For deployments where audio must stay on the device (telephony, embedded systems, wearables, robotics, kiosks), request on-device and embedded access at attentionlabs.ai.

Documentation

How SAA compares

Layer The question it answers What it does not tell you
Voice activity detection (VAD) Is anyone speaking right now? Whether that speech was meant for the agent
Speaker diarization / speaker ID Who is speaking? Whether that person is addressing the agent
Turn detection / endpointing Has the speaker finished their turn? Whether the turn was directed at the agent
Wake word Did they say the trigger phrase? Natural address with no keyword (and it degrades across multi-turn / hands-busy use)
Noise suppression / audio enhancement Is the audio clean? Whether the clean audio was directed at the agent
SAA (this SDK) Was this utterance directed at the device? It does not transcribe, identify the speaker, or denoise; it gates, then hands addressed audio to the layers you already run

SAA composes with every layer above; it does not replace them (it can make a wake word optional where wake-word UX is the bottleneck).

FAQ

Isn't this just VAD? No. VAD answers "is anyone speaking?" and fires on any voice: a coworker, the TV, the person next to your user. SAA answers "was this speech directed at the device?" and runs after VAD, so the pipeline only wakes on audio actually meant for the agent.

Is this speaker identification or diarization? No. Speaker ID answers "who is speaking" and diarization answers "who said what." SAA answers "who is being spoken to." An enrolled user can address the person beside them (not for the agent); an unenrolled stranger can speak straight to the device (for the agent). It is an orthogonal primitive: no enrollment, no speaker profiles.

Does it run on-device? No. SAA is a hosted service. The npm and PyPI packages are Apache-2.0 thin clients that capture, encode, and stream audio (and optional low-rate video) to the hosted inference endpoint; the model and weights run server-side. On-device and embedded delivery is a separate enterprise programme; it is not part of the public SDK.

Does it add latency? Yes: one hosted round-trip before STT. SAA does not make addressed speech faster. What it removes is the full STT + LLM + TTS pipeline firing on audio that was never addressed to the agent. Whether that is net-positive depends on how much non-directed audio your environment carries.

Why bother, what does it save me? In multi-talker environments, a large share of what your VAD flags was never meant for the agent. SAA drops that audio before STT is invoked, so you transcribe and reason over fewer non-directed seconds and get fewer ghost responses. The cheapest token is the one you never spend.

Will the agent respond to its own voice? Not if you tell it when it is talking. Call markResponding(true) / mark_responding(True) when your TTS starts; SAA suppresses the gate during playback and resumes once the tail clears. No mute-mic hack.

How do I trade off false positives vs. false negatives? setThreshold() / set_threshold() moves the operating point on the precision-recall curve. When the model is uncertain it fails closed: audio is not forwarded unless the system is confident the speech is addressed to the device. So the conservative default errs toward a possibly-missed command rather than a spurious trigger. Loosen the threshold to forward more readily; tighten it to suppress more aggressively.

What about my language? The acoustic model is designed to be language-agnostic, but cross-lingual validation is ongoing and cross-lingual recall is a known limitation under active work. English is where confidence is highest today; test carefully in other languages and in heavy-overlap (simultaneous-speech) environments before relying on it.

How accurate is it? On the held-out evaluation in the technical report (arXiv:2604.08412): 0.86 F1 audio-only, 0.95 F1 audio + video fusion. Two caveats up front: it fails closed under distribution shift, and cross-lingual recall is a known limitation. The paper figures demonstrate the approach; real-world performance varies by acoustic environment.

Is it open source? The client SDKs are Apache-2.0. The model, weights, and inference are a hosted service, not open source; think of SAA as a hosted API with permissively-licensed client libraries, not a self-hostable model.

License

Apache-2.0 across the repo, each package and the examples ship under it (see each subtree's LICENSE). The hosted cloud service is governed by the attention labs Terms of Service.

SECURITY.md · CONTRIBUTING.md · CODE_OF_CONDUCT.md · CHANGELOG.md · NOTICE · CITATION.cff


An attention labs project. © 2026 Socero Inc.

About

Addressee detection for voice agents: device-directed speech detection that runs before STT, so background speech, side conversations, and the agent's own TTS echo never trigger it. No wake word, model-agnostic, drop-in for LiveKit, Pipecat, ElevenLabs, Twilio, and OpenAI. The layer your VAD and turn detection are missing.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors