A voice AI assistant that answers calls over SIP and speaks back using Sber Voice (STT/TTS) and any OpenAI-compatible LLM. Works perfectly as a real-time voice channel for OpenClaw AI agents via OpenClaw's OpenAI-compatible API.
Callers speak to the assistant over a regular phone line (SIP). The assistant transcribes their speech with Sber STT, generates a response with any OpenAI-compatible LLM (including OpenClaw's built-in API), and speaks it back using Silero TTS — all in real time.
Caller ◄──SIP──► LiveKit SIP Trunk ◄──WebRTC──► Agent (LiveKit SDK)
│
┌────────┼────────┐
▼ ▼ ▼
Sber STT LLM Silero TTS
│
OpenClaw │ OpenAI-compatible
AI Agent │ (Ollama, GPT, ...)
- 🤖 OpenClaw AI voice channel — use any OpenClaw agent as the LLM brain via its OpenAI-compatible API
- 📞 SIP telephony — inbound + outbound calls via LiveKit SIP Trunk
- 🎙️ Sber SaluteSpeech STT — gRPC streaming speech recognition (Russian language)
- 🗣️ Silero TTS — self-hosted neural text-to-speech (HTTP microservice)
- 🧠 Any LLM — OpenAI-compatible API (OpenClaw, Ollama, GPT, Claude, etc.)
- 🔇 No VAD needed — server-side endpointing via Sber's EOU detection
- 🛡️ Confirmation phrases — instant "one moment" playback while LLM thinks (no silence gaps)
- 🔌 Modular services — STT, TTS, auth all run as separate microservices
- 🐳 Docker Compose — single
up -dto start everything ⚠️ Resilience — configurable retry, apology playback on LLM errors, silence timeout → hangup
# 1. Copy and fill in environment
cp .env.example .env
# 2. Start all services
docker compose up -d| Variable | Description |
|---|---|
LIVEKIT_URL |
ws://<host>:7880 |
LIVEKIT_API_KEY |
From livekit.yaml |
LIVEKIT_API_SECRET |
From livekit.yaml |
EXTERNAL_IP |
Server public IP |
LLM_BASE_URL |
OpenAI-compatible LLM endpoint (e.g. http://openclaw:8080/v1 for OpenClaw) |
LLM_API_KEY |
LLM API key (use ollama for Ollama, openclaw for OpenClaw) |
SBER_CLIENT_ID |
Sber RCE key Client ID |
SBER_CLIENT_SECRET |
Sber RCE key secret (base64) |
SIP_OUTBOUND_TRUNK_ID |
LiveKit outbound trunk ID (for outbound calls) |
See .env.example for the full list.
The LiveKit agent runs in two modes, determined automatically from dispatch metadata:
| Mode | Trigger | Behaviour |
|---|---|---|
| inbound | metadata has no phone_number |
Agent waits for an incoming SIP call |
| outbound | metadata has "phone_number": "+7..." |
Agent dials out and starts speaking |
┌──────────┐ ┌──────────────┐ ┌───────────┐ ┌──────────┐ ┌───────────┐
│ Caller │ SIP │ LiveKit SIP │ WS │ Agent │ gRPC │ Sber STT │ │ LLM │
│ │────►│ Trunk │────►│(LiveKit │─────►│(Salute- │ │(OpenClaw /│
│ │ │ │ │ SDK) │ │ Speech) │ │ Ollama / │
│ │◄────│ │◄────│ │◄─────│ │ │ GPT) │
│ │ SIP │ │ WS │ │ HTTP │ │ │ │
└──────────┘ └──────────────┘ │ │ └──────────┘ └───────────┘
│ │◄────── HTTP ─────────┐
│ │ │
└───────────┘ ┌───────────┴───────────┐
│ Silero TTS │
│ (tts-service) │
└───────────────────────┘
- Call arrives — SIP provider rings LiveKit SIP Trunk
- Agent joins — LiveKit dispatches the call to the agent
- Listening — agent opens a gRPC stream to Sber STT and listens for speech
- User speaks — audio is streamed to Sber, which detects end-of-utterance (EOU)
- Confirmation — agent instantly plays a short "one moment" phrase via Silero TTS
- LLM turn — transcript is sent to the LLM (OpenClaw, Ollama, GPT, etc.); the response is streamed back
- Response spoken — LLM text is synthesised by Silero TTS and played to the caller
- Loop — agent returns to listening state for the next turn
Turn handling is strict — no overlap between user and agent speech:
allow_interruptions=False— Sber transcripts received during agent TTS are ignoreddiscard_audio_if_uninterruptible=False— no audio filtering, Sber decides- Server-side endpointing via Sber's EOU signal, no client-side VAD
| Service | Container | Role |
|---|---|---|
| livekit | livekit |
WebRTC SFU (signalling + media), v1.12 |
| redis | redis |
LiveKit coordination |
| lk-tts | tts-service |
Silero TTS HTTP microservice |
| lk-auth | auth-service |
Sber OAuth 2.0 token management |
| lk-inbound | agent |
LiveKit Agent — SIP ↔ LLM orchestration |
lk sip inbound create inbound-trunk.json
lk sip inbound list # save trunk_idMANGO_PASSWORD=$MANGO_PASSWORD lk sip outbound create outbound-trunk.json
lk sip outbound list # save trunk_id → SIP_OUTBOUND_TRUNK_IDlk sip dispatch create dispatch-rule.jsonlk dispatch create \
--new-room \
--agent-name sber-voice-assistant \
--metadata '{"phone_number": "+71234567890"}'livekit-agent/
├── my_agent/ # LiveKit agent package
│ ├── session.py # CallSession — turn orchestration
│ ├── plugin_stt.py # WebSocket STT plugin (→ Sber)
│ ├── plugin_tts.py # HTTP TTS plugin (→ Silero)
│ ├── plugin_tts_transforms.py # Text transforms (digits → words)
│ ├── sentence_splitter.py # Aggressive sentence tokenizer for TTS
│ ├── http_api.py # FastAPI (/call, /hangup)
│ └── config.py # Centralised configuration
├── stt_service/ # STT microservice
│ ├── server.py # HTTP/WebSocket entrypoint
│ ├── sber_stt.py # gRPC streaming client (Sber v2)
│ └── token_manager.py # Sber OAuth 2.0 token management
├── tts_service/ # TTS microservice
│ ├── server.py # HTTP entrypoint
│ ├── tts_engine.py # Silero TTS wrapper
│ └── translit.py # Latin → Cyrillic transliteration
└── tests/ # Pytest suite (52 tests)
Contributions are welcome! Please see CONTRIBUTING.md.
See SECURITY.md for our security policy and vulnerability reporting process.
This project is licensed under the MIT License — see the LICENSE file for details.
This project uses several open-source components with different licenses. See NOTICE.md for full attribution and license information, including:
- Apache 2.0 — livekit-agents, livekit-plugins-openai, requests, grpcio, protobuf
- LGPL — num2words
- CC BY-NC-SA 4.0 — Silero TTS model weights (non-commercial use only)