Fully local AI voice assistant for macOS. Speaks into any live call in your cloned voice — no cloud APIs required.
Saymo composes short, natural speech from optional data sources (tracker, notes, text files), synthesizes it with voice cloning, and routes audio into the active call through a virtual microphone. Everything — language model, speech-to-text, text-to-speech — runs on-device.
- Local: Ollama + faster-whisper + Coqui XTTS v2 (or Piper / macOS
sayas fallback). - Voice cloning: 5-minute sample → your voice, fine-tuning optional.
- Routing: BlackHole virtual mic → any browser-based call app.
- Call automation: Chrome-driven mute/unmute for 8 providers (Glip, Zoom, Google Meet, MS Teams, Telegram, Yandex Telemost, VK Teams, MTS Link).
- Listening mode: auto-detects when your name is called, answers questions from provided context.
- User-configurable prompts and vocabulary — no source edits required.
Project status: early public alpha. Expect rough edges. Contributions welcome.
- macOS with Apple Silicon (M1/M2/M3/M4), arm64 terminal, not Rosetta
- Python 3.11+
- Homebrew
- Google Chrome
- ~10 GB free disk space
git clone https://github.com/acme/saymo && cd saymo
./setup.shsetup.sh is the master orchestrator — it walks you through:
- Saymo core (uv venv, Ollama, Whisper, BlackHole)
- F5-TTS Russian voice cloning (recommended)
- XTTS+RVC pipeline (optional alternative)
- Interactive wizard for
~/.saymo/config.yaml
Each step asks before doing anything heavy. Re-runnable; skips what's already installed. Total time on a fresh Mac: ~30 minutes (most spent on model downloads).
For the full walkthrough see docs/QUICK-START.md.
After setup.sh finishes, run:
saymo test-tts "Привет, это тест" # Check that TTS works
saymo test-devices # Verify audio devicesTo re-configure later:
saymo setup # Interactive: name, devices, TTS engine
saymo record-voice -d 12 # Record a fresh ~12s voice reference┌─────────────────────────────────────────────────────────────┐
│ Audio MIDI Setup │
│ Create "Multi-Output Device": │
│ ✓ Your headphones (master, no drift correction) │
│ ✓ BlackHole 16ch (drift correction ON) │
│ │
│ In your call app: │
│ Microphone → BlackHole 2ch │
│ Speakers → Multi-Output Device │
└─────────────────────────────────────────────────────────────┘
# Before the call: prepare text + cached audio
saymo prepare -p personal
saymo prepare-responses # pre-synthesize the Q&A library for live mode
saymo rehearsal -p personal --text "Your Name, what is the status?"
# safe dry-run: trigger + handbrake + playback gates
saymo auto-preflight -p personal
saymo review # optional: check generated audio
# During the call
saymo speak -p personal # manual trigger, instant playback
saymo auto -p personal # listen for your name, speak when called
saymo auto -p personal --mic # same, but from laptop mic (for testing)
saymo trigger-capture -p personal --window 8
# save call windows classified for trigger training
# Extras
saymo dashboard # interactive TUI./scripts/add_hotkeys.py
saymo hotkey-doctor -p personal
saymo takeover-check -p personalDefault hotkeys:
| Hotkey | Action |
|---|---|
Cmd+Shift+S |
Speak the prepared cached standup immediately in saymo auto |
Cmd+Shift+U |
Manual takeover: stop Saymo playback, pause auto-mode, switch the call mic to your real mic when the provider supports it; press again to return to BlackHole 2ch and resume |
Cmd+Shift+X |
Stop current Saymo playback |
Cmd+Shift+M |
Toggle Auto Answer on/off |
Cmd+Shift+A |
Optional: approved draft speak request |
Cmd+Shift+K |
Optional: skip current candidate |
Cmd+Shift+C |
Optional: add a capture marker for later review |
Cmd+Shift+I |
Optional: print live-control status |
Cmd+Shift+H |
Optional: hard-disable auto-answer and stop playback |
Mic switching is automatic for providers that implement switch_mic()
(glip, mts_link). For other providers, Saymo still pauses/resumes; switch
the call microphone manually in the meeting UI.
Use saymo takeover-check -p personal before a call to verify mic switching
against the active Chrome tab.
Hotkeys are global after saymo auto starts, so the terminal does not need to
be focused. The Saymo process must still be running, and macOS must allow
Accessibility access for the terminal app. For a visible button-style panel,
run saymo live-control -p personal in another terminal. See
docs/LIVE-CONTROL-HANDBRAKE.md.
For scriptable controls, use saymo live-control-action -p personal --action stop.
Before trusting the live loop, run saymo rehearsal -p personal --text "..."
to create a local dry-run artifact under ~/.saymo/rehearsals/<profile>/.
The rehearsal reports pass/warn/fail readiness, trigger/addressing action,
whether Saymo would speak, and why it would block.
Use trigger-capture when you want to collect real call phrases for improving
trigger detection:
saymo trigger-capture -p personal
saymo trigger-capture -p personal --session daily-2026-05-20
saymo trigger-capture -p personal --device "MacBook Pro Microphone" --duration 60
saymo trigger-eval -p personalBy default it listens on audio.capture_device, normally BlackHole 16ch.
Each window is saved as WAV plus JSON metadata under
~/.saymo/trigger_samples/<profile>/ and separated into:
asked_to_speak, mentioned_me, question, speech, and optional
silence. Add --save-silence only when you also need silent windows for
debugging. Named sessions let you review one meeting run later.
Review the saved windows without opening JSON by hand:
saymo trigger-sessions list -p personal
saymo trigger-sessions summary -p personal --session daily-2026-05-20
saymo trigger-samples list -p personal
saymo trigger-samples list -p personal --session daily-2026-05-20 --speaker other
saymo trigger-samples label ~/.saymo/trigger_samples/personal/question/<sample>.json --speaker other
saymo trigger-samples decision ~/.saymo/trigger_samples/personal/question/<sample>.json --decision rejected
saymo trigger-samples category ~/.saymo/trigger_samples/personal/question/<sample>.json --category mentioned_me
saymo trigger-samples review -p personal --session daily-2026-05-20
saymo trigger-samples replay ~/.saymo/trigger_samples/personal/asked_to_speak/<sample>.json
saymo trigger-eval -p personal --promote ~/.saymo/trigger_samples/personal/asked_to_speak/<sample>.json
saymo trigger-classifier train -p personal
saymo trigger-classifier readiness -p personal
saymo trigger-classifier evaluate -p personal
saymo trigger-classifier live-assist enable -p personal
saymo trigger-classifier live-assist status -p personal
saymo trigger-classifier inspect -p personal
saymo trigger-eval -p personal --classifier-shadow
saymo trigger-classifier delete -p personal --yes
saymo trigger-samples report -p personal -o ~/.saymo/trigger_samples/personal-report.mdtrigger-eval compares stored and current classification, reports misses and
false positives, groups results by speaker label (me, other, unknown),
and can promote a heard name variant into vocabulary.fuzzy_expansions before
re-running the evaluation. Use trigger-samples label to correct who spoke in a
saved window; old samples without a label are treated as unknown. Use
trigger-samples category and trigger-samples review to correct buckets and
labels after a call. Use trigger-samples decision to mark whether Saymo's
answer decision was accepted or rejected; then trigger-classifier train
writes a local JSON artifact under ~/.saymo/models/trigger_classifier/.
trigger-classifier readiness and evaluate show whether labels are balanced
enough for local assist. trigger-classifier live-assist enable is guarded by
readiness and model-fingerprint checks; in live mode it can only downgrade an
already deterministic answer candidate to skip. trigger-eval --classifier-shadow and trigger-check --classifier-shadow remain available
for non-live diagnostics. Use trigger-classifier inspect or delete to audit
or remove the local artifact. Reports omit raw audio and transcript text.
saymo auto works with all Chrome-based call apps — the provider is
picked by meetings.<profile>.provider in config:
provider: |
Service |
|---|---|
glip (default) |
RingCentral Glip |
zoom |
Zoom |
google_meet |
Google Meet |
ms_teams |
Microsoft Teams |
telegram |
Telegram calls (web) |
telemost |
Yandex Telemost |
vk_teams |
VK Teams |
mts_link |
MTS Link |
Measure provider latency against an active Chrome call:
saymo provider-latency -p personal --text "John, что по статусу?"
saymo provider-latency -p personal --audio ~/.saymo/audio_cache/$(date +%F).wavThe probe reports capture, transcription, trigger decision, provider unmute,
playback start, playback duration, and mute recovery. It writes JSON and
Markdown history under ~/.saymo/provider_latency/<profile>/<provider>/ so you
can compare recurring provider bottlenecks.
Run saymo list-plugins to see everything available in your install.
When your name is called and the surrounding transcript looks like a
question, auto consults a pre-synthesised response library and plays
the best-matching cached variant — no network hop, no synthesis lag.
Populate the library once with saymo prepare-responses. Built-in
intents cover status (как дела), blockers, ETA, testing stage, review.
Extend with your own wording via config.responses.library.
On cache miss, you can opt into a live fallback: Ollama composes an answer from your standup summary + JIRA context, the TTS engine synthesizes it, and Saymo plays it back. This adds a few seconds of latency but covers any question. Enable it in config:
responses:
live_fallback: trueWithout live_fallback (default), a cache miss falls back to the
generic standup audio — quiet, reliable, no LLM dependency.
All LLM prompts are templates loaded from config.yaml → prompts.* at runtime, with sensible generic defaults in source. To customize voice/tone:
prompts:
standup_ru: |
Ты — помощник для ежедневных встреч. Составь отчёт на русском...
{yesterday_notes}
{today_notes}
qa_system_ru: |
Ты — {user_name}, {user_role}. Отвечай кратко, 1-3 предложения...See config.example.yaml for all available keys and the default set.
Adding your own abbreviations or fuzzy name expansions to the TTS normalizer is done through config, not source:
vocabulary:
abbreviations:
MYAPI: "май-эй-пи-ай"
K8S: "кубернетес"
fuzzy_expansions:
Alex: ["Alex", "Алекс", "Саша", "Саня"]To check whether a live phrase will trigger Saymo before joining a call:
saymo trigger-check -p personal --text "John, что по статусу?"
saymo trigger-check -p personal --text "John, что по статусу?" --classifier-shadow
saymo trigger-check -p personal --mic
saymo auto-preflight -p personal
saymo trigger-setup -p personal --heard "Jon, что по статусу?"
saymo trigger-learn -p personal --heard "Jon"The diagnostic shows trigger match, whether the mention is addressed to you,
question detection, confirmation behavior, auto-mode action, and response-cache
routing. auto-preflight checks prepared audio, devices, provider readiness,
profile triggers, response-cache coverage, fallback mode, and live tuning.
Use trigger-setup when Whisper consistently hears your name as a different
spelling; it updates vocabulary.fuzzy_expansions and verifies the learned
variant immediately. You can paste the whole transcribed phrase; Saymo extracts
the likely name variant before saving it.
When safety.require_confirmation is enabled, auto-mode waits for a second
trigger mention within safety.confirmation_timeout_seconds before speaking;
this helps suppress accidental mentions in live calls.
┌───────────────┐ ┌──────────────┐ ┌────────────────┐ ┌──────────────┐
│ Source plugin │──▶│ LLM composer │──▶│ Text normalizer│──▶│ TTS engine │
│ (optional) │ │ (Ollama) │ │ (abbrevs, │ │ (XTTS clone │
│ │ │ │ │ numbers) │ │ / Piper) │
└───────────────┘ └──────────────┘ └────────────────┘ └──────┬───────┘
│
┌──────────────┐ ┌──────────────┐ ┌────────────────┐ │
│Call provider │◀──│ Auto trigger │◀──│ STT (Whisper) │ Audio bytes
│(mute/unmute) │ │(name detect) │ │ (capture call) │ │
└──────┬───────┘ └──────────────┘ └────────────────┘ │
│ │
▼ ▼
BlackHole 2ch ─────────────────────────────────────────── Audio output + monitor
(virtual mic)
Details in docs/PRD.md and ADRs under docs/adr/.
Multiple paths for getting your own voice on calls — pick the one that matches your patience and quality bar:
| Tier | Setup | Time | Subjective similarity | Doc |
|---|---|---|---|---|
| Zero-shot XTTS | saymo record-voice |
5 min | ~5/10 | — |
| Fine-tuned XTTS | + saymo train-voice |
2-3 h | ~7-8/10 | docs/VOICE-TRAINING.md |
| Fine-tuned XTTS + RVC | + scripts/install_rvc.sh |
+1-2 h | 9-10/10 | docs/RVC-VOICE-CLONING.md |
| F5-TTS Russian (alt path) | scripts/install_f5tts.sh |
~10 min | 9-10/10 | docs/F5TTS-VOICE-CLONING.md |
If your voice "sounds close but not quite you" after XTTS fine-tune, that's the XTTS speaker-encoder ceiling. RVC swaps the timbre on top to break through. F5-TTS is a one-stage alternative — Russian-purpose-built model, no second pass, simpler pipeline.
- Everything runs on-device by default. Cloud TTS / STT providers are optional and disabled in the example config.
- Voice samples and secrets are listed in
.gitignore— they never leave your machine. - Prompts, vocabulary, trigger phrases are all in your config file — source stays generic.
docs/QUICK-START.md— start here if you're newdocs/OVERVIEW.md— what Saymo is, how it's wiredCONTRIBUTING.md— dev setup, conventions, PR workflowCHANGELOG.md— version history (Keep a Changelog)CODE_OF_CONDUCT.md— Contributor Covenant 2.1SECURITY.md— vulnerability reporting + threat modeldocs/— voice training, RVC, F5-TTS, ADRs, PRDs
Bug? Idea? Use the issue templates under .github/ISSUE_TEMPLATE/.
MIT — see LICENSE.
- Coqui TTS for XTTS v2.
- Ollama for local LLM hosting.
- faster-whisper for transcription.
- BlackHole for virtual audio routing.