Skip to content

Swarek/voicecode-mcp

Repository files navigation

VoiceCode MCP

Make your coding agent call you, talk through the task, and keep working from the transcript.

Voice-native agentic coding through a single MCP.

VoiceCode MCP turns voice into a first-class part of the agent loop.
Instead of switching to another app, dictating a prompt, or losing context between tools, your agent can start a real voice conversation, get the transcript back, and continue the task immediately.

It works for the simple case — one coding agent you can just talk to — and for the serious case — multiple agent sessions on the same machine that need voice without creating chaos.


Why this exists

Typing is not always the best interface for coding agents.

Sometimes the fastest way to unblock an agent is to speak for 30 seconds:

  • clarify what you actually want
  • explain a bug in real-world terms
  • answer implementation questions quickly
  • choose between two valid directions
  • review what was built and say what to change

Most current setups treat voice as a side tool, a dictation layer, or a separate app.

VoiceCode MCP keeps voice inside the same agent loop.
The agent talks to you, receives the transcript, and keeps going.


What makes it different

Not another voice app

VoiceCode MCP is not trying to become a separate assistant UI.
It is an MCP for coding agents.

Not just speech-to-text

The point is not "replace typing with dictation."
The point is to let the agent launch a focused live conversation when voice is the fastest way to get aligned.

Built for real agent workflows

VoiceCode MCP is designed for:

  • single-session workflows — one agent you can just speak to
  • multi-session workflows — multiple agent sessions sharing voice safely on the same machine
  • task-scoped alignment — kickoff, clarification, decision, bug triage, and review
  • agent continuation — the transcript comes back into the same task loop

How it works

The happy path is intentionally simple:

  1. The host agent decides voice is the best next step.
  2. It calls voice_call(...) with a HiCharlie agent_id and optional MCP runtime variables.
  3. A live voice conversation starts.
  4. The call ends.
  5. The transcript is returned directly to the host agent.
  6. The host agent continues the task from the transcript.

If the host UI drops the tool result, the transcript can still be recovered locally.


Core features

  • One MCP, one mental model
  • Real live voice calls started by an agent
  • Transcript returned directly to the agent
  • Local transcript recovery if the host loses the result
  • Local completion inbox for delayed queued rooms
  • Shared FIFO queue across concurrent sessions on the same machine
  • Provider WebRTC-first interface
  • Works well for both voice-assisted and voice-first workflows
  • Minimal surface area with a strong default path

Best use cases

VoiceCode MCP is most useful when the blocker is human alignment, not code generation.

Great fits

  • ambiguous feature kickoff
  • fast clarification during implementation
  • bug reproduction interviews
  • architecture or product decision checkpoints
  • review / acceptance feedback after a change
  • multi-session coding setups where several agents may need voice in sequence

Bad fits

  • tiny deterministic edits
  • questions the repo or docs already answer
  • mechanical refactors
  • situations where one short text message is faster than a call

The best workflows are usually short, focused, task-scoped voice rounds.


Built for serious agentic coding

VoiceCode MCP is designed around a simple idea:

Use voice when the fastest next step is to align with the human.

That means voice can be used at the beginning, middle, or end of a task:

  • to clarify intent
  • to collect constraints
  • to resolve tradeoffs
  • to correct misunderstandings
  • to validate what was built

It does not assume that every task needs voice.
It does not force a separate workflow.
It gives coding agents a clean way to escalate to live conversation when that is the highest-signal move.


Single session or many sessions

VoiceCode MCP is useful in both modes.

Single-session

You have one coding agent.
It can call you when voice is the best way to unblock the task.

Multi-session

You have several coding sessions running in parallel.
VoiceCode MCP adds a local FIFO queue so that:

  • one call is active at a time
  • other calls wait cleanly
  • the next call starts when the current one ends

That queue is local to this MCP process family. This matters a lot in real workflows: the scarce resource is the human, not the model.

Without queueing, multi-session voice becomes chaos very quickly.

The queue is deliberately treated as a backend boundary. Today the backend is local SQLite; the long-term product surface should be a HiCharlie-hosted queue that can be opened from another device. See docs/developer/HOSTED_QUEUE_BRIDGE.md for the API shape needed to move the room list from localhost to HiCharlie.


MCP API

Main tool

voice_call(
    agent_id,
    first_message=None,
    call_tag=None,
    correlation_id=None,
    client_instance_id=None,
    return_channel="auto",
    language="fr",
    max_duration_seconds=300,
    agent_variables=None,
) -> transcript_text

In the normal flow, this is enough:

  • start the call
  • get the transcript back
  • continue the task

For delayed queue flows, pass a stable call_tag or correlation_id. VoiceCode stores the final outcome in the local completion inbox, keyed by correlation_id, call_tag, and client_instance_id. return_channel="auto" targets the current Codex thread when CODEX_THREAD_ID or VOICECODE_CODEX_THREAD_ID is available; otherwise it falls back to the local inbox.

agent_id is required for every call. The MCP does not support agentless voice sessions. First call voice_agents() to inspect the agent's declared MCP runtime variables, then pass values through agent_variables. The MCP rejects unknown variable names before creating the HiCharlie session.

For saved agents, the only conversation-level inputs are:

  • agent_variables, mapped directly to HiCharlie template_variables;
  • first_message, sent only when explicitly provided.

The MCP does not send system_prompt, system_prompt_append, objective, conversation_context, context, or any default greeting. The saved HiCharlie agent owns its prompt, opening behavior, voice, STT, TTS, LLM, and runtime configuration.

voice_agents() -> json_text

voice_call(
    agent_id="905a39f7-75e7-496a-82d2-d83c411861dd",
    agent_variables={"contexte": "Contexte utile pour cet appel."},
)

Live session tools

Use these when the controlling agent must read the transcript while the call is still running instead of waiting for voice_call to return.

voice_session_start(
    agent_id,
    first_message=None,
    call_tag=None,
    language="fr",
    max_duration_seconds=300,
    agent_variables=None,
    visual_board_mode="off",
) -> json_text

voice_session_events(
    session_id,
    since_event_id=0,
    max_events=100,
    wait_seconds=0.0,
) -> json_text

voice_session_reply(
    session_id,
    call_id,
    message,
    action_id="voicecode_control_agent",
    ok=True,
) -> json_text

voice_session_context(
    session_id,
    context,
    ttl="next_turn",
    priority="normal",
    label="codex_context_update",
) -> json_text

voice_session_stop(session_id) -> json_text

voice_session_start starts the same direct HiCharlie WebRTC path as voice_call, but it returns immediately with a session_id. Poll voice_session_events with the returned next_event_id to receive only new events. Set wait_seconds for simple long-polling. Transcript text arrives as transcript.delta events with role and text fields. When the session is done, the response has done: true and the final transcript is persisted for get_voice_transcript.

Live sessions also register a HiCharlie client_action named voicecode_control_agent. When the voice agent asks the controlling coding agent for context, voice_session_events emits a control.client_action_request event with call_id, action_id, arguments, and text. Reply with voice_session_reply. The observed HiCharlie widget protocol supports this client_action.request / client_action.result exchange over the WebRTC control data channel. A generic unsolicited "push this instruction into the voice agent" event is not exposed by that widget protocol, so VoiceCode does not claim proactive instruction injection as a stable provider feature.

When voice_session_start(..., visual_board_mode="on_demand") is used, VoiceCode sends a session-scoped HiCharlie runtime_actions entry named voicecode_visual_board. This action is ephemeral: it exists only for that session and does not modify the saved HiCharlie agent. If the voice agent calls it, VoiceCode handles the request locally by opening/updating the visual runtime room, then replies over the existing client_action.result bridge. Use visual_board_mode="auto" when the caller should receive a visual_room_url immediately at session start, before the agent makes its first visual action. Supported operations are open_board, add_block, update_block, connect_blocks, focus, highlight, set_status, and commit.

voice_session_stop requests local shutdown for a running live session.

voice_session_context pushes runtime context into an active HiCharlie voice session after it has started. Pass the MCP session_id returned by voice_session_start; VoiceCode maps it to the active HiCharlie provider session internally. Use it when Codex is reading voice_session_events, detects something useful from the live transcript or repo, and wants the voice agent to use that information on the next LLM turn. It calls HiCharlie POST /api/v1/sessions/{session_id}/context through the same API key and base URL used for the session.

Example:

voice_session_context(
    session_id="live_abc123",
    context=(
        "L'utilisateur parle du bug de prononciation sur le mot cote. "
        "Ne clôture pas encore; demande ce qu'il a entendu exactement."
    ),
    ttl="next_turn",
)

Timing and limits:

  • ttl="next_turn" applies once to the next HiCharlie LLM turn.
  • ttl="session" persists until the session ends.
  • priority accepts low, normal, or high.
  • context must be non-empty and at most 4000 characters.
  • The tool does not interrupt speech already being generated.
  • It does not edit the saved agent prompt.
  • It does not add this runtime context to the visible transcript.
  • If HiCharlie returns 409 provider unsupported, the MCP reports that dynamic context injection is unavailable for that provider/session.

Recovery tool

get_voice_transcript(
    conversation_id=None,
    call_tag=None,
    latest=False,
) -> transcript_text

Use this when the call happened but the host UI lost the returned result.

Local completion inbox

Use this when a queued room may finish minutes or hours after the creating agent asked for it. VoiceCode writes a local completion record when the call is queued, marks it active when the local queue slot is acquired, and marks it completed or failed when the conversation ends.

voice_completions(
    completion_id=None,
    correlation_id=None,
    call_tag=None,
    client_instance_id=None,
    status=None,
    include_delivered=False,
    limit=20,
) -> json_text

voice_completion_ack(
    completion_id=None,
    correlation_id=None,
) -> json_text

voice_dispatch_completions(
    limit=10,
    dry_run=False,
) -> json_text

voice_completions returns pending local records with status, timestamps, provider, agent id, transcript text when available, and a clear error message for failed or empty-transcript calls. By default, already acknowledged records are hidden. Call voice_completion_ack after the creating agent has consumed the transcript so the same completion is not delivered repeatedly.

When a record has a Codex return channel, VoiceCode can deliver it proactively instead of waiting for polling. The return channel format is codex:<thread_id>. Delivery runs:

codex exec resume <thread_id> -

with a compact handoff prompt containing the final status, metadata, error if any, and transcript. The completion is acknowledged only after the Codex command exits successfully. Failed deliveries stay pending with retry metadata and can be retried by calling voice_dispatch_completions(...) or by running the watcher:

voicecode-delivery-dispatch --watch

voice_dispatch_completions(dry_run=True) shows due records without transcript text. VOICECODE_DELIVERY_STALE_SECONDS lets the dispatcher reclaim a delivering record after a crash or restart.

This is intentionally local for now. It does not require a HiCharlie product change. Transcript content is only sent to the configured return channel; hosted queue proposals should keep room lists metadata-only.


Install

Run the installer from the repo checkout:

./scripts/install.sh

It installs the uv environment, links the VoiceCode skill, and writes .mcp.local.json with an absolute project path. If HICHARLIE_API_KEY is already present in the shell or .env.local, the generated local config uses that value directly. Use that generated file for MCP clients that launch servers from another working directory.


Why call_tag matters

For real usage, call_tag should be treated as standard, not optional.

The standalone CLI accepts the same recovery tag:

uv run voicecode-call \
  --agent-id 905a39f7-75e7-496a-82d2-d83c411861dd \
  --agent-variable contexte="Clarifier le point bloquant" \
  --call-tag task-123

It gives you:

  • reliable transcript recovery
  • clean task-level tracking
  • safer multi-session behavior
  • easier debugging when multiple calls happen on the same machine

A good pattern is:

vc:<repo_slug>:<task_key>:<stage>:r<round>:<host>:<session4>:<utc>

Example:

vc:payments-api:auth-refresh:kickoff:r1:codex:7f3c:20260411T1510Z

A practical example

A coding agent starts implementing a feature but the request is still fuzzy.

It does a quick repo orientation pass, then starts a voice call:

  • “What exactly should happen in this flow?”
  • “What matters most here: speed, simplicity, or backwards compatibility?”
  • “What should definitely stay unchanged?”

The user answers in natural language.
The transcript comes back.
The agent updates its task understanding and keeps coding.

Later, the same task may trigger another short voice round to resolve one remaining decision or review the result.

This is where VoiceCode MCP shines:
not as constant chatter, but as precise voice checkpoints inside real coding work.


Provider

Current provider:

  • HiCharlie

Provider selection stays behind the same MCP interface. The public tool resolves to HiCharlie when HICHARLIE_API_KEY is configured. Each call must provide an explicit agent_id; there is no default-agent or agentless fallback path. Per-call customization goes through declared agent_variables and optional first_message.

For HiCharlie, the default transport is provider WebRTC:

  1. VoiceCode creates a HiCharlie session with transport="webrtc".
  2. It opens a shared local UI at http://localhost:8765/ by default.
  3. The per-call page joins the provider WebRTC session through the HiCharlie widget.
  4. HiCharlie owns the session, signaling endpoint, ICE servers, and agent runtime.
  5. VoiceCode waits for the call to finish, fetches a non-empty transcript, stores it locally, and returns text to the agent.

This keeps the MCP thin and makes active calls observable from one stable local dashboard. Set VOICECODE_LOCAL_UI_PORT to change the fixed localhost port. Set VOICECODE_WEBRTC_MODE=direct for headless local testing or diagnostics. The direct path still uses local microphone/speaker processes as the machine audio bridge.

Live WebRTC proof

To check local setup without creating a HiCharlie session, run:

uv run python scripts/verify_hicharlie_webrtc.py --check-setup

That verifies the local MCP config, stdio launch, required environment, and writable queue and transcript stores. It writes status: "setup_pass" or status: "setup_fail", but it is not an end-to-end proof.

After setting a HICHARLIE_API_KEY that can create sessions and read transcripts, run:

uv run python scripts/verify_hicharlie_webrtc.py \
  --agent-id 905a39f7-75e7-496a-82d2-d83c411861dd

To check credentials before involving microphone/speaker audio, run:

uv run python scripts/verify_hicharlie_webrtc.py --preflight-only \
  --agent-id 905a39f7-75e7-496a-82d2-d83c411861dd

Preflight only proves that HiCharlie accepted the key and returned a WebRTC session bootstrap. The full command above is still required for audio, transcript, and persistence proof. Preflight creates a short 30-second session unless --max-duration-seconds is passed explicitly.

The runtime loads .env.local, .env, and then .mcp.local.json from the project root, without overriding real variables already set by the shell. Empty values and MCP placeholders like ${HICHARLIE_API_KEY} or ${VOICECODE_QUEUE_DB_PATH} are treated as missing, so .env.local can still provide usable runtime settings. Reading .mcp.local.json keeps local proof commands and standalone CLI runs aligned with the config that MCP clients use after ./scripts/install.sh. Set VOICECODE_MCP_LOCAL_ENV_ENABLED=0 for hermetic test or CI subprocesses that must not consume local MCP secrets. Keep local secrets in .env.local; .env* files are ignored except for .env.example.

The verifier uses the direct headless path and does not open a browser or localhost page. It requires the WebRTC connection to reach connected, requires the WebRTC control channel to open, requires local microphone frames, requires remote audio frames/playback, requires transcribed text, verifies local transcript recovery, and writes a JSON proof under artifacts/. If audio evidence is incomplete, the failure JSON lists the missing signals explicitly. A successful full verifier run writes status: "pass" and e2e: true; preflight and failure proofs are not accepted as end-to-end proof. Transcript-stage failures include a provider_transcript summary when HiCharlie returned a payload without usable text, so the failure is distinguishable from generic call launch failures.

Local runtime settings

VoiceCode stores its local queue and transcripts under the user cache directory by default:

  • macOS: ~/Library/Caches/voicecode-mcp/
  • Linux: ~/.cache/voicecode-mcp/

The normal environment variables are:

# Required: set a real key with sessions:create and sessions:transcript:read.
HICHARLIE_API_KEY=
VOICECODE_QUEUE_ENABLED=1
VOICECODE_QUEUE_BACKEND=local
VOICECODE_QUEUE_DB_PATH=~/Library/Caches/voicecode-mcp/queue.sqlite3
VOICECODE_LOCAL_UI_ENABLED=1
VOICECODE_LOCAL_UI_AUTO_OPEN=1
VOICECODE_CALL_HANDOFF_MODE=manual
VOICECODE_LOCAL_UI_CONFIG_PATH=~/Library/Caches/voicecode-mcp/local_ui_config.json
VOICECODE_TRANSCRIPT_STORE_ENABLED=1
VOICECODE_TRANSCRIPT_DB_PATH=~/Library/Caches/voicecode-mcp/transcripts.sqlite3
VOICECODE_COMPLETION_INBOX_ENABLED=1
VOICECODE_COMPLETION_INBOX_DB_PATH=~/Library/Caches/voicecode-mcp/completions.sqlite3
# Optional stable id for this MCP client/machine. Generated locally when unset.
VOICECODE_CLIENT_INSTANCE_ID=
VOICECODE_DELIVERY_ENABLED=1
VOICECODE_DELIVERY_AUTO_DISPATCH=1
VOICECODE_DELIVERY_RETRY_SECONDS=30
VOICECODE_DELIVERY_MAX_ATTEMPTS=5
VOICECODE_DELIVERY_STALE_SECONDS=600
# Optional explicit Codex delivery target. Otherwise CODEX_THREAD_ID is used
# when the MCP host provides it.
VOICECODE_CODEX_THREAD_ID=
# Optional Codex binary override. Defaults to codex from PATH.
VOICECODE_CODEX_BIN=

VOICECODE_LOCAL_UI_ENABLED=0 runs ordinary calls without opening the shared localhost UI. VOICECODE_LOCAL_UI_AUTO_OPEN=0 keeps the shared UI available but only lists new rooms instead of opening a new browser tab per call. VOICECODE_CALL_HANDOFF_MODE=manual shows the next queued room and waits for a click; auto_chain opens the next ready room automatically when the current call reaches a terminal state.

VOICECODE_QUEUE_BACKEND=local is the only implemented queue backend today. The reserved future value is hicharlie_hosted, once HiCharlie exposes a hosted MCP voice-room queue API.

Older VOCAL_AGENT_* variables are still accepted as a compatibility fallback. VoiceCode marks HiCharlie session creation with session_origin="voicecode_mcp". The transcript is part of the voice_call(...) contract: if HiCharlie returns no transcribed text, local persistence is disabled, or persistence fails, the call fails instead of returning an empty or unrecoverable result. The completion inbox mirrors those failures so a delayed creator can see why a queued room did not produce a usable transcript. When return_channel resolves to codex:<thread_id>, the dispatcher sends the same failed or expired completion to Codex so the original thread can decide the next step.


Experimental Visual Surface Runtime

VoiceCode also includes an isolated local prototype for the next layer above voice: a visual room where Codex or another agent can use a Miro/Mixboard-like board to create cards, text, images, links, focus the user's viewport, record decisions, and commit a final handoff.

Run it locally:

uv run voicecode-visual-runtime --open

If apps/visual-board/dist exists, the Python runtime serves the React/Excalidraw prototype on the same localhost route. Build it with:

cd apps/visual-board
npm install
npm run build

The runtime serves one shared localhost site, defaulting to:

http://localhost:8798/visual

It stores prototype state in SQLite at:

~/Library/Caches/voicecode-mcp/visual-runtime.sqlite3

Useful environment variables:

VOICECODE_VISUAL_RUNTIME_PORT=8798
VOICECODE_VISUAL_RUNTIME_DB_PATH=~/Library/Caches/voicecode-mcp/visual-runtime.sqlite3

The visual tools are intentionally separate from the voice tools but exposed by the same MCP facade:

visual_room_open(title=None)
visual_surface_create(room_id, title=None, initial_state=None, surface_type="board")
visual_surface_add_block(room_id=None, surface_id=None, kind="note", title="New block", text="", block_id=None, image_url=None, x=None, y=None, width=None, height=None)
visual_surface_update_block(room_id=None, surface_id=None, block_id, kind=None, title=None, text=None, image_url=None, x=None, y=None, width=None, height=None)
visual_surface_connect_blocks(room_id=None, surface_id=None, source_block_id, target_block_id, label="", edge_id=None, edge_type="relates_to", auto_layout=False, layout_spacing=160, layout_direction="right")
visual_surface_get_state(room_id=None, surface_id=None)
visual_surface_view(action, room_id=None, surface_id=None, target_block_id=None, focus_zoom=None)
visual_surface_propose_action(room_id=None, surface_id=None, target_block_id=None, after_text=None, title=None, explanation=None)
visual_surface_apply_action(proposal_id, decision="apply")
visual_surface_record_decision(room_id, text)
visual_surface_get_events(room_id, since_event_id=None, limit=100)
visual_surface_commit(room_id, surface_id=None)
visual_room_close(room_id, reason=None)

edge_type accepts relates_to, next, and, or, condition, supports, blocks, or causes. layout_direction accepts right, left, down, or up when auto_layout=True.

The Excalidraw prototype has polished dev controls for the agent-facing primitives: add cards, free text, images, connectors, frames, focus, highlight, status badges, selection updates, snapshot export, and recent event inspection. The legacy inline UI remains as a fallback when the React build is absent.

This is not a hosted HiCharlie product surface yet. It is a local testbed for the future room model: typed surfaces, structured actions, event log, proposals, confirmation, and final Codex handoff.


Designed to fit modern coding agents

VoiceCode MCP is meant to work with MCP-compatible coding environments, including workflows built around tools like:

  • Codex
  • Claude Code
  • Cursor
  • OpenClaw
  • other MCP clients

The goal is not to be tied to one host.
The goal is to make voice a clean primitive that any serious coding agent can use.


Voice-assisted or voice-first

VoiceCode MCP supports both styles.

Voice-assisted

Voice is used only when it clearly beats text.
This will be the right default for many developers.

Voice-first

Voice becomes the main user-facing interface for important checkpoints:

  • kickoff
  • clarification
  • decision
  • validation
  • reprioritization

The agent still works autonomously between calls.
This is not about talking constantly.
It is about making voice the control plane when that feels better than typing.


Project principles

  • Open source first
  • One MCP instead of another app
  • Transcript back into the same agent loop
  • Minimal surface area
  • Local-first concurrency control
  • Provider WebRTC-first design
  • Useful in real developer workflows, not just demos

Repository status

This repository is the new public home for VoiceCode MCP.

What is going into this repo:

  • the MCP server
  • HiCharlie WebRTC integration
  • local queueing for concurrent sessions
  • transcript recovery support
  • examples for MCP clients
  • install and setup instructions
  • contribution guidelines

If you are early: perfect. This is exactly the moment to star the repo, try it, and help shape the workflow.


Why this project matters

Coding agents are already good at reading code, planning, editing, and testing.
What they still lack is a clean, native way to talk to the human when human alignment is the real blocker.

VoiceCode MCP fills that gap.

It gives agents a simple new move:

“Let me call you, get aligned, and continue.”

That sounds small.
In practice, it changes how natural agentic coding feels.


Contributing

Contributions, feedback, issues, and workflow ideas are welcome.

Especially useful feedback:

  • host integration examples
  • voice workflow patterns that worked well
  • multi-session edge cases
  • WebRTC session and transcript edge cases
  • transcript recovery failures
  • README and onboarding improvements

If you try VoiceCode MCP in a real coding workflow, we want to hear about it.


Built by

VoiceCode MCP is built by the team at HiCharlie.

VoiceCode MCP is intentionally a HiCharlie-only MVP for now. HiCharlie owns the runtime stack behind the session, including whichever STT/LLM/TTS providers the agent selects internally. The MCP should not add separate ElevenLabs or provider-generic logic until there is a clear product need.

one open MCP that makes coding agents natural to talk to.

About

Voice-native agentic coding through a single MCP.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors