Make your coding agent call you, talk through the task, and keep working from the transcript.
Voice-native agentic coding through a single MCP.
VoiceCode MCP turns voice into a first-class part of the agent loop.
Instead of switching to another app, dictating a prompt, or losing context between tools, your agent can start a real voice conversation, get the transcript back, and continue the task immediately.
It works for the simple case — one coding agent you can just talk to — and for the serious case — multiple agent sessions on the same machine that need voice without creating chaos.
Typing is not always the best interface for coding agents.
Sometimes the fastest way to unblock an agent is to speak for 30 seconds:
- clarify what you actually want
- explain a bug in real-world terms
- answer implementation questions quickly
- choose between two valid directions
- review what was built and say what to change
Most current setups treat voice as a side tool, a dictation layer, or a separate app.
VoiceCode MCP keeps voice inside the same agent loop.
The agent talks to you, receives the transcript, and keeps going.
VoiceCode MCP is not trying to become a separate assistant UI.
It is an MCP for coding agents.
The point is not "replace typing with dictation."
The point is to let the agent launch a focused live conversation when voice is the fastest way to get aligned.
VoiceCode MCP is designed for:
- single-session workflows — one agent you can just speak to
- multi-session workflows — multiple agent sessions sharing voice safely on the same machine
- task-scoped alignment — kickoff, clarification, decision, bug triage, and review
- agent continuation — the transcript comes back into the same task loop
The happy path is intentionally simple:
- The host agent decides voice is the best next step.
- It calls
voice_call(...)with a HiCharlieagent_idand optional MCP runtime variables. - A live voice conversation starts.
- The call ends.
- The transcript is returned directly to the host agent.
- The host agent continues the task from the transcript.
If the host UI drops the tool result, the transcript can still be recovered locally.
- One MCP, one mental model
- Real live voice calls started by an agent
- Transcript returned directly to the agent
- Local transcript recovery if the host loses the result
- Local completion inbox for delayed queued rooms
- Shared FIFO queue across concurrent sessions on the same machine
- Provider WebRTC-first interface
- Works well for both voice-assisted and voice-first workflows
- Minimal surface area with a strong default path
VoiceCode MCP is most useful when the blocker is human alignment, not code generation.
- ambiguous feature kickoff
- fast clarification during implementation
- bug reproduction interviews
- architecture or product decision checkpoints
- review / acceptance feedback after a change
- multi-session coding setups where several agents may need voice in sequence
- tiny deterministic edits
- questions the repo or docs already answer
- mechanical refactors
- situations where one short text message is faster than a call
The best workflows are usually short, focused, task-scoped voice rounds.
VoiceCode MCP is designed around a simple idea:
Use voice when the fastest next step is to align with the human.
That means voice can be used at the beginning, middle, or end of a task:
- to clarify intent
- to collect constraints
- to resolve tradeoffs
- to correct misunderstandings
- to validate what was built
It does not assume that every task needs voice.
It does not force a separate workflow.
It gives coding agents a clean way to escalate to live conversation when that is the highest-signal move.
VoiceCode MCP is useful in both modes.
You have one coding agent.
It can call you when voice is the best way to unblock the task.
You have several coding sessions running in parallel.
VoiceCode MCP adds a local FIFO queue so that:
- one call is active at a time
- other calls wait cleanly
- the next call starts when the current one ends
That queue is local to this MCP process family. This matters a lot in real workflows: the scarce resource is the human, not the model.
Without queueing, multi-session voice becomes chaos very quickly.
The queue is deliberately treated as a backend boundary. Today the backend is local SQLite; the long-term product surface should be a HiCharlie-hosted queue that can be opened from another device. See docs/developer/HOSTED_QUEUE_BRIDGE.md for the API shape needed to move the room list from localhost to HiCharlie.
voice_call(
agent_id,
first_message=None,
call_tag=None,
correlation_id=None,
client_instance_id=None,
return_channel="auto",
language="fr",
max_duration_seconds=300,
agent_variables=None,
) -> transcript_textIn the normal flow, this is enough:
- start the call
- get the transcript back
- continue the task
For delayed queue flows, pass a stable call_tag or correlation_id. VoiceCode
stores the final outcome in the local completion inbox, keyed by
correlation_id, call_tag, and client_instance_id. return_channel="auto"
targets the current Codex thread when CODEX_THREAD_ID or
VOICECODE_CODEX_THREAD_ID is available; otherwise it falls back to the local
inbox.
agent_id is required for every call. The MCP does not support agentless voice
sessions. First call voice_agents() to inspect the agent's declared MCP
runtime variables, then pass values through agent_variables. The MCP rejects
unknown variable names before creating the HiCharlie session.
For saved agents, the only conversation-level inputs are:
agent_variables, mapped directly to HiCharlietemplate_variables;first_message, sent only when explicitly provided.
The MCP does not send system_prompt, system_prompt_append, objective,
conversation_context, context, or any default greeting. The saved HiCharlie
agent owns its prompt, opening behavior, voice, STT, TTS, LLM, and runtime
configuration.
voice_agents() -> json_text
voice_call(
agent_id="905a39f7-75e7-496a-82d2-d83c411861dd",
agent_variables={"contexte": "Contexte utile pour cet appel."},
)Use these when the controlling agent must read the transcript while the call is
still running instead of waiting for voice_call to return.
voice_session_start(
agent_id,
first_message=None,
call_tag=None,
language="fr",
max_duration_seconds=300,
agent_variables=None,
visual_board_mode="off",
) -> json_text
voice_session_events(
session_id,
since_event_id=0,
max_events=100,
wait_seconds=0.0,
) -> json_text
voice_session_reply(
session_id,
call_id,
message,
action_id="voicecode_control_agent",
ok=True,
) -> json_text
voice_session_context(
session_id,
context,
ttl="next_turn",
priority="normal",
label="codex_context_update",
) -> json_text
voice_session_stop(session_id) -> json_textvoice_session_start starts the same direct HiCharlie WebRTC path as
voice_call, but it returns immediately with a session_id. Poll
voice_session_events with the returned next_event_id to receive only new
events. Set wait_seconds for simple long-polling. Transcript text arrives as
transcript.delta events with role and text fields. When the session is
done, the response has done: true and the final transcript is persisted for
get_voice_transcript.
Live sessions also register a HiCharlie client_action named
voicecode_control_agent. When the voice agent asks the controlling coding
agent for context, voice_session_events emits a control.client_action_request
event with call_id, action_id, arguments, and text. Reply with
voice_session_reply. The observed HiCharlie widget protocol supports this
client_action.request / client_action.result exchange over the WebRTC
control data channel. A generic unsolicited "push this instruction into the
voice agent" event is not exposed by that widget protocol, so VoiceCode does not
claim proactive instruction injection as a stable provider feature.
When voice_session_start(..., visual_board_mode="on_demand") is used,
VoiceCode sends a session-scoped HiCharlie runtime_actions entry named
voicecode_visual_board. This action is ephemeral: it exists only for that
session and does not modify the saved HiCharlie agent. If the voice agent calls
it, VoiceCode handles the request locally by opening/updating the visual runtime
room, then replies over the existing client_action.result bridge. Use
visual_board_mode="auto" when the caller should receive a visual_room_url
immediately at session start, before the agent makes its first visual action.
Supported operations are open_board, add_block, update_block,
connect_blocks, focus, highlight, set_status, and commit.
voice_session_stop requests local shutdown for a running live session.
voice_session_context pushes runtime context into an active HiCharlie voice
session after it has started. Pass the MCP session_id returned by
voice_session_start; VoiceCode maps it to the active HiCharlie provider
session internally. Use it when Codex is reading
voice_session_events, detects something useful from the live transcript or
repo, and wants the voice agent to use that information on the next LLM turn.
It calls HiCharlie POST /api/v1/sessions/{session_id}/context through the
same API key and base URL used for the session.
Example:
voice_session_context(
session_id="live_abc123",
context=(
"L'utilisateur parle du bug de prononciation sur le mot cote. "
"Ne clôture pas encore; demande ce qu'il a entendu exactement."
),
ttl="next_turn",
)Timing and limits:
ttl="next_turn"applies once to the next HiCharlie LLM turn.ttl="session"persists until the session ends.priorityacceptslow,normal, orhigh.contextmust be non-empty and at most 4000 characters.- The tool does not interrupt speech already being generated.
- It does not edit the saved agent prompt.
- It does not add this runtime context to the visible transcript.
- If HiCharlie returns
409 provider unsupported, the MCP reports that dynamic context injection is unavailable for that provider/session.
get_voice_transcript(
conversation_id=None,
call_tag=None,
latest=False,
) -> transcript_textUse this when the call happened but the host UI lost the returned result.
Use this when a queued room may finish minutes or hours after the creating agent asked for it. VoiceCode writes a local completion record when the call is queued, marks it active when the local queue slot is acquired, and marks it completed or failed when the conversation ends.
voice_completions(
completion_id=None,
correlation_id=None,
call_tag=None,
client_instance_id=None,
status=None,
include_delivered=False,
limit=20,
) -> json_text
voice_completion_ack(
completion_id=None,
correlation_id=None,
) -> json_text
voice_dispatch_completions(
limit=10,
dry_run=False,
) -> json_textvoice_completions returns pending local records with status, timestamps,
provider, agent id, transcript text when available, and a clear error message
for failed or empty-transcript calls. By default, already acknowledged records
are hidden. Call voice_completion_ack after the creating agent has consumed
the transcript so the same completion is not delivered repeatedly.
When a record has a Codex return channel, VoiceCode can deliver it proactively
instead of waiting for polling. The return channel format is
codex:<thread_id>. Delivery runs:
codex exec resume <thread_id> -with a compact handoff prompt containing the final status, metadata, error if
any, and transcript. The completion is acknowledged only after the Codex command
exits successfully. Failed deliveries stay pending with retry metadata and can
be retried by calling voice_dispatch_completions(...) or by running the
watcher:
voicecode-delivery-dispatch --watchvoice_dispatch_completions(dry_run=True) shows due records without transcript
text. VOICECODE_DELIVERY_STALE_SECONDS lets the dispatcher reclaim a
delivering record after a crash or restart.
This is intentionally local for now. It does not require a HiCharlie product change. Transcript content is only sent to the configured return channel; hosted queue proposals should keep room lists metadata-only.
Run the installer from the repo checkout:
./scripts/install.shIt installs the uv environment, links the VoiceCode skill, and writes
.mcp.local.json with an absolute project path. If HICHARLIE_API_KEY is
already present in the shell or .env.local, the generated local config uses
that value directly. Use that generated file for MCP clients that launch
servers from another working directory.
For real usage, call_tag should be treated as standard, not optional.
The standalone CLI accepts the same recovery tag:
uv run voicecode-call \
--agent-id 905a39f7-75e7-496a-82d2-d83c411861dd \
--agent-variable contexte="Clarifier le point bloquant" \
--call-tag task-123It gives you:
- reliable transcript recovery
- clean task-level tracking
- safer multi-session behavior
- easier debugging when multiple calls happen on the same machine
A good pattern is:
vc:<repo_slug>:<task_key>:<stage>:r<round>:<host>:<session4>:<utc>
Example:
vc:payments-api:auth-refresh:kickoff:r1:codex:7f3c:20260411T1510Z
A coding agent starts implementing a feature but the request is still fuzzy.
It does a quick repo orientation pass, then starts a voice call:
- “What exactly should happen in this flow?”
- “What matters most here: speed, simplicity, or backwards compatibility?”
- “What should definitely stay unchanged?”
The user answers in natural language.
The transcript comes back.
The agent updates its task understanding and keeps coding.
Later, the same task may trigger another short voice round to resolve one remaining decision or review the result.
This is where VoiceCode MCP shines:
not as constant chatter, but as precise voice checkpoints inside real coding work.
Current provider:
- HiCharlie
Provider selection stays behind the same MCP interface. The public tool resolves
to HiCharlie when HICHARLIE_API_KEY is configured. Each call must provide an
explicit agent_id; there is no default-agent or agentless fallback path. Per-call
customization goes through declared agent_variables and optional
first_message.
For HiCharlie, the default transport is provider WebRTC:
- VoiceCode creates a HiCharlie session with
transport="webrtc". - It opens a shared local UI at
http://localhost:8765/by default. - The per-call page joins the provider WebRTC session through the HiCharlie widget.
- HiCharlie owns the session, signaling endpoint, ICE servers, and agent runtime.
- VoiceCode waits for the call to finish, fetches a non-empty transcript, stores it locally, and returns text to the agent.
This keeps the MCP thin and makes active calls observable from one stable local
dashboard. Set VOICECODE_LOCAL_UI_PORT to change the fixed localhost port.
Set VOICECODE_WEBRTC_MODE=direct for headless local testing or diagnostics.
The direct path still uses local microphone/speaker processes as the machine
audio bridge.
To check local setup without creating a HiCharlie session, run:
uv run python scripts/verify_hicharlie_webrtc.py --check-setupThat verifies the local MCP config, stdio launch, required environment, and
writable queue and transcript stores. It writes status: "setup_pass" or
status: "setup_fail", but it is not an end-to-end proof.
After setting a HICHARLIE_API_KEY that can create sessions and read
transcripts, run:
uv run python scripts/verify_hicharlie_webrtc.py \
--agent-id 905a39f7-75e7-496a-82d2-d83c411861ddTo check credentials before involving microphone/speaker audio, run:
uv run python scripts/verify_hicharlie_webrtc.py --preflight-only \
--agent-id 905a39f7-75e7-496a-82d2-d83c411861ddPreflight only proves that HiCharlie accepted the key and returned a WebRTC
session bootstrap. The full command above is still required for audio,
transcript, and persistence proof. Preflight creates a short 30-second session
unless --max-duration-seconds is passed explicitly.
The runtime loads .env.local, .env, and then .mcp.local.json from the
project root, without overriding real variables already set by the shell. Empty
values and MCP placeholders like ${HICHARLIE_API_KEY} or
${VOICECODE_QUEUE_DB_PATH} are treated as missing, so .env.local can still
provide usable runtime settings. Reading .mcp.local.json keeps local proof
commands and standalone CLI runs aligned with the config that MCP clients use
after ./scripts/install.sh.
Set VOICECODE_MCP_LOCAL_ENV_ENABLED=0 for hermetic test or CI subprocesses
that must not consume local MCP secrets.
Keep local secrets in .env.local; .env* files are ignored except for
.env.example.
The verifier uses the direct headless path and does not open a browser or
localhost page. It requires the WebRTC
connection to reach connected, requires the WebRTC control channel to open,
requires local microphone frames, requires remote audio frames/playback,
requires transcribed text, verifies local transcript recovery, and writes a JSON
proof under artifacts/. If audio evidence is incomplete, the failure JSON
lists the missing signals explicitly. A successful full verifier run writes
status: "pass" and e2e: true; preflight and failure proofs are not accepted
as end-to-end proof. Transcript-stage failures include a provider_transcript
summary when HiCharlie returned a payload without usable text, so the failure is
distinguishable from generic call launch failures.
VoiceCode stores its local queue and transcripts under the user cache directory by default:
- macOS:
~/Library/Caches/voicecode-mcp/ - Linux:
~/.cache/voicecode-mcp/
The normal environment variables are:
# Required: set a real key with sessions:create and sessions:transcript:read.
HICHARLIE_API_KEY=
VOICECODE_QUEUE_ENABLED=1
VOICECODE_QUEUE_BACKEND=local
VOICECODE_QUEUE_DB_PATH=~/Library/Caches/voicecode-mcp/queue.sqlite3
VOICECODE_LOCAL_UI_ENABLED=1
VOICECODE_LOCAL_UI_AUTO_OPEN=1
VOICECODE_CALL_HANDOFF_MODE=manual
VOICECODE_LOCAL_UI_CONFIG_PATH=~/Library/Caches/voicecode-mcp/local_ui_config.json
VOICECODE_TRANSCRIPT_STORE_ENABLED=1
VOICECODE_TRANSCRIPT_DB_PATH=~/Library/Caches/voicecode-mcp/transcripts.sqlite3
VOICECODE_COMPLETION_INBOX_ENABLED=1
VOICECODE_COMPLETION_INBOX_DB_PATH=~/Library/Caches/voicecode-mcp/completions.sqlite3
# Optional stable id for this MCP client/machine. Generated locally when unset.
VOICECODE_CLIENT_INSTANCE_ID=
VOICECODE_DELIVERY_ENABLED=1
VOICECODE_DELIVERY_AUTO_DISPATCH=1
VOICECODE_DELIVERY_RETRY_SECONDS=30
VOICECODE_DELIVERY_MAX_ATTEMPTS=5
VOICECODE_DELIVERY_STALE_SECONDS=600
# Optional explicit Codex delivery target. Otherwise CODEX_THREAD_ID is used
# when the MCP host provides it.
VOICECODE_CODEX_THREAD_ID=
# Optional Codex binary override. Defaults to codex from PATH.
VOICECODE_CODEX_BIN=VOICECODE_LOCAL_UI_ENABLED=0 runs ordinary calls without opening the shared
localhost UI. VOICECODE_LOCAL_UI_AUTO_OPEN=0 keeps the shared UI available but
only lists new rooms instead of opening a new browser tab per call.
VOICECODE_CALL_HANDOFF_MODE=manual shows the next queued room and waits for a
click; auto_chain opens the next ready room automatically when the current
call reaches a terminal state.
VOICECODE_QUEUE_BACKEND=local is the only implemented queue backend today. The
reserved future value is hicharlie_hosted, once HiCharlie exposes a hosted MCP
voice-room queue API.
Older VOCAL_AGENT_* variables are still accepted as a compatibility fallback.
VoiceCode marks HiCharlie session creation with session_origin="voicecode_mcp".
The transcript is part of the voice_call(...) contract: if HiCharlie returns
no transcribed text, local persistence is disabled, or persistence fails, the
call fails instead of returning an empty or unrecoverable result.
The completion inbox mirrors those failures so a delayed creator can see why a
queued room did not produce a usable transcript.
When return_channel resolves to codex:<thread_id>, the dispatcher sends the
same failed or expired completion to Codex so the original thread can decide the
next step.
VoiceCode also includes an isolated local prototype for the next layer above voice: a visual room where Codex or another agent can use a Miro/Mixboard-like board to create cards, text, images, links, focus the user's viewport, record decisions, and commit a final handoff.
Run it locally:
uv run voicecode-visual-runtime --openIf apps/visual-board/dist exists, the Python runtime serves the React/Excalidraw
prototype on the same localhost route. Build it with:
cd apps/visual-board
npm install
npm run buildThe runtime serves one shared localhost site, defaulting to:
http://localhost:8798/visual
It stores prototype state in SQLite at:
~/Library/Caches/voicecode-mcp/visual-runtime.sqlite3
Useful environment variables:
VOICECODE_VISUAL_RUNTIME_PORT=8798
VOICECODE_VISUAL_RUNTIME_DB_PATH=~/Library/Caches/voicecode-mcp/visual-runtime.sqlite3The visual tools are intentionally separate from the voice tools but exposed by the same MCP facade:
visual_room_open(title=None)
visual_surface_create(room_id, title=None, initial_state=None, surface_type="board")
visual_surface_add_block(room_id=None, surface_id=None, kind="note", title="New block", text="", block_id=None, image_url=None, x=None, y=None, width=None, height=None)
visual_surface_update_block(room_id=None, surface_id=None, block_id, kind=None, title=None, text=None, image_url=None, x=None, y=None, width=None, height=None)
visual_surface_connect_blocks(room_id=None, surface_id=None, source_block_id, target_block_id, label="", edge_id=None, edge_type="relates_to", auto_layout=False, layout_spacing=160, layout_direction="right")
visual_surface_get_state(room_id=None, surface_id=None)
visual_surface_view(action, room_id=None, surface_id=None, target_block_id=None, focus_zoom=None)
visual_surface_propose_action(room_id=None, surface_id=None, target_block_id=None, after_text=None, title=None, explanation=None)
visual_surface_apply_action(proposal_id, decision="apply")
visual_surface_record_decision(room_id, text)
visual_surface_get_events(room_id, since_event_id=None, limit=100)
visual_surface_commit(room_id, surface_id=None)
visual_room_close(room_id, reason=None)edge_type accepts relates_to, next, and, or, condition, supports,
blocks, or causes. layout_direction accepts right, left, down, or
up when auto_layout=True.
The Excalidraw prototype has polished dev controls for the agent-facing primitives: add cards, free text, images, connectors, frames, focus, highlight, status badges, selection updates, snapshot export, and recent event inspection. The legacy inline UI remains as a fallback when the React build is absent.
This is not a hosted HiCharlie product surface yet. It is a local testbed for the future room model: typed surfaces, structured actions, event log, proposals, confirmation, and final Codex handoff.
VoiceCode MCP is meant to work with MCP-compatible coding environments, including workflows built around tools like:
- Codex
- Claude Code
- Cursor
- OpenClaw
- other MCP clients
The goal is not to be tied to one host.
The goal is to make voice a clean primitive that any serious coding agent can use.
VoiceCode MCP supports both styles.
Voice is used only when it clearly beats text.
This will be the right default for many developers.
Voice becomes the main user-facing interface for important checkpoints:
- kickoff
- clarification
- decision
- validation
- reprioritization
The agent still works autonomously between calls.
This is not about talking constantly.
It is about making voice the control plane when that feels better than typing.
- Open source first
- One MCP instead of another app
- Transcript back into the same agent loop
- Minimal surface area
- Local-first concurrency control
- Provider WebRTC-first design
- Useful in real developer workflows, not just demos
This repository is the new public home for VoiceCode MCP.
What is going into this repo:
- the MCP server
- HiCharlie WebRTC integration
- local queueing for concurrent sessions
- transcript recovery support
- examples for MCP clients
- install and setup instructions
- contribution guidelines
If you are early: perfect. This is exactly the moment to star the repo, try it, and help shape the workflow.
Coding agents are already good at reading code, planning, editing, and testing.
What they still lack is a clean, native way to talk to the human when human alignment is the real blocker.
VoiceCode MCP fills that gap.
It gives agents a simple new move:
“Let me call you, get aligned, and continue.”
That sounds small.
In practice, it changes how natural agentic coding feels.
Contributions, feedback, issues, and workflow ideas are welcome.
Especially useful feedback:
- host integration examples
- voice workflow patterns that worked well
- multi-session edge cases
- WebRTC session and transcript edge cases
- transcript recovery failures
- README and onboarding improvements
If you try VoiceCode MCP in a real coding workflow, we want to hear about it.
VoiceCode MCP is built by the team at HiCharlie.
VoiceCode MCP is intentionally a HiCharlie-only MVP for now. HiCharlie owns the runtime stack behind the session, including whichever STT/LLM/TTS providers the agent selects internally. The MCP should not add separate ElevenLabs or provider-generic logic until there is a clear product need.
one open MCP that makes coding agents natural to talk to.