Camera/audio/cmd web bridges + Mac+LM-Studio agentic compose#2277
Camera/audio/cmd web bridges + Mac+LM-Studio agentic compose#2277tfius wants to merge 29 commits into
Conversation
…d capture-viewer tool Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the custom Go2FullRecorder with the stock go2-memory recorder for the capture demo (camera + lidar + odom is enough for the frames+trajectory viewer). Point the capture-viewer at recording_go2.db and ignore recording sidecars + the MuJoCo runtime log. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New TakePictureSkill subscribes to color_image, caches the latest frame, and on
take_picture() JPEG-encodes it and POSTs to robomoo's /api/robot/frame with a
shared bearer token (ROBOMOO_URL / ROBOT_INGEST_TOKEN from env). Wired into
unitree_go2_agentic_gemini and registered as take-picture-skill so the agent can
call it ("take a picture").
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
take_picture now attaches the robot's odom pose (poseX/poseY) + label so the web can pin captures on the map. New MapUploader subscribes global_costmap, renders it with turbo_image, and POSTs the PNG + grid metadata to robomoo /api/robot/map (throttled). Both wired into unitree_go2_agentic_gemini. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Consolidate the uncommitted Gemini layer onto the branch: - Gemini VL model (dimos/models/vl/gemini.py) wired into create()/types - --detection-model CLI knob + prefetch - Navigation/PersonFollow vl_model_name wiring - explore_and_capture skill gated on FrontierExplorerSpec - Gemini image/text embeddings raise instead of returning random vectors Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ueprint The dog has no onboard brain — all compute runs on the Mac. Moondream is unusable there (~6 min/inference + Metal crash, which aborted the blueprint at startup and looked like a connection failure). Route every VL path to Gemini: PersonFollow vl_model_name moondream->gemini, plus .global_config(detection_model=gemini) for look_out_for / PerceiveLoop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GeminiSpeakSkill called GeminiTTSNode.consume_text() on every speak(), which spawned a fresh worker thread + subscription each call (the repeated 'Starting GeminiTTSNode' log) and leaked the old ones. Wire the TTS pipeline ONCE in start() against a long-lived Subject; each speak() just pushes onto it and the node drains FIFO. Default speak() to blocking=False so the agent isn't stalled on synthesis; blocking=True still waits, matching on the emitted utterance so a concurrent non-blocking speak can't trip the wait. Make consume_text idempotent as defense-in-depth. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
take_picture blocked the agent on a synchronous httpx.post (up to 30s) before returning. Snapshot the current frame+pose synchronously, then dispatch the JPEG encode + POST to a daemon thread and return at once. Refactor the upload into _upload_frame(frame, pose, ...) shared by both the skill and the explore capture loop (via a thin _upload_current wrapper); track outstanding upload threads and join them in stop(). Trade-off: the skill no longer returns the storage key or surfaces HTTP errors to the agent — upload failures are logged only. That is the intended cost of returning fast. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Go2 camera is body-fixed; the only way to look up/down is to pitch the body
with the Euler sport command (api_id 1007), which was not exposed and needs a
parameter payload execute_sport_command can't send. Add a tilt_body(pitch_deg,
roll_deg, yaw_deg) skill that publishes Euler to SPORT_MOD with the
{"parameter": {"data": {x,y,z}}} payload (same publish_request path
execute_sport_command uses), converting degrees->radians and clamping to the
safe standing envelope (±0.75/±0.75/±0.6 rad). Negative pitch looks up.
Note: pitch sign and the {data:{x,y,z}} vs flat {x,y,z} payload form should be
confirmed on-robot; the clamp is the safety net.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
On a Mac (no CUDA) follow_person falls back to 'redetect', which ran a full synchronous Gemini detection every control cycle — capping the loop at ~0.5-1 Hz so the robot acted on stale velocity commands (the laggy feel). Decouple the two: a background thread re-detects every _redetect_period (0.8s) to anchor a cheap local OpenCV tracker, while the control loop runs at _frequency (20 Hz), updating the tracker locally and publishing a fresh twist each cycle. _create_tracker() picks the best available tracker (CSRT > KCF > MIL, across the main and legacy cv2 namespaces) so it auto-upgrades to CSRT where opencv-contrib exists and falls back to MIL (base OpenCV) here. Lost handling keeps the existing _lost_timeout semantics. The EdgeTAM (CUDA) path and auto mode-resolution are untouched, so GPU machines are unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…el in one call Tilting then capturing as separate agent calls is racy: tilt_body returns before the body settles and take_picture snapshots instantly, so the photo can catch a mid-tilt view. Add a deterministic tilt_and_capture(pitch_deg=-20, note, settle_s=1.0) that runs the whole sequence in the background — tilt, wait to settle, snapshot + upload the tilted view, then re-level — and returns at once. Reuses the existing ModuleRef-spec pattern: a new TiltSpec resolves structurally to UnitreeSkillContainer.tilt_body (which @Skill exposes over rpc), so the capture module can aim the body-fixed camera without owning the WebRTC connection. Negative pitch looks up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
map_uploader now ships a value-preserving grayscale occupancy PNG (free=0, occupied=1..100, unknown=255) instead of a baked turbo image, so the web can recolor + overlay client-side. Add scripts/export_recording.py: reads a memory2 SqliteStore .db and pushes a top-down lidar occupancy map, downsampled odom trajectory, and CLIP-embedded keyframes (+thumbnails+pose) to robomoo's /api/robot/*. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wifi/ROBOT_IP sanity check, NTP sync, venv bootstrap, and `exec dimos run <blueprint>` (default `unitree-go2-basic`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CameraMjpegModule (`camera-mjpeg-module`) republishes the `color_image` stream as an HTTP MJPEG feed and a single-JPEG snapshot, with CORS open: GET /video_feed/color_image multipart/x-mixed-replace GET /snapshot/color_image image/jpeg Default port 7780. Compose with any blueprint that publishes `color_image` (sim or real Go2): dimos --simulation run unitree-go2-basic camera-mjpeg-module Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds full bidirectional audio support for the real Go2:
- UnitreeWebRTCConnection.audio_stream() emits AudioMessage
(int16 PCM + sample_rate + channels) by hooking the existing
WebRTC audio transceiver and activating it via switchAudioChannel.
Float frames are scaled before int16 cast.
- UnitreeWebRTCConnection.play_wav_bytes(wav) uploads + plays a
WAV through the audiohub megaphone, then exits megaphone mode
after the clip's duration so subsequent commands work normally.
- GO2Connection exposes audio: Out[AudioMessage] and
audio_in: In[bytes], plus two skills:
play_wav(wav_path) — local-filesystem WAV
play_wav_b64(wav_b64) — base64 WAV for remote MCP clients
- New AudioWsModule (`audio-ws-module`) bridges to the browser:
WebSocket /audio_out binary PCM frames
GET /audio_info sample_rate + channels (initial WS frame
also broadcasts {"event":"format", ...})
POST /play WAV body for robot speaker
Simulation has no audio; the wiring is hasattr-guarded so
unitree-go2-basic still works in sim.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ATION mode - BRIDGES=1 (default) appends `camera-mjpeg-module audio-ws-module` to the dimos run argv so the web endpoints come up automatically. - SIMULATION=1 swaps in `--simulation`, skips wifi/ping/NTP checks, installs the `sim` extra on first venv bootstrap. - EXTRA="…" lets callers tack on additional modules. - Banner prints every endpoint that will be available (command center, MJPEG, snapshot, audio WS, audio play). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rapper
go2-start.sh
- LMSTUDIO=1 routes McpClient to LM Studio's OpenAI-compatible server
at http://127.0.0.1:1234/v1 (LMSTUDIO_MODEL=qwen/qwen3-8b by default).
- MLXVLM=1 routes to the mlxvlm Gemma-4 server at
http://127.0.0.1:8080/v1 (MLXVLM_MODEL defaults to gemma-4-E4B-it-MLX-4bit).
- Probes /v1/models before launch; bails with a useful hint if the
backend isn't up.
- Both presets set OPENAI_BASE_URL + OPENAI_API_KEY and pass
`-o mcpclient.model=openai:<model>` to dimos run.
- Warns when an LLM preset is set against a non-agentic blueprint.
sim-with-llm.sh
- One-liner: `./sim-with-llm.sh mlxvlm` runs the sim with the agentic
blueprint pointed at mlxvlm; `lmstudio` and `ollama` variants too.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
unitree-go2-agentic pulls in SecurityModule (EdgeTAM/CUDA), local Moondream VL (crashes Metal), and OpenAI-hardcoded TTS — none of which boot on Apple Silicon. unitree-go2-agentic-gemini is the existing Mac skeleton that disables those and uses Gemini for VL/embeddings/TTS. The chat LLM override (LMSTUDIO/MLXVLM) still applies — it's just the McpClient. Surface a GOOGLE_API_KEY warning early since Gemini VL/embed/TTS fail at runtime without it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ompose unitree-go2-agentic-gemini imports GeminiSpeakSkill at module load (before any --disable can apply), and google-genai isn't installed, so the gemini blueprint crashed at import. Switch the default compose to: unitree-go2-basic (camera + viz, no Google imports) + mcp-server + mcp-client (MCP agent loop) + unitree-skill-container (wait / current_time / sport / tilt_body) + camera-mjpeg-module + audio-ws-module (from BRIDGES=1) For chat: LMSTUDIO=1 or MLXVLM=1 reroutes McpClient as before. For ollama: keep the existing unitree-go2-agentic-ollama (clean Mac compose). Trade-off: no relative_move skill (needs the nav stack). Publish to /cmd_vel directly if movement is needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
UnitreeSkillContainer requires a NavigationInterfaceSpec module (replanning-a-star-planner + nav-stack chain). Without it, build fails: "No module met that spec." Minimal default is now just: unitree-go2-basic + mcp-server + mcp-client + bridges Skills available: observe, play_wav, play_wav_b64. Header documents how to layer in unitree-skill-container + replanning-a-star-planner when movement skills are needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Document the blueprint compatibility matrix on Apple Silicon, the skill availability per compose, and the VL backend gotcha (qwen is Alibaba cloud, not local MLX; moondream crashes Metal; gemini is Google cloud). Notes the missing piece: a Mac-local VL backend (mlxvlm/openai_compat) that would unlock the richer NavigationSkill / PersonFollow containers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous start() did `loop.run_until_complete(server.serve())` from
a daemon thread and used `run_coroutine_threadsafe` to broadcast audio
frames, which races the loop startup and silently raises
"Event loop stopped before Future completed" when uvicorn shuts down
or fails to bind.
Now:
- uvicorn.Server.run() owns its own loop and lifecycle.
- Subscriber thread enqueues frames; a coroutine inside the uvicorn
loop (spawned at FastAPI startup) drains the queue and fans out
to WebSocket clients.
- Bind errors (port already in use) are logged with the actual
OSError and a kill hint, instead of a generic "loop stopped".
- Drops oldest frame on overflow so latency stays bounded.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… brief cmd-bridge-module exposes the robot's drive interface over HTTP so an external VLM (mlxvlm Gemma-4) can run a perceive-act loop: POST /cmd_vel one Twist + duration POST /path sequence of Twist steps (raw or semantic forward/left/degrees) POST /stop cancel any in-flight /path GET /pose current base_link pose in world frame Open-loop (no SLAM, no obstacle avoidance) so it works in sim and on the bare-metal robot without the nav stack. The VLM is expected to re-plan each iteration from a fresh camera frame. journal/2026-05-27-mlxvlm-robot-integration-prompt.md is the brief for the Claude working on the mlxvlm side: endpoint contract, the `/api/robot/navigate` perceive-act loop, JSON schema for the model's per-step reply, failure modes, and sim-only testing instructions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
BRIDGES=1 now appends camera-mjpeg-module + audio-ws-module + cmd-bridge-module, and the banner lists the new cmd_vel/path/stop/pose endpoints alongside the camera and audio URLs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mode Real-robot LLM launches (LMSTUDIO=1 ./go2-start.sh) used to forget the McpClient, so the -o mcpclient.model=... override silently bound to nothing and LM Studio sat unused. go2-start.sh now mirrors what sim-with-llm.sh was doing: when an LLM preset is set against a non-agentic blueprint, mcp-server + mcp-client are appended to EXTRA. The check is skipped when the blueprint already includes an agent (e.g. unitree-go2-agentic-ollama). sim-with-llm.sh becomes a thin wrapper: it only picks the backend preset and forwards to go2-start.sh with SIMULATION=1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Greptile SummaryThis PR adds three HTTP/WebSocket bridges (MJPEG camera, PCM audio, cmd_vel) to expose robot I/O over plain HTTP for external VLM/LLM processes, adds Go2 audio I/O via WebRTC, several new skills (Gemini TTS, macOS
Confidence Score: 3/5Safe to merge with one fix: the megaphone-exit bug in play_wav_bytes must be addressed before deploying on a real Go2, as it can leave the robot unresponsive to all motion commands until manually reset. The audio WebRTC path in connection.py enters megaphone mode and then calls exit_megaphone() outside any finally block. A coroutine cancellation during active playback will leave the robot permanently in megaphone mode, blocking every subsequent motion command. The rest of the PR — the three web bridges, Gemini TTS skill, take-picture skill, and launcher scripts — is well-structured and largely correct. dimos/robot/unitree/connection.py (play_wav_bytes / _upload_play_exit) requires the exit_megaphone finally-block fix before real-robot use. go2-start.sh has a minor hardcoded path to address. Important Files Changed
Sequence DiagramsequenceDiagram
participant ExtClient as External VLM/Client
participant MjpegMod as CameraMjpegModule (port 7780)
participant AudioMod as AudioWsModule (port 7781)
participant CmdMod as CmdBridgeModule (port 7782)
participant GO2Conn as GO2Connection
participant UnitreeConn as UnitreeWebRTCConnection
GO2Conn->>MjpegMod: color_image stream
GO2Conn->>AudioMod: audio (AudioMessage) stream
UnitreeConn->>GO2Conn: video_stream / audio_stream
ExtClient->>MjpegMod: GET /video_feed/color_image (MJPEG)
MjpegMod-->>ExtClient: multipart/x-mixed-replace JPEG frames
ExtClient->>AudioMod: WS /audio_out
AudioMod-->>ExtClient: binary int16 PCM frames
ExtClient->>AudioMod: POST /play (WAV bytes)
AudioMod->>GO2Conn: audio_in (bytes)
GO2Conn->>UnitreeConn: play_wav_bytes(wav)
UnitreeConn->>UnitreeConn: upload_megaphone, enter, sleep, exit
ExtClient->>CmdMod: POST /cmd_vel or /path
CmdMod->>GO2Conn: cmd_vel (Twist)
ExtClient->>CmdMod: GET /pose
CmdMod-->>ExtClient: x, y, z, theta
Reviews (1): Last reviewed commit: "fix(go2-start): auto-inject mcp-server/m..." | Re-trigger Greptile |
| await hub.upload_megaphone(tmp_path) | ||
| await hub.enter_megaphone() | ||
| # Hold megaphone for the clip's duration, plus a small flush margin, | ||
| # then release so other commands work normally. | ||
| await asyncio.sleep(duration + 0.5) | ||
| await hub.exit_megaphone() |
There was a problem hiding this comment.
If
asyncio.sleep() is cancelled (e.g. the event loop shuts down mid-play) or any awaitable after enter_megaphone() raises, exit_megaphone() is never called and the robot is permanently stuck in megaphone mode until manually reset, rendering all movement commands ineffective.
| await hub.upload_megaphone(tmp_path) | |
| await hub.enter_megaphone() | |
| # Hold megaphone for the clip's duration, plus a small flush margin, | |
| # then release so other commands work normally. | |
| await asyncio.sleep(duration + 0.5) | |
| await hub.exit_megaphone() | |
| await hub.upload_megaphone(tmp_path) | |
| await hub.enter_megaphone() | |
| try: | |
| # Hold megaphone for the clip's duration, plus a small flush margin, | |
| # then release so other commands work normally. | |
| await asyncio.sleep(duration + 0.5) | |
| finally: | |
| await hub.exit_megaphone() |
| if [[ "$LMSTUDIO" == "1" ]]; then | ||
| c_red " start LM Studio's Local Server (Cmd-Shift-2) and load a tool-capable model" | ||
| else | ||
| c_red " start mlxvlm: cd /Users/tex/repos/ai/mlx/mlxvlm && scripts/start-all.sh" |
There was a problem hiding this comment.
This error message contains a hardcoded path from the developer's local machine (
/Users/tex/repos/ai/mlx/mlxvlm). Every other user will see an incorrect and confusing instruction.
| c_red " start mlxvlm: cd /Users/tex/repos/ai/mlx/mlxvlm && scripts/start-all.sh" | |
| c_red " start mlxvlm server and ensure it is listening on port 8080" |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| class TwistRequest(BaseModel): | ||
| linear: list[float] = Field(default=[0.0, 0.0, 0.0], min_length=3, max_length=3) | ||
| angular: list[float] = Field(default=[0.0, 0.0, 0.0], min_length=3, max_length=3) | ||
| duration: float = 0.5 # seconds to hold the command before stopping |
There was a problem hiding this comment.
duration has no upper bound. A caller can send duration: 999999, blocking the uvicorn thread-pool worker (and the _drive_lock) for hours, making /stop unable to interrupt ongoing execution and preventing subsequent commands from acquiring the lock.
| duration: float = 0.5 # seconds to hold the command before stopping | |
| duration: float = Field(default=0.5, ge=0.0, le=30.0) # seconds to hold the command before stopping |
Summary
Adds three web bridges and a local-LLM hackathon quickstart, exposing the
robot's I/O over plain HTTP/WS so external processes (e.g. an MLX VLM running
elsewhere on the same Mac) can drive perception + control without speaking
any internal dimos APIs.
New modules (Mac- and sim-friendly, no CUDA)
camera-mjpeg-module— port 7780GET /video_feed/color_imageMJPEG (multipart/x-mixed-replace)GET /snapshot/color_imagesingle JPEGaudio-ws-module— port 7781ws://…/audio_outbinary int16 PCM frames (with a JSON{event:"format",…}hello)GET /audio_inforeports rate/channelsPOST /playaccepts WAV bytes → robot speaker (megaphone path)cmd-bridge-module— port 7782POST /cmd_vel,POST /path(raw Twist or semanticforward/left/degrees),POST /stop,GET /poseGo2 audio I/O (real robot only — sim has no audio)
UnitreeWebRTCConnection.audio_stream()→AudioMessage(data, sample_rate, channels),hooks the existing WebRTC audio transceiver
UnitreeWebRTCConnection.play_wav_bytes(wav)— uploads viaaudiohub.upload_megaphone,enters megaphone, sleeps for the WAV's duration, then exits cleanly
GO2Connection:play_wav(path),play_wav_b64(base64)audio: Out[AudioMessage],audio_in: In[bytes]Launcher scripts
go2-start.sh— hackathon quickstart: wifi/ping/NTP checks, env-based LLMpresets (
LMSTUDIO=1,MLXVLM=1), auto-injectsmcp-server+mcp-clientwhen an LLM preset is set against a non-agentic blueprint
sim-with-llm.sh— thin wrapper that setsSIMULATION=1and forwards/v1/modelsbefore launch and surface a useful error if thebackend isn't up
Docs
journal/2026-05-27-camera-audio-bridges.md— implementation notes,review fixes, Mac+local-LLM agentic landscape
journal/2026-05-27-mlxvlm-robot-integration-prompt.md— drop-in brieffor an external Claude/agent that consumes these endpoints
Includes (from earlier feat/gemini-go2-2245 work by @grmkris, not by me)
feat(go2): all-Gemini VL for the Mac-only agentic blueprintfeat(take_picture),feat(follow_person),feat(go2): tilt_body,save/load map, room_scan, etc.
Test plan
./sim-with-llm.sh lmstudioboots; agent reaches the endpoint./sim-with-llm.sh mlxvlmreaches mlxvlm at :8080http://127.0.0.1:7780/video_feed/color_image— MJPEG renderscurl -X POST -d '{"steps":[{"forward":0.5,"duration":2}]}' http://127.0.0.1:7782/pathdrives the sim Go2 forward
dimos mcp list-toolsshowsobserve,play_wav,play_wav_b64audio_outstreams PCM;POST /playplaysthrough the dog's speaker and exits megaphone cleanly
🤖 Generated with Claude Code