Skip to content

Camera/audio/cmd web bridges + Mac+LM-Studio agentic compose#2277

Open
tfius wants to merge 29 commits into
dimensionalOS:mainfrom
tfius:feat/web-bridges-llm-presets
Open

Camera/audio/cmd web bridges + Mac+LM-Studio agentic compose#2277
tfius wants to merge 29 commits into
dimensionalOS:mainfrom
tfius:feat/web-bridges-llm-presets

Conversation

@tfius
Copy link
Copy Markdown

@tfius tfius commented May 28, 2026

Summary

Adds three web bridges and a local-LLM hackathon quickstart, exposing the
robot's I/O over plain HTTP/WS so external processes (e.g. an MLX VLM running
elsewhere on the same Mac) can drive perception + control without speaking
any internal dimos APIs.

New modules (Mac- and sim-friendly, no CUDA)

  • camera-mjpeg-module — port 7780
    • GET /video_feed/color_image MJPEG (multipart/x-mixed-replace)
    • GET /snapshot/color_image single JPEG
    • CORS open
  • audio-ws-module — port 7781
    • ws://…/audio_out binary int16 PCM frames (with a JSON {event:"format",…} hello)
    • GET /audio_info reports rate/channels
    • POST /play accepts WAV bytes → robot speaker (megaphone path)
  • cmd-bridge-module — port 7782
    • POST /cmd_vel, POST /path (raw Twist or semantic forward/left/degrees),
      POST /stop, GET /pose
    • Open-loop; designed for a VLM that re-plans each iteration

Go2 audio I/O (real robot only — sim has no audio)

  • UnitreeWebRTCConnection.audio_stream()AudioMessage(data, sample_rate, channels),
    hooks the existing WebRTC audio transceiver
  • UnitreeWebRTCConnection.play_wav_bytes(wav) — uploads via audiohub.upload_megaphone,
    enters megaphone, sleeps for the WAV's duration, then exits cleanly
  • New skills on GO2Connection: play_wav(path), play_wav_b64(base64)
  • New streams: audio: Out[AudioMessage], audio_in: In[bytes]

Launcher scripts

  • go2-start.sh — hackathon quickstart: wifi/ping/NTP checks, env-based LLM
    presets (LMSTUDIO=1, MLXVLM=1), auto-injects mcp-server + mcp-client
    when an LLM preset is set against a non-agentic blueprint
  • sim-with-llm.sh — thin wrapper that sets SIMULATION=1 and forwards
  • Both probe /v1/models before launch and surface a useful error if the
    backend isn't up

Docs

  • journal/2026-05-27-camera-audio-bridges.md — implementation notes,
    review fixes, Mac+local-LLM agentic landscape
  • journal/2026-05-27-mlxvlm-robot-integration-prompt.md — drop-in brief
    for an external Claude/agent that consumes these endpoints

Includes (from earlier feat/gemini-go2-2245 work by @grmkris, not by me)

  • feat(go2): all-Gemini VL for the Mac-only agentic blueprint
  • feat(take_picture), feat(follow_person), feat(go2): tilt_body,
    save/load map, room_scan, etc.

Test plan

  • ./sim-with-llm.sh lmstudio boots; agent reaches the endpoint
  • ./sim-with-llm.sh mlxvlm reaches mlxvlm at :8080
  • Open http://127.0.0.1:7780/video_feed/color_image — MJPEG renders
  • curl -X POST -d '{"steps":[{"forward":0.5,"duration":2}]}' http://127.0.0.1:7782/path
    drives the sim Go2 forward
  • dimos mcp list-tools shows observe, play_wav, play_wav_b64
  • On real Go2: WebSocket audio_out streams PCM; POST /play plays
    through the dog's speaker and exits megaphone cleanly

🤖 Generated with Claude Code

bogwi and others added 29 commits May 25, 2026 13:46
…d capture-viewer tool

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the custom Go2FullRecorder with the stock go2-memory recorder for the
capture demo (camera + lidar + odom is enough for the frames+trajectory viewer).
Point the capture-viewer at recording_go2.db and ignore recording sidecars +
the MuJoCo runtime log.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New TakePictureSkill subscribes to color_image, caches the latest frame, and on
take_picture() JPEG-encodes it and POSTs to robomoo's /api/robot/frame with a
shared bearer token (ROBOMOO_URL / ROBOT_INGEST_TOKEN from env). Wired into
unitree_go2_agentic_gemini and registered as take-picture-skill so the agent can
call it ("take a picture").

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
take_picture now attaches the robot's odom pose (poseX/poseY) + label so the web
can pin captures on the map. New MapUploader subscribes global_costmap, renders
it with turbo_image, and POSTs the PNG + grid metadata to robomoo /api/robot/map
(throttled). Both wired into unitree_go2_agentic_gemini.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Consolidate the uncommitted Gemini layer onto the branch:
- Gemini VL model (dimos/models/vl/gemini.py) wired into create()/types
- --detection-model CLI knob + prefetch
- Navigation/PersonFollow vl_model_name wiring
- explore_and_capture skill gated on FrontierExplorerSpec
- Gemini image/text embeddings raise instead of returning random vectors

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ueprint

The dog has no onboard brain — all compute runs on the Mac. Moondream is
unusable there (~6 min/inference + Metal crash, which aborted the blueprint
at startup and looked like a connection failure). Route every VL path to
Gemini: PersonFollow vl_model_name moondream->gemini, plus
.global_config(detection_model=gemini) for look_out_for / PerceiveLoop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GeminiSpeakSkill called GeminiTTSNode.consume_text() on every speak(), which
spawned a fresh worker thread + subscription each call (the repeated 'Starting
GeminiTTSNode' log) and leaked the old ones. Wire the TTS pipeline ONCE in
start() against a long-lived Subject; each speak() just pushes onto it and the
node drains FIFO. Default speak() to blocking=False so the agent isn't stalled
on synthesis; blocking=True still waits, matching on the emitted utterance so a
concurrent non-blocking speak can't trip the wait. Make consume_text idempotent
as defense-in-depth.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
take_picture blocked the agent on a synchronous httpx.post (up to 30s) before
returning. Snapshot the current frame+pose synchronously, then dispatch the
JPEG encode + POST to a daemon thread and return at once. Refactor the upload
into _upload_frame(frame, pose, ...) shared by both the skill and the explore
capture loop (via a thin _upload_current wrapper); track outstanding upload
threads and join them in stop().

Trade-off: the skill no longer returns the storage key or surfaces HTTP errors
to the agent — upload failures are logged only. That is the intended cost of
returning fast.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Go2 camera is body-fixed; the only way to look up/down is to pitch the body
with the Euler sport command (api_id 1007), which was not exposed and needs a
parameter payload execute_sport_command can't send. Add a tilt_body(pitch_deg,
roll_deg, yaw_deg) skill that publishes Euler to SPORT_MOD with the
{"parameter": {"data": {x,y,z}}} payload (same publish_request path
execute_sport_command uses), converting degrees->radians and clamping to the
safe standing envelope (±0.75/±0.75/±0.6 rad). Negative pitch looks up.

Note: pitch sign and the {data:{x,y,z}} vs flat {x,y,z} payload form should be
confirmed on-robot; the clamp is the safety net.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
On a Mac (no CUDA) follow_person falls back to 'redetect', which ran a full
synchronous Gemini detection every control cycle — capping the loop at ~0.5-1 Hz
so the robot acted on stale velocity commands (the laggy feel). Decouple the two:
a background thread re-detects every _redetect_period (0.8s) to anchor a cheap
local OpenCV tracker, while the control loop runs at _frequency (20 Hz),
updating the tracker locally and publishing a fresh twist each cycle.

_create_tracker() picks the best available tracker (CSRT > KCF > MIL, across the
main and legacy cv2 namespaces) so it auto-upgrades to CSRT where opencv-contrib
exists and falls back to MIL (base OpenCV) here. Lost handling keeps the existing
_lost_timeout semantics. The EdgeTAM (CUDA) path and auto mode-resolution are
untouched, so GPU machines are unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…el in one call

Tilting then capturing as separate agent calls is racy: tilt_body returns before
the body settles and take_picture snapshots instantly, so the photo can catch a
mid-tilt view. Add a deterministic tilt_and_capture(pitch_deg=-20, note,
settle_s=1.0) that runs the whole sequence in the background — tilt, wait to
settle, snapshot + upload the tilted view, then re-level — and returns at once.

Reuses the existing ModuleRef-spec pattern: a new TiltSpec resolves structurally
to UnitreeSkillContainer.tilt_body (which @Skill exposes over rpc), so the
capture module can aim the body-fixed camera without owning the WebRTC
connection. Negative pitch looks up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
map_uploader now ships a value-preserving grayscale occupancy PNG
(free=0, occupied=1..100, unknown=255) instead of a baked turbo image,
so the web can recolor + overlay client-side.

Add scripts/export_recording.py: reads a memory2 SqliteStore .db and
pushes a top-down lidar occupancy map, downsampled odom trajectory, and
CLIP-embedded keyframes (+thumbnails+pose) to robomoo's /api/robot/*.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wifi/ROBOT_IP sanity check, NTP sync, venv bootstrap, and
`exec dimos run <blueprint>` (default `unitree-go2-basic`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CameraMjpegModule (`camera-mjpeg-module`) republishes the `color_image`
stream as an HTTP MJPEG feed and a single-JPEG snapshot, with CORS open:

  GET /video_feed/color_image   multipart/x-mixed-replace
  GET /snapshot/color_image     image/jpeg

Default port 7780. Compose with any blueprint that publishes
`color_image` (sim or real Go2):

  dimos --simulation run unitree-go2-basic camera-mjpeg-module

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds full bidirectional audio support for the real Go2:

  - UnitreeWebRTCConnection.audio_stream() emits AudioMessage
    (int16 PCM + sample_rate + channels) by hooking the existing
    WebRTC audio transceiver and activating it via switchAudioChannel.
    Float frames are scaled before int16 cast.

  - UnitreeWebRTCConnection.play_wav_bytes(wav) uploads + plays a
    WAV through the audiohub megaphone, then exits megaphone mode
    after the clip's duration so subsequent commands work normally.

  - GO2Connection exposes audio: Out[AudioMessage] and
    audio_in: In[bytes], plus two skills:
      play_wav(wav_path)   — local-filesystem WAV
      play_wav_b64(wav_b64) — base64 WAV for remote MCP clients

  - New AudioWsModule (`audio-ws-module`) bridges to the browser:
      WebSocket /audio_out  binary PCM frames
      GET       /audio_info sample_rate + channels (initial WS frame
                            also broadcasts {"event":"format", ...})
      POST      /play       WAV body for robot speaker

Simulation has no audio; the wiring is hasattr-guarded so
unitree-go2-basic still works in sim.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ATION mode

- BRIDGES=1 (default) appends `camera-mjpeg-module audio-ws-module`
  to the dimos run argv so the web endpoints come up automatically.
- SIMULATION=1 swaps in `--simulation`, skips wifi/ping/NTP checks,
  installs the `sim` extra on first venv bootstrap.
- EXTRA="…" lets callers tack on additional modules.
- Banner prints every endpoint that will be available
  (command center, MJPEG, snapshot, audio WS, audio play).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rapper

go2-start.sh
  - LMSTUDIO=1 routes McpClient to LM Studio's OpenAI-compatible server
    at http://127.0.0.1:1234/v1 (LMSTUDIO_MODEL=qwen/qwen3-8b by default).
  - MLXVLM=1 routes to the mlxvlm Gemma-4 server at
    http://127.0.0.1:8080/v1 (MLXVLM_MODEL defaults to gemma-4-E4B-it-MLX-4bit).
  - Probes /v1/models before launch; bails with a useful hint if the
    backend isn't up.
  - Both presets set OPENAI_BASE_URL + OPENAI_API_KEY and pass
    `-o mcpclient.model=openai:<model>` to dimos run.
  - Warns when an LLM preset is set against a non-agentic blueprint.

sim-with-llm.sh
  - One-liner: `./sim-with-llm.sh mlxvlm` runs the sim with the agentic
    blueprint pointed at mlxvlm; `lmstudio` and `ollama` variants too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
unitree-go2-agentic pulls in SecurityModule (EdgeTAM/CUDA), local
Moondream VL (crashes Metal), and OpenAI-hardcoded TTS — none of
which boot on Apple Silicon.

unitree-go2-agentic-gemini is the existing Mac skeleton that
disables those and uses Gemini for VL/embeddings/TTS. The chat LLM
override (LMSTUDIO/MLXVLM) still applies — it's just the McpClient.

Surface a GOOGLE_API_KEY warning early since Gemini VL/embed/TTS
fail at runtime without it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ompose

unitree-go2-agentic-gemini imports GeminiSpeakSkill at module load
(before any --disable can apply), and google-genai isn't installed,
so the gemini blueprint crashed at import.

Switch the default compose to:

  unitree-go2-basic                  (camera + viz, no Google imports)
  + mcp-server + mcp-client          (MCP agent loop)
  + unitree-skill-container          (wait / current_time / sport / tilt_body)
  + camera-mjpeg-module + audio-ws-module  (from BRIDGES=1)

For chat: LMSTUDIO=1 or MLXVLM=1 reroutes McpClient as before.
For ollama: keep the existing unitree-go2-agentic-ollama (clean Mac compose).

Trade-off: no relative_move skill (needs the nav stack).
Publish to /cmd_vel directly if movement is needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
UnitreeSkillContainer requires a NavigationInterfaceSpec module
(replanning-a-star-planner + nav-stack chain). Without it, build
fails: "No module met that spec."

Minimal default is now just:
  unitree-go2-basic + mcp-server + mcp-client + bridges

Skills available: observe, play_wav, play_wav_b64. Header documents
how to layer in unitree-skill-container + replanning-a-star-planner
when movement skills are needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Document the blueprint compatibility matrix on Apple Silicon, the
skill availability per compose, and the VL backend gotcha (qwen is
Alibaba cloud, not local MLX; moondream crashes Metal; gemini is
Google cloud). Notes the missing piece: a Mac-local VL backend
(mlxvlm/openai_compat) that would unlock the richer NavigationSkill
/ PersonFollow containers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous start() did `loop.run_until_complete(server.serve())` from
a daemon thread and used `run_coroutine_threadsafe` to broadcast audio
frames, which races the loop startup and silently raises
"Event loop stopped before Future completed" when uvicorn shuts down
or fails to bind.

Now:
  - uvicorn.Server.run() owns its own loop and lifecycle.
  - Subscriber thread enqueues frames; a coroutine inside the uvicorn
    loop (spawned at FastAPI startup) drains the queue and fans out
    to WebSocket clients.
  - Bind errors (port already in use) are logged with the actual
    OSError and a kill hint, instead of a generic "loop stopped".
  - Drops oldest frame on overflow so latency stays bounded.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… brief

cmd-bridge-module exposes the robot's drive interface over HTTP so an
external VLM (mlxvlm Gemma-4) can run a perceive-act loop:

  POST /cmd_vel   one Twist + duration
  POST /path      sequence of Twist steps (raw or semantic forward/left/degrees)
  POST /stop      cancel any in-flight /path
  GET  /pose      current base_link pose in world frame

Open-loop (no SLAM, no obstacle avoidance) so it works in sim and on the
bare-metal robot without the nav stack. The VLM is expected to re-plan
each iteration from a fresh camera frame.

journal/2026-05-27-mlxvlm-robot-integration-prompt.md is the brief for
the Claude working on the mlxvlm side: endpoint contract, the
`/api/robot/navigate` perceive-act loop, JSON schema for the model's
per-step reply, failure modes, and sim-only testing instructions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
BRIDGES=1 now appends camera-mjpeg-module + audio-ws-module +
cmd-bridge-module, and the banner lists the new cmd_vel/path/stop/pose
endpoints alongside the camera and audio URLs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mode

Real-robot LLM launches (LMSTUDIO=1 ./go2-start.sh) used to forget the
McpClient, so the -o mcpclient.model=... override silently bound to
nothing and LM Studio sat unused.

go2-start.sh now mirrors what sim-with-llm.sh was doing: when an LLM
preset is set against a non-agentic blueprint, mcp-server + mcp-client
are appended to EXTRA. The check is skipped when the blueprint already
includes an agent (e.g. unitree-go2-agentic-ollama).

sim-with-llm.sh becomes a thin wrapper: it only picks the backend
preset and forwards to go2-start.sh with SIMULATION=1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@tfius tfius requested a review from leshy as a code owner May 28, 2026 04:39
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 28, 2026

Greptile Summary

This PR adds three HTTP/WebSocket bridges (MJPEG camera, PCM audio, cmd_vel) to expose robot I/O over plain HTTP for external VLM/LLM processes, adds Go2 audio I/O via WebRTC, several new skills (Gemini TTS, macOS say, take_picture, map_uploader), a Gemini-all blueprint, and two hackathon launcher scripts.

  • Web bridges (mjpeg_module, audio_ws_module, cmd_bridge_module): each runs uvicorn in a daemon thread, uses proper cross-thread handoffs, and binds to loopback only.
  • Audio I/O (UnitreeWebRTCConnection): audio_stream() correctly uses a Subject + finally_action teardown; play_wav_bytes() fires a coroutine into the existing event loop — but exit_megaphone() is missing from the finally block, which can leave the robot unable to move.
  • Launchers (go2-start.sh, sim-with-llm.sh): pre-flight wifi/ping/NTP/LLM checks with LLM auto-injection; a hardcoded developer filesystem path appears in one error message.

Confidence Score: 3/5

Safe to merge with one fix: the megaphone-exit bug in play_wav_bytes must be addressed before deploying on a real Go2, as it can leave the robot unresponsive to all motion commands until manually reset.

The audio WebRTC path in connection.py enters megaphone mode and then calls exit_megaphone() outside any finally block. A coroutine cancellation during active playback will leave the robot permanently in megaphone mode, blocking every subsequent motion command. The rest of the PR — the three web bridges, Gemini TTS skill, take-picture skill, and launcher scripts — is well-structured and largely correct.

dimos/robot/unitree/connection.py (play_wav_bytes / _upload_play_exit) requires the exit_megaphone finally-block fix before real-robot use. go2-start.sh has a minor hardcoded path to address.

Important Files Changed

Filename Overview
dimos/robot/unitree/connection.py Adds AudioMessage dataclass, audio_stream() observable, and play_wav_bytes(). The exit_megaphone() call is not wrapped in a finally block, so cancellation leaves the robot stuck in megaphone mode.
dimos/web/audio_ws_module.py New WebSocket bridge for robot audio (mic out + WAV playback). Thread/async handoff via a bounded queue is correctly implemented; _clients is only accessed from the same asyncio loop so no race. Clean.
dimos/web/cmd_bridge_module.py New HTTP bridge for robot motion commands. Lock-and-cancel design is correct; duration field lacks an upper bound that could stall the thread-pool worker indefinitely.
dimos/web/mjpeg_module.py New MJPEG + snapshot HTTP module. Frame encoding is correct (RGB to BGR before imencode), snapshot is lock-protected, CORS is open. No issues.
dimos/robot/unitree/go2/connection.py Adds audio/audio_in streams and play_wav/play_wav_b64 skills to GO2Connection. Uses hasattr guards for sim compatibility. Clean.
go2-start.sh New hackathon launcher with wifi/ping/NTP/LLM-endpoint checks. Contains a hardcoded developer local path (/Users/tex/repos/…) in the mlxvlm error message.
dimos/agents/skills/gemini_speak_skill.py New Gemini TTS-backed speak skill. Idempotent consumer wiring, fire-and-forget queue, blocking wait with text matching, and resource cleanup are all handled correctly.
dimos/agents/skills/take_picture_skill.py New skill for one-shot and explore-and-capture photo upload. Background threads tracked and joined on stop; tilt always re-leveled in finally. Clean.
sim-with-llm.sh Thin wrapper around go2-start.sh that routes the backend argument (mlxvlm/lmstudio/ollama) and sets SIMULATION=1. No issues.
dimos/stream/audio/tts/node_gemini.py New Gemini TTS pipeline node. Worker thread and subscription lifecycle are correctly managed; dispose() drains the queue and joins the thread before completing subjects.

Sequence Diagram

sequenceDiagram
    participant ExtClient as External VLM/Client
    participant MjpegMod as CameraMjpegModule (port 7780)
    participant AudioMod as AudioWsModule (port 7781)
    participant CmdMod as CmdBridgeModule (port 7782)
    participant GO2Conn as GO2Connection
    participant UnitreeConn as UnitreeWebRTCConnection

    GO2Conn->>MjpegMod: color_image stream
    GO2Conn->>AudioMod: audio (AudioMessage) stream
    UnitreeConn->>GO2Conn: video_stream / audio_stream

    ExtClient->>MjpegMod: GET /video_feed/color_image (MJPEG)
    MjpegMod-->>ExtClient: multipart/x-mixed-replace JPEG frames

    ExtClient->>AudioMod: WS /audio_out
    AudioMod-->>ExtClient: binary int16 PCM frames

    ExtClient->>AudioMod: POST /play (WAV bytes)
    AudioMod->>GO2Conn: audio_in (bytes)
    GO2Conn->>UnitreeConn: play_wav_bytes(wav)
    UnitreeConn->>UnitreeConn: upload_megaphone, enter, sleep, exit

    ExtClient->>CmdMod: POST /cmd_vel or /path
    CmdMod->>GO2Conn: cmd_vel (Twist)

    ExtClient->>CmdMod: GET /pose
    CmdMod-->>ExtClient: x, y, z, theta
Loading

Reviews (1): Last reviewed commit: "fix(go2-start): auto-inject mcp-server/m..." | Re-trigger Greptile

Comment on lines +502 to +507
await hub.upload_megaphone(tmp_path)
await hub.enter_megaphone()
# Hold megaphone for the clip's duration, plus a small flush margin,
# then release so other commands work normally.
await asyncio.sleep(duration + 0.5)
await hub.exit_megaphone()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 If asyncio.sleep() is cancelled (e.g. the event loop shuts down mid-play) or any awaitable after enter_megaphone() raises, exit_megaphone() is never called and the robot is permanently stuck in megaphone mode until manually reset, rendering all movement commands ineffective.

Suggested change
await hub.upload_megaphone(tmp_path)
await hub.enter_megaphone()
# Hold megaphone for the clip's duration, plus a small flush margin,
# then release so other commands work normally.
await asyncio.sleep(duration + 0.5)
await hub.exit_megaphone()
await hub.upload_megaphone(tmp_path)
await hub.enter_megaphone()
try:
# Hold megaphone for the clip's duration, plus a small flush margin,
# then release so other commands work normally.
await asyncio.sleep(duration + 0.5)
finally:
await hub.exit_megaphone()

Comment thread go2-start.sh
if [[ "$LMSTUDIO" == "1" ]]; then
c_red " start LM Studio's Local Server (Cmd-Shift-2) and load a tool-capable model"
else
c_red " start mlxvlm: cd /Users/tex/repos/ai/mlx/mlxvlm && scripts/start-all.sh"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 This error message contains a hardcoded path from the developer's local machine (/Users/tex/repos/ai/mlx/mlxvlm). Every other user will see an incorrect and confusing instruction.

Suggested change
c_red " start mlxvlm: cd /Users/tex/repos/ai/mlx/mlxvlm && scripts/start-all.sh"
c_red " start mlxvlm server and ensure it is listening on port 8080"

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

class TwistRequest(BaseModel):
linear: list[float] = Field(default=[0.0, 0.0, 0.0], min_length=3, max_length=3)
angular: list[float] = Field(default=[0.0, 0.0, 0.0], min_length=3, max_length=3)
duration: float = 0.5 # seconds to hold the command before stopping
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 duration has no upper bound. A caller can send duration: 999999, blocking the uvicorn thread-pool worker (and the _drive_lock) for hours, making /stop unable to interrupt ongoing execution and preventing subsequent commands from acquiring the lock.

Suggested change
duration: float = 0.5 # seconds to hold the command before stopping
duration: float = Field(default=0.5, ge=0.0, le=30.0) # seconds to hold the command before stopping

@leshy leshy added the hackaton label May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants