From fe668e6450fa87041cbb742987db2fe36879e981 Mon Sep 17 00:00:00 2001 From: "Wenjie F." Date: Fri, 29 May 2026 00:17:47 +0800 Subject: [PATCH 1/8] Add Fetch hackathon submission (Team Pivot) --- hackathon/fetch/README.md | 91 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 91 insertions(+) create mode 100644 hackathon/fetch/README.md diff --git a/hackathon/fetch/README.md b/hackathon/fetch/README.md new file mode 100644 index 0000000000..3a37c8c387 --- /dev/null +++ b/hackathon/fetch/README.md @@ -0,0 +1,91 @@ +# Fetch + +Hackathon submission from **Team Pivot** — Philip Seifi ([@seifip](https://github.com/seifip)), Wenjie Fu ([@Wenjix](https://github.com/Wenjix)), and GuoZi ([@GuoZhuoRan](https://github.com/GuoZhuoRan)). + +**A Unitree Go2 robot dog that trades ice-cold Cokes for instant photos.** + +## If You Only Have 60 Seconds + +1. Watch the demo: _publishing to YouTube — link coming shortly._ +2. Read the project: https://github.com/seifip/robodog-fetch +3. Run it yourself (zero hardware — phone or laptop browser camera): + https://github.com/seifip/robodog-fetch#quickstart-run-it-yourself + +## One Sentence + +Fetch works a crowd like a tiny, soda-carrying street performer: a vision LLM +"reads the room" on every camera frame and decides where to move, what to say, and +when to snap the photo — all running as a single FastAPI + WebSocket server you can +try from a phone browser **before any robot is involved**. + +```text +camera frame (+depth) + -> vision LLM (OpenAI / Gemini) + -> decision (state, cmd_vel, line, photo?) + -> act (move Go2, speak, snap photo) + ^------------------ ~1s scan loop ------------------v +``` + +## What Matters + +- **DimOS is the runtime.** Fetch reuses the DimOS teleop web pattern — a FastAPI + server serves an HTTPS phone UI, the phone streams camera frames over a WebSocket, + and the server returns motion / speech / photo decisions. On the dog it drives + DimOS Unitree WebRTC control + LiDAR. +- **Fetch is the behavior layer.** The vision-LLM decision loop, persona, approach/ + trade/photo state machine, and voice all sit on top of DimOS primitives. +- **Three camera sources, one loop:** phone browser camera (zero hardware), Record3D + USB RGBD (real iPhone LiDAR depth), and a live Unitree Go2 over Wi-Fi. + +## What We Built + +- A vision-LLM that evaluates each frame and emits a structured decision: target, + `cmd_vel`, spoken line, and photo-framing readiness. +- The full interaction flow: scan for a relaxed guest → obstacle-aware approach → + wave + personalized one-liner → "grab a Coke, pose" → snap the instant photo + dance. +- A single-page phone UI (camera feed, previews, decision display, audio routing, + photo flow) backed by a FastAPI + WebSocket server. +- Runtime-switchable **voice**: one-way TTS across Cartesia / Gemini Live / OpenAI, + plus an opt-in **two-way Gemini Live** conversation that drives the dog through + tool calls (`accept_offer`, `take_photo`, `celebrate`, `do_trick`, `stop_and_reset`). +- Safety/privacy guardrails: humor constrained to visible, non-sensitive context; + LiDAR/depth-enforced `<4m` stop and obstacle avoidance. + +## Reviewer Map + +| Question | Open this | +| --- | --- | +| What is the demo? | _Publishing to YouTube — link coming shortly._ | +| Where is the full source? | https://github.com/seifip/robodog-fetch | +| How do I run it (no hardware)? | https://github.com/seifip/robodog-fetch#quickstart-run-it-yourself | +| How does the decision loop work? | https://github.com/seifip/robodog-fetch#how-it-works-at-a-glance | +| What's the DimOS integration? | https://github.com/seifip/robodog-fetch#built-on-dimos | +| Where's the DimOS runtime? | https://github.com/dimensionalOS/dimos | + +## How to Run + +Zero-hardware path (phone or laptop browser camera), from the DimOS monorepo root: + +```bash +python -m dimos.experimental.fetch.iphone_middleware --host 0.0.0.0 --port 8455 +``` + +Open `https://127.0.0.1:8455/fetch` and tap **Record** to start the ~1-second scan +loop. The full quickstart (Record3D USB and live Go2 paths, provider keys, and the +voice modes) is in the project README. + +## Scope Boundary + +This PR is a hackathon **submission pointer**: the full source, demo video, and +assets are hosted externally at https://github.com/seifip/robodog-fetch. It adds a +single markdown file under `hackathon/` and **does not vendor Fetch into DimOS or +modify any DimOS runtime code**. (Fetch is designed to live at +`dimos/experimental/fetch/` in the monorepo; that vendoring is intentionally out of +scope for this submission.) + +## Validation + +- `pytest -q` in the project repo is green — all provider calls are mocked, so there + are no real API calls (covers the 26 middleware tests plus the policy, conversation, + and TTS suites). +- This submission adds markdown only; no DimOS code is touched. From aed8d7ed72435fa664b8065ee63690aa54f5358b Mon Sep 17 00:00:00 2001 From: "Wenjie F." Date: Fri, 29 May 2026 00:27:21 +0800 Subject: [PATCH 2/8] Enrich Fetch submission README with technical detail --- hackathon/fetch/README.md | 63 +++++++++++++++++++++++++-------------- 1 file changed, 40 insertions(+), 23 deletions(-) diff --git a/hackathon/fetch/README.md b/hackathon/fetch/README.md index 3a37c8c387..9735b44826 100644 --- a/hackathon/fetch/README.md +++ b/hackathon/fetch/README.md @@ -7,7 +7,7 @@ Hackathon submission from **Team Pivot** — Philip Seifi ([@seifip](https://git ## If You Only Have 60 Seconds 1. Watch the demo: _publishing to YouTube — link coming shortly._ -2. Read the project: https://github.com/seifip/robodog-fetch +2. Full source: https://github.com/seifip/robodog-fetch 3. Run it yourself (zero hardware — phone or laptop browser camera): https://github.com/seifip/robodog-fetch#quickstart-run-it-yourself @@ -28,28 +28,44 @@ camera frame (+depth) ## What Matters -- **DimOS is the runtime.** Fetch reuses the DimOS teleop web pattern — a FastAPI - server serves an HTTPS phone UI, the phone streams camera frames over a WebSocket, - and the server returns motion / speech / photo decisions. On the dog it drives - DimOS Unitree WebRTC control + LiDAR. +- **DimOS is the runtime.** Fetch reuses the DimOS teleop web pattern (HTTPS phone + UI, WebSocket camera frames, motion/speech/photo decisions) and drives a **real + Unitree Go2 over DimOS's WebRTC stack** — with selectable connection modes + (`auto` / `local_ap` / `local_sta`) so it reaches the dog on its local-AP network + at `192.168.12.1` as well as standard Wi-Fi. - **Fetch is the behavior layer.** The vision-LLM decision loop, persona, approach/ trade/photo state machine, and voice all sit on top of DimOS primitives. -- **Three camera sources, one loop:** phone browser camera (zero hardware), Record3D - USB RGBD (real iPhone LiDAR depth), and a live Unitree Go2 over Wi-Fi. +- **Real-time by design.** A ~1-second scan loop and low-latency speech (Cartesia + Sonic by default) keep the interaction feeling live, not turn-based. ## What We Built -- A vision-LLM that evaluates each frame and emits a structured decision: target, - `cmd_vel`, spoken line, and photo-framing readiness. -- The full interaction flow: scan for a relaxed guest → obstacle-aware approach → - wave + personalized one-liner → "grab a Coke, pose" → snap the instant photo + dance. -- A single-page phone UI (camera feed, previews, decision display, audio routing, - photo flow) backed by a FastAPI + WebSocket server. +- A vision-LLM that turns each frame into a structured decision: target, `cmd_vel`, + spoken line, and photo-framing readiness. +- The full interaction flow: scan for a relaxed guest → obstacle-aware approach + (turning to keep the subject in frame) → wave + personalized one-liner → + "grab a Coke, pose" → snap the instant photo + dance. +- **Instant photo → the guest's phone.** Shots save locally and can mirror to an + iCloud or Google Drive folder (`FETCH_PHOTO_MIRROR_DIRS`) so the demo phone syncs + the picture seconds after it's taken. +- A single-page phone UI (camera feed, previews, live decision display, audio + routing, photo flow) backed by FastAPI + WebSocket. - Runtime-switchable **voice**: one-way TTS across Cartesia / Gemini Live / OpenAI, - plus an opt-in **two-way Gemini Live** conversation that drives the dog through - tool calls (`accept_offer`, `take_photo`, `celebrate`, `do_trick`, `stop_and_reset`). -- Safety/privacy guardrails: humor constrained to visible, non-sensitive context; - LiDAR/depth-enforced `<4m` stop and obstacle avoidance. + plus opt-in **two-way Gemini Live** conversation that drives the dog through tool + calls (`accept_offer`, `take_photo`, `celebrate`, `do_trick`, `stop_and_reset`). +- Safety/privacy guardrails: humor limited to visible, non-sensitive context; + LiDAR/depth-enforced `<4 m` stop and obstacle avoidance. + +## Under the Hood (for the technically curious) + +| Piece | What it does | +| --- | --- | +| **Camera sources** | One loop, three inputs: phone browser camera (zero hardware), Record3D USB RGBD (real iPhone LiDAR depth), and a live Go2 over WebRTC. | +| **Vision policy** | `FetchPolicy.analyze_frame()` sends image + prompt to the vision LLM and normalizes the JSON into a decision dict (default OpenAI `gpt-5-mini`; `--vision-provider gemini` for `gemini-3.5-flash`). | +| **Go2 transport** | DimOS Unitree WebRTC; `--robot-connection-method auto\|local_ap\|local_sta` (default `local_ap`) + `--robot-ip` select how to reach the dog. | +| **Voice** | Provider-switchable TTS at runtime (no restart) plus an optional persistent Gemini Live session with server-side VAD / barge-in. | +| **Photos** | Capture to `static/captures/`, optionally mirrored to iCloud/Drive folders via `FETCH_PHOTO_MIRROR_DIRS`. | +| **Tests** | **76 passing tests**, all providers mocked — no live API calls needed to review (policy, middleware routes, TTS, conversation tools, photo saving). | ## Reviewer Map @@ -71,8 +87,9 @@ python -m dimos.experimental.fetch.iphone_middleware --host 0.0.0.0 --port 8455 ``` Open `https://127.0.0.1:8455/fetch` and tap **Record** to start the ~1-second scan -loop. The full quickstart (Record3D USB and live Go2 paths, provider keys, and the -voice modes) is in the project README. +loop. To drive a real dog, add `--robot-ip 192.168.12.1 --robot-connection-method +local_ap`. The full quickstart (Record3D USB, live Go2, provider keys, and voice +modes) is in the project README. ## Scope Boundary @@ -81,11 +98,11 @@ assets are hosted externally at https://github.com/seifip/robodog-fetch. It adds single markdown file under `hackathon/` and **does not vendor Fetch into DimOS or modify any DimOS runtime code**. (Fetch is designed to live at `dimos/experimental/fetch/` in the monorepo; that vendoring is intentionally out of -scope for this submission.) +scope for this pointer.) ## Validation -- `pytest -q` in the project repo is green — all provider calls are mocked, so there - are no real API calls (covers the 26 middleware tests plus the policy, conversation, - and TTS suites). +- `pytest -q` in the project repo: **76 passed**, providers mocked (no real API + calls) — covers policy normalization, middleware routes, TTS, conversation tools, + and photo saving. - This submission adds markdown only; no DimOS code is touched. From 616be61d3c8fee30aba4eeb7393c05a4a7001dae Mon Sep 17 00:00:00 2001 From: "Wenjie F." Date: Fri, 29 May 2026 00:31:01 +0800 Subject: [PATCH 3/8] Note branded photos and Gemini 2.5 Flash-Lite demo config --- hackathon/fetch/README.md | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/hackathon/fetch/README.md b/hackathon/fetch/README.md index 9735b44826..912d6307cf 100644 --- a/hackathon/fetch/README.md +++ b/hackathon/fetch/README.md @@ -35,8 +35,9 @@ camera frame (+depth) at `192.168.12.1` as well as standard Wi-Fi. - **Fetch is the behavior layer.** The vision-LLM decision loop, persona, approach/ trade/photo state machine, and voice all sit on top of DimOS primitives. -- **Real-time by design.** A ~1-second scan loop and low-latency speech (Cartesia - Sonic by default) keep the interaction feeling live, not turn-based. +- **Real-time by design.** A ~1-second scan loop, a low-latency vision model (the + live demo runs on **Gemini 2.5 Flash-Lite**), and fast speech (Cartesia Sonic by + default) keep the interaction feeling live, not turn-based. ## What We Built @@ -45,9 +46,10 @@ camera frame (+depth) - The full interaction flow: scan for a relaxed guest → obstacle-aware approach (turning to keep the subject in frame) → wave + personalized one-liner → "grab a Coke, pose" → snap the instant photo + dance. -- **Instant photo → the guest's phone.** Shots save locally and can mirror to an - iCloud or Google Drive folder (`FETCH_PHOTO_MIRROR_DIRS`) so the demo phone syncs - the picture seconds after it's taken. +- **Instant, Fetch-branded photo → the guest's phone.** Each capture is composited + with the Fetch logo in a Polaroid-style branded photo view (with a print sound), + saved locally and optionally mirrored to an iCloud or Google Drive folder + (`FETCH_PHOTO_MIRROR_DIRS`) so the demo phone syncs it seconds after it's taken. - A single-page phone UI (camera feed, previews, live decision display, audio routing, photo flow) backed by FastAPI + WebSocket. - Runtime-switchable **voice**: one-way TTS across Cartesia / Gemini Live / OpenAI, @@ -61,10 +63,10 @@ camera frame (+depth) | Piece | What it does | | --- | --- | | **Camera sources** | One loop, three inputs: phone browser camera (zero hardware), Record3D USB RGBD (real iPhone LiDAR depth), and a live Go2 over WebRTC. | -| **Vision policy** | `FetchPolicy.analyze_frame()` sends image + prompt to the vision LLM and normalizes the JSON into a decision dict (default OpenAI `gpt-5-mini`; `--vision-provider gemini` for `gemini-3.5-flash`). | +| **Vision policy** | `FetchPolicy.analyze_frame()` sends image + prompt to a provider-selectable vision LLM (OpenAI or Gemini) and normalizes the JSON into a decision dict; the live demo runs on **Gemini 2.5 Flash-Lite** for the lowest latency. | | **Go2 transport** | DimOS Unitree WebRTC; `--robot-connection-method auto\|local_ap\|local_sta` (default `local_ap`) + `--robot-ip` select how to reach the dog. | | **Voice** | Provider-switchable TTS at runtime (no restart) plus an optional persistent Gemini Live session with server-side VAD / barge-in. | -| **Photos** | Capture to `static/captures/`, optionally mirrored to iCloud/Drive folders via `FETCH_PHOTO_MIRROR_DIRS`. | +| **Photos** | Fetch-branded capture (logo composited via ``) to `static/captures/`, optionally mirrored to iCloud/Drive folders via `FETCH_PHOTO_MIRROR_DIRS`. | | **Tests** | **76 passing tests**, all providers mocked — no live API calls needed to review (policy, middleware routes, TTS, conversation tools, photo saving). | ## Reviewer Map From d6b078c3f18ca34b23d19d2d7eb388b483ef9aac Mon Sep 17 00:00:00 2001 From: "Wenjie F." Date: Fri, 29 May 2026 00:48:15 +0800 Subject: [PATCH 4/8] Add published YouTube demo link --- hackathon/fetch/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/hackathon/fetch/README.md b/hackathon/fetch/README.md index 912d6307cf..52cc11e404 100644 --- a/hackathon/fetch/README.md +++ b/hackathon/fetch/README.md @@ -6,7 +6,7 @@ Hackathon submission from **Team Pivot** — Philip Seifi ([@seifip](https://git ## If You Only Have 60 Seconds -1. Watch the demo: _publishing to YouTube — link coming shortly._ +1. Watch the demo: https://www.youtube.com/watch?v=8hHYE1239wg 2. Full source: https://github.com/seifip/robodog-fetch 3. Run it yourself (zero hardware — phone or laptop browser camera): https://github.com/seifip/robodog-fetch#quickstart-run-it-yourself @@ -73,7 +73,7 @@ camera frame (+depth) | Question | Open this | | --- | --- | -| What is the demo? | _Publishing to YouTube — link coming shortly._ | +| What is the demo? | https://www.youtube.com/watch?v=8hHYE1239wg | | Where is the full source? | https://github.com/seifip/robodog-fetch | | How do I run it (no hardware)? | https://github.com/seifip/robodog-fetch#quickstart-run-it-yourself | | How does the decision loop work? | https://github.com/seifip/robodog-fetch#how-it-works-at-a-glance | From 16f94b8a36ead0c19b252d1f73df2eb537c526a2 Mon Sep 17 00:00:00 2001 From: "Wenjie F." Date: Fri, 29 May 2026 00:49:21 +0800 Subject: [PATCH 5/8] Reframe reviewer path as 90 seconds --- hackathon/fetch/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hackathon/fetch/README.md b/hackathon/fetch/README.md index 52cc11e404..00fa90d5d0 100644 --- a/hackathon/fetch/README.md +++ b/hackathon/fetch/README.md @@ -4,7 +4,7 @@ Hackathon submission from **Team Pivot** — Philip Seifi ([@seifip](https://git **A Unitree Go2 robot dog that trades ice-cold Cokes for instant photos.** -## If You Only Have 60 Seconds +## If You Only Have 90 Seconds 1. Watch the demo: https://www.youtube.com/watch?v=8hHYE1239wg 2. Full source: https://github.com/seifip/robodog-fetch From b5fce1b1e6221008af7c593b3d569dc304d091d6 Mon Sep 17 00:00:00 2001 From: "Wenjie F." Date: Fri, 29 May 2026 01:43:25 +0800 Subject: [PATCH 6/8] Add why-a-beach, UX latency, and roadmap/business sections --- hackathon/fetch/README.md | 29 ++++++++++++++++++++++++++--- 1 file changed, 26 insertions(+), 3 deletions(-) diff --git a/hackathon/fetch/README.md b/hackathon/fetch/README.md index 00fa90d5d0..0e530ce3a9 100644 --- a/hackathon/fetch/README.md +++ b/hackathon/fetch/README.md @@ -35,9 +35,21 @@ camera frame (+depth) at `192.168.12.1` as well as standard Wi-Fi. - **Fetch is the behavior layer.** The vision-LLM decision loop, persona, approach/ trade/photo state machine, and voice all sit on top of DimOS primitives. -- **Real-time by design.** A ~1-second scan loop, a low-latency vision model (the - live demo runs on **Gemini 2.5 Flash-Lite**), and fast speech (Cartesia Sonic by - default) keep the interaction feeling live, not turn-based. +- **Real-time by design.** We benchmarked round-trip latency across vision and speech + models (`scripts/latency_bench.py`) and run the fastest combo — **Gemini 2.5 + Flash-Lite** vision + **Cartesia Sonic** speech; camera frames are downscaled + (≤640 px) before analysis, and the whole scan loop lands around one second. + +## Why a beach? + +Quadrupeds earn their keep on terrain wheels can't handle, so we built Fetch around +that. We chose **sand** for a form-factor reason: the Go2's camera sits low and looks +*up* at standing people, but on a beach people sit or lie on the sand — dropping into +the dog's natural eye-line and making the interaction feel natural. And it's feasible +today: quadrupeds already run on sand +([RaiBo](https://techxplore.com/news/2023-01-raibo-versatile-robo-dog-sandy-beach.html) +at 3 m/s) and [sand-walking foot +adaptations](https://www.popsci.com/technology/robot-moose/) cut foot sinkage ~46%. ## What We Built @@ -93,6 +105,17 @@ loop. To drive a real dog, add `--robot-ip 192.168.12.1 --robot-connection-metho local_ap`. The full quickstart (Record3D USB, live Go2, provider keys, and voice modes) is in the project README. +## What's Next + +- **Sense the trade.** The Go2 EDU's [foot-force sensors](https://www.unitree.com/go2/foot/) + could detect a Coke lifted from the back via the change in total load — closing the + loop without the camera's framing check. +- **Real sand.** Fit sand-walking foot adaptations for an outdoor beach deployment. + +**The bigger picture:** Fetch is an autonomous **brand ambassador** and **mobile +vendor** (Coca-Cola here), pointing toward fleets of autonomous robot-dog vendors that +self-resupply at beachside bars/vendors or autonomous resupply stations. + ## Scope Boundary This PR is a hackathon **submission pointer**: the full source, demo video, and From b39f7d4e2d53bd84e9ffe99b6705b9a3e0c8797f Mon Sep 17 00:00:00 2001 From: "Wenjie F." Date: Fri, 29 May 2026 01:52:09 +0800 Subject: [PATCH 7/8] Promote business framing to early 'The opportunity' section --- hackathon/fetch/README.md | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/hackathon/fetch/README.md b/hackathon/fetch/README.md index 0e530ce3a9..856fadab4b 100644 --- a/hackathon/fetch/README.md +++ b/hackathon/fetch/README.md @@ -26,6 +26,14 @@ camera frame (+depth) ^------------------ ~1s scan loop ------------------v ``` +## The opportunity + +Fetch is an autonomous **brand ambassador** and **mobile vendor** — here for Coca-Cola +— that hands out product, creates a memorable branded moment, and walks away with the +guest's photo. The longer-term vision: fleets of autonomous robot-dog vendors that roam +the beach and **self-resupply** at beachside bars and vendors, or at dedicated +autonomous resupply stations. + ## What Matters - **DimOS is the runtime.** Fetch reuses the DimOS teleop web pattern (HTTPS phone @@ -112,10 +120,6 @@ modes) is in the project README. loop without the camera's framing check. - **Real sand.** Fit sand-walking foot adaptations for an outdoor beach deployment. -**The bigger picture:** Fetch is an autonomous **brand ambassador** and **mobile -vendor** (Coca-Cola here), pointing toward fleets of autonomous robot-dog vendors that -self-resupply at beachside bars/vendors or autonomous resupply stations. - ## Scope Boundary This PR is a hackathon **submission pointer**: the full source, demo video, and From a8f545b9d54de4f043f894b95757bf4f6af136d4 Mon Sep 17 00:00:00 2001 From: "Wenjie F." Date: Fri, 29 May 2026 10:32:06 +0800 Subject: [PATCH 8/8] Document camera/LiDAR capture, iCloud/Drive sync, and Xiaomi-printer print pipeline --- hackathon/fetch/README.md | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/hackathon/fetch/README.md b/hackathon/fetch/README.md index 856fadab4b..7e8ffa9930 100644 --- a/hackathon/fetch/README.md +++ b/hackathon/fetch/README.md @@ -66,10 +66,11 @@ adaptations](https://www.popsci.com/technology/robot-moose/) cut foot sinkage ~4 - The full interaction flow: scan for a relaxed guest → obstacle-aware approach (turning to keep the subject in frame) → wave + personalized one-liner → "grab a Coke, pose" → snap the instant photo + dance. -- **Instant, Fetch-branded photo → the guest's phone.** Each capture is composited - with the Fetch logo in a Polaroid-style branded photo view (with a print sound), - saved locally and optionally mirrored to an iCloud or Google Drive folder - (`FETCH_PHOTO_MIRROR_DIRS`) so the demo phone syncs it seconds after it's taken. +- **Instant photo → the guest's hands.** Captured from the Go2's camera + LiDAR and + composited with the Fetch logo (Polaroid-style branded view + print sound), the shot — + plus our demo recordings — syncs to iCloud / Google Drive via mirror folders + (`FETCH_PHOTO_MIRROR_DIRS`); at the event, a synced phone sends it to a **Xiaomi + mini-printer** through the printer's app for an instant physical print. - A single-page phone UI (camera feed, previews, live decision display, audio routing, photo flow) backed by FastAPI + WebSocket. - Runtime-switchable **voice**: one-way TTS across Cartesia / Gemini Live / OpenAI, @@ -86,7 +87,7 @@ adaptations](https://www.popsci.com/technology/robot-moose/) cut foot sinkage ~4 | **Vision policy** | `FetchPolicy.analyze_frame()` sends image + prompt to a provider-selectable vision LLM (OpenAI or Gemini) and normalizes the JSON into a decision dict; the live demo runs on **Gemini 2.5 Flash-Lite** for the lowest latency. | | **Go2 transport** | DimOS Unitree WebRTC; `--robot-connection-method auto\|local_ap\|local_sta` (default `local_ap`) + `--robot-ip` select how to reach the dog. | | **Voice** | Provider-switchable TTS at runtime (no restart) plus an optional persistent Gemini Live session with server-side VAD / barge-in. | -| **Photos** | Fetch-branded capture (logo composited via ``) to `static/captures/`, optionally mirrored to iCloud/Drive folders via `FETCH_PHOTO_MIRROR_DIRS`. | +| **Photos** | Fetch-branded capture (logo composited via ``) to `static/captures/`, mirrored to iCloud/Drive folders via `FETCH_PHOTO_MIRROR_DIRS`; a synced phone then prints it on a Xiaomi mini-printer via the printer's app. | | **Tests** | **76 passing tests**, all providers mocked — no live API calls needed to review (policy, middleware routes, TTS, conversation tools, photo saving). | ## Reviewer Map