dimensionalOS · Wenjix · May 28, 2026 · May 28, 2026 · May 28, 2026 · May 28, 2026
@@ -0,0 +1,138 @@
+# Fetch
+
+Hackathon submission from **Team Pivot** — Philip Seifi ([@seifip](https://github.com/seifip)), Wenjie Fu ([@Wenjix](https://github.com/Wenjix)), and GuoZi ([@GuoZhuoRan](https://github.com/GuoZhuoRan)).
+
+**A Unitree Go2 robot dog that trades ice-cold Cokes for instant photos.**
+
+## If You Only Have 90 Seconds
+
+1. Watch the demo: https://www.youtube.com/watch?v=8hHYE1239wg
+2. Full source: https://github.com/seifip/robodog-fetch
+3. Run it yourself (zero hardware — phone or laptop browser camera):
+   https://github.com/seifip/robodog-fetch#quickstart-run-it-yourself
+
+## One Sentence
+
+Fetch works a crowd like a tiny, soda-carrying street performer: a vision LLM
+"reads the room" on every camera frame and decides where to move, what to say, and
+when to snap the photo — all running as a single FastAPI + WebSocket server you can
+try from a phone browser **before any robot is involved**.
+
+```text
+camera frame (+depth)
+  -> vision LLM (OpenAI / Gemini)
+  -> decision (state, cmd_vel, line, photo?)
+  -> act (move Go2, speak, snap photo)
+  ^------------------ ~1s scan loop ------------------v
+```
+
+## The opportunity
+
+Fetch is an autonomous **brand ambassador** and **mobile vendor** — here for Coca-Cola
+— that hands out product, creates a memorable branded moment, and walks away with the
+guest's photo. The longer-term vision: fleets of autonomous robot-dog vendors that roam
+the beach and **self-resupply** at beachside bars and vendors, or at dedicated
+autonomous resupply stations.
+
+## What Matters
+
+- **DimOS is the runtime.** Fetch reuses the DimOS teleop web pattern (HTTPS phone
+  UI, WebSocket camera frames, motion/speech/photo decisions) and drives a **real
+  Unitree Go2 over DimOS's WebRTC stack** — with selectable connection modes
+  (`auto` / `local_ap` / `local_sta`) so it reaches the dog on its local-AP network
+  at `192.168.12.1` as well as standard Wi-Fi.
+- **Fetch is the behavior layer.** The vision-LLM decision loop, persona, approach/
+  trade/photo state machine, and voice all sit on top of DimOS primitives.
+- **Real-time by design.** We benchmarked round-trip latency across vision and speech
+  models (`scripts/latency_bench.py`) and run the fastest combo — **Gemini 2.5
+  Flash-Lite** vision + **Cartesia Sonic** speech; camera frames are downscaled
+  (≤640 px) before analysis, and the whole scan loop lands around one second.
+
+## Why a beach?
+
+Quadrupeds earn their keep on terrain wheels can't handle, so we built Fetch around
+that. We chose **sand** for a form-factor reason: the Go2's camera sits low and looks
+*up* at standing people, but on a beach people sit or lie on the sand — dropping into
+the dog's natural eye-line and making the interaction feel natural. And it's feasible
+today: quadrupeds already run on sand
+([RaiBo](https://techxplore.com/news/2023-01-raibo-versatile-robo-dog-sandy-beach.html)
+at 3 m/s) and [sand-walking foot
+adaptations](https://www.popsci.com/technology/robot-moose/) cut foot sinkage ~46%.
+
+## What We Built
+
+- A vision-LLM that turns each frame into a structured decision: target, `cmd_vel`,
+  spoken line, and photo-framing readiness.
+- The full interaction flow: scan for a relaxed guest → obstacle-aware approach
+  (turning to keep the subject in frame) → wave + personalized one-liner →
+  "grab a Coke, pose" → snap the instant photo + dance.
+- **Instant photo → the guest's hands.** Captured from the Go2's camera + LiDAR and
+  composited with the Fetch logo (Polaroid-style branded view + print sound), the shot —
+  plus our demo recordings — syncs to iCloud / Google Drive via mirror folders
+  (`FETCH_PHOTO_MIRROR_DIRS`); at the event, a synced phone sends it to a **Xiaomi
+  mini-printer** through the printer's app for an instant physical print.
+- A single-page phone UI (camera feed, previews, live decision display, audio
+  routing, photo flow) backed by FastAPI + WebSocket.
+- Runtime-switchable **voice**: one-way TTS across Cartesia / Gemini Live / OpenAI,
+  plus opt-in **two-way Gemini Live** conversation that drives the dog through tool
+  calls (`accept_offer`, `take_photo`, `celebrate`, `do_trick`, `stop_and_reset`).
+- Safety/privacy guardrails: humor limited to visible, non-sensitive context;
+  LiDAR/depth-enforced `<4 m` stop and obstacle avoidance.
+
+## Under the Hood (for the technically curious)
+
+| Piece | What it does |
+| --- | --- |
+| **Camera sources** | One loop, three inputs: phone browser camera (zero hardware), Record3D USB RGBD (real iPhone LiDAR depth), and a live Go2 over WebRTC. |
+| **Vision policy** | `FetchPolicy.analyze_frame()` sends image + prompt to a provider-selectable vision LLM (OpenAI or Gemini) and normalizes the JSON into a decision dict; the live demo runs on **Gemini 2.5 Flash-Lite** for the lowest latency. |
+| **Go2 transport** | DimOS Unitree WebRTC; `--robot-connection-method auto\|local_ap\|local_sta` (default `local_ap`) + `--robot-ip` select how to reach the dog. |
+| **Voice** | Provider-switchable TTS at runtime (no restart) plus an optional persistent Gemini Live session with server-side VAD / barge-in. |
+| **Photos** | Fetch-branded capture (logo composited via `<canvas>`) to `static/captures/`, mirrored to iCloud/Drive folders via `FETCH_PHOTO_MIRROR_DIRS`; a synced phone then prints it on a Xiaomi mini-printer via the printer's app. |
+| **Tests** | **76 passing tests**, all providers mocked — no live API calls needed to review (policy, middleware routes, TTS, conversation tools, photo saving). |
+
+## Reviewer Map
+
+| Question | Open this |
+| --- | --- |
+| What is the demo? | https://www.youtube.com/watch?v=8hHYE1239wg |
+| Where is the full source? | https://github.com/seifip/robodog-fetch |
+| How do I run it (no hardware)? | https://github.com/seifip/robodog-fetch#quickstart-run-it-yourself |
+| How does the decision loop work? | https://github.com/seifip/robodog-fetch#how-it-works-at-a-glance |
+| What's the DimOS integration? | https://github.com/seifip/robodog-fetch#built-on-dimos |
+| Where's the DimOS runtime? | https://github.com/dimensionalOS/dimos |
+
+## How to Run
+
+Zero-hardware path (phone or laptop browser camera), from the DimOS monorepo root:
+
+```bash
+python -m dimos.experimental.fetch.iphone_middleware --host 0.0.0.0 --port 8455
+```
+
+Open `https://127.0.0.1:8455/fetch` and tap **Record** to start the ~1-second scan
+loop. To drive a real dog, add `--robot-ip 192.168.12.1 --robot-connection-method
+local_ap`. The full quickstart (Record3D USB, live Go2, provider keys, and voice
+modes) is in the project README.
+
+## What's Next
+
+- **Sense the trade.** The Go2 EDU's [foot-force sensors](https://www.unitree.com/go2/foot/)
+  could detect a Coke lifted from the back via the change in total load — closing the
+  loop without the camera's framing check.
+- **Real sand.** Fit sand-walking foot adaptations for an outdoor beach deployment.
+
+## Scope Boundary
+
+This PR is a hackathon **submission pointer**: the full source, demo video, and
+assets are hosted externally at https://github.com/seifip/robodog-fetch. It adds a
+single markdown file under `hackathon/` and **does not vendor Fetch into DimOS or
+modify any DimOS runtime code**. (Fetch is designed to live at
+`dimos/experimental/fetch/` in the monorepo; that vendoring is intentionally out of
+scope for this pointer.)
+
+## Validation
+
+- `pytest -q` in the project repo: **76 passed**, providers mocked (no real API
+  calls) — covers policy normalization, middleware routes, TTS, conversation tools,
+  and photo saving.
+- This submission adds markdown only; no DimOS code is touched.