Skip to content

[MuShanghai DimOS Hackathon 2026] Goldie — Team Perception#2294

Open
mkuwdev wants to merge 22 commits into
dimensionalOS:mainfrom
CeciliaZ030:main
Open

[MuShanghai DimOS Hackathon 2026] Goldie — Team Perception#2294
mkuwdev wants to merge 22 commits into
dimensionalOS:mainfrom
CeciliaZ030:main

Conversation

@mkuwdev
Copy link
Copy Markdown

@mkuwdev mkuwdev commented May 28, 2026

Hackathon submission for the MuShanghai DimOS Hackathon 2026 by Team Perception
(Joy Munn, Yichu Lau, Cecilia Zhang, Brecht Davos, Figo Saleh).

We built Goldie — a voice-controlled guide-dog interface for the Unitree Go2, designed for low-vision and blind users. Speak a destination, the dog confirms out loud and navigates you there, narrating along the way.

What's included

  • webapp/ — iPhone PWA with hold-to-speak voice control, OpenAI TTS replies, manual joystick teleop, and a full mock for development without a robot
  • Typed agent message envelopes — so the phone knows what to speak vs show
  • Direct move skill with stall recovery — enabling the Go2 to climb stairs
  • macOS support fixes for the full DimOS stack

Links

CeciliaZ030 and others added 22 commits May 27, 2026 01:47
Drops the working Goldie webapp into /webapp verbatim (Next 16 Pages Router,
Tailwind v4, on-device STT + TTS, voice/manual modes, joystick, PWA) so it
keeps working as-is. The previous App-Router stub is preserved in
webapp/SCAFFOLD-REFERENCE.md together with the monorepo backend's
agent_state/token contract for later merging. Adds webapp/AGENTS.md briefing.
Also drops the build artifacts (.next) the scaffold had committed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Browser SpeechSynthesis silently drops async speak() on iOS (agent
replies arrive over SSE, outside a user gesture), so responses were
shown but never heard. Replace it with OpenAI gpt-4o-mini-tts behind a
server-side /api/tts route, played through a single <audio> element
unlocked by one in-gesture tap, which iOS allows for async playback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
/api/tts imports the openai SDK; un-ignore webapp/package.json so the
dependency is reproducible from a clean clone / deploy build.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…quick actions

Speak only the agent's final replies:
- classify `{kind:"ai"|"tool"|"system"}` envelopes and TTS only `kind==="ai"`;
  tool/system stay on screen but silent. Mock now emits the same envelopes for
  dev/prod parity.

TTS voice tuning:
- default to a female voice (`shimmer`); steer pace via gpt-4o-mini-tts
  `instructions` (numeric `speed` isn't honored by this model). Documented in
  `.env.example`.

Barge-in UX:
- while the user is holding to speak, suppress incoming TTS so the previous
  turn's reply doesn't talk over them.

Network robustness (`dimos.ts`):
- `fetchWithRetry` (8s timeout + retries) on submit_query / unitree command /
  interrupt; timeout-only on status polling. Commands now throw on failure,
  and the UI shows "Couldn't reach the robot — try again" instead of silently
  swallowing the loss.

STT hardening:
- tear down the previous SpeechRecognition session before starting a new one
  (no overlapping mic sessions / stale callbacks); `onend` ignores superseded
  sessions; start/stop wrapped against iOS state-throwing.
- map real error codes to user-facing messages (`no-speech` / `network` /
  `aborted` no longer mislabeled "check mic permission"); `aborted` stays
  silent.

Quick actions:
- buttons now Sit / Jump / Stand, sent through `/submit_query` as
  natural-language commands ("sit"/"jump"/"stand up") so the agent narrates
  them like voice turns.

Dev diagnostics:
- per-frame `📩 agent-msg [kind=..]` log and `🔊 tts(SPOKEN|muted, kind=..)`
  log printed to the dev terminal; `🎤⚠️ stt-error: <code>` for STT errors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
webapp: replace scaffold with Goldie (voice + manual control app)
WebInput previously created an empty `agent_responses` text stream and never
fed it. Subscribe to the agent's LCM "/agent" channel and forward each
`BaseMessage` to `agent_responses` as a typed JSON envelope:

    {"kind": "ai" | "tool" | "system", "text": "<content>"}

`human` messages (the echoed user input) are skipped. The webapp reads this
envelope and speaks only `kind == "ai"` (tool/system are shown as status).

Also: STT is now a hard dependency (the try/except around WhisperNode is
gone), and audio_subject is no longer optional.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
WebInput previously created an empty `agent_responses` text stream and never
fed it. Subscribe to the agent's LCM "/agent" channel and forward each
`BaseMessage` to `agent_responses` as a typed JSON envelope:

    {"kind": "ai" | "tool" | "system", "text": "<content>"}

`human` messages (the echoed user input) are skipped. The webapp reads this
envelope and speaks only `kind == "ai"` (tool/system are shown as status).

Also: STT is now a hard dependency (the try/except around WhisperNode is
gone), and audio_subject is no longer optional.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
backend: forward LLM agent replies on agent_responses as typed envelopes
- Replace upstream DimOS README with Goldie hackathon submission doc
  covering story, what we built, architecture, stair-climbing achievement,
  challenges, and quick start
- Add docs/goldie-architecture.png
- Add docs/screenshots/ (splash, voice, manual mode)
- Add webapp/TECHFLOW.md (full end-to-end channel trace)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
docs: update README with team name, video link, numbered challenges, …
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants