Skip to content

leeguooooo/iphone-use

Repository files navigation

iphone-use icon

iphone-use

Computer-use, but for the iPhone — let AI agents (and your browser) see and drive a real phone.

License: MIT Platform: macOS 15+ Built with Rust Streaming: WebRTC / H.264

English · 简体中文

Controlling an iPhone from a browser — live screen plus a touch toolbar (Home, Spotlight, App Switcher, keyboard)

Remote-control your iPhone from any web browser — over macOS iPhone Mirroring, with low-latency WebRTC video and near-native touch. A Rust daemon captures the Mirroring window with ScreenCaptureKit, hardware-encodes it to H.264 with VideoToolbox, and streams it to iPhone Safari (or any browser) over WebRTC — injecting taps, swipes, scrolls, and text back as continuous system events. AI agents, scripts, and bots can drive the same phone through a simple HTTP API.

Think Chrome Remote Desktop, but for your iPhone — running entirely on your own Mac, no third-party cloud.

Features

  • 📱 Control an iPhone from a browser — live screen with tap / swipe / scroll / type, on iPhone Safari or any desktop browser.
  • Low latency — hardware H.264 (VideoToolbox) over WebRTC, not screenshot polling.
  • 🤚 Near-native touch — real scroll-wheel scrolling, keycode text input, Home / Spotlight / App-Switcher shortcuts.
  • 🤖 Agent-ready — an HTTP API (/agent/input, /agent/screenshot) lets AI agents and scripts see and drive the phone.
  • 🌐 LAN or remote — same Wi-Fi over your local network, or from anywhere via a Cloudflare tunnel + TURN.
  • 🔒 Self-hosted & authenticated — password login; runs on your own machine, your screen never leaves your control.

v2 — a full WebRTC + hardware-codec + continuous-input rebuild of the original v1 screenshot-polling server. The input + video vertical (video, tap, scroll, text, shortcuts, LAN WebRTC) is validated on real hardware.

Architecture

Architecture

A Rust daemon captures the macOS iPhone Mirroring window with ScreenCaptureKit, hardware-encodes it to H.264 with VideoToolbox, and streams it over WebRTC (webrtc-rs, axum for HTTP/WS signaling). The same capture/input core serves two front-ends: a human client (iPhone Safari — live video + continuous touch) and an agent client (an HTTP control API; see Agent API). Touch is injected back as continuous CGEvents through the system HID event tap. STUN handles most NAT; optional Cloudflare TURN relays the rest.

Key input findings baked into the daemon (all hardware-validated):

  • Scroll is a wheel event. iPhone Mirroring reads a mouse-drag as a long-press / icon-reorder and never scrolls — a finger swipe must map to CGEvent scroll-wheel.
  • Text is keycodes, not Unicode. Mirroring forwards virtual keycodes (and a real Shift key), not the CGEvent Unicode payload. CJK caveat: typing sends US keycodes; if the phone keyboard is a Chinese (Pinyin) IME, digits become candidate-selectors (a1b2c3啊不c3) — switch the phone to the English ABC keyboard for literal text. Real CJK input needs the on-phone IME and is out of scope for now.
  • HID taps need the Mirroring window frontmost — the daemon re-asserts focus only when another app steals it.

Deployment — a GUI-session LaunchAgent

Deployment

ScreenCaptureKit (Screen Recording) and input injection (Accessibility) require TCC grants tied to a signed identity in the login session — an SSH-spawned binary is denied. So the daemon runs as a codesigned LaunchAgent in the desktop session, granted once; SSH shells, agents, and the iPhone Safari controller all connect to it.

Control lease — one cursor, one controller

Control and input

HID-tap input drives the host Mac's one real cursor with the Mirroring window frontmost. A mandatory control lease grants that single cursor to one controller at a time (human or agent); the most recent actor holds control. Without the lease, human and agent would corrupt each other's gestures fighting over the same cursor. Viewers (WebRTC video consumers not sending input) are unaffected: last-connected-wins for input, but all viewers keep their video stream.

Requirements

  • macOS 15 Sequoia or later (iPhone Mirroring's requirement) with iPhone Mirroring set up and signed in. Validated on macOS 15 Sequoia / 26 Tahoe; see the Roadmap for macOS 27 support.
  • Rust toolchain (to build) — cargo.
  • Zero external runtime dependencies — all input (tap, scroll, text, key, shortcuts) is injected via native CGEvent directly, and screenshots use the built-in screencapture CLI. No third-party binary (cua-driver or otherwise) is required at runtime.
  • (optional) a Cloudflare TURN key for cross-network (cellular / remote) access.

Install

Build, bundle into a signed .app, and register the LaunchAgent:

cargo build --release --bin iphone-use
./scripts/make-app.sh                 # → ./iPhoneUse.app
./install.sh ./iPhoneUse.app       # signs, installs, writes the LaunchAgent

install.sh binds 0.0.0.0, generates a password (or uses $PHONE_REMOTE_PASSWORD), opens the Screen Recording + Accessibility panes to grant once, and prints the iPhone connect URL. On the iPhone (same Wi-Fi) open http://<mac-lan-ip>:44321/phone and enter the password.

Pre-built binaries are published from CI on every version tag — see the Releases page. To cut the first release: trigger the smoke-test via Actions → workflow_dispatch, then git tag v0.1.0 && git push origin v0.1.0. install.sh self-signs the app locally with codesign -s -; Gatekeeper will prompt unless the binary is notarized (optional secrets: APPLE_SIGNING_CERTIFICATE / APPLE_SIGNING_CERTIFICATE_PASSWORD / APPLE_SIGN_IDENTITY; notarization: APPLE_ID / APPLE_ID_PASSWORD / APPLE_TEAM_ID). Unsigned is the default path.

Run without installing (dev)

PHONE_REMOTE_HOST=0.0.0.0 PHONE_REMOTE_PASSWORD=secret \
  ./target/release/iphone-use serve

Upgrades

The daemon checks GitHub releases daily and reports it in GET /agent/status (version / latest / update_available); the web client shows a banner when you're behind. Upgrading is the same one-liner as installing (TCC grants persist — same bundle id):

curl -fsSL https://raw.githubusercontent.com/leeguooooo/iphone-use/main/install.sh | sh   # daemon
npx skills update -g                                                                      # agent skill

Disable the check with PHONE_REMOTE_NO_UPDATE_CHECK=1 (air-gapped setups).

Feedback — humans and agents alike

Rough edge? Open an issue. AI agents are explicitly invited: the bundled skill instructs agents to file structured issues (with user consent) when they hit friction using the API — misleading errors, missing capabilities, docs that lie. Complaints from the heaviest users make the product better.

Configuration (environment)

Variable Default Purpose
PHONE_REMOTE_HOST 127.0.0.1 Listen address (0.0.0.0 for LAN).
PHONE_REMOTE_PORT 44321 Listen port.
PHONE_REMOTE_PASSWORD (none) Shared password (cookie login + agent bearer fallback).
PHONE_REMOTE_AGENT_TOKEN (none) Dedicated agent bearer token. When set, the agent API accepts only this token (the password is no longer valid as a bearer); unset = password doubles as the bearer (legacy).
PHONE_REMOTE_CF_TURN_KEY_ID / _API_TOKEN Cloudflare TURN key → ephemeral relay creds for cross-network.
PHONE_REMOTE_WDA_URL (none) L2 element-tree control: a WebDriverAgent reachable at this URL (use http://127.0.0.1:8100 via the relay from scripts/setup-wda.sh). When set, agent text/taps auto-route through the phone-side element layer — CJK text lands cleanly, label-taps need no coordinates, nothing touches the host cursor. Unset = pure pixel path.
PHONE_REMOTE_TURN_URLS / _USERNAME / _CREDENTIAL Static TURN server (alternative to Cloudflare).
PHONE_REMOTE_AUTO_RESUME (off) 1 = experimental: a watchdog clicks the Mirroring Resume/Connect button to recover the paused screen unattended. Off by default — macOS blocks a background agent from focusing Mirroring while the phone is in use, so it can't be made reliable; mirror_state/hint tell you when to click manually instead.

Agent API

Agents drive the phone by connecting in to the running daemon (never by spawning their own input process — macOS makes a spawned child's events untrusted). Bearer auth: Authorization: Bearer <token> where token is PHONE_REMOTE_AGENT_TOKEN when set, otherwise PHONE_REMOTE_PASSWORD (legacy fallback).

Method Path Purpose
GET /agent/status Auth / health probe + driveability: {ok, phone_target, wda, drivable, mirror_state, hint, mode, viewer_count, …}.
POST /agent/input One control message: tap / scroll / text / key / shortcut / keyboard (normalized [0,1] coords).
GET /agent/screenshot Current phone screen as PNG (validated frame; falls back to on-device capture).

Gate actions on drivable, not phone_target: the Mirroring window can be up yet showing the "Connection Paused" / "iPhone in Use" interstitial, where taps land in the void. mirror_state (active/paused/in_use/offline) + hint say what to do (paused → tap Resume; in_use → lock the phone; offline → open Mirroring). human_active:true warns a person is using the Mac — in mirror mode an L3 tap steals their focus, so back off or switch to agent mode (/agent/mode, on-device).

Full reference: docs/agent-api.html.

HOST=http://<mac-lan-ip>:44321; AUTH="Authorization: Bearer $PW"
curl -s -H "$AUTH" "$HOST/agent/screenshot" -o screen.png
curl -s -H "$AUTH" -X POST "$HOST/agent/input" -d '{"type":"shortcut","name":"home"}'
curl -s -H "$AUTH" -X POST "$HOST/agent/input" -d '{"type":"tap","x":0.5,"y":0.3}'
curl -s -H "$AUTH" -X POST "$HOST/agent/input" -d '{"type":"keyboard"}'   # dismiss the keyboard (wda)

MCP server

iphone-use-mcp is an MCP stdio server (crates/mcp) that bridges MCP clients — Claude Desktop, Claude Code — to the daemon's agent API. Seven tools: phone_status, screenshot, elements, tap, tap_label, scroll, type (CJK-clean when WDA is live), key, shortcut. Two env vars: PHONE_REMOTE_URL (default http://127.0.0.1:44321) and PHONE_REMOTE_TOKEN (optional; maps to PHONE_REMOTE_AGENT_TOKEN on the daemon side).

Add to your claude_desktop_config.json (or Claude Code MCP config):

{
  "mcpServers": {
    "iphone-use": {
      "command": "/path/to/iphone-use-mcp",
      "env": {
        "PHONE_REMOTE_URL": "http://127.0.0.1:44321",
        "PHONE_REMOTE_TOKEN": "<your-agent-token>"
      }
    }
  }
}

See crates/mcp/README.md for full tool schemas and build instructions.

Shortcuts bridge (experimental)

Shortcuts bridge

Beyond tapping through the UI, an agent can reach native iOS APIs — battery, Apple Health, Location, Messages, HomeKit — through one curated bridge shortcut. The daemon triggers the "iU Bridge" Shortcut by name (clipboard verb + Spotlight), the shortcut dispatches on that verb to the matching native action and POSTs structured JSON back to /agent/inbox — deterministic data instead of screen-scraping. This is an additive fast path: UI automation (tap / scroll, any app) stays the universal fallback. See shortcuts/README.md and the verb map in shortcuts/registry.json.

Agent skill

Teach any skills-capable agent (Claude Code, etc.) to drive your phone — including the vision once → script forever methodology (solve a phone task visually the first time, then freeze it into a repeatable one-command script):

npx skills add leeguooooo/iphone-use

Installing globally with -g? If the skills CLI prints PromptScript does not support global skill installation, that's a harmless partial failure — PromptScript only supports project-level skills, so its target is skipped while every other agent (Claude Code, etc.) still installs. Add -a claude to target a specific agent and silence the warning.

The skill covers the agent API, the see→act→verify loop, hardware-validated input facts (scroll direction, the keycode/IME caveat), and a worked example — a full Apple Health export (no API exists; the agent taps through the Health app and the data lands on your Mac in ~3 minutes). See skills/iphone-use/SKILL.md.

Security notes

This tool exposes live phone control over the network. Treat the URL and password like sensitive credentials.

  • A password is mandatory when binding to the LAN (install.sh enforces it).
  • HTTPS for remote access is terminated by a Cloudflare tunnel (the daemon serves plain HTTP and reads X-Forwarded-Proto); the session cookie is HttpOnly + SameSite=Lax.
  • Don't leave payment apps, private chats, or 2FA screens open while exposing access.
  • Stop / unload the LaunchAgent when not in use.

Roadmap

Shipped and hardware-validated on macOS 15 Sequoia / 26 Tahoe: WebRTC video, tap, scroll, keycode text, shortcuts, frontmost-robust input, the agent HTTP API, and the LaunchAgent install. Next:

  • macOS 27 "Golden Gate" support. macOS 27 makes the iPhone Mirroring window resizable with variable aspect ratios (and can render an iPad layout) — it's no longer portrait-locked. Make window selection aspect-independent (rank by on-screen + area, not shape), re-validate capture + input on the 27 beta, and add the new Control Center shortcut. Goal: one build that runs on macOS 15 / 26 / 27.
  • MCP server wrapping the agent API, so MCP clients (Claude, etc.) get tap / type / scroll / screenshot as native tools.
  • Cross-network validation of the Cloudflare dynamic TURN path with a real key (the minting + refresh code already ships; needs an end-to-end run off-LAN).
  • Element-tree control via WebDriverAgent (the "L2" layer) — shipped and hardware-validated (iPhone 17 / iOS 27) through the daemon's own API. WDA runs on the phone and drives iOS's accessibility tree, so the same agent API auto-routes to the best path: {"type":"text"} lands CJK cleanly (the pixel path's keycodes get eaten by the Pinyin IME), {"type":"tap","label":"…"} taps by element (no coordinates, no host cursor), GET /agent/elements returns the UI as text (an order of magnitude cheaper than vision), and screenshots fall back to on-device capture when Mirroring is gone — the agent keeps seeing and acting while a human is holding the phone. One-command setup: ./scripts/setup-wda.sh (requires Xcode); every hardware-validated pitfall is in docs/wda-setup.html.
  • Release binaries in CI + a one-line curl … install.sh | sh install.
  • A short demo (GIF / video) of an AI agent driving the phone through the API.

Issues and PRs welcome.

Layout

  • crates/core — capture, encode, coordinate/geometry, input injection, control lease.
  • crates/server — the iphone-use daemon: HTTP/WS, WebRTC, signaling, agent API, TURN.
  • web/index.html — the iPhone Safari client (WebRTC viewer + touch).
  • install.sh, scripts/make-app.sh, deploy/ — packaging + LaunchAgent.
  • docs/ — design spec, runbooks, agent API reference, research notes.

License

MIT

About

Computer-use, but for the iPhone — AI agents (and your browser) see and drive a real phone over macOS iPhone Mirroring. Low-latency WebRTC video, near-native touch, HTTP API + MCP. Rust, self-hosted.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors