Skip to content

Evaluate Moonshine as a 4th on-device engine (streaming/live-partials play, not an accuracy rung) #9

Description

@samkeen

Finding (2026-06-14)

Evaluated moonshine-ai/moonshine as a candidate on-device engine. Verdict: a good candidate — but for live partials + tiny footprint, not as another accuracy rung. Worth adding behind the existing spine; the cost is a second inference runtime.

What it is

  • Streaming-first ASR family, MIT-licensed, designed for edge from day one. English Tiny (26M, 12.66% WER) → Base (58M) → Small Streaming (123M) → Medium Streaming (245M, 6.65% WER), plus 7 other languages. Medium Streaming beats Whisper Large v3 (7.44%) at ~1/6 the params; sub-200ms latency; native streaming with input-encoding caching.
  • Runtime: ONNX Runtime (C++ core)not MLX or CoreML. Ships an official Swift package (moonshine-ai/moonshine-swift, SPM) with an example Transcriber iOS app. A Swift API also exists in sherpa-onnx.

Why it fits

  1. The spine was built for this. Same drop-in shape as Parakeet (T2.5): case moonshine in TranscriptionEngine, a MoonshineModelStore + store(for:) arm in ModelStores, a TranscriberFactory arm, an options case, a settings section.
  2. Local-first, on-device, not-Apple-only → textbook match for the "on-device ≠ Apple-only" axis (MIT, fully offline, third-party model).
  3. Closes the one real gap: live partials. Apple Speech is currently the only engine with live partials — Whisper and Parakeet are both finalize-only ("placeholder UX while recording", planning/notes.md). Moonshine's streaming models are purpose-built to feed our existing TranscriptionSession streaming protocol with incremental text → it would be the first non-Apple engine showing words as you speak, which is the core tap→speak→see-it UX.
  4. Tiny footprint. Models are 26–245 MB vs Parakeet's 2.4 GB / Whisper's 481 MB — a big win on download size/memory; sidesteps the 8 GB jetsam concerns.

Costs / risks

  1. A second inference runtime. Existing engines are raw mlx-swift ports (Parakeet = 12 hand-ported files). Moonshine via ONNX Runtime adds a new dependency stack; the ORT iOS static lib inflates the binary (models are small, the runtime isn't). Dilutes the "MLX primary for ASR" stance slightly. Counterpoint: using moonshine-swift/sherpa-onnx avoids a Parakeet-style hand-port — likely less integration code, just a new binary dependency.
  2. Maturity. moonshine-swift is official and actively released (v0.0.62, June 2026; 46 releases) but young/niche (~7 stars, ~57 commits). Treat like the LocalLLMClient caveat.
  3. Outside the MLX eviction logic. The "single live MLX engine" eviction in TranscriberFactory assumes MLX; Moonshine sits outside it (like Apple). Low risk given small models, but needs a co-residency check.
  4. CPU/ORT path, not GPU/ANE by default (optionally CoreML EP). Different perf profile than the Metal-backed MLX engines; sub-200ms claims are CPU-measured.

Where it slots

Not above Parakeet on raw accuracy (Parakeet 0.6b stays the ceiling). Moonshine adds a new axis: low-latency, small-footprint, live-streaming third-party ASR. Natural fit as a T2.x "streaming on-device" engine. Also a ready-made AlteredCraft post ("MLX vs ONNX Runtime on iPhone"; "the first non-Apple engine with true live partials").

Verify before committing

  • moonshine-swift streaming API actually emits incremental partials we can pipe into TranscriptionSession.
  • ORT iOS binary-size hit on the built .app.
  • Works behind the existing TranscriberFactory/eviction without co-residency surprises.

From engine-landscape evaluation, 2026-06-14.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestquestionFurther information is requested

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions