You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Evaluated litert-community/gemma-4-E2B-it-litert-lm as a candidate LLM-cleanup engine for the deferred enrichment stages. Verdict: a good candidate for the L-stages — text cleanup / categorization — but explicitly NOT a transcription engine. Keep it where the plan already has it: an optional L5 comparison engine behind LanguageModel, after the MLX LLM path is proven. Already referenced in planning/notes.md (the L5 row + Appendix A); this issue captures the concrete specs.
⚠️ Not a transcription model
It's a text-generation LLM (text in → text out). The Gemma 4 family has an optional audio encoder ("the Vision and Audio models are loaded on demand"), but that's general audio-understanding bolted onto an LLM — not a dedicated ASR engine, geared to short clips, and it would lose on WER to our purpose-built engines. Transcription stays with Whisper / Parakeet (and Qwen3-ASR remains the MLX-native ASR sanity-check candidate). Gemma's job is the step after STT:
Google on-device LLM, instruction-tuned, built on Gemini research. Mixed-precision (2/4/8-bit) mobile quantization.
Footprint: 2.5 GB full deployment, but ~0.8 GB text-only (vision/audio submodels load on demand). Our cleanup task is text-only → the ~0.8 GB path applies — lighter than Parakeet's 1.28 GB resident floor and well under the 2.5 GB Qwen3-4B-Q4 sweet spot.
iOS RAM: 607–1450 MB on iPhone 17 Pro. Comfortable on the 8 GB iPhone 15 Pro Max even co-resident with other state; no increased-memory-limit pressure.
Throughput: 56.5 decode tok/s on iPhone GPU. Usable for streaming a cleaned note into the view.
Context: 2k benchmarked, up to 32k. Plenty for single-note cleanup.
License: Apache 2.0 — matches the licensing constraint in notes.md ("Qwen / Gemma E2B are Apache 2.0"); clean if we ever move to a paid tier.
Runtime: LiteRT-LM (primary); MediaPipe LLM Inference in maintenance mode.
Why it fits
The plan already reserves a slot.notes.md lists the L-stage engine order as "MLX primary, llama.cpp fallback, LiteRT-LM third, Apple Foundation optional fourth," and files Gemma E2B under L5 (Optional). This is the comparison/diversity engine, not the lead.
Right-sized for the task. L2/L3 are modest instruction-following (messy transcript → clean note; pin a fixed taxonomy, model picks from allowed). A 2B-effective model is plausibly enough, and its tiny text-only footprint is a real win on an 8 GB device.
Local-first, on-device, not-Apple-only — textbook match for the design axis. Apache 2.0, fully offline.
Costs / risks
A second inference runtime. Our spine is mlx-swift (Whisper + Parakeet). LiteRT-LM is a different on-device runtime — adding it means linking a new C++ stack into the app and writing a LanguageModel conformer against its API, not just an HF download. Dilutes the "MLX primary" stance (same shape of cost flagged for Moonshine in Evaluate Moonshine as a 4th on-device engine (streaming/live-partials play, not an accuracy rung) #9).
Out of phase. The LanguageModel protocol doesn't exist yet and L1 (the MLX inference spike) hasn't started. L1's whole point is proving any local LLM runs on the 15 Pro Max via the runtime we already have (MLX). Standing up a new runtime before that inverts "do the hard thing first on the smaller problem."
Not the default. The default local LLM is Qwen3 4B via MLX. Gemma E2B's value is being the second opinion we A/B against once MLX works — realized after the MLX path exists, not instead of it.
Where it slots
L5, optional. The one scenario that promotes it: if the L1 MLX-LLM spike hits a wall (jetsam / perf), Gemma E2B's ~0.8 GB text-only footprint + official iOS support make LiteRT-LM the natural fallback — the role the engine abstraction was built to absorb ("format availability lag — MLX / GGUF / LiteRT-LM" in notes.md). Also a ready-made AlteredCraft post ("MLX vs llama.cpp vs LiteRT-LM on a real iPhone").
Verify before committing (when L stages resume)
L1 MLX-LLM spike done first (prove the MLX path before adding a second runtime).
Finding (2026-06-14)
Evaluated
litert-community/gemma-4-E2B-it-litert-lmas a candidate LLM-cleanup engine for the deferred enrichment stages. Verdict: a good candidate for the L-stages — text cleanup / categorization — but explicitly NOT a transcription engine. Keep it where the plan already has it: an optional L5 comparison engine behindLanguageModel, after the MLX LLM path is proven. Already referenced inplanning/notes.md(the L5 row + Appendix A); this issue captures the concrete specs.It's a text-generation LLM (text in → text out). The Gemma 4 family has an optional audio encoder ("the Vision and Audio models are loaded on demand"), but that's general audio-understanding bolted onto an LLM — not a dedicated ASR engine, geared to short clips, and it would lose on WER to our purpose-built engines. Transcription stays with Whisper / Parakeet (and Qwen3-ASR remains the MLX-native ASR sanity-check candidate). Gemma's job is the step after STT:
What it is
increased-memory-limitpressure.notes.md("Qwen / Gemma E2B are Apache 2.0"); clean if we ever move to a paid tier.Why it fits
notes.mdlists the L-stage engine order as "MLX primary, llama.cpp fallback, LiteRT-LM third, Apple Foundation optional fourth," and files Gemma E2B under L5 (Optional). This is the comparison/diversity engine, not the lead.allowed). A 2B-effective model is plausibly enough, and its tiny text-only footprint is a real win on an 8 GB device.Costs / risks
mlx-swift(Whisper + Parakeet). LiteRT-LM is a different on-device runtime — adding it means linking a new C++ stack into the app and writing aLanguageModelconformer against its API, not just an HF download. Dilutes the "MLX primary" stance (same shape of cost flagged for Moonshine in Evaluate Moonshine as a 4th on-device engine (streaming/live-partials play, not an accuracy rung) #9).LanguageModelprotocol doesn't exist yet and L1 (the MLX inference spike) hasn't started. L1's whole point is proving any local LLM runs on the 15 Pro Max via the runtime we already have (MLX). Standing up a new runtime before that inverts "do the hard thing first on the smaller problem."Where it slots
L5, optional. The one scenario that promotes it: if the L1 MLX-LLM spike hits a wall (jetsam / perf), Gemma E2B's ~0.8 GB text-only footprint + official iOS support make LiteRT-LM the natural fallback — the role the engine abstraction was built to absorb ("format availability lag — MLX / GGUF / LiteRT-LM" in
notes.md). Also a ready-made AlteredCraft post ("MLX vs llama.cpp vs LiteRT-LM on a real iPhone").Verify before committing (when L stages resume)
.app.Planning-doc references
planning/notes.md— L5 row ("Optional — … LiteRT-LM (Gemma 4 E2B)"), theLanguageModelengine-order note, and Appendix A (LLM enrichment engine + model research).From engine-landscape evaluation, 2026-06-14.