Skip to content

Evaluate Gemma 4 E2B (LiteRT-LM) as an L-stage LLM-cleanup engine — ~0.8 GB text-only, not a transcription model #10

Description

@samkeen

Finding (2026-06-14)

Evaluated litert-community/gemma-4-E2B-it-litert-lm as a candidate LLM-cleanup engine for the deferred enrichment stages. Verdict: a good candidate for the L-stages — text cleanup / categorization — but explicitly NOT a transcription engine. Keep it where the plan already has it: an optional L5 comparison engine behind LanguageModel, after the MLX LLM path is proven. Already referenced in planning/notes.md (the L5 row + Appendix A); this issue captures the concrete specs.

⚠️ Not a transcription model

It's a text-generation LLM (text in → text out). The Gemma 4 family has an optional audio encoder ("the Vision and Audio models are loaded on demand"), but that's general audio-understanding bolted onto an LLM — not a dedicated ASR engine, geared to short clips, and it would lose on WER to our purpose-built engines. Transcription stays with Whisper / Parakeet (and Qwen3-ASR remains the MLX-native ASR sanity-check candidate). Gemma's job is the step after STT:

audio ──► [Whisper / Parakeet = STT] ──► raw transcript ──► [Gemma E2B = LLM cleanup] ──► clean, categorized note
            (T-stages, shipped)                               (L-stages, deferred — this issue)

What it is

  • Google on-device LLM, instruction-tuned, built on Gemini research. Mixed-precision (2/4/8-bit) mobile quantization.
  • Footprint: 2.5 GB full deployment, but ~0.8 GB text-only (vision/audio submodels load on demand). Our cleanup task is text-only → the ~0.8 GB path applies — lighter than Parakeet's 1.28 GB resident floor and well under the 2.5 GB Qwen3-4B-Q4 sweet spot.
  • iOS RAM: 607–1450 MB on iPhone 17 Pro. Comfortable on the 8 GB iPhone 15 Pro Max even co-resident with other state; no increased-memory-limit pressure.
  • Throughput: 56.5 decode tok/s on iPhone GPU. Usable for streaming a cleaned note into the view.
  • Context: 2k benchmarked, up to 32k. Plenty for single-note cleanup.
  • License: Apache 2.0 — matches the licensing constraint in notes.md ("Qwen / Gemma E2B are Apache 2.0"); clean if we ever move to a paid tier.
  • Runtime: LiteRT-LM (primary); MediaPipe LLM Inference in maintenance mode.

Why it fits

  1. The plan already reserves a slot. notes.md lists the L-stage engine order as "MLX primary, llama.cpp fallback, LiteRT-LM third, Apple Foundation optional fourth," and files Gemma E2B under L5 (Optional). This is the comparison/diversity engine, not the lead.
  2. Right-sized for the task. L2/L3 are modest instruction-following (messy transcript → clean note; pin a fixed taxonomy, model picks from allowed). A 2B-effective model is plausibly enough, and its tiny text-only footprint is a real win on an 8 GB device.
  3. Local-first, on-device, not-Apple-only — textbook match for the design axis. Apache 2.0, fully offline.

Costs / risks

  1. A second inference runtime. Our spine is mlx-swift (Whisper + Parakeet). LiteRT-LM is a different on-device runtime — adding it means linking a new C++ stack into the app and writing a LanguageModel conformer against its API, not just an HF download. Dilutes the "MLX primary" stance (same shape of cost flagged for Moonshine in Evaluate Moonshine as a 4th on-device engine (streaming/live-partials play, not an accuracy rung) #9).
  2. Out of phase. The LanguageModel protocol doesn't exist yet and L1 (the MLX inference spike) hasn't started. L1's whole point is proving any local LLM runs on the 15 Pro Max via the runtime we already have (MLX). Standing up a new runtime before that inverts "do the hard thing first on the smaller problem."
  3. Not the default. The default local LLM is Qwen3 4B via MLX. Gemma E2B's value is being the second opinion we A/B against once MLX works — realized after the MLX path exists, not instead of it.

Where it slots

L5, optional. The one scenario that promotes it: if the L1 MLX-LLM spike hits a wall (jetsam / perf), Gemma E2B's ~0.8 GB text-only footprint + official iOS support make LiteRT-LM the natural fallback — the role the engine abstraction was built to absorb ("format availability lag — MLX / GGUF / LiteRT-LM" in notes.md). Also a ready-made AlteredCraft post ("MLX vs llama.cpp vs LiteRT-LM on a real iPhone").

Verify before committing (when L stages resume)

  • L1 MLX-LLM spike done first (prove the MLX path before adding a second runtime).
  • LiteRT-LM iOS binary-size hit on the built .app.
  • Text-only path actually loads at ~0.8 GB on-device (vision/audio submodels stay unloaded).
  • Cleanup/categorization quality A/B vs Qwen3 4B (MLX) on real captured notes.

Planning-doc references

  • planning/notes.md — L5 row ("Optional — … LiteRT-LM (Gemma 4 E2B)"), the LanguageModel engine-order note, and Appendix A (LLM enrichment engine + model research).
  • Local-first / opt-in-cloud and "on-device ≠ Apple-only" intro callouts in the same doc — Gemma here is the local default-comparison path, never cloud.

From engine-landscape evaluation, 2026-06-14.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestquestionFurther information is requested

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions