Evaluate Gemma 4 E2B (LiteRT-LM) as an L-stage LLM-cleanup engine — ~0.8 GB text-only, not a transcription model

## Finding (2026-06-14)

Evaluated [`litert-community/gemma-4-E2B-it-litert-lm`](https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm) as a candidate **LLM-cleanup engine** for the deferred enrichment stages. **Verdict: a good candidate for the L-stages — text cleanup / categorization — but explicitly NOT a transcription engine.** Keep it where the plan already has it: an optional **L5** comparison engine behind `LanguageModel`, after the MLX LLM path is proven. Already referenced in `planning/notes.md` (the L5 row + Appendix A); this issue captures the concrete specs.

## ⚠️ Not a transcription model

It's a text-generation LLM (text in → text out). The Gemma 4 family has an *optional* audio encoder ("the Vision and Audio models are loaded on demand"), but that's general audio-understanding bolted onto an LLM — **not a dedicated ASR engine**, geared to short clips, and it would lose on WER to our purpose-built engines. Transcription stays with Whisper / Parakeet (and Qwen3-ASR remains the MLX-native ASR sanity-check candidate). Gemma's job is the step *after* STT:

```
audio ──► [Whisper / Parakeet = STT] ──► raw transcript ──► [Gemma E2B = LLM cleanup] ──► clean, categorized note
            (T-stages, shipped)                               (L-stages, deferred — this issue)
```

## What it is

- **Google on-device LLM**, instruction-tuned, built on Gemini research. Mixed-precision (2/4/8-bit) mobile quantization.
- **Footprint: 2.5 GB full deployment, but ~0.8 GB text-only** (vision/audio submodels load on demand). Our cleanup task is text-only → the **~0.8 GB** path applies — lighter than Parakeet's 1.28 GB resident floor and well under the 2.5 GB Qwen3-4B-Q4 sweet spot.
- **iOS RAM: 607–1450 MB** on iPhone 17 Pro. Comfortable on the 8 GB iPhone 15 Pro Max even co-resident with other state; no `increased-memory-limit` pressure.
- **Throughput: 56.5 decode tok/s on iPhone GPU.** Usable for streaming a cleaned note into the view.
- **Context: 2k benchmarked, up to 32k.** Plenty for single-note cleanup.
- **License: Apache 2.0** — matches the licensing constraint in `notes.md` ("Qwen / Gemma E2B are Apache 2.0"); clean if we ever move to a paid tier.
- **Runtime: LiteRT-LM** (primary); MediaPipe LLM Inference in maintenance mode.

## Why it fits

1. **The plan already reserves a slot.** `notes.md` lists the L-stage engine order as "MLX primary, llama.cpp fallback, **LiteRT-LM third**, Apple Foundation optional fourth," and files Gemma E2B under **L5 (Optional)**. This is the comparison/diversity engine, not the lead.
2. **Right-sized for the task.** L2/L3 are modest instruction-following (messy transcript → clean note; pin a fixed taxonomy, model picks from `allowed`). A 2B-effective model is plausibly enough, and its tiny text-only footprint is a real win on an 8 GB device.
3. **Local-first, on-device, not-Apple-only** — textbook match for the design axis. Apache 2.0, fully offline.

## Costs / risks

1. **A second inference runtime.** Our spine is `mlx-swift` (Whisper + Parakeet). LiteRT-LM is a different on-device runtime — adding it means linking a new C++ stack into the app and writing a `LanguageModel` conformer against its API, not just an HF download. Dilutes the "MLX primary" stance (same shape of cost flagged for Moonshine in #9).
2. **Out of phase.** The `LanguageModel` protocol doesn't exist yet and **L1 (the MLX inference spike) hasn't started.** L1's whole point is proving *any* local LLM runs on the 15 Pro Max via the runtime we already have (MLX). Standing up a new runtime before that inverts "do the hard thing first on the smaller problem."
3. **Not the default.** The default local LLM is **Qwen3 4B via MLX**. Gemma E2B's value is being the second opinion we A/B against once MLX works — realized *after* the MLX path exists, not instead of it.

## Where it slots

**L5, optional.** The one scenario that promotes it: if the L1 MLX-LLM spike hits a wall (jetsam / perf), Gemma E2B's ~0.8 GB text-only footprint + official iOS support make LiteRT-LM the natural fallback — the role the engine abstraction was built to absorb ("format availability lag — MLX / GGUF / LiteRT-LM" in `notes.md`). Also a ready-made AlteredCraft post ("MLX vs llama.cpp vs LiteRT-LM on a real iPhone").

## Verify before committing (when L stages resume)

- [ ] L1 MLX-LLM spike done first (prove the MLX path before adding a second runtime).
- [ ] LiteRT-LM iOS binary-size hit on the built `.app`.
- [ ] Text-only path actually loads at ~0.8 GB on-device (vision/audio submodels stay unloaded).
- [ ] Cleanup/categorization quality A/B vs Qwen3 4B (MLX) on real captured notes.

## Planning-doc references

- [`planning/notes.md`](https://github.com/AlteredCraft/relay-notes/blob/main/planning/notes.md) — L5 row ("Optional — … LiteRT-LM (Gemma 4 E2B)"), the `LanguageModel` engine-order note, and **Appendix A** (LLM enrichment engine + model research).
- Local-first / opt-in-cloud and "on-device ≠ Apple-only" intro callouts in the same doc — Gemma here is the local default-comparison path, never cloud.

---
_From engine-landscape evaluation, 2026-06-14._


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate Gemma 4 E2B (LiteRT-LM) as an L-stage LLM-cleanup engine — ~0.8 GB text-only, not a transcription model #10

Finding (2026-06-14)

⚠️ Not a transcription model

What it is

Why it fits

Costs / risks

Where it slots

Verify before committing (when L stages resume)

Planning-doc references

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Evaluate Gemma 4 E2B (LiteRT-LM) as an L-stage LLM-cleanup engine — ~0.8 GB text-only, not a transcription model #10

Description

Finding (2026-06-14)

⚠️ Not a transcription model

What it is

Why it fits

Costs / risks

Where it slots

Verify before committing (when L stages resume)

Planning-doc references

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions