diff --git a/CHANGE_LOG.md b/CHANGE_LOG.md index a401c07..1e2aef5 100644 --- a/CHANGE_LOG.md +++ b/CHANGE_LOG.md @@ -140,3 +140,8 @@ Entry style: bold lead-in summarizing what shipped, then the *why* / non-obvious - **L2.0 device-validated (iPhone 15 Pro Max) — on-device LLM cleanup works; the thesis lands on the first real run.** Two findings. (1) **The QAT build doesn't load.** `mlx-community/gemma-4-E2B-it-qat-4bit` failed with `keyNotFound(language_model.model.layers.15.self_attn.k_proj.weight)`: Gemma 4 uses **KV-cache sharing** (layers 15–34 of 35 reuse earlier layers' K/V, so the checkpoint omits their `k_proj`/`v_proj`), but MLXLLM 3.31.3's `Gemma4Attention` declares a `kProj` Linear for *every* layer (runtime handles sharing via a `sharedKV` path; load-time doesn't), so weight-mapping throws at layer 15. Verified via the two repos' `model.safetensors.index.json`: the QAT build has `k_proj` on layers 0–14 only; the **non-QAT `mlx-community/gemma-4-e2b-it-4bit`** (the library's registered `LLMRegistry.gemma4_e2b_it_4bit` preset) materializes `k_proj` on all 35 layers. **Switched the smoke primary QAT → non-QAT** (one line) — the plan's "prefer `-qat-4bit`" is overridden by tooling (`plan.L2.md` §4). The download + bridge were never at fault (the QAT run downloaded 4.3 GB and mapped weights cleanly through layer 14). (2) **It runs, and runs well.** Non-QAT Gemma 4 E2B loaded in **3.4 s** (resident floor **2.67 GB**) and cleaned the inline sample to a genuinely good result — fillers (`um`/`uh`) removed, doubled words fixed (`the the`→`the`, `is is`→`is`), punctuation + sentence breaks added, `on boarding`→`onboarding`, `theres`→`there's`, **all content preserved, nothing invented** — an early but strong signal for the L2 bet (cleanup mitigates STT mess). Throughput **~23 tok/s** (approx, word-count-derived; precise streamed tok/s deferred to L2.3). **Memory:** generation peak `phys_footprint` **3.02 GB** — *just* under the ~3 GB no-entitlement jetsam ceiling **on a tiny ~50-word output**; a real multi-minute note's larger KV cache will likely cross it, so the entitlement stays justified (re-measure on a long fixture at L2.2/L2.3). **Entitlement resolved (§10 Q3):** the **free Apple ID tier accepts `increased-memory-limit`** — the device build signed, installed, and ran with it; no V1.4 trigger. L2.0 (+ most of L2.1) done. Next: **L2.2** — capture real Apple-Speech transcripts into `cleanup_fixtures.json`, then **L2.3** the head-to-head verdict. - **L2 re-sequenced — in-app "Clean up" MVP pulled forward ahead of the fixtures/harness route (code-complete; device-pending).** Since L2.0 proved on-device cleanup works with strong quality, the "harness-first, UI-last" caution was retired and the real feature became the evaluation vehicle (dogfooding real notes beats curated fixtures and yields the transcripts for free). **L2.2 fixtures superseded; L2.3 model A/B deferred** (the `LLMCleanupSmoke` repo-id path still sweeps candidates when wanted). The feature, mirroring the re-transcribe workflow and decoupled from MLX behind the `LanguageModel` protocol: **centralized model management in the Tuning sheet** — `CleanupModelStore` (a `DownloadableModelStore` bound to the new `ModelDownloadSpec.gemmaCleanupE2B`, pinned to commit `2c3e5074…`: the complete `gemma-4-e2b-it-4bit` snapshot — `model.safetensors` 3.58 GB + tokenizer/config/chat-template, 8 files, each SHA-256 + size verified) into `Application Support/llm/gemma-4-e2b-it-4bit/`, registered as a sibling on `ModelStores` (outside the transcription-engine `store(for:)`/`readyEngines` machinery), surfaced via `CleanupModelSection` (download/progress/delete, ~3.4 GB copy, same shape as `WhisperModelSection`/`ParakeetModelSection`). The app loads from the downloaded directory via `LLMModelFactory.loadContainer(from:using:)` — so **`MLXLanguageModel` gained a `Source` enum** (`.directory` for the app, `.repoId` for the DEBUG smoke; the smoke keeps the HubClient path, so `swift-huggingface` stays used and no dep was removed). **`Cleaner`** (`@MainActor @Observable`, the cleanup analogue of `ReTranscriber`) gates on `store.status == .ready`, runs `clean()` off a directory-loaded model, returns a non-destructive `Outcome`, and `evict()`s the ~2.7 GB model when the note is left. **`Note`** gained additive `cleanedTranscript`/`cleanupModel` (+ `isCleaned`/`applyCleanup`/`clearCleanup`) — raw `transcript` is never overwritten. **`NoteDetailView`**: a "Clean up" control (→ "Set up cleanup model" deep-link to the Tuning sheet when the model's absent, via an `onOpenSettings` closure threaded `ContentView`→`NotesListView`→detail), a before/after **Accept/Decline** sheet (`CleanupOutcomeSheet`), and a cleaned/raw display toggle + "Cleaned with …" provenance + "Remove". Tests (sim-safe): `CleanerTests` (gating + generic error message), `NoteTests` cleanup helpers + persistence, `DownloadableModelStoreTests` gemma spec pinning + `CleanupModelStore` readiness/subdirectory. Build + full simulator suite (20 suites) green. Plan: `plan.L2.md` §1 (re-sequencing note) + §6. - **L2.4 device-validated (iPhone 15 Pro Max) — in-app cleanup works end-to-end.** Downloaded the cleanup model via the Tuning `CleanupModelSection` (the SHA-pinned `CleanupModelStore` path → `Application Support/llm/gemma-4-e2b-it-4bit/`, separate from the smoke's HubClient cache), then ran "Clean up" on a real note: the before/after sheet showed the cleaned candidate, Accept persisted `cleanedTranscript`/`cleanupModel` with the raw `transcript` preserved. Confirms the directory-load path (`loadContainer(from:using:)`) and the whole gated flow on device. **L2 has shipped a usable on-device cleanup feature** — the L2 thesis (cleanup mitigates STT errors) is now dogfoodable on real notes; remaining L2 polish (precise tok/s; optional model picker / formal A/B via the deferred L2.3 harness) is non-blocking. +- **Onboarding doc — `docs/relay-notes-ios-guide.html`.** A self-contained, dependency-free HTML guide aimed at a seasoned programmer who is new to iOS/Swift: a "Rosetta Stone" mapping familiar concepts (interface→`protocol`, sum type→`enum` w/ associated values, ORM→SwiftData, UI-thread→`@MainActor`, Stream/Observable→`AsyncStream`, DI scope→`@Environment`) onto this codebase, then a guided tour of the provider-abstraction spine (`Transcriber`/`TranscriptionSession`/`TranscriptionEngine`/`TranscriberFactory` + the `TranscriptionOptions` sum type) and an end-to-end `tap → speak → saved` data-flow walkthrough traced through the real types (`RecorderViewModel` state machine, `LiveAudioEngine` double-duty tap, the three `AsyncStream`s, SwiftData `Note` save). Covers the Swift-6-strict-concurrency story (actors, `@MainActor`-by-default, the `nonisolated protocol` trap, `@unchecked Sendable` on `TapState`) and the iOS realities (permissions, `AVAudioSession`, background-audio `Info.plist`, simulator-can't-run-MLX, 7-day free-tier signing). **Four hand-authored inline-SVG architecture diagrams** orient the reader: a 5-layer system map, a concurrency isolation-domains map, a runtime swimlane for tap→speak→saved, and the recorder state-machine. Self-contained (no JS libs): a vanilla-JS Swift/bash highlighter + sidebar scroll-spy; teal→indigo app-icon palette. Reference/onboarding artifact, not a code change — no app behavior touched. + +## 2026-06-15 + +- **Onboarding guide synced to T2 + L2 (`docs/relay-notes-ios-guide.html`).** Merged `main` into the doc branch and brought the guide current with the two features that landed since it was written: **Parakeet (third on-device transcription engine)** and **L2 on-device LLM cleanup (the `LanguageModel` spine)**. Substantive edits: framed the provider abstraction as **used twice** (transcription + cleanup) rather than "reserved for a future stage"; the tour now shows **three engines** (Apple Speech as the *permanent* default, Whisper + Parakeet opt-in) plus an optional "Clean up" pipeline step; §06 gained a **"The spine, proven twice: `LanguageModel`"** subsection (protocol + `MLXLanguageModel` actor + `Cleaner` + non-destructive `Note.cleanedTranscript`), and the factory snippet now shows the **single-live-MLX-engine eviction** + the `ModelStores` registry; §09 became a **three-engine** comparison and notes the cleanup LLM as a fourth MLX actor; §08 documents the additive/non-destructive cleanup fields; §10 fixed the now-stale signing note and added a card on the **`increased-memory-limit` entitlement** (accepted on the free tier); §11 updated to ~150 tests + the new `mlx-swift-lm` / `swift-huggingface` / `swift-transformers` deps. **Redrew the layered-map diagram** (6 columns, the two protocol spines in a teal/indigo seam, Parakeet/`MLXLanguageModel`/`ModelStores`/`Cleaner` added) and updated the isolation-domains actor box. Diagrams re-rendered + eyeballed (cairosvg). Docs-only; no app behavior touched. diff --git a/docs/relay-notes-ios-guide.html b/docs/relay-notes-ios-guide.html new file mode 100644 index 0000000..a3fa8c0 --- /dev/null +++ b/docs/relay-notes-ios-guide.html @@ -0,0 +1,1331 @@ + + + + + +Relay Notes — An iOS & Swift Field Guide for Seasoned Programmers + + + + + + +
+ + +
+ + +
+
+
A guide for people who already ship software
+

Relay Notes, decoded.

+

You know how to architect systems, reason about concurrency, and read a call graph. What you may not know is the iOS dialect: SwiftUI, actors, property wrappers, SwiftData, and the strict-concurrency rules Swift 6 enforces at compile time. This guide is a translation layer — it maps what you already know onto how a real, working iPhone app is built, using Relay Notes as the worked example.

+
+ What  on-device voice → text → cleanup + UI  SwiftUI + Concurrency  Swift 6 strict + Data  SwiftData + ML  MLX · Whisper · Parakeet · LLM + Target  iPhone 15 Pro Max +
+
+ +

How to read this

+

The sections build on each other but each stands alone. If you want the aha fast, read §02 the Rosetta Stone (concept mapping), then jump straight to §07, which traces a single recording end-to-end through the actual types. The middle sections (§03–§05) are the language/framework primer; the later ones (§06–§11) are this app's architecture and the iOS-specific gotchas.

+
+
▹ The one idea that explains the whole codebase
+

Every external capability sits behind a protocol so the runtime provider is swappable without a rebuild. Transcription has three interchangeable engines (Apple Speech, on-device Whisper, on-device Parakeet) plugged into one socket — and the app now does the same trick a second time for an on-device LLM "Clean up" pass (a separate LanguageModel spine). Hold that thought — it's the pattern, and most of the architecture exists to serve it.

+
+
+ + +
+

01 The 60-second tour

+

Relay Notes is intentionally tiny in scope: tap a button, speak, and an on-device transcript is saved — then, optionally, a one-tap on-device LLM "Clean up" pass tidies it. No account, no server, works in airplane mode. That narrowness is a feature — it keeps the surface small enough that the architecture is legible. The pipeline is a straight line, with cleanup as an optional post-hoc step on a saved note:

+ +
+
🎙️
Capturemic → PCM
+
+
💾
Persist audioAAC/m4a
+
+
📝
Transcribeon-device
+
+
🗂️
Store noteSwiftData
+
⋯›
+
Clean upopt · on-device LLM
+
+ +

The "transcribe" box is where the interesting design lives. It's not one thing — it's a socket that accepts any of three on-device engines:

+
+
+

Apple Speech — permanent default

+

Built into iOS (SpeechAnalyzer + SpeechTranscriber). Streams a live transcript as you talk. No download. The "it just works" floor — and the settled default (iOS 27 strengthens it).

+
+
+

Whisper — opt-in upgrade

+

whisper-small.en (481 MB) via MLX. Downloaded on first use. Decodes once when you stop. Proves "on-device ≠ Apple-only."

+
+
+

Parakeet — opt-in upgrade

+

NVIDIA parakeet-tdt-0.6b-v2 (FastConformer + TDT, ~2.4 GB) via MLX. Also finalize-only. A second your-choice engine, hand-ported the same way.

+
+
+

And the "Clean up" feature is the same pattern again: a one-tap action that runs the saved transcript through an on-device LLM (Gemma 4 E2B) behind a separate LanguageModel protocol. Everything below explains how these pieces are wired so that swapping a provider is a runtime choice, not a code change — and how Swift's type system and concurrency model keep it safe.

+
+ + +
+

02 The Rosetta Stone

+

Almost every iOS concept has a name you already know under a different label. Here's the dictionary. Skim it now; the rest of the guide makes it concrete.

+ +
+
+
Interface / abstract base / traitdefines a contract, no implementation
+
+
protocol — e.g. Transcriber, TranscriptionSession. Conformers can be classes, structs, or actors.
+
+
+
Discriminated union / sum type / sealed class"one of these shapes, type-safely"
+
+
enum with associated values — e.g. TranscriptionOptions.apple(…) | .whisperMLX
+
+
+
async / awaitsame as JS, C#, Rust, Python
+
+
Identical keywords. Task { … } ≈ spawning a coroutine; structured concurrency is the default.
+
+
+
UI thread / main thread affinity"touch UI only from the main thread"
+
+
@MainActor — a compiler-enforced annotation, not a convention you have to remember.
+
+
+
Thread-safe / no data racesguaranteed at compile time
+
+
Sendable + actor. Swift 6 rejects the build on a potential data race.
+
+
+
Observable / Channel / Stream / Subjectpush values to a consumer over time
+
+
AsyncStream<T> — consumed with for await x in stream.
+
+
+
React component + render()UI is a pure function of state
+
+
A struct conforming to View, with a body computed property.
+
+
+
useState / signals / observable statestate that re-renders the UI
+
+
@State, @Observable. Mutating a tracked field re-runs the affected body.
+
+
+
React Context / DI scope / ambient injection"reach a dependency without prop-drilling"
+
+
@Environment(\.modelContext) and friends.
+
+
+
ORM entity + repository/sessionActiveRecord, Hibernate, Prisma model
+
+
@Model class Note (entity) + ModelContext (unit of work) + @Query (live result set).
+
+
+
Nullable<T> / Option<T> / T?"might be absent"
+
+
T?, unwrapped with guard let / if let. No implicit null.
+
+
+
main() / app entry pointwhere the process starts
+
+
@main struct …App: App with a body: some Scene.
+
+
+
package.json / pom.xml / Cargo.tomlproject + dependency manifest
+
+
The .xcodeproj (a project.pbxproj file) + Swift Package Manager (Package.resolved).
+
+
+
+
◆ The mental model that pays off most
+

Swift leans value types (struct/enum, copied) far harder than the OO languages you're used to. Reference types (class, actor) are the exception, reserved for identity and shared mutable state. When you see struct here, think "immutable-ish value, cheap to copy, no aliasing surprises." That single shift removes most "wait, why did that change?" confusion.

+
+
+ + +
+

03 Swift, the parts that bite

+

You can read Swift on sight — it's C-family with type inference. Four features show up constantly in this codebase and behave differently than their cousins elsewhere. Learn these four and the source stops surprising you.

+ +

1. struct vs class vs actor vs enum

+

Four ways to declare a type, chosen by semantics, not habit:

+ + + + + + +
KeywordSemanticsUsed here for
structValue — copied on assignment, no shared identityNote options, RecordingOptions, every View
enumValue + closed set of cases, optionally with payloadsTranscriptionOptions, the recorder's State
classReference — shared identity, mutable, ARC-managedRecorderViewModel, Tunings, LiveAudioEngine
actorReference + serialized access (its own isolation domain)WhisperMLXTranscriber (guards ~480 MB of model state)
+ +

2. Enums carry data — this is the tagged union you wanted

+

This is the single most important Swift idiom in the app. TranscriptionOptions isn't a struct with a pile of nullable fields — it's a closed set of shapes, each with its own type-safe payload:

+
enum TranscriptionOptions: Sendable {
+    case apple(AppleSpeechOptions)   // Apple-only: preset + contextual strings
+    case whisperMLX                  // Whisper: no decode dials in v1
+}
+
+struct AppleSpeechOptions: Sendable {
+    var preset: SpeechTranscriber.Preset = .transcription
+    var contextualStrings: [String] = []
+}
+

You destructure it with switch or pattern-matching guard. The compiler forces you to handle every case, and you literally cannot read Apple's preset out of a .whisperMLX value — it doesn't exist on that case. That's a whole class of "field is null for this provider" bugs deleted at the type level.

+ +

3. The recorder is a state machine, expressed as an enum

+

Instead of a tangle of isRecording / isPaused / errorMessage booleans that can contradict each other, the recorder's entire lifecycle is one value that's always exactly one valid state:

+
enum State: Equatable {
+    case idle
+    case recording(partial: String)   // live transcript so far
+    case paused(partial: String)      // interrupted by a call/Siri/alarm
+    case finalizing                   // stopped; transcribing
+    case finished(transcript: String)
+    case failed(message: String)
+}
+

Illegal states are unrepresentable — there is no "recording AND failed" because the value is one case at a time. Transitions are a pure switch, which makes them unit-testable without spinning up audio hardware (see RecorderViewModel.nextState(for:from:)). Drawn out, the whole lifecycle is small:

+ +
+ + + + + + + + + idle + + + recording + partial: String + + + paused + partial: String + + + finalizing + transcribing… + + + finished + transcript + + + failed + message + + + + + tap ● start + + + .began + + + .resumed + + + tap ◼ stop + + + .stopped + + + ok + + + error / no speech + + + reset() — finished / failed → idle + + +
recording ↔ paused is driven entirely by AVAudioSession interruptions (a call/Siri/alarm); a non-resumable .stopped auto-finalizes so audio is never lost. A failure during start or transcription lands in failed with a user-friendly message.
+
+ +

4. Optionals and guard

+

There is no null. "Might be absent" is encoded in the type as T?, and you must unwrap before use. The idiomatic unwrap is guard let — an early-return that also narrows the type for the rest of the scope:

+
guard let session, let url else {
+    state = .failed(message: "Recording could not be saved. Please try again.")
+    return
+}
+// past this line, `session` and `url` are non-optional
+
+
▹ Reading tip
+

some View / any Transcriber: some means "one specific concrete type the compiler knows but I won't name" (opaque return); any means "a boxed value of any conformer, decided at runtime" (existential). The app uses any Transcriber precisely because the concrete engine is a runtime choice.

+
+
+ + +
+

04 SwiftUI: the UI is a function of state

+

If you've used React, SwiftUI will feel familiar: you describe the UI for a given state, and the framework diffs and re-renders. A view is a struct (cheap, disposable, recreated constantly) with a body that returns a description of the UI — never an imperative "now mutate this label."

+ +
struct ContentView: View {
+    @Environment(\.modelContext) private var modelContext   // injected dependency
+    @State private var viewModel: RecorderViewModel?        // owned, re-renders on change
+    @State private var showSettings = false
+
+    var body: some View {
+        NavigationStack {
+            VStack(spacing: 0) {
+                NotesListView(searchText: searchText, reTranscriber: reTranscriber)
+                Divider()
+                if let viewModel {
+                    RecorderView(viewModel: viewModel)
+                }
+            }
+            .navigationTitle("Relay Notes")
+            .toolbar { /* settings button */ }
+            .sheet(isPresented: $showSettings) { /* settings sheet */ }
+        }
+    }
+}
+ +

Property wrappers = where state lives

+

Those @-prefixed declarations aren't decoration — each one tells SwiftUI a different thing about ownership and reactivity:

+ + + + + + + +
WrapperClosest analogueMeans
@StateuseStateThis view owns this value; mutating it re-renders.
@Observablea signal / observable storeMacro on a class; reads in a body auto-subscribe. RecorderViewModel and Tunings use it.
@EnvironmentReact ContextPull an ambient dependency (the SwiftData context) without threading it through every initializer.
@Querya live DB query / useQueryA SwiftData fetch that re-runs and re-renders when matching rows change. The notes list is just @Query(sort: \.createdAt, order: .reverse).
$valuetwo-way bindingThe $ prefix makes a Binding — a read/write handle a child view (e.g. a TextField) can mutate.
+ +

The composition root: where dependencies are wired

+

Notice viewModel starts as nil. The app has no DI container — instead there's a composition root, the spot where the real object graph is assembled once, lazily, the first time the view appears:

+
.task {                                  // runs once when the view appears
+    if viewModel == nil {
+        let tunings = Tunings()
+        tunings.reconcileEngineAvailability(whisperReady: whisperStore.status == .ready)
+        let factory = TranscriberFactory(whisperModelStore: whisperStore)
+        viewModel = RecorderViewModel(
+            engine: LiveAudioEngine(),
+            transcriberFactory: factory,
+            modelContext: modelContext,
+            tunings: tunings
+        )
+        reTranscriber = ReTranscriber(factory: factory, whisperStore: whisperStore)
+    }
+}
+

This is hand-rolled constructor injection: the view model is handed its collaborators (LiveAudioEngine, the factory, the SwiftData context, the tunings) rather than reaching for globals. That's what makes the logic testable — tests construct a RecorderViewModel with fakes. .task is the lifecycle hook (≈ useEffect(() => …, [])), and it's async-aware so it can await without blocking the UI.

+
+
◆ MVVM, lightly
+

Views stay dumb (layout + bindings). The RecorderViewModel holds the state machine and orchestration. This split is what lets the gnarly async audio logic be exercised by 80-odd unit tests while the SwiftUI layer stays a thin shell.

+
+
+ + +
+

05 Swift 6 concurrency (read this twice)

+

This is the section that trips up newcomers the hardest, because Swift 6 promotes data races from "runtime heisenbug" to compile error. The rules are strict, but once they click, the audio pipeline reads cleanly.

+ +

Actors = isolation domains

+

An actor is a reference type whose mutable state can only be touched one task at a time — the compiler serializes access for you. Calls from outside hop onto the actor and are therefore await-ed. The Whisper transcriber is an actor precisely because it caches ~480 MB of non-thread-safe MLX model state across calls:

+
actor WhisperMLXTranscriber: Transcriber {
+    private var cache: LoadedAssets?      // model + tokenizer + mel filters
+    // serialized by the actor → safe to mutate without a single lock
+}
+ +

@MainActor = the UI isolation domain

+

There's a special global actor, @MainActor, that represents the main thread. Anything that touches UI state is annotated with it, and the compiler then guarantees those members run on the main thread. The view model and the audio engine both opt in:

+
@MainActor
+@Observable
+final class RecorderViewModel { /* state, tasks, orchestration */ }
+
+@MainActor
+final class LiveAudioEngine { /* setup/teardown on the main actor */ }
+

Notably, this project sets SWIFT_DEFAULT_ACTOR_ISOLATION = MainActor — so types are main-actor by default unless they opt out. That default is convenient for UI code but creates the single sharpest gotcha in the codebase 👇

+ +
+
⚠ The nonisolated protocol trap
+

With main-actor-by-default on, an unannotated protocol becomes implicitly @MainActor, and conformance inference then silently smears @MainActor onto your conformers — it once stamped @MainActor onto an actor's synchronous init. The fix is to mark isolation-neutral protocols explicitly:

+
// Both protocols are nonisolated *on purpose* so each conformer
+// picks its own isolation: AppleSpeechTranscriber is a plain class,
+// WhisperMLXTranscriber is an actor.
+nonisolated protocol Transcriber: Sendable {
+    func transcribe(_ audio: URL, options: TranscriptionOptions) async throws -> String
+    func makeStreamingSession(options: TranscriptionOptions) async throws -> any TranscriptionSession
+}
+
+ +

Sendable = "safe to cross an isolation boundary"

+

To pass a value between actors/tasks, the compiler must know it can't introduce a race. Value types (struct/enum of Sendable parts) are automatically Sendable. When you know something is safe but can't prove it to the compiler — like a helper that the audio thread touches single-threaded by construction — you assert it with @unchecked Sendable and take responsibility:

+
// Runs on the realtime audio thread; single-threaded access by construction.
+private final class TapState: @unchecked Sendable { … }
+ +

AsyncStream = the backbone of the pipeline

+

The whole capture→transcribe handoff is plumbed with async streams — a producer yields values, a consumer for awaits them. Three streams flow through a single recording:

+
struct LiveRecording: Sendable {
+    let url: URL                                       // where the audio file lands
+    let buffers: AsyncStream<AVAudioPCMBuffer>          // mic audio, chunk by chunk
+    let interruptions: AsyncStream<InterruptionEvent>  // call/Siri/alarm events
+}
+

And the transcript itself comes back as a stream of ever-growing strings — that's how the live partial transcript updates the UI character-by-character:

+
updatesTask = Task { [weak self, session] in
+    for await partial in session.updates {          // each value = transcript so far
+        guard let self else { return }
+        if case .recording = self.state {
+            self.state = .recording(partial: partial)   // re-renders the live card
+        }
+    }
+}
+ +
+ + + + + + + + + @MainActor — UI & orchestration + + + audio thread · realtime + + + actor — serialized + + + nonisolated · own isolation + + + + RecorderViewModel + @Observable · state machine + feed / updates / interruption tasks + + + Tunings + + + TranscriberFactory + caches providers + + + LiveAudioEngine + setup / teardown on main actor + installs the input tap ▸ + + + SwiftUI Views + read state → + re-render + + + + input tap closure + + + TapState + @unchecked Sendable + + + + WhisperMLXTranscriber + cached weights · one at a time + + Parakeet · cleanup LLM + + + + AppleSpeechTranscriber + final class + + + AppleSpeechSession + emits live partials + + + WhisperStreamingSession + accumulates PCM · + Mutex<[Float]> + + + + + installs tap + + + buffers ⟿ + + + updates ⟿ + + + feed() · finish() + + + await transcribePCM() + + + + solid = await (cross-domain hop) + + dashed = AsyncStream (value pipe) + +
Four isolation domains, and every arrow between them is a boundary the compiler checks. Values only cross as an await or through an AsyncStream — which is why the realtime audio thread, the GPU actor, and the UI never race.
+
+ +
+
▹ How to read the concurrency, fast
+

Mentally tag each type with its domain: @MainActor (UI + orchestration), actor (Whisper model), the audio thread (the tap closure + TapState). Every arrow between domains is an await or an AsyncStream. The compiler already verified the crossings are race-free — so you can trust the boundaries and just follow the data.

+
+
+ + +
+

06 The architecture spine: provider abstraction

+

Here's the load-bearing pattern, stated plainly: capabilities hide behind protocols; concrete providers are resolved at runtime. It's used twice now — once for transcription (Transcriber, three engines) and once for on-device cleanup (LanguageModel, the "Clean up" feature). If you internalize this section, the file layout makes sense. Here's the whole system on one slide — five layers, each depending only on the one below; the two protocol spines sit in the teal/indigo band:

+ +
+ + + + + + uses ↓ + + + + + + + UI + SwiftUI + + ContentView + NotesListView + RecorderView + NoteDetailView + SettingsView + + + + Orchestration + @MainActor + + + RecorderViewModel + state machine + tasks + + Tunings + @Observable + + ReTranscriber + re-run another engine + + Cleaner + drives "Clean up" + + + + The spines + protocols + DI + + Transcriberprotocol + Transcription-Session + Transcription-Options + Transcriber-Factory + ModelStoresregistry + LanguageModelcleanup spine + + + + Providers + & capability + + AppleSpeech-Transcriberclass + WhisperMLX-Transcriberactor + ParakeetMLX-Transcriberactor + MLXLanguage-Modelactor + LiveAudio-Engine@MainActor + Note@Model + + + + Frameworks + + Speech + AVFoundation + mlx-swift + mlx-swift-lm + SwiftData + + +
Two protocol spines live in the band-3 seam: transcription (teal — Transcriber & friends) and cleanup (indigo — LanguageModel). Everything above talks only to those protocols; the provider row holds the swappable implementations (the three teal/indigo provider boxes are the MLX actors). Add a capability = add a box, not a refactor.
+
+ +

The contract has two methods, both on purpose

+
nonisolated protocol Transcriber: Sendable {
+    // File-based. UNUSED by the app today — kept for the future cloud-STT
+    // providers (which work on uploaded files) and a "re-transcribe" action.
+    func transcribe(_ audio: URL, options: TranscriptionOptions) async throws -> String
+
+    // Streaming. This is what the recorder actually uses: it returns a session
+    // that the audio engine feeds buffers into.
+    func makeStreamingSession(options: TranscriptionOptions) async throws -> any TranscriptionSession
+}
+
+
⚑ Don't "clean up" the unused method
+

The file-based transcribe(_:options:) looks like dead code — the app only calls the streaming path. It's deliberately retained for cloud STT (which operates on uploaded files) and a future re-transcribe action. This is the kind of intent that lives in comments and the planning docs, not in the call graph. Read before deleting.

+
+ +

A session is the live handle

+

The streaming method hands back a TranscriptionSession — the object the audio engine pushes buffers into and reads results out of. Crucially, the session is the authority on its own behavior, so the recorder doesn't branch on engine type:

+
nonisolated protocol TranscriptionSession: Sendable, AnyObject {
+    var audioFormat: AVAudioFormat? { get }     // the PCM format it wants
+    var updates: AsyncStream<String> { get }     // live partial transcripts
+    var emitsLivePartials: Bool { get }          // Apple: true · Whisper: false
+    var modelDescription: String { get }         // provenance, saved on the Note
+    func feed(_ buffer: AVAudioPCMBuffer)        // push mic audio in
+    func finish() async throws -> String         // stop, return final transcript
+    func cancel() async
+}
+

emitsLivePartials is a nice example of pushing a decision to the type that owns the knowledge: the recorder asks the session "do you stream?" rather than inferring it from an engine enum — so adding an engine doesn't mean editing the recorder.

+ +

Resolving the provider: the factory

+

A small factory maps the selected engine to a cached concrete instance — with a twist that matters on a memory-constrained phone: it keeps at most one MLX engine resident at a time. Whisper (~0.5 GB) and Parakeet (~1.2 GB) are never used simultaneously, so switching to a different MLX engine evicts the previous model's weights before loading the next. Apple Speech holds no such state, so it's cached independently:

+
@MainActor
+final class TranscriberFactory {
+    private var appleSpeech: AppleSpeechTranscriber?              // cheap, cached on its own
+    private var liveMLX: (engine: TranscriptionEngine,           // at most ONE MLX engine live
+                          transcriber: any Transcriber)?
+
+    func transcriber(for engine: TranscriptionEngine) -> any Transcriber {
+        switch engine {
+        case .apple:       return appleSpeech ?? makeApple()
+        case .whisperMLX:  return liveMLXTranscriber(for: engine) { WhisperMLXTranscriber(store: stores?.whisper) }
+        case .parakeetMLX: return liveMLXTranscriber(for: engine) { ParakeetMLXTranscriber(store: stores?.parakeet) }
+        }
+    }
+
+    private func liveMLXTranscriber(for engine: TranscriptionEngine, _ make: () -> any Transcriber) -> any Transcriber {
+        if let liveMLX, liveMLX.engine == engine { return liveMLX.transcriber }
+        liveMLX = nil               // drop the old model (the factory's only strong ref) → its ~GB of weights free
+        let new = make(); liveMLX = (engine, new); return new
+    }
+}
+

Where the weights live on disk is the job of the ModelStores registry — one place that maps each engine to its DownloadableModelStore and answers "is this engine ready?" (Apple is always ready; an MLX engine only once its model is downloaded). That's what gates engine selection in Settings and reconciles a stale saved choice at launch.

+ +

Put together, the socket looks like this — three live providers (plus a cloud slot reserved), one interface, the choice deferred to runtime:

+
+
🍎
AppleSpeech
Transcriber
default · shipping
+
🧠
WhisperMLX
Transcriber
on-device · shipping
+
🦜
ParakeetMLX
Transcriber
on-device · shipping
+
☁️
Cloud
Transcriber
opt-in · not built
+
+
🔌
any Transcriberthe socket
+
+ +

The spine, proven twice: LanguageModel

+

The strongest evidence that the abstraction earns its keep: when the team wanted on-device transcript cleanup (de-filler, punctuation, light structure via a local LLM), they didn't bolt it onto the transcriber — they stamped out a second, identical spine. Same isolation rules, same actor-for-GPU-state pattern, same swap-the-provider-at-runtime promise:

+
nonisolated protocol LanguageModel: Sendable {
+    func clean(_ raw: String) async throws -> String
+    // L3 adds `categorize(_:into:)` additively — designed in now, not a reshape.
+}
+
+// The MLX conformer is an actor, exactly like the transcribers — it caches a
+// non-Sendable, GPU-bound ModelContainer (Gemma 4 E2B, 4-bit) across calls.
+actor MLXLanguageModel: LanguageModel {
+    func clean(_ raw: String) async throws -> String {
+        let session = ChatSession(container, instructions: CleanupPrompt.system,
+                                  generateParameters: .init(temperature: 0))  // greedy: must not invent
+        return try await session.respond(to: raw)
+    }
+    func evict() { container = nil; MLX.GPU.clearCache() }   // same "one live model" discipline
+}
+

The orchestrator mirrors ReTranscriber too: Cleaner (a @MainActor @Observable) gates the "Clean up" button on the model being downloaded, runs clean() off a saved note, and hands back a non-destructive candidate the user accepts or declines. Accepting writes a separate field — the raw transcript is never overwritten (more in §08). The prompt lives in one place (CleanupPrompt) so swapping the model never changes behavior.

+
+
◆ Why this is worth the ceremony — now demonstrably
+

The point was always to validate "on-device, your-choice model" without betting the app on one vendor. Because each seam is a protocol, a new engine is a new file, not a refactor — and that claim is no longer theoretical: the cleanup feature dropped an entire second capability into the same shape, reusing the same actor-isolation and single-live-model patterns. The abstraction pays for itself with three live transcription engines and an on-device LLM.

+
+
+ + +
+

07 Tap → speak → saved: the whole path

+

This is the payoff. Here's one recording, traced end-to-end through the real types. Every concept above shows up doing a job.

+ +
+ + + + + + + + + + + + + + + + + + You + + RecorderView / VM + @MainActor + + LiveAudioEngine + + audio thread + + TranscriptionSession + + SwiftData + @Query + + + + + loop · while recording + + + + + tap ● record + + + makeStreamingSession(options) + + + engine.start(analyzerFormat) + + + ↩ LiveRecording { buffers, interruptions } + + + + buffers ⟿ (each chunk) + + + session.feed(buffer) + + + updates ⟿ partial transcript + + + state = .recording(partial) + + + + tap ◼ stop + + + engine.stop() + + + session.finish() → transcript + + + insert + save Note + + + @Query re-renders the list automatically + + +
Time flows downward. Solid = a direct call; dashed teal = an AsyncStream delivering values over time. The loop is the live phase; the bottom four messages are stop → decode → persist → the list updating itself. The six numbered steps below narrate this same picture.
+
+ +
+
+

The tap: build the session, start the engine

+

Recording/RecorderViewModel.swift → startRecording()

+

The view model reads a snapshot of the user's tunings (engine, preset, bitrate) at this instant — mid-recording setting changes intentionally don't take effect until next time. It asks the factory for the right Transcriber, makes a streaming session, then starts the audio engine, handing it the session's preferred audioFormat so capture and recognition agree on a PCM format.

+
let transcriber = transcriberFactory.transcriber(for: tunings.engine)
+let session = try await transcriber.makeStreamingSession(options: tunings.transcriptionOptions)
+let recording = try await engine.start(options: tunings.recordingOptions,
+                                       analyzerFormat: session.audioFormat)
+
+ +
+

Capture does double duty on every buffer

+

Audio/LiveAudioEngine.swift → installTap + TapState.handle

+

An input tap on AVAudioEngine fires on the realtime audio thread for each chunk of mic audio. For every buffer it (a) writes AAC/m4a to disk for later playback, and (b) converts the PCM to the analyzer's format and yields it into an AsyncStream. That's why the saved audio and the transcript come from one capture, not two.

+
func handle(buffer: AVAudioPCMBuffer) {
+    try? audioFile.write(from: buffer)              // (a) persist for playback
+    // (b) convert to the analyzer's format, then:
+    continuation.yield(outBuffer)                   // → into LiveRecording.buffers
+}
+
+
⚠ Audio-thread reality
+

The tap closure runs on a realtime thread that must never block. It holds a @unchecked Sendable TapState (single-threaded by construction) — this is exactly the case where you assert thread-safety to the compiler because the runtime contract guarantees it but the type system can't see it.

+
+
+ +
+

Three concurrent tasks consume the streams

+

RecorderViewModel.startRecording()

+

The view model spins up structured tasks, one per stream: feed (push buffers into the session + compute a mic level), updates (live partial transcript → state), and interruptions (call/Siri/alarm → pause/resume/finalize). The state flips to .recording(partial: "") and the UI comes alive.

+
feedTask = Task { [session] in
+    for await buffer in recording.buffers { session.feed(buffer); … }
+}
+updatesTask = Task { for await partial in session.updates { … } }      // live text
+interruptionTask = Task { for await e in recording.interruptions { … } } // pause/resume
+state = .recording(partial: "")
+
+ +
+

Live transcript streams to the screen (Apple) — or a meter does (Whisper)

+

AppleSpeechTranscriber.swift / WhisperStreamingSession.swift

+

With Apple Speech, each recognizer result (volatile or final) is folded into a growing string and yielded on updates — you watch the words appear. With Whisper there are no partials by design (emitsLivePartials = false); it just accumulates PCM in memory, so the UI shows a live audio-level meter + elapsed timer placeholder instead of a blank card.

+
+ +
+

Stop: finalize and get the transcript

+

RecorderViewModel.stopAndTranscribe()

+

State → .finalizing. The engine stops (closing the audio file), the feed/interruption tasks cancel, then session.finish() returns the final transcript. For Apple that drains the last results; for Whisper that's where the entire decode happens — a 5-minute note sits ~80 s on the spinner. Same interface, very different cost profile, hidden behind finish().

+
state = .finalizing
+let url = await engine.stop()
+let transcript = try await session.finish()        // Apple: drain · Whisper: decode now
+
+ +
+

Persist the note

+

Models/Note.swift + ModelContext

+

A Note is created — storing the audio filename (not a URL), the transcript, and the engine's modelDescription as provenance — inserted into the SwiftData context and saved. Because the notes list is a live @Query, the new row appears in the UI automatically. State → .finished. Done.

+
let note = Note(audioFilename: url.lastPathComponent,
+                transcript: transcript,
+                transcriptionModel: session.modelDescription)
+modelContext.insert(note)
+try modelContext.save()
+state = .finished(transcript: transcript)
+
+
+
+
▹ Step back and notice
+

The recorder never names "Apple" or "Whisper." It talks to a Transcriber and a TranscriptionSession, asks them what they can do (emitsLivePartials, audioFormat, modelDescription), and lets the streams carry the data. That's the spine doing its job — orchestration with zero engine-specific branches.

+
+
+ + +
+

08 Persistence: SwiftData in one page

+

SwiftData is Apple's modern ORM (a type-safe wrapper over Core Data). If you've used Prisma, Room, or ActiveRecord, you already know the shape — three pieces:

+
+

@Model = entity

A macro on a class that makes its stored properties persistent columns. Note is the only model.

+

ModelContext = session

The unit of work: insert, delete, save. Injected via @Environment.

+

@Query = live fetch

A reactive result set. The list view re-renders when rows change — no manual refresh.

+
+ +
@Model
+final class Note {
+    var id: UUID
+    var createdAt: Date
+    var audioFilename: String      // ← filename, NOT a URL (see below)
+    var transcript: String         // the canonical RAW text — never overwritten
+    var title: String?
+    var transcriptionModel: String?    // provenance: "Apple Speech" / "Whisper (small.en)"
+    var originalTranscript: String?    // pre-edit baseline, enables revert
+    var cleanedTranscript: String?     // LLM-cleaned version (non-destructive); nil = never cleaned
+    var cleanupModel: String?          // provenance: "Gemma 4 E2B (MLX 4-bit)"
+}
+ +

The container is wired up once, at the app's entry point — that's all it takes to get a working store:

+
@main
+struct Relay_NotesApp: App {
+    var body: some Scene {
+        WindowGroup { ContentView() }
+            .modelContainer(for: Note.self)     // creates/opens the SQLite store
+    }
+}
+ +
+
⚠ Store the filename, never the URL
+

The app's container path can change between launches, so a persisted absolute URL goes stale. The note stores audioFilename and resolves it against the documents directory at access time:

+
var audioURL: URL { URL.documentsDirectory.appending(path: audioFilename) }
+
+
+
⚑ One canonical delete
+

A note has two artifacts: the SwiftData row and an audio file on disk. Deleting the row alone orphans the file. So there's exactly one approved delete, used everywhere:

+
func deleteWithAudio(in context: ModelContext) {
+    try? FileManager.default.removeItem(at: audioURL)   // file
+    context.delete(self); try? context.save()           // row
+}
+
+
+
◆ Additive & non-destructive by design
+

Notice the optional-by-default fields. Both edit (originalTranscript) and cleanup (cleanedTranscript/cleanupModel) were added after notes already existed — as nil-defaulting optionals, which SwiftData migrates for free (no migration plan, no backfill). And both are non-destructive: the raw transcript is always the source of truth, so an LLM cleanup can be accepted, toggled, or removed without ever losing what was actually said. That's a deliberate stance — an LLM "improving" a note by inventing detail is worse than a messy-but-true one.

+
+
+ + +
+

09 Three engines, one socket

+

The same TranscriptionSession interface backs three genuinely different implementations. Comparing them is the clearest illustration of why the abstraction earns its keep:

+ + + + + + + + + +
Apple SpeechWhisper (MLX)Parakeet (MLX)
RolePermanent defaultopt-in upgradeopt-in upgrade
FrameworkSpeechAnalyzermlx-swift, hand-portedmlx-swift, hand-ported
ModelApple's, no choicewhisper-small.en · 481 MBparakeet-tdt-0.6b-v2 · ~2.4 GB
InstallNothing to downloadDownload on first useDownload on first use
Live partialsYes — stream as you talkNo — decode at finish()No — decode at finish()
Concurrency typenonisolated classactoractor
While recording, UI…live transcript cardmeter + timer placeholdermeter + timer placeholder
+

And there's a fourth MLX actor that isn't a transcriber at all — MLXLanguageModel (Gemma 4 E2B, the cleanup model from §06). The factory's "one live MLX model at a time" rule spans all of them: Whisper, Parakeet, and the cleanup LLM are never co-resident on the 8 GB device.

+ +

Why the MLX engines are actors and Apple is a class

+

This is the concurrency model paying rent. Each MLX engine caches hundreds of MB to GBs of non-Sendable model state and is GPU-bound — serializing access through an actor is both correct (no races on the cache) and desirable (decodes shouldn't overlap on one GPU). Apple's transcriber holds no such shared state, so a plain nonisolated class is enough. Same protocol, different isolation, chosen per-conformer — exactly the freedom the nonisolated protocol from §05 preserves, and the reason adding Parakeet (and later the cleanup LLM) didn't perturb anything above the provider layer.

+ +
+
◆ The product stance behind the engineering
+

Two axes are kept independent on purpose: where it runs (on-device vs cloud) and whose model (Apple's vs your choice). Apple Speech is on-device-Apple and the settled default — it works the instant the app installs, with no download. Whisper and Parakeet are on-device-yours, opt-in upgrades. The current bet is notable: rather than chase a more accurate transcriber, mitigate transcription errors with the on-device cleanup LLM — which is exactly why the second LanguageModel spine exists. Everything stays local-first, cloud opt-in only: the app is 100% functional in airplane mode. The protocols are what keep every corner reachable without forking the app.

+
+
+ + +
+

10 iOS realities you won't hit elsewhere

+

Platform friction that has nothing to do with the architecture but everything to do with shipping an iPhone app as a newcomer:

+ +
+
+

Permissions are async dialogs

+

Mic and speech access each trigger a one-time system prompt you must await, and the human-readable reason strings live in Info.plist. The recorder handles denial as a first-class failure state with an actionable message.

+
+
+

The audio session is global, shared state

+

AVAudioSession is a singleton you configure (.playAndRecord) and that the OS can yank away mid-recording. Calls, Siri, and alarms arrive as interruption notifications — handled here as a .began → .paused, .resumed → .recording, .stopped → auto-finalize flow.

+
+
+

Background recording needs an entitlement

+

Locked-screen capture requires UIBackgroundModes = [audio], which Xcode can't auto-generate — hence a hand-maintained partial Info.plist. Don't delete it (it merges on top of the generated one).

+
+
+

The simulator can't run MLX

+

The simulator's GPU lacks the family MLX needs, so any MLX test would crash the whole suite. Those tests are compiled but skipped via #if !targetEnvironment(simulator); on-device math is validated by a DEBUG "smoke" button instead.

+
+
+

Free-tier signing expires every 7 days

+

Sideloading on a free Apple ID means the app stops launching after a week — re-build from Xcode to re-sign. Data (SwiftData rows + audio files) survives because the bundle ID is stable. The paid Developer Program is still deferred — sideload covers the personal-device use case.

+
+
+

The 4-bit LLM needs the memory entitlement

+

The cleanup model's generation peak crosses iOS's ~3 GB jetsam ceiling, so the app ships com.apple.developer.kernel.increased-memory-limit (in Relay Notes.entitlements). The pleasant surprise: the free Apple ID tier accepts it — no paid program needed. The ASR engines (Whisper/Parakeet) stay under the ceiling and don't require it.

+
+
+

The project file is generated, edit it with care

+

project.pbxproj is a fussy machine file. The project uses Xcode's file-system-synchronized groups; adding files/targets is done via the xcodeproj Ruby gem, not by hand, and validated on a /tmp copy first.

+
+
+
+ + +
+

11 Build & tooling, the short version

+

Day-to-day is a terminal loop (Xcode stays open in the background for signing, previews, and Instruments). The two commands you'll run most:

+
# Run the full test suite in the simulator (the command you'll use most)
+xcodebuild test -project "Relay Notes.xcodeproj" -scheme "Relay Notes" \
+  -destination 'platform=iOS Simulator,name=iPhone 17 Pro' 2>&1 | xcbeautify
+
+# Build only — faster sanity check
+xcodebuild build -project "Relay Notes.xcodeproj" -scheme "Relay Notes" \
+  -destination 'platform=iOS Simulator,name=iPhone 17 Pro' 2>&1 | xcbeautify
+
    +
  • The quotes matter — both the project name and scheme contain a space.
  • +
  • Pipe through xcbeautify; raw xcodebuild output buries the actual errors.
  • +
  • Tests use Swift's Testing framework (import Testing, @Test, #expect) — closer to modern JS/Rust test ergonomics than XCTest. ~150 tests; the MLX numerics (Whisper/Parakeet/LLM) are device-only, validated by DEBUG "smoke" buttons.
  • +
  • New test files must be registered in the project: ruby scripts/add_test_file.rb MyNewTests.swift.
  • +
  • Dependencies resolve via Swift Package Manager on first build — mlx-swift (the ASR engines), plus mlx-swift-lm + swift-huggingface + swift-transformers (the cleanup LLM: model loading, HF download, tokenizers/chat templates). Nothing to install by hand.
  • +
+
+ + +
+

12 Where to read next

+

A reading order into the actual repo, now that the map is in your head:

+
+
+
The spine, in codestart here
+
+
Transcription/Transcriber.swift — the protocols + the TranscriptionOptions sum type, with the nonisolated rationale in a long comment.
+
+
+
The spine, againthe pattern, reused
+
+
Enrichment/LanguageModel.swift + MLXLanguageModel.swift + Cleaner.swift — the cleanup feature, built as a carbon copy of the transcription spine.
+
+
+
The orchestratorthe state machine in motion
+
+
Recording/RecorderViewModel.swift — follow startRecording() then stopAndTranscribe().
+
+
+
The trickiest filerealtime audio + Sendable boundaries
+
+
Audio/LiveAudioEngine.swift — the double-duty tap and TapState.
+
+
+
The why behind everythingprose, not code
+
+
CLAUDE.md (architecture + conventions), planning/notes.md (roadmap + stance), CHANGE_LOG.md (what shipped & why).
+
+
+ +
+
▹ The takeaway, in one sentence
+

Relay Notes is a small app that uses Swift's type system (sum-type enums, protocols, optionals) to make illegal states unrepresentable, and Swift's concurrency model (actors, @MainActor, AsyncStream) to make a realtime audio→text→storage pipeline race-free at compile time — all so that providers can be swapped at runtime without touching the code that orchestrates them. The proof it works: the same protocol-spine pattern now backs three transcription engines and a second, independent on-device LLM cleanup capability.

+
+ +
+

Generated as an onboarding companion to the Relay Notes codebase · iOS 26 · SwiftUI · Swift 6 strict concurrency · SwiftData · MLX (mlx-swift + mlx-swift-lm). Code excerpts are lightly trimmed from the real source; read the files for the full, commented versions.

+
+
+ +
+
+ + + +