From d98ff9f626bce129bdc248c48276be30b5ec65f5 Mon Sep 17 00:00:00 2001 From: Claude Date: Sun, 14 Jun 2026 04:19:36 +0000 Subject: [PATCH 1/3] docs: add iOS/Swift onboarding guide for seasoned programmers Self-contained HTML guide (docs/relay-notes-ios-guide.html) that maps familiar programming concepts onto this codebase's iOS/Swift idioms, then walks the provider-abstraction spine and the tap-to-saved data flow through the real types. Covers Swift 6 strict concurrency and the iOS-specific gotchas. No app behavior changed. https://claude.ai/code/session_01W47XYXAX6wnMNT3KGJbzQh --- CHANGE_LOG.md | 4 + docs/relay-notes-ios-guide.html | 935 ++++++++++++++++++++++++++++++++ 2 files changed, 939 insertions(+) create mode 100644 docs/relay-notes-ios-guide.html diff --git a/CHANGE_LOG.md b/CHANGE_LOG.md index 78686bb..78881b2 100644 --- a/CHANGE_LOG.md +++ b/CHANGE_LOG.md @@ -120,3 +120,7 @@ Entry style: bold lead-in summarizing what shipped, then the *why* / non-obvious - **T2.1b shipped (code-complete; device mel-smoke pending) — Parakeet mel front-end `ParakeetAudio.swift`.** Ports `senstella/parakeet-mlx`'s `get_logmel` to mlx-swift: preemph → periodic Hann (win 400, zero-padded to n_fft 512) → STFT → L2 magnitude → power² → Slaney 128-mel → `log(x+1e-5)` → per-feature (per-mel-bin) z-score → `[1, t, 128]`. The genuinely generic, already-device-validated STFT primitives (`WhisperAudio.stft`/`hanning`) are reused; only the window padding, mel, and normalization are Parakeet-specific. **Resolved the plan's three §5.2 risks toward NeMo's training featurizer:** (1) **preemph 0.97** — `dacite.from_dict` (senstella's loader) applies the `PreprocessArgs.preemph=0.97` dataclass default for the *absent* key, confirming the model was trained with it; flipped `ParakeetConfig` to decode absent→0.97 while still honoring a present value (incl. explicit `null`→`nil`, distinguished via `container.contains`), and corrected the T2.1a "preemph off" comments. (2) **periodic Hann** via `WhisperAudio.hanning` (denominator N), not FluidInference's symmetric `/(N−1)`. (3) **L2 magnitude** (`MLX.abs` on the complex rfft), matching NeMo's `torch.stft`, not the Python's `|re|+|im|` L1. **Found + fixed a fourth latent risk the plan didn't enumerate:** the FluidInference filterbank uses the **HTK** mel *scale* (`2595·log10(1+f/700)`), but the oracle calls `librosa.filters.mel(htk=False, norm="slaney")` = the **Slaney** mel scale; `ParakeetAudio.melFilterbank` reimplements the Slaney scale + Slaney area-norm in host `Double` to match librosa exactly (also dodging the reference's per-element `.item()` GPU syncs). Also used the `nFFT`-based STFT frame count (= NeMo's `torch.stft(center=True)`), not the Python's `win_length`-based count (a latent quirk that can read past the buffer on some lengths). **Smoke:** new weight-free `ParakeetSmoke.runFeaturizer()` (T2.1b) computes the mel on `ls_test.flac` and logs shape + range; runs *before* the heavy T2.1a cast so the result lands fast. Side-fix: marked `ParakeetTDTConfig.load(from:)` `nonisolated` (an extension method doesn't inherit the struct's `nonisolated`, so the default-MainActor isolation made it uncallable from the nonisolated smoke — a Swift 6 hard-error-in-waiting). Build + 5 `ParakeetConfigTests` green (added an explicit-`null`-preemph test). **Device-validated 2026-06-13 (iPhone 15 Pro Max):** `runFeaturizer` on `ls_test.flac` (106640 samples / 6.665 s) → mel **`[1, 667, 128]`** (= NeMo's `1 + len//hop` frame count, confirming the `nFFT`-based STFT), **per-feature mean ≈ 0** (8.3e-08 — the z-score working), range [−1.77, 6.79]; `preemph 0.97` logged live. The smoke validates shape + normalization; **full numerical correctness of the four fidelity choices (preemph / Hann / L2 / Slaney scale) still rides on the T2.1d substring gate** — they're the suspects if it fails. - **T2.1c shipped (code-complete; device forward + peak-footprint pending) — FastConformer encoder.** `ParakeetEncoder.swift` (`ParakeetConformerEncoder` + `ParakeetFeedForward` / `ParakeetConvModule` / `ParakeetConformerBlock` / `ParakeetSubsampling`) and `ParakeetAttention.swift` (`ParakeetRelPosAttention` + `parakeetRelPositionalEncoding`) port `senstella/parakeet-mlx`'s `conformer.py`/`attention.py` to mlx-swift: dw-striding subsampling (factor 8, 128→16 freq, `C·16→1024` projection) → 24 macaron conformer blocks (½-residual SiLU FF → rel-pos MHA → pointwise/GLU/depthwise-k9/BatchNorm/SiLU conv → ½-residual FF → LayerNorm). **Two load-bearing port decisions, both verified against the references + mlx-swift source:** (1) **no weight remapper and no conv transpose** — the `@ModuleInfo`/`@ParameterInfo` keys are the snake_case safetensors keys verbatim (the Whisper-port convention), and the mlx-community safetensors already store conv weights in MLX channel-last layout (the reference loads with `let transformedWeights = weights`, i.e. no transpose); so `loadArrays` → strip `encoder.` → `unflattened` → `update` is the whole mapping. (2) **rel-pos attention via the SDPA-additive-mask trick** — `matrix_bd = relShift((q+bias_v)·pᵀ)·scale` passed as the additive `mask` to `MLXFast.scaledDotProductAttention(q+bias_u, k, v)`, which is the Transformer-XL `AC+BD` sum (only the `rel_pos` variant is ported; the `rel_pos_local_attn` Metal-kernel path is skipped — this checkpoint is `rel_pos`). The positional encoding is a free function (no learned params → stays out of the module tree) computing exactly the centered `2·T−1` window the Python slices from its 5000-row buffer. **Caught a fourth latent issue beyond §5.2's three:** the Python's `win_length`-based STFT frame count can read past the buffer on some lengths; used the `nFFT`-based count (= NeMo's `torch.stft(center=True)`), already in T2.1b. **Loaded by the incremental bf16 cast-release path (§3.1):** `ParakeetConformerEncoder.load` casts each F32 tensor and drops its source before the next with `Memory.cacheLimit = 0`, keeping only `encoder.*` (decoder/joint F32 released); MLX laziness means the modules' random init weights are never materialized before `update` replaces them. **Smoke:** new `ParakeetSmoke.runEncoder()` (T2.1c) loads the encoder, featurizes `ls_test.flac` → bf16, runs the forward, and logs output shape (expect `[1, ~84, 1024]`) + timing + the **peak `phys_footprint`** sampled off-actor during the forward — the figure that decides the `increased-memory-limit` entitlement (§3.1); `run()` now does featurizer→encoder (T2.1a `runLoadFootprint` retained, no longer auto-run, to avoid a redundant second cast). Side: used the non-deprecated `Memory.cacheLimit` setter in new code. Build + full simulator suite green (encoder is MLX-touching ⇒ device-only; no new sim tests). **Device-validated 2026-06-13 (iPhone 15 Pro Max):** encoder output **`[1, 84, 1024]`** (667 mel frames ÷8; exact), output a clean LayerNorm-shaped distribution (range [−0.58, 0.49], mean ≈ 0 — strong evidence the 42 key templates all loaded; the final proof is the T2.1d substring gate), forward **130 ms** (~50× realtime), load 1.85 s. **Entitlement resolved (open question #2): NOT needed** — forward-pass peak `phys_footprint` **1.31 GB** (MLX peak-active 1.22 GB), only ~60 MB of activations over the ~1.27 GB bf16 weight floor (weight-dominated), ~1.7 GB under the ~3 GB no-entitlement ceiling. Parakeet runs without `increased-memory-limit`, like Whisper, with room for the tiny decoder/joint in T2.1d. - **T2.1d shipped (code-complete; device substring gate pending) — TDT greedy decoder + vocab decode + full-model wiring.** `ParakeetDecoder.swift` (`ParakeetPredictNetwork` = embed + 2-layer LSTM stack; `ParakeetJointNetwork` = enc/pred projections → ReLU → final Linear; `ParakeetTDTModel` = encoder + decoder + joint) and `ParakeetTokenizer.swift` (`parakeetDecodeTokens`, `▁`→space) port `senstella/parakeet-mlx`'s `rnnt.py` + `parakeet.py::decode_greedy` + `tokenizer.py`. **The TDT greedy loop:** per encoder frame, run the prediction net (embed last emitted token or zero; advances state **only on a non-blank emission**) + joint → split the joint logits into the **vocab head** (`argmax`; index == vocab size ⇒ blank ⇒ don't emit) and the **duration head** (`argmax` over the 5 `durations` ⇒ frames to advance `step`), with the `max_symbols=10` stuck-guard forcing progress on a run of duration-0 steps. **Verified against mlx-swift source so the weights load with no remapper:** mlx-swift's `LSTM` already exposes the `Wx`/`Wh`/`bias` keys and the i/f/g/o gate order the mlx-community checkpoint was converted with (per-layer LSTMs load directly into `dec_rnn.lstm.{N}.*`); the `joint_net` array keeps `[ReLU, Identity, Linear]` so the final Linear lands at index 2 (`joint_net.2.*`); the `prediction` sub-container nests `embed`/`dec_rnn` to match `decoder.prediction.*`. Single-step decode (batch 1, seq 1) sidesteps the reference LSTM wrapper's seq/batch-axis ambiguity. The lazy graph is bounded by `eval`-ing the committed LSTM `(h, c)` each emission. `ParakeetTDTModel.load` loads the **full** safetensors (encoder + decoder + joint) via the §3.1 incremental bf16 cast-release, keys verbatim. **Smoke:** `ParakeetSmoke.runDecode()` (T2.1d) loads the full model, featurizes `ls_test.flac`, transcribes end-to-end, and asserts the transcript contains *"openly shouldered the burden"* (case-insensitive) — logs PASS/FAIL + transcript + the full-model peak footprint; `run()` now does featurizer→decode (subsuming the encoder load). 3 sim-safe `ParakeetTokenizerTests` (▁→space, out-of-range guard, empty) wired via `add_test_file.rb`. Build + full simulator suite green. **Device-validated 2026-06-13 (iPhone 15 Pro Max): substring PASS** — `ls_test.flac` → *"Then the good soul openly shouldered the burden she had borne so long in secret, and bravely trudged on alone."*, word-matching the Python reference; transcribe 287 ms (~23× realtime), full-model peak `phys_footprint` 1.34 GB (still no entitlement). **This confirms the entire T2.1 model port (featurizer + encoder + decoder) end-to-end.** Getting there took fixing two bugs that only surfaced past a byte-correct encoder, both isolated **against the Python oracle on the Mac (no device round-trips)**: (1) **BatchNorm ran in training mode** — MLXNN Modules default to `training = true`, so the conv BatchNorm used batch stats instead of the loaded running_mean/var; the loaders now call `model.train(false)` (the Python `model.eval()`). Insidious because training-mode BatchNorm still produces *normalized-looking* output, so the T2.1c shape/range smoke passed while the encoder was numerically wrong. (2) **the joint's final Linear silently never loaded** — `let jointNet: [Module]` keyed off the property *name* (`jointNet`), but the safetensors key is `joint_net` (MLXNN has no `@ModuleInfo` key-override for unwrapped arrays — the encoder's `layers`/`conv`/`lstm` only match because they're single words); `update(verify: .none)` skipped the unmatched `joint_net.2.*`, leaving a **random final projection** → real-but-wrong tokens. Renamed the property to `joint_net`, and **switched both loaders to `update(verify: .noUnusedKeys)`** so any future key mismatch throws at load (naming the unused weight) instead of silently degrading. The decisive diagnostic: reimplementing our *exact* Swift decode loop in Python on the reference weights produced the *correct* transcript, proving the decode logic was right and pinning the bug to mlx-swift loading. The earlier L2→L1 magnitude fix (T2.1b CHANGE_LOG / §5.2 RISK 3) was found the same way — diffing the featurizer against the oracle. Device confirms the featurizer (`mel[0,0,:5]`) and full encoder (`enc[0,0,:5]`, range) now reproduce the reference to bf16 precision. + +## 2026-06-14 + +- **Onboarding doc — `docs/relay-notes-ios-guide.html`.** A self-contained, dependency-free HTML guide aimed at a seasoned programmer who is new to iOS/Swift: a "Rosetta Stone" mapping familiar concepts (interface→`protocol`, sum type→`enum` w/ associated values, ORM→SwiftData, UI-thread→`@MainActor`, Stream/Observable→`AsyncStream`, DI scope→`@Environment`) onto this codebase, then a guided tour of the provider-abstraction spine (`Transcriber`/`TranscriptionSession`/`TranscriptionEngine`/`TranscriberFactory` + the `TranscriptionOptions` sum type) and an end-to-end `tap → speak → saved` data-flow walkthrough traced through the real types (`RecorderViewModel` state machine, `LiveAudioEngine` double-duty tap, the three `AsyncStream`s, SwiftData `Note` save). Covers the Swift-6-strict-concurrency story (actors, `@MainActor`-by-default, the `nonisolated protocol` trap, `@unchecked Sendable` on `TapState`) and the iOS realities (permissions, `AVAudioSession`, background-audio `Info.plist`, simulator-can't-run-MLX, 7-day free-tier signing). Includes a vanilla-JS Swift/bash syntax highlighter + sidebar scroll-spy; teal→indigo app-icon palette. Reference/onboarding artifact, not a code change — no app behavior touched. diff --git a/docs/relay-notes-ios-guide.html b/docs/relay-notes-ios-guide.html new file mode 100644 index 0000000..91d2317 --- /dev/null +++ b/docs/relay-notes-ios-guide.html @@ -0,0 +1,935 @@ + + + + + +Relay Notes — An iOS & Swift Field Guide for Seasoned Programmers + + + + + + +
+ + +
+ + +
+
+
A guide for people who already ship software
+

Relay Notes, decoded.

+

You know how to architect systems, reason about concurrency, and read a call graph. What you may not know is the iOS dialect: SwiftUI, actors, property wrappers, SwiftData, and the strict-concurrency rules Swift 6 enforces at compile time. This guide is a translation layer — it maps what you already know onto how a real, working iPhone app is built, using Relay Notes as the worked example.

+
+ What  on-device voice → text + UI  SwiftUI + Concurrency  Swift 6 strict + Data  SwiftData + ML  MLX (Whisper) + Target  iPhone 15 Pro Max +
+
+ +

How to read this

+

The sections build on each other but each stands alone. If you want the aha fast, read §02 the Rosetta Stone (concept mapping), then jump straight to §07, which traces a single recording end-to-end through the actual types. The middle sections (§03–§05) are the language/framework primer; the later ones (§06–§11) are this app's architecture and the iOS-specific gotchas.

+
+
▹ The one idea that explains the whole codebase
+

Every external capability sits behind a protocol so the runtime provider is swappable without a rebuild. Transcription has two interchangeable engines (Apple Speech, on-device Whisper) plugged into one socket. Hold that thought — it's the spine, and most of the architecture exists to serve it.

+
+
+ + +
+

01 The 60-second tour

+

Relay Notes is intentionally tiny in scope: tap a button, speak, and an on-device transcript is saved. No account, no server, works in airplane mode. That narrowness is a feature — it keeps the surface small enough that the architecture is legible. The pipeline is a straight line:

+ +
+
🎙️
Capturemic → PCM
+
+
💾
Persist audioAAC/m4a
+
+
📝
Transcribeon-device
+
+
🗂️
Store noteSwiftData
+
+ +

The "transcribe" box is where the interesting design lives. It's not one thing — it's a socket that accepts either engine:

+
+
+

Apple Speech — the default

+

Built into iOS (SpeechAnalyzer + SpeechTranscriber). Streams a live transcript as you talk. No download, no model management. The "it just works" floor.

+
+
+

On-device Whisper — the upgrade

+

whisper-small.en (481 MB) running through MLX. Downloaded on first use. Decodes once when you stop (no live partials). Proves "on-device ≠ Apple-only."

+
+
+

Everything below explains how those pieces are wired so that swapping engines is a runtime choice, not a code change — and how Swift's type system and concurrency model are leaned on to keep it safe.

+
+ + +
+

02 The Rosetta Stone

+

Almost every iOS concept has a name you already know under a different label. Here's the dictionary. Skim it now; the rest of the guide makes it concrete.

+ +
+
+
Interface / abstract base / traitdefines a contract, no implementation
+
+
protocol — e.g. Transcriber, TranscriptionSession. Conformers can be classes, structs, or actors.
+
+
+
Discriminated union / sum type / sealed class"one of these shapes, type-safely"
+
+
enum with associated values — e.g. TranscriptionOptions.apple(…) | .whisperMLX
+
+
+
async / awaitsame as JS, C#, Rust, Python
+
+
Identical keywords. Task { … } ≈ spawning a coroutine; structured concurrency is the default.
+
+
+
UI thread / main thread affinity"touch UI only from the main thread"
+
+
@MainActor — a compiler-enforced annotation, not a convention you have to remember.
+
+
+
Thread-safe / no data racesguaranteed at compile time
+
+
Sendable + actor. Swift 6 rejects the build on a potential data race.
+
+
+
Observable / Channel / Stream / Subjectpush values to a consumer over time
+
+
AsyncStream<T> — consumed with for await x in stream.
+
+
+
React component + render()UI is a pure function of state
+
+
A struct conforming to View, with a body computed property.
+
+
+
useState / signals / observable statestate that re-renders the UI
+
+
@State, @Observable. Mutating a tracked field re-runs the affected body.
+
+
+
React Context / DI scope / ambient injection"reach a dependency without prop-drilling"
+
+
@Environment(\.modelContext) and friends.
+
+
+
ORM entity + repository/sessionActiveRecord, Hibernate, Prisma model
+
+
@Model class Note (entity) + ModelContext (unit of work) + @Query (live result set).
+
+
+
Nullable<T> / Option<T> / T?"might be absent"
+
+
T?, unwrapped with guard let / if let. No implicit null.
+
+
+
main() / app entry pointwhere the process starts
+
+
@main struct …App: App with a body: some Scene.
+
+
+
package.json / pom.xml / Cargo.tomlproject + dependency manifest
+
+
The .xcodeproj (a project.pbxproj file) + Swift Package Manager (Package.resolved).
+
+
+
+
◆ The mental model that pays off most
+

Swift leans value types (struct/enum, copied) far harder than the OO languages you're used to. Reference types (class, actor) are the exception, reserved for identity and shared mutable state. When you see struct here, think "immutable-ish value, cheap to copy, no aliasing surprises." That single shift removes most "wait, why did that change?" confusion.

+
+
+ + +
+

03 Swift, the parts that bite

+

You can read Swift on sight — it's C-family with type inference. Four features show up constantly in this codebase and behave differently than their cousins elsewhere. Learn these four and the source stops surprising you.

+ +

1. struct vs class vs actor vs enum

+

Four ways to declare a type, chosen by semantics, not habit:

+ + + + + + +
KeywordSemanticsUsed here for
structValue — copied on assignment, no shared identityNote options, RecordingOptions, every View
enumValue + closed set of cases, optionally with payloadsTranscriptionOptions, the recorder's State
classReference — shared identity, mutable, ARC-managedRecorderViewModel, Tunings, LiveAudioEngine
actorReference + serialized access (its own isolation domain)WhisperMLXTranscriber (guards ~480 MB of model state)
+ +

2. Enums carry data — this is the tagged union you wanted

+

This is the single most important Swift idiom in the app. TranscriptionOptions isn't a struct with a pile of nullable fields — it's a closed set of shapes, each with its own type-safe payload:

+
enum TranscriptionOptions: Sendable {
+    case apple(AppleSpeechOptions)   // Apple-only: preset + contextual strings
+    case whisperMLX                  // Whisper: no decode dials in v1
+}
+
+struct AppleSpeechOptions: Sendable {
+    var preset: SpeechTranscriber.Preset = .transcription
+    var contextualStrings: [String] = []
+}
+

You destructure it with switch or pattern-matching guard. The compiler forces you to handle every case, and you literally cannot read Apple's preset out of a .whisperMLX value — it doesn't exist on that case. That's a whole class of "field is null for this provider" bugs deleted at the type level.

+ +

3. The recorder is a state machine, expressed as an enum

+

Instead of a tangle of isRecording / isPaused / errorMessage booleans that can contradict each other, the recorder's entire lifecycle is one value that's always exactly one valid state:

+
enum State: Equatable {
+    case idle
+    case recording(partial: String)   // live transcript so far
+    case paused(partial: String)      // interrupted by a call/Siri/alarm
+    case finalizing                   // stopped; transcribing
+    case finished(transcript: String)
+    case failed(message: String)
+}
+

Illegal states are unrepresentable — there is no "recording AND failed" because the value is one case at a time. Transitions are a pure switch, which makes them unit-testable without spinning up audio hardware (see RecorderViewModel.nextState(for:from:)).

+ +

4. Optionals and guard

+

There is no null. "Might be absent" is encoded in the type as T?, and you must unwrap before use. The idiomatic unwrap is guard let — an early-return that also narrows the type for the rest of the scope:

+
guard let session, let url else {
+    state = .failed(message: "Recording could not be saved. Please try again.")
+    return
+}
+// past this line, `session` and `url` are non-optional
+
+
▹ Reading tip
+

some View / any Transcriber: some means "one specific concrete type the compiler knows but I won't name" (opaque return); any means "a boxed value of any conformer, decided at runtime" (existential). The app uses any Transcriber precisely because the concrete engine is a runtime choice.

+
+
+ + +
+

04 SwiftUI: the UI is a function of state

+

If you've used React, SwiftUI will feel familiar: you describe the UI for a given state, and the framework diffs and re-renders. A view is a struct (cheap, disposable, recreated constantly) with a body that returns a description of the UI — never an imperative "now mutate this label."

+ +
struct ContentView: View {
+    @Environment(\.modelContext) private var modelContext   // injected dependency
+    @State private var viewModel: RecorderViewModel?        // owned, re-renders on change
+    @State private var showSettings = false
+
+    var body: some View {
+        NavigationStack {
+            VStack(spacing: 0) {
+                NotesListView(searchText: searchText, reTranscriber: reTranscriber)
+                Divider()
+                if let viewModel {
+                    RecorderView(viewModel: viewModel)
+                }
+            }
+            .navigationTitle("Relay Notes")
+            .toolbar { /* settings button */ }
+            .sheet(isPresented: $showSettings) { /* settings sheet */ }
+        }
+    }
+}
+ +

Property wrappers = where state lives

+

Those @-prefixed declarations aren't decoration — each one tells SwiftUI a different thing about ownership and reactivity:

+ + + + + + + +
WrapperClosest analogueMeans
@StateuseStateThis view owns this value; mutating it re-renders.
@Observablea signal / observable storeMacro on a class; reads in a body auto-subscribe. RecorderViewModel and Tunings use it.
@EnvironmentReact ContextPull an ambient dependency (the SwiftData context) without threading it through every initializer.
@Querya live DB query / useQueryA SwiftData fetch that re-runs and re-renders when matching rows change. The notes list is just @Query(sort: \.createdAt, order: .reverse).
$valuetwo-way bindingThe $ prefix makes a Binding — a read/write handle a child view (e.g. a TextField) can mutate.
+ +

The composition root: where dependencies are wired

+

Notice viewModel starts as nil. The app has no DI container — instead there's a composition root, the spot where the real object graph is assembled once, lazily, the first time the view appears:

+
.task {                                  // runs once when the view appears
+    if viewModel == nil {
+        let tunings = Tunings()
+        tunings.reconcileEngineAvailability(whisperReady: whisperStore.status == .ready)
+        let factory = TranscriberFactory(whisperModelStore: whisperStore)
+        viewModel = RecorderViewModel(
+            engine: LiveAudioEngine(),
+            transcriberFactory: factory,
+            modelContext: modelContext,
+            tunings: tunings
+        )
+        reTranscriber = ReTranscriber(factory: factory, whisperStore: whisperStore)
+    }
+}
+

This is hand-rolled constructor injection: the view model is handed its collaborators (LiveAudioEngine, the factory, the SwiftData context, the tunings) rather than reaching for globals. That's what makes the logic testable — tests construct a RecorderViewModel with fakes. .task is the lifecycle hook (≈ useEffect(() => …, [])), and it's async-aware so it can await without blocking the UI.

+
+
◆ MVVM, lightly
+

Views stay dumb (layout + bindings). The RecorderViewModel holds the state machine and orchestration. This split is what lets the gnarly async audio logic be exercised by 80-odd unit tests while the SwiftUI layer stays a thin shell.

+
+
+ + +
+

05 Swift 6 concurrency (read this twice)

+

This is the section that trips up newcomers the hardest, because Swift 6 promotes data races from "runtime heisenbug" to compile error. The rules are strict, but once they click, the audio pipeline reads cleanly.

+ +

Actors = isolation domains

+

An actor is a reference type whose mutable state can only be touched one task at a time — the compiler serializes access for you. Calls from outside hop onto the actor and are therefore await-ed. The Whisper transcriber is an actor precisely because it caches ~480 MB of non-thread-safe MLX model state across calls:

+
actor WhisperMLXTranscriber: Transcriber {
+    private var cache: LoadedAssets?      // model + tokenizer + mel filters
+    // serialized by the actor → safe to mutate without a single lock
+}
+ +

@MainActor = the UI isolation domain

+

There's a special global actor, @MainActor, that represents the main thread. Anything that touches UI state is annotated with it, and the compiler then guarantees those members run on the main thread. The view model and the audio engine both opt in:

+
@MainActor
+@Observable
+final class RecorderViewModel { /* state, tasks, orchestration */ }
+
+@MainActor
+final class LiveAudioEngine { /* setup/teardown on the main actor */ }
+

Notably, this project sets SWIFT_DEFAULT_ACTOR_ISOLATION = MainActor — so types are main-actor by default unless they opt out. That default is convenient for UI code but creates the single sharpest gotcha in the codebase 👇

+ +
+
⚠ The nonisolated protocol trap
+

With main-actor-by-default on, an unannotated protocol becomes implicitly @MainActor, and conformance inference then silently smears @MainActor onto your conformers — it once stamped @MainActor onto an actor's synchronous init. The fix is to mark isolation-neutral protocols explicitly:

+
// Both protocols are nonisolated *on purpose* so each conformer
+// picks its own isolation: AppleSpeechTranscriber is a plain class,
+// WhisperMLXTranscriber is an actor.
+nonisolated protocol Transcriber: Sendable {
+    func transcribe(_ audio: URL, options: TranscriptionOptions) async throws -> String
+    func makeStreamingSession(options: TranscriptionOptions) async throws -> any TranscriptionSession
+}
+
+ +

Sendable = "safe to cross an isolation boundary"

+

To pass a value between actors/tasks, the compiler must know it can't introduce a race. Value types (struct/enum of Sendable parts) are automatically Sendable. When you know something is safe but can't prove it to the compiler — like a helper that the audio thread touches single-threaded by construction — you assert it with @unchecked Sendable and take responsibility:

+
// Runs on the realtime audio thread; single-threaded access by construction.
+private final class TapState: @unchecked Sendable { … }
+ +

AsyncStream = the backbone of the pipeline

+

The whole capture→transcribe handoff is plumbed with async streams — a producer yields values, a consumer for awaits them. Three streams flow through a single recording:

+
struct LiveRecording: Sendable {
+    let url: URL                                       // where the audio file lands
+    let buffers: AsyncStream<AVAudioPCMBuffer>          // mic audio, chunk by chunk
+    let interruptions: AsyncStream<InterruptionEvent>  // call/Siri/alarm events
+}
+

And the transcript itself comes back as a stream of ever-growing strings — that's how the live partial transcript updates the UI character-by-character:

+
updatesTask = Task { [weak self, session] in
+    for await partial in session.updates {          // each value = transcript so far
+        guard let self else { return }
+        if case .recording = self.state {
+            self.state = .recording(partial: partial)   // re-renders the live card
+        }
+    }
+}
+
+
▹ How to read the concurrency, fast
+

Mentally tag each type with its domain: @MainActor (UI + orchestration), actor (Whisper model), the audio thread (the tap closure + TapState). Every arrow between domains is an await or an AsyncStream. The compiler already verified the crossings are race-free — so you can trust the boundaries and just follow the data.

+
+
+ + +
+

06 The architecture spine: provider abstraction

+

Here's the load-bearing pattern, stated plainly: capabilities hide behind protocols; concrete providers are resolved at runtime. Today it's transcription; the same shape is reserved for a future LLM-cleanup stage. If you internalize this section, the file layout makes sense.

+ +

The contract has two methods, both on purpose

+
nonisolated protocol Transcriber: Sendable {
+    // File-based. UNUSED by the app today — kept for the future cloud-STT
+    // providers (which work on uploaded files) and a "re-transcribe" action.
+    func transcribe(_ audio: URL, options: TranscriptionOptions) async throws -> String
+
+    // Streaming. This is what the recorder actually uses: it returns a session
+    // that the audio engine feeds buffers into.
+    func makeStreamingSession(options: TranscriptionOptions) async throws -> any TranscriptionSession
+}
+
+
⚑ Don't "clean up" the unused method
+

The file-based transcribe(_:options:) looks like dead code — the app only calls the streaming path. It's deliberately retained for cloud STT (which operates on uploaded files) and a future re-transcribe action. This is the kind of intent that lives in comments and the planning docs, not in the call graph. Read before deleting.

+
+ +

A session is the live handle

+

The streaming method hands back a TranscriptionSession — the object the audio engine pushes buffers into and reads results out of. Crucially, the session is the authority on its own behavior, so the recorder doesn't branch on engine type:

+
nonisolated protocol TranscriptionSession: Sendable, AnyObject {
+    var audioFormat: AVAudioFormat? { get }     // the PCM format it wants
+    var updates: AsyncStream<String> { get }     // live partial transcripts
+    var emitsLivePartials: Bool { get }          // Apple: true · Whisper: false
+    var modelDescription: String { get }         // provenance, saved on the Note
+    func feed(_ buffer: AVAudioPCMBuffer)        // push mic audio in
+    func finish() async throws -> String         // stop, return final transcript
+    func cancel() async
+}
+

emitsLivePartials is a nice example of pushing a decision to the type that owns the knowledge: the recorder asks the session "do you stream?" rather than inferring it from an engine enum — so adding an engine doesn't mean editing the recorder.

+ +

Resolving the provider: the factory

+

A small factory maps the user's selected engine to a (cached) concrete instance. Caching is load-bearing — recreating the Whisper transcriber would reload 480 MB of weights every recording:

+
@MainActor
+final class TranscriberFactory {
+    private var appleSpeech: AppleSpeechTranscriber?
+    private var whisperMLX: WhisperMLXTranscriber?
+
+    func transcriber(for engine: TranscriptionEngine) -> any Transcriber {
+        switch engine {
+        case .apple:      return appleSpeech ?? { let t = AppleSpeechTranscriber(locale: locale); appleSpeech = t; return t }()
+        case .whisperMLX: return whisperMLX  ?? { let t = WhisperMLXTranscriber(store: whisperModelStore); whisperMLX = t; return t }()
+        }
+    }
+}
+ +

Put together, the socket looks like this — three concrete providers, one interface, the choice deferred to runtime:

+
+
🍎
AppleSpeech
Transcriber
on-device · shipping
+
🧠
WhisperMLX
Transcriber
on-device · shipping
+
☁️
Cloud
Transcriber
opt-in · not built
+
+
🔌
any Transcriberthe socket
+
+
+
◆ Why this is worth the ceremony
+

The whole point of v1 is to validate "on-device, your-choice model" without betting the app on one vendor. Because the seam is a protocol, a new engine is a new file, not a refactor — and the next planned capability (local LLM cleanup) drops a LanguageModel protocol into the exact same shape. The abstraction isn't speculative; it already pays for itself with two live engines.

+
+
+ + +
+

07 Tap → speak → saved: the whole path

+

This is the payoff. Here's one recording, traced end-to-end through the real types. Every concept above shows up doing a job.

+ +
+
+

The tap: build the session, start the engine

+

Recording/RecorderViewModel.swift → startRecording()

+

The view model reads a snapshot of the user's tunings (engine, preset, bitrate) at this instant — mid-recording setting changes intentionally don't take effect until next time. It asks the factory for the right Transcriber, makes a streaming session, then starts the audio engine, handing it the session's preferred audioFormat so capture and recognition agree on a PCM format.

+
let transcriber = transcriberFactory.transcriber(for: tunings.engine)
+let session = try await transcriber.makeStreamingSession(options: tunings.transcriptionOptions)
+let recording = try await engine.start(options: tunings.recordingOptions,
+                                       analyzerFormat: session.audioFormat)
+
+ +
+

Capture does double duty on every buffer

+

Audio/LiveAudioEngine.swift → installTap + TapState.handle

+

An input tap on AVAudioEngine fires on the realtime audio thread for each chunk of mic audio. For every buffer it (a) writes AAC/m4a to disk for later playback, and (b) converts the PCM to the analyzer's format and yields it into an AsyncStream. That's why the saved audio and the transcript come from one capture, not two.

+
func handle(buffer: AVAudioPCMBuffer) {
+    try? audioFile.write(from: buffer)              // (a) persist for playback
+    // (b) convert to the analyzer's format, then:
+    continuation.yield(outBuffer)                   // → into LiveRecording.buffers
+}
+
+
⚠ Audio-thread reality
+

The tap closure runs on a realtime thread that must never block. It holds a @unchecked Sendable TapState (single-threaded by construction) — this is exactly the case where you assert thread-safety to the compiler because the runtime contract guarantees it but the type system can't see it.

+
+
+ +
+

Three concurrent tasks consume the streams

+

RecorderViewModel.startRecording()

+

The view model spins up structured tasks, one per stream: feed (push buffers into the session + compute a mic level), updates (live partial transcript → state), and interruptions (call/Siri/alarm → pause/resume/finalize). The state flips to .recording(partial: "") and the UI comes alive.

+
feedTask = Task { [session] in
+    for await buffer in recording.buffers { session.feed(buffer); … }
+}
+updatesTask = Task { for await partial in session.updates { … } }      // live text
+interruptionTask = Task { for await e in recording.interruptions { … } } // pause/resume
+state = .recording(partial: "")
+
+ +
+

Live transcript streams to the screen (Apple) — or a meter does (Whisper)

+

AppleSpeechTranscriber.swift / WhisperStreamingSession.swift

+

With Apple Speech, each recognizer result (volatile or final) is folded into a growing string and yielded on updates — you watch the words appear. With Whisper there are no partials by design (emitsLivePartials = false); it just accumulates PCM in memory, so the UI shows a live audio-level meter + elapsed timer placeholder instead of a blank card.

+
+ +
+

Stop: finalize and get the transcript

+

RecorderViewModel.stopAndTranscribe()

+

State → .finalizing. The engine stops (closing the audio file), the feed/interruption tasks cancel, then session.finish() returns the final transcript. For Apple that drains the last results; for Whisper that's where the entire decode happens — a 5-minute note sits ~80 s on the spinner. Same interface, very different cost profile, hidden behind finish().

+
state = .finalizing
+let url = await engine.stop()
+let transcript = try await session.finish()        // Apple: drain · Whisper: decode now
+
+ +
+

Persist the note

+

Models/Note.swift + ModelContext

+

A Note is created — storing the audio filename (not a URL), the transcript, and the engine's modelDescription as provenance — inserted into the SwiftData context and saved. Because the notes list is a live @Query, the new row appears in the UI automatically. State → .finished. Done.

+
let note = Note(audioFilename: url.lastPathComponent,
+                transcript: transcript,
+                transcriptionModel: session.modelDescription)
+modelContext.insert(note)
+try modelContext.save()
+state = .finished(transcript: transcript)
+
+
+
+
▹ Step back and notice
+

The recorder never names "Apple" or "Whisper." It talks to a Transcriber and a TranscriptionSession, asks them what they can do (emitsLivePartials, audioFormat, modelDescription), and lets the streams carry the data. That's the spine doing its job — orchestration with zero engine-specific branches.

+
+
+ + +
+

08 Persistence: SwiftData in one page

+

SwiftData is Apple's modern ORM (a type-safe wrapper over Core Data). If you've used Prisma, Room, or ActiveRecord, you already know the shape — three pieces:

+
+

@Model = entity

A macro on a class that makes its stored properties persistent columns. Note is the only model.

+

ModelContext = session

The unit of work: insert, delete, save. Injected via @Environment.

+

@Query = live fetch

A reactive result set. The list view re-renders when rows change — no manual refresh.

+
+ +
@Model
+final class Note {
+    var id: UUID
+    var createdAt: Date
+    var audioFilename: String      // ← filename, NOT a URL (see below)
+    var transcript: String
+    var title: String?
+    var transcriptionModel: String?    // provenance: "Apple Speech" / "Whisper (small.en)"
+    var originalTranscript: String?    // pre-edit baseline, enables revert
+}
+ +

The container is wired up once, at the app's entry point — that's all it takes to get a working store:

+
@main
+struct Relay_NotesApp: App {
+    var body: some Scene {
+        WindowGroup { ContentView() }
+            .modelContainer(for: Note.self)     // creates/opens the SQLite store
+    }
+}
+ +
+
⚠ Store the filename, never the URL
+

The app's container path can change between launches, so a persisted absolute URL goes stale. The note stores audioFilename and resolves it against the documents directory at access time:

+
var audioURL: URL { URL.documentsDirectory.appending(path: audioFilename) }
+
+
+
⚑ One canonical delete
+

A note has two artifacts: the SwiftData row and an audio file on disk. Deleting the row alone orphans the file. So there's exactly one approved delete, used everywhere:

+
func deleteWithAudio(in context: ModelContext) {
+    try? FileManager.default.removeItem(at: audioURL)   // file
+    context.delete(self); try? context.save()           // row
+}
+
+
+ + +
+

09 Two engines, one socket

+

The same TranscriptionSession interface backs two genuinely different implementations. Comparing them is the clearest illustration of why the abstraction earns its keep:

+ + + + + + + + + +
Apple SpeechOn-device Whisper (MLX)
FrameworkSpeechAnalyzer + SpeechTranscribermlx-swift 0.31.4, hand-ported pipeline
ModelApple's, bundled with iOS, no choicewhisper-small.en · 481 MB · your choice of repo
InstallNothing to downloadDownloaded on first use → Application Support
Live partialsYes — words stream as you talkNo — accumulate PCM, decode at finish()
Concurrency typenonisolated final classactor (guards cached weights)
CostReal-time, free~4× real-time decode; ~2.8 GB peak footprint
While recording, the UI…renders the live transcript cardrenders a mic-level meter + timer placeholder
+ +

Why Whisper is an actor and Apple is a class

+

This is the concurrency model paying rent. Whisper caches hundreds of MB of non-Sendable MLX state and is GPU-bound — serializing access through an actor is both correct (no races on the cache) and desirable (decodes shouldn't overlap on one GPU). Apple's transcriber holds no such shared state, so a plain nonisolated class is enough. Same protocol, different isolation, chosen per-conformer — exactly the freedom the nonisolated protocol from §05 preserves.

+ +
+
◆ The product stance behind the engineering
+

Two axes are kept independent on purpose: where it runs (on-device vs cloud) and whose model (Apple's vs your choice). Apple Speech is on-device-Apple; Whisper is on-device-yours; a future cloud provider would be the third corner. v1 is deliberately local-first, cloud opt-in only — the app must be 100% functional in airplane mode. The protocol is what keeps all three corners reachable without forking the app.

+
+
+ + +
+

10 iOS realities you won't hit elsewhere

+

Platform friction that has nothing to do with the architecture but everything to do with shipping an iPhone app as a newcomer:

+ +
+
+

Permissions are async dialogs

+

Mic and speech access each trigger a one-time system prompt you must await, and the human-readable reason strings live in Info.plist. The recorder handles denial as a first-class failure state with an actionable message.

+
+
+

The audio session is global, shared state

+

AVAudioSession is a singleton you configure (.playAndRecord) and that the OS can yank away mid-recording. Calls, Siri, and alarms arrive as interruption notifications — handled here as a .began → .paused, .resumed → .recording, .stopped → auto-finalize flow.

+
+
+

Background recording needs an entitlement

+

Locked-screen capture requires UIBackgroundModes = [audio], which Xcode can't auto-generate — hence a hand-maintained partial Info.plist. Don't delete it (it merges on top of the generated one).

+
+
+

The simulator can't run MLX

+

The simulator's GPU lacks the family MLX needs, so any MLX test would crash the whole suite. Those tests are compiled but skipped via #if !targetEnvironment(simulator); on-device math is validated by a DEBUG "smoke" button instead.

+
+
+

Free-tier signing expires every 7 days

+

Sideloading on a free Apple ID means the app stops launching after a week — re-build from Xcode to re-sign. Data (SwiftData rows + audio files) survives because the bundle ID is stable. The paid Developer Program is deferred until on-device LLM inference is validated.

+
+
+

The project file is generated, edit it with care

+

project.pbxproj is a fussy machine file. The project uses Xcode's file-system-synchronized groups; adding files/targets is done via the xcodeproj Ruby gem, not by hand, and validated on a /tmp copy first.

+
+
+
+ + +
+

11 Build & tooling, the short version

+

Day-to-day is a terminal loop (Xcode stays open in the background for signing, previews, and Instruments). The two commands you'll run most:

+
# Run the full test suite in the simulator (the command you'll use most)
+xcodebuild test -project "Relay Notes.xcodeproj" -scheme "Relay Notes" \
+  -destination 'platform=iOS Simulator,name=iPhone 17 Pro' 2>&1 | xcbeautify
+
+# Build only — faster sanity check
+xcodebuild build -project "Relay Notes.xcodeproj" -scheme "Relay Notes" \
+  -destination 'platform=iOS Simulator,name=iPhone 17 Pro' 2>&1 | xcbeautify
+
    +
  • The quotes matter — both the project name and scheme contain a space.
  • +
  • Pipe through xcbeautify; raw xcodebuild output buries the actual errors.
  • +
  • Tests use Swift's Testing framework (import Testing, @Test, #expect) — closer to modern JS/Rust test ergonomics than XCTest. ~80 tests, ~12 s warm.
  • +
  • New test files must be registered in the project: ruby scripts/add_test_file.rb MyNewTests.swift.
  • +
  • Dependencies (mlx-swift, swift-numerics) resolve automatically via Swift Package Manager on first build — nothing to install by hand.
  • +
+
+ + +
+

12 Where to read next

+

A reading order into the actual repo, now that the map is in your head:

+
+
+
The spine, in codestart here
+
+
Transcription/Transcriber.swift — the protocols + the TranscriptionOptions sum type, with the nonisolated rationale in a long comment.
+
+
+
The orchestratorthe state machine in motion
+
+
Recording/RecorderViewModel.swift — follow startRecording() then stopAndTranscribe().
+
+
+
The trickiest filerealtime audio + Sendable boundaries
+
+
Audio/LiveAudioEngine.swift — the double-duty tap and TapState.
+
+
+
The why behind everythingprose, not code
+
+
CLAUDE.md (architecture + conventions), planning/notes.md (roadmap + stance), CHANGE_LOG.md (what shipped & why).
+
+
+ +
+
▹ The takeaway, in one sentence
+

Relay Notes is a small app that uses Swift's type system (sum-type enums, protocols, optionals) to make illegal states unrepresentable, and Swift's concurrency model (actors, @MainActor, AsyncStream) to make a realtime audio→text→storage pipeline race-free at compile time — all so that the transcription engine can be swapped at runtime without touching the code that orchestrates it.

+
+ +
+

Generated as an onboarding companion to the Relay Notes codebase · iOS 26 · SwiftUI · Swift 6 strict concurrency · SwiftData · MLX. Code excerpts are lightly trimmed from the real source; read the files for the full, commented versions.

+
+
+ +
+
+ + + + From ca42eaff2e2d880b3a951ed75dd72a8ecefd68c9 Mon Sep 17 00:00:00 2001 From: Claude Date: Sun, 14 Jun 2026 04:43:36 +0000 Subject: [PATCH 2/3] docs: add four architecture diagrams to the iOS/Swift guide Inline SVG (no dependencies): a layered system map, a concurrency isolation-domains map, a runtime swimlane for tap-to-saved, and the recorder state machine. Themed to match; rendered for layout review. https://claude.ai/code/session_01W47XYXAX6wnMNT3KGJbzQh --- CHANGE_LOG.md | 2 +- docs/relay-notes-ios-guide.html | 343 +++++++++++++++++++++++++++++++- 2 files changed, 342 insertions(+), 3 deletions(-) diff --git a/CHANGE_LOG.md b/CHANGE_LOG.md index 78881b2..7c555cb 100644 --- a/CHANGE_LOG.md +++ b/CHANGE_LOG.md @@ -123,4 +123,4 @@ Entry style: bold lead-in summarizing what shipped, then the *why* / non-obvious ## 2026-06-14 -- **Onboarding doc — `docs/relay-notes-ios-guide.html`.** A self-contained, dependency-free HTML guide aimed at a seasoned programmer who is new to iOS/Swift: a "Rosetta Stone" mapping familiar concepts (interface→`protocol`, sum type→`enum` w/ associated values, ORM→SwiftData, UI-thread→`@MainActor`, Stream/Observable→`AsyncStream`, DI scope→`@Environment`) onto this codebase, then a guided tour of the provider-abstraction spine (`Transcriber`/`TranscriptionSession`/`TranscriptionEngine`/`TranscriberFactory` + the `TranscriptionOptions` sum type) and an end-to-end `tap → speak → saved` data-flow walkthrough traced through the real types (`RecorderViewModel` state machine, `LiveAudioEngine` double-duty tap, the three `AsyncStream`s, SwiftData `Note` save). Covers the Swift-6-strict-concurrency story (actors, `@MainActor`-by-default, the `nonisolated protocol` trap, `@unchecked Sendable` on `TapState`) and the iOS realities (permissions, `AVAudioSession`, background-audio `Info.plist`, simulator-can't-run-MLX, 7-day free-tier signing). Includes a vanilla-JS Swift/bash syntax highlighter + sidebar scroll-spy; teal→indigo app-icon palette. Reference/onboarding artifact, not a code change — no app behavior touched. +- **Onboarding doc — `docs/relay-notes-ios-guide.html`.** A self-contained, dependency-free HTML guide aimed at a seasoned programmer who is new to iOS/Swift: a "Rosetta Stone" mapping familiar concepts (interface→`protocol`, sum type→`enum` w/ associated values, ORM→SwiftData, UI-thread→`@MainActor`, Stream/Observable→`AsyncStream`, DI scope→`@Environment`) onto this codebase, then a guided tour of the provider-abstraction spine (`Transcriber`/`TranscriptionSession`/`TranscriptionEngine`/`TranscriberFactory` + the `TranscriptionOptions` sum type) and an end-to-end `tap → speak → saved` data-flow walkthrough traced through the real types (`RecorderViewModel` state machine, `LiveAudioEngine` double-duty tap, the three `AsyncStream`s, SwiftData `Note` save). Covers the Swift-6-strict-concurrency story (actors, `@MainActor`-by-default, the `nonisolated protocol` trap, `@unchecked Sendable` on `TapState`) and the iOS realities (permissions, `AVAudioSession`, background-audio `Info.plist`, simulator-can't-run-MLX, 7-day free-tier signing). **Four hand-authored inline-SVG architecture diagrams** orient the reader: a 5-layer system map (UI → orchestration → the spine → providers → frameworks, spine highlighted), a concurrency isolation-domains map (`@MainActor` / `actor` / audio-thread / `nonisolated`, every boundary an `await` or `AsyncStream`), a runtime swimlane for tap→speak→saved, and the recorder state-machine. Self-contained (no JS libs): a vanilla-JS Swift/bash highlighter + sidebar scroll-spy; SVGs rendered/eyeballed via a throwaway cairosvg pass for layout. Teal→indigo app-icon palette. Reference/onboarding artifact, not a code change — no app behavior touched. diff --git a/docs/relay-notes-ios-guide.html b/docs/relay-notes-ios-guide.html index 91d2317..42c049e 100644 --- a/docs/relay-notes-ios-guide.html +++ b/docs/relay-notes-ios-guide.html @@ -219,6 +219,31 @@ color:#fff; cursor: pointer; box-shadow: 0 6px 18px -4px rgba(0,0,0,.6); } main { padding-top: 60px; } } + + /* ---------- Architecture diagrams (inline SVG, dependency-free) ---------- */ + figure.diagram { margin: 26px 0; border: 1px solid var(--line); border-radius: 14px; + background: linear-gradient(180deg, #121925, var(--bg-card)); padding: 18px 16px 10px; } + figure.diagram svg { width: 100%; height: auto; display: block; } + figure.diagram figcaption { font-size: 12.8px; color: var(--text-dim); text-align: center; + margin-top: 10px; line-height: 1.5; max-width: 66ch; margin-left: auto; margin-right: auto; } + .diagram text { font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, sans-serif; } + .dg-box { fill: #1a2230; stroke: #2b3950; stroke-width: 1.5; } + .dg-accent { fill: rgba(27,181,164,.13); stroke: var(--teal-bright); stroke-width: 1.7; } + .dg-indigo { fill: rgba(79,70,229,.18); stroke: var(--indigo-bright); stroke-width: 1.5; } + .dg-amber { fill: rgba(245,185,113,.13); stroke: #f5b971; stroke-width: 1.5; } + .dg-pink { fill: rgba(255,123,156,.13); stroke: var(--kw); stroke-width: 1.5; } + .dg-region { fill: rgba(255,255,255,.022); stroke: #33425a; stroke-width: 1.4; stroke-dasharray: 5 4; } + .dg-band { fill: #0e131c; stroke: #1c2533; } + .dg-t { fill: #f2f5f9; font-weight: 700; } + .dg-s { fill: #93a0b1; } + .dg-mono { font-family: "SF Mono", ui-monospace, Menlo, monospace; fill: #9fe6d8; } + .dg-edge { stroke: #6b7890; stroke-width: 1.7; fill: none; } + .dg-edge-async { stroke: var(--teal-bright); stroke-width: 1.7; fill: none; stroke-dasharray: 6 4; } + .dg-elbl { fill: #cdd6e3; } + .dg-elbl-a { fill: #84e3d5; } + .dg-dom { fill: var(--indigo-bright); font-weight: 700; letter-spacing: .4px; } + .dg-dom-t { fill: var(--teal-bright); font-weight: 700; letter-spacing: .4px; } + .dg-dom-a { fill: #f5b971; font-weight: 700; letter-spacing: .4px; } @@ -418,7 +443,68 @@

3. The recorder is a state machine, expressed as an enum

case finished(transcript: String) case failed(message: String) } -

Illegal states are unrepresentable — there is no "recording AND failed" because the value is one case at a time. Transitions are a pure switch, which makes them unit-testable without spinning up audio hardware (see RecorderViewModel.nextState(for:from:)).

+

Illegal states are unrepresentable — there is no "recording AND failed" because the value is one case at a time. Transitions are a pure switch, which makes them unit-testable without spinning up audio hardware (see RecorderViewModel.nextState(for:from:)). Drawn out, the whole lifecycle is small:

+ +
+ + + + + + + + + idle + + + recording + partial: String + + + paused + partial: String + + + finalizing + transcribing… + + + finished + transcript + + + failed + message + + + + + tap ● start + + + .began + + + .resumed + + + tap ◼ stop + + + .stopped + + + ok + + + error / no speech + + + reset() — finished / failed → idle + + +
recording ↔ paused is driven entirely by AVAudioSession interruptions (a call/Siri/alarm); a non-resumable .stopped auto-finalizes so audio is never lost. A failure during start or transcription lands in failed with a user-friendly message.
+

4. Optionals and guard

There is no null. "Might be absent" is encoded in the type as T?, and you must unwrap before use. The idiomatic unwrap is guard let — an early-return that also narrows the type for the rest of the scope:

@@ -548,6 +634,104 @@

AsyncStream = the backbone of the pipeline

} } } + +
+ + + + + + + + + @MainActor — UI & orchestration + + + audio thread · realtime + + + actor — serialized + + + nonisolated · own isolation + + + + RecorderViewModel + @Observable · state machine + feed / updates / interruption tasks + + + Tunings + + + TranscriberFactory + caches providers + + + LiveAudioEngine + setup / teardown on main actor + installs the input tap ▸ + + + SwiftUI Views + read state → + re-render + + + + input tap closure + + + TapState + @unchecked Sendable + + + + WhisperMLXTranscriber + one decode at a time + cached weights ≈ 480 MB + + + + AppleSpeechTranscriber + final class + + + AppleSpeechSession + emits live partials + + + WhisperStreamingSession + accumulates PCM · + Mutex<[Float]> + + + + + installs tap + + + buffers ⟿ + + + updates ⟿ + + + feed() · finish() + + + await transcribePCM() + + + + solid = await (cross-domain hop) + + dashed = AsyncStream (value pipe) + +
Four isolation domains, and every arrow between them is a boundary the compiler checks. Values only cross as an await or through an AsyncStream — which is why the realtime audio thread, the GPU actor, and the UI never race.
+
+
▹ How to read the concurrency, fast

Mentally tag each type with its domain: @MainActor (UI + orchestration), actor (Whisper model), the audio thread (the tap closure + TapState). Every arrow between domains is an await or an AsyncStream. The compiler already verified the crossings are race-free — so you can trust the boundaries and just follow the data.

@@ -557,7 +741,79 @@

AsyncStream = the backbone of the pipeline

06 The architecture spine: provider abstraction

-

Here's the load-bearing pattern, stated plainly: capabilities hide behind protocols; concrete providers are resolved at runtime. Today it's transcription; the same shape is reserved for a future LLM-cleanup stage. If you internalize this section, the file layout makes sense.

+

Here's the load-bearing pattern, stated plainly: capabilities hide behind protocols; concrete providers are resolved at runtime. Today it's transcription; the same shape is reserved for a future LLM-cleanup stage. If you internalize this section, the file layout makes sense. Here's the whole system on one slide — five layers, each depending only on the one below:

+ +
+ + + + + + uses ↓ + + + + + + + UI + SwiftUI + + ContentView + NotesListView + RecorderView + NoteDetailView + SettingsView + + + + Orchestration + @MainActor + + + RecorderViewModel + the state machine + async task orchestration + + Tunings + @Observable + + ReTranscriber + re-run a note through the other engine + + + + The spine + protocols + DI + + Transcriberprotocol + Transcription-Session + Transcription-Options · enum + Transcription-Engine · enum + Transcriber-Factory + + + + Providers + & capability + + AppleSpeech-Transcriberclass + WhisperMLX-Transcriberactor + LiveAudio-Engine@MainActor + WhisperModel-Storedownloads + Note@Model + + + + Frameworks + + Speech + AVFoundation + MLX · mlx-swift + SwiftData + + +
The teal layer is the seam: everything above it talks only to protocols; everything at the provider layer is a swappable implementation (WhisperMLXTranscriber, also teal, is the one that plugs in). Add an engine = add a box on the provider row.
+

The contract has two methods, both on purpose

nonisolated protocol Transcriber: Sendable {
@@ -621,6 +877,89 @@ 

Resolving the provider: the factory

07 Tap → speak → saved: the whole path

This is the payoff. Here's one recording, traced end-to-end through the real types. Every concept above shows up doing a job.

+
+ + + + + + + + + + + + + + + + + + You + + RecorderView / VM + @MainActor + + LiveAudioEngine + + audio thread + + TranscriptionSession + + SwiftData + @Query + + + + + loop · while recording + + + + + tap ● record + + + makeStreamingSession(options) + + + engine.start(analyzerFormat) + + + ↩ LiveRecording { buffers, interruptions } + + + + buffers ⟿ (each chunk) + + + session.feed(buffer) + + + updates ⟿ partial transcript + + + state = .recording(partial) + + + + tap ◼ stop + + + engine.stop() + + + session.finish() → transcript + + + insert + save Note + + + @Query re-renders the list automatically + + +
Time flows downward. Solid = a direct call; dashed teal = an AsyncStream delivering values over time. The loop is the live phase; the bottom four messages are stop → decode → persist → the list updating itself. The six numbered steps below narrate this same picture.
+
+

The tap: build the session, start the engine

From a8157a82160b7ee2dc2c6138bb804136eff382d9 Mon Sep 17 00:00:00 2001 From: Claude Date: Mon, 15 Jun 2026 03:17:34 +0000 Subject: [PATCH 3/3] docs: sync iOS/Swift guide to T2 (Parakeet) + L2 (LLM cleanup) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Brings the guide current with the two features that landed on main since it was written: a third on-device transcription engine (Parakeet) and the on-device LLM "Clean up" feature behind a second LanguageModel spine. - Frame the provider abstraction as used twice (transcription + cleanup) - Tour: three engines (Apple permanent default) + optional cleanup step - §06: new "spine, proven twice" subsection; factory eviction + ModelStores - §09: three-engine comparison + cleanup LLM as a fourth MLX actor - §08: additive/non-destructive Note cleanup fields - §10: increased-memory-limit entitlement (free-tier accepts it) - Redrew the layered-map diagram with both protocol spines https://claude.ai/code/session_01W47XYXAX6wnMNT3KGJbzQh --- CHANGE_LOG.md | 4 + docs/relay-notes-ios-guide.html | 211 ++++++++++++++++++++------------ 2 files changed, 138 insertions(+), 77 deletions(-) diff --git a/CHANGE_LOG.md b/CHANGE_LOG.md index db84138..1e2aef5 100644 --- a/CHANGE_LOG.md +++ b/CHANGE_LOG.md @@ -141,3 +141,7 @@ Entry style: bold lead-in summarizing what shipped, then the *why* / non-obvious - **L2 re-sequenced — in-app "Clean up" MVP pulled forward ahead of the fixtures/harness route (code-complete; device-pending).** Since L2.0 proved on-device cleanup works with strong quality, the "harness-first, UI-last" caution was retired and the real feature became the evaluation vehicle (dogfooding real notes beats curated fixtures and yields the transcripts for free). **L2.2 fixtures superseded; L2.3 model A/B deferred** (the `LLMCleanupSmoke` repo-id path still sweeps candidates when wanted). The feature, mirroring the re-transcribe workflow and decoupled from MLX behind the `LanguageModel` protocol: **centralized model management in the Tuning sheet** — `CleanupModelStore` (a `DownloadableModelStore` bound to the new `ModelDownloadSpec.gemmaCleanupE2B`, pinned to commit `2c3e5074…`: the complete `gemma-4-e2b-it-4bit` snapshot — `model.safetensors` 3.58 GB + tokenizer/config/chat-template, 8 files, each SHA-256 + size verified) into `Application Support/llm/gemma-4-e2b-it-4bit/`, registered as a sibling on `ModelStores` (outside the transcription-engine `store(for:)`/`readyEngines` machinery), surfaced via `CleanupModelSection` (download/progress/delete, ~3.4 GB copy, same shape as `WhisperModelSection`/`ParakeetModelSection`). The app loads from the downloaded directory via `LLMModelFactory.loadContainer(from:using:)` — so **`MLXLanguageModel` gained a `Source` enum** (`.directory` for the app, `.repoId` for the DEBUG smoke; the smoke keeps the HubClient path, so `swift-huggingface` stays used and no dep was removed). **`Cleaner`** (`@MainActor @Observable`, the cleanup analogue of `ReTranscriber`) gates on `store.status == .ready`, runs `clean()` off a directory-loaded model, returns a non-destructive `Outcome`, and `evict()`s the ~2.7 GB model when the note is left. **`Note`** gained additive `cleanedTranscript`/`cleanupModel` (+ `isCleaned`/`applyCleanup`/`clearCleanup`) — raw `transcript` is never overwritten. **`NoteDetailView`**: a "Clean up" control (→ "Set up cleanup model" deep-link to the Tuning sheet when the model's absent, via an `onOpenSettings` closure threaded `ContentView`→`NotesListView`→detail), a before/after **Accept/Decline** sheet (`CleanupOutcomeSheet`), and a cleaned/raw display toggle + "Cleaned with …" provenance + "Remove". Tests (sim-safe): `CleanerTests` (gating + generic error message), `NoteTests` cleanup helpers + persistence, `DownloadableModelStoreTests` gemma spec pinning + `CleanupModelStore` readiness/subdirectory. Build + full simulator suite (20 suites) green. Plan: `plan.L2.md` §1 (re-sequencing note) + §6. - **L2.4 device-validated (iPhone 15 Pro Max) — in-app cleanup works end-to-end.** Downloaded the cleanup model via the Tuning `CleanupModelSection` (the SHA-pinned `CleanupModelStore` path → `Application Support/llm/gemma-4-e2b-it-4bit/`, separate from the smoke's HubClient cache), then ran "Clean up" on a real note: the before/after sheet showed the cleaned candidate, Accept persisted `cleanedTranscript`/`cleanupModel` with the raw `transcript` preserved. Confirms the directory-load path (`loadContainer(from:using:)`) and the whole gated flow on device. **L2 has shipped a usable on-device cleanup feature** — the L2 thesis (cleanup mitigates STT errors) is now dogfoodable on real notes; remaining L2 polish (precise tok/s; optional model picker / formal A/B via the deferred L2.3 harness) is non-blocking. - **Onboarding doc — `docs/relay-notes-ios-guide.html`.** A self-contained, dependency-free HTML guide aimed at a seasoned programmer who is new to iOS/Swift: a "Rosetta Stone" mapping familiar concepts (interface→`protocol`, sum type→`enum` w/ associated values, ORM→SwiftData, UI-thread→`@MainActor`, Stream/Observable→`AsyncStream`, DI scope→`@Environment`) onto this codebase, then a guided tour of the provider-abstraction spine (`Transcriber`/`TranscriptionSession`/`TranscriptionEngine`/`TranscriberFactory` + the `TranscriptionOptions` sum type) and an end-to-end `tap → speak → saved` data-flow walkthrough traced through the real types (`RecorderViewModel` state machine, `LiveAudioEngine` double-duty tap, the three `AsyncStream`s, SwiftData `Note` save). Covers the Swift-6-strict-concurrency story (actors, `@MainActor`-by-default, the `nonisolated protocol` trap, `@unchecked Sendable` on `TapState`) and the iOS realities (permissions, `AVAudioSession`, background-audio `Info.plist`, simulator-can't-run-MLX, 7-day free-tier signing). **Four hand-authored inline-SVG architecture diagrams** orient the reader: a 5-layer system map, a concurrency isolation-domains map, a runtime swimlane for tap→speak→saved, and the recorder state-machine. Self-contained (no JS libs): a vanilla-JS Swift/bash highlighter + sidebar scroll-spy; teal→indigo app-icon palette. Reference/onboarding artifact, not a code change — no app behavior touched. + +## 2026-06-15 + +- **Onboarding guide synced to T2 + L2 (`docs/relay-notes-ios-guide.html`).** Merged `main` into the doc branch and brought the guide current with the two features that landed since it was written: **Parakeet (third on-device transcription engine)** and **L2 on-device LLM cleanup (the `LanguageModel` spine)**. Substantive edits: framed the provider abstraction as **used twice** (transcription + cleanup) rather than "reserved for a future stage"; the tour now shows **three engines** (Apple Speech as the *permanent* default, Whisper + Parakeet opt-in) plus an optional "Clean up" pipeline step; §06 gained a **"The spine, proven twice: `LanguageModel`"** subsection (protocol + `MLXLanguageModel` actor + `Cleaner` + non-destructive `Note.cleanedTranscript`), and the factory snippet now shows the **single-live-MLX-engine eviction** + the `ModelStores` registry; §09 became a **three-engine** comparison and notes the cleanup LLM as a fourth MLX actor; §08 documents the additive/non-destructive cleanup fields; §10 fixed the now-stale signing note and added a card on the **`increased-memory-limit` entitlement** (accepted on the free tier); §11 updated to ~150 tests + the new `mlx-swift-lm` / `swift-huggingface` / `swift-transformers` deps. **Redrew the layered-map diagram** (6 columns, the two protocol spines in a teal/indigo seam, Parakeet/`MLXLanguageModel`/`ModelStores`/`Cleaner` added) and updated the isolation-domains actor box. Diagrams re-rendered + eyeballed (cairosvg). Docs-only; no app behavior touched. diff --git a/docs/relay-notes-ios-guide.html b/docs/relay-notes-ios-guide.html index 42c049e..a3fa8c0 100644 --- a/docs/relay-notes-ios-guide.html +++ b/docs/relay-notes-ios-guide.html @@ -266,7 +266,7 @@ 06The architecture spine 07Tap → speak → saved 08Persistence: SwiftData - 09Two engines, one socket + 09Three engines, one socket 10iOS realities 11Build & tooling 12Where to read next @@ -281,11 +281,11 @@

Relay Notes, decoded.

You know how to architect systems, reason about concurrency, and read a call graph. What you may not know is the iOS dialect: SwiftUI, actors, property wrappers, SwiftData, and the strict-concurrency rules Swift 6 enforces at compile time. This guide is a translation layer — it maps what you already know onto how a real, working iPhone app is built, using Relay Notes as the worked example.

- What  on-device voice → text + What  on-device voice → text → cleanup UI  SwiftUI Concurrency  Swift 6 strict Data  SwiftData - ML  MLX (Whisper) + ML  MLX · Whisper · Parakeet · LLM Target  iPhone 15 Pro Max
@@ -294,14 +294,14 @@

How to read this

The sections build on each other but each stands alone. If you want the aha fast, read §02 the Rosetta Stone (concept mapping), then jump straight to §07, which traces a single recording end-to-end through the actual types. The middle sections (§03–§05) are the language/framework primer; the later ones (§06–§11) are this app's architecture and the iOS-specific gotchas.

▹ The one idea that explains the whole codebase
-

Every external capability sits behind a protocol so the runtime provider is swappable without a rebuild. Transcription has two interchangeable engines (Apple Speech, on-device Whisper) plugged into one socket. Hold that thought — it's the spine, and most of the architecture exists to serve it.

+

Every external capability sits behind a protocol so the runtime provider is swappable without a rebuild. Transcription has three interchangeable engines (Apple Speech, on-device Whisper, on-device Parakeet) plugged into one socket — and the app now does the same trick a second time for an on-device LLM "Clean up" pass (a separate LanguageModel spine). Hold that thought — it's the pattern, and most of the architecture exists to serve it.

01 The 60-second tour

-

Relay Notes is intentionally tiny in scope: tap a button, speak, and an on-device transcript is saved. No account, no server, works in airplane mode. That narrowness is a feature — it keeps the surface small enough that the architecture is legible. The pipeline is a straight line:

+

Relay Notes is intentionally tiny in scope: tap a button, speak, and an on-device transcript is saved — then, optionally, a one-tap on-device LLM "Clean up" pass tidies it. No account, no server, works in airplane mode. That narrowness is a feature — it keeps the surface small enough that the architecture is legible. The pipeline is a straight line, with cleanup as an optional post-hoc step on a saved note:

🎙️
Capturemic → PCM
@@ -311,20 +311,26 @@

01 The 60-second tour

📝
Transcribeon-device
🗂️
Store noteSwiftData
+
⋯›
+
Clean upopt · on-device LLM
-

The "transcribe" box is where the interesting design lives. It's not one thing — it's a socket that accepts either engine:

-
+

The "transcribe" box is where the interesting design lives. It's not one thing — it's a socket that accepts any of three on-device engines:

+
-

Apple Speech — the default

-

Built into iOS (SpeechAnalyzer + SpeechTranscriber). Streams a live transcript as you talk. No download, no model management. The "it just works" floor.

+

Apple Speech — permanent default

+

Built into iOS (SpeechAnalyzer + SpeechTranscriber). Streams a live transcript as you talk. No download. The "it just works" floor — and the settled default (iOS 27 strengthens it).

-

On-device Whisper — the upgrade

-

whisper-small.en (481 MB) running through MLX. Downloaded on first use. Decodes once when you stop (no live partials). Proves "on-device ≠ Apple-only."

+

Whisper — opt-in upgrade

+

whisper-small.en (481 MB) via MLX. Downloaded on first use. Decodes once when you stop. Proves "on-device ≠ Apple-only."

+
+
+

Parakeet — opt-in upgrade

+

NVIDIA parakeet-tdt-0.6b-v2 (FastConformer + TDT, ~2.4 GB) via MLX. Also finalize-only. A second your-choice engine, hand-ported the same way.

-

Everything below explains how those pieces are wired so that swapping engines is a runtime choice, not a code change — and how Swift's type system and concurrency model are leaned on to keep it safe.

+

And the "Clean up" feature is the same pattern again: a one-tap action that runs the saved transcript through an on-device LLM (Gemma 4 E2B) behind a separate LanguageModel protocol. Everything below explains how these pieces are wired so that swapping a provider is a runtime choice, not a code change — and how Swift's type system and concurrency model keep it safe.

@@ -688,9 +694,9 @@

AsyncStream = the backbone of the pipeline

- WhisperMLXTranscriber - one decode at a time - cached weights ≈ 480 MB + WhisperMLXTranscriber + cached weights · one at a time + + Parakeet · cleanup LLM @@ -741,7 +747,7 @@

AsyncStream = the backbone of the pipeline

06 The architecture spine: provider abstraction

-

Here's the load-bearing pattern, stated plainly: capabilities hide behind protocols; concrete providers are resolved at runtime. Today it's transcription; the same shape is reserved for a future LLM-cleanup stage. If you internalize this section, the file layout makes sense. Here's the whole system on one slide — five layers, each depending only on the one below:

+

Here's the load-bearing pattern, stated plainly: capabilities hide behind protocols; concrete providers are resolved at runtime. It's used twice now — once for transcription (Transcriber, three engines) and once for on-device cleanup (LanguageModel, the "Clean up" feature). If you internalize this section, the file layout makes sense. Here's the whole system on one slide — five layers, each depending only on the one below; the two protocol spines sit in the teal/indigo band:

@@ -766,53 +772,59 @@

06 The architecture spine: provider abstraction

SettingsView - + Orchestration @MainActor - - RecorderViewModel - the state machine + async task orchestration - - Tunings - @Observable - - ReTranscriber - re-run a note through the other engine + + RecorderViewModel + state machine + tasks + + Tunings + @Observable + + ReTranscriber + re-run another engine + + Cleaner + drives "Clean up" - - The spine - protocols + DI - - Transcriberprotocol - Transcription-Session - Transcription-Options · enum - Transcription-Engine · enum - Transcriber-Factory + + The spines + protocols + DI + + Transcriberprotocol + Transcription-Session + Transcription-Options + Transcriber-Factory + ModelStoresregistry + LanguageModelcleanup spine - + Providers & capability - AppleSpeech-Transcriberclass - WhisperMLX-Transcriberactor - LiveAudio-Engine@MainActor - WhisperModel-Storedownloads - Note@Model + AppleSpeech-Transcriberclass + WhisperMLX-Transcriberactor + ParakeetMLX-Transcriberactor + MLXLanguage-Modelactor + LiveAudio-Engine@MainActor + Note@Model - + Frameworks - - Speech - AVFoundation - MLX · mlx-swift - SwiftData + + Speech + AVFoundation + mlx-swift + mlx-swift-lm + SwiftData -
The teal layer is the seam: everything above it talks only to protocols; everything at the provider layer is a swappable implementation (WhisperMLXTranscriber, also teal, is the one that plugs in). Add an engine = add a box on the provider row.
+
Two protocol spines live in the band-3 seam: transcription (teal — Transcriber & friends) and cleanup (indigo — LanguageModel). Everything above talks only to those protocols; the provider row holds the swappable implementations (the three teal/indigo provider boxes are the MLX actors). Add a capability = add a box, not a refactor.

The contract has two methods, both on purpose

@@ -844,31 +856,60 @@

A session is the live handle

emitsLivePartials is a nice example of pushing a decision to the type that owns the knowledge: the recorder asks the session "do you stream?" rather than inferring it from an engine enum — so adding an engine doesn't mean editing the recorder.

Resolving the provider: the factory

-

A small factory maps the user's selected engine to a (cached) concrete instance. Caching is load-bearing — recreating the Whisper transcriber would reload 480 MB of weights every recording:

+

A small factory maps the selected engine to a cached concrete instance — with a twist that matters on a memory-constrained phone: it keeps at most one MLX engine resident at a time. Whisper (~0.5 GB) and Parakeet (~1.2 GB) are never used simultaneously, so switching to a different MLX engine evicts the previous model's weights before loading the next. Apple Speech holds no such state, so it's cached independently:

@MainActor
 final class TranscriberFactory {
-    private var appleSpeech: AppleSpeechTranscriber?
-    private var whisperMLX: WhisperMLXTranscriber?
+    private var appleSpeech: AppleSpeechTranscriber?              // cheap, cached on its own
+    private var liveMLX: (engine: TranscriptionEngine,           // at most ONE MLX engine live
+                          transcriber: any Transcriber)?
 
     func transcriber(for engine: TranscriptionEngine) -> any Transcriber {
         switch engine {
-        case .apple:      return appleSpeech ?? { let t = AppleSpeechTranscriber(locale: locale); appleSpeech = t; return t }()
-        case .whisperMLX: return whisperMLX  ?? { let t = WhisperMLXTranscriber(store: whisperModelStore); whisperMLX = t; return t }()
+        case .apple:       return appleSpeech ?? makeApple()
+        case .whisperMLX:  return liveMLXTranscriber(for: engine) { WhisperMLXTranscriber(store: stores?.whisper) }
+        case .parakeetMLX: return liveMLXTranscriber(for: engine) { ParakeetMLXTranscriber(store: stores?.parakeet) }
         }
     }
+
+    private func liveMLXTranscriber(for engine: TranscriptionEngine, _ make: () -> any Transcriber) -> any Transcriber {
+        if let liveMLX, liveMLX.engine == engine { return liveMLX.transcriber }
+        liveMLX = nil               // drop the old model (the factory's only strong ref) → its ~GB of weights free
+        let new = make(); liveMLX = (engine, new); return new
+    }
 }
+

Where the weights live on disk is the job of the ModelStores registry — one place that maps each engine to its DownloadableModelStore and answers "is this engine ready?" (Apple is always ready; an MLX engine only once its model is downloaded). That's what gates engine selection in Settings and reconciles a stale saved choice at launch.

-

Put together, the socket looks like this — three concrete providers, one interface, the choice deferred to runtime:

+

Put together, the socket looks like this — three live providers (plus a cloud slot reserved), one interface, the choice deferred to runtime:

-
🍎
AppleSpeech
Transcriber
on-device · shipping
+
🍎
AppleSpeech
Transcriber
default · shipping
🧠
WhisperMLX
Transcriber
on-device · shipping
-
☁️
Cloud
Transcriber
opt-in · not built
+
🦜
ParakeetMLX
Transcriber
on-device · shipping
+
☁️
Cloud
Transcriber
opt-in · not built
🔌
any Transcriberthe socket
+ +

The spine, proven twice: LanguageModel

+

The strongest evidence that the abstraction earns its keep: when the team wanted on-device transcript cleanup (de-filler, punctuation, light structure via a local LLM), they didn't bolt it onto the transcriber — they stamped out a second, identical spine. Same isolation rules, same actor-for-GPU-state pattern, same swap-the-provider-at-runtime promise:

+
nonisolated protocol LanguageModel: Sendable {
+    func clean(_ raw: String) async throws -> String
+    // L3 adds `categorize(_:into:)` additively — designed in now, not a reshape.
+}
+
+// The MLX conformer is an actor, exactly like the transcribers — it caches a
+// non-Sendable, GPU-bound ModelContainer (Gemma 4 E2B, 4-bit) across calls.
+actor MLXLanguageModel: LanguageModel {
+    func clean(_ raw: String) async throws -> String {
+        let session = ChatSession(container, instructions: CleanupPrompt.system,
+                                  generateParameters: .init(temperature: 0))  // greedy: must not invent
+        return try await session.respond(to: raw)
+    }
+    func evict() { container = nil; MLX.GPU.clearCache() }   // same "one live model" discipline
+}
+

The orchestrator mirrors ReTranscriber too: Cleaner (a @MainActor @Observable) gates the "Clean up" button on the model being downloaded, runs clean() off a saved note, and hands back a non-destructive candidate the user accepts or declines. Accepting writes a separate field — the raw transcript is never overwritten (more in §08). The prompt lives in one place (CleanupPrompt) so swapping the model never changes behavior.

-
◆ Why this is worth the ceremony
-

The whole point of v1 is to validate "on-device, your-choice model" without betting the app on one vendor. Because the seam is a protocol, a new engine is a new file, not a refactor — and the next planned capability (local LLM cleanup) drops a LanguageModel protocol into the exact same shape. The abstraction isn't speculative; it already pays for itself with two live engines.

+
◆ Why this is worth the ceremony — now demonstrably
+

The point was always to validate "on-device, your-choice model" without betting the app on one vendor. Because each seam is a protocol, a new engine is a new file, not a refactor — and that claim is no longer theoretical: the cleanup feature dropped an entire second capability into the same shape, reusing the same actor-isolation and single-live-model patterns. The abstraction pays for itself with three live transcription engines and an on-device LLM.

@@ -1046,10 +1087,12 @@

08 Persistence: SwiftData in one page

var id: UUID var createdAt: Date var audioFilename: String // ← filename, NOT a URL (see below) - var transcript: String + var transcript: String // the canonical RAW text — never overwritten var title: String? var transcriptionModel: String? // provenance: "Apple Speech" / "Whisper (small.en)" var originalTranscript: String? // pre-edit baseline, enables revert + var cleanedTranscript: String? // LLM-cleaned version (non-destructive); nil = never cleaned + var cleanupModel: String? // provenance: "Gemma 4 E2B (MLX 4-bit)" }

The container is wired up once, at the app's entry point — that's all it takes to get a working store:

@@ -1074,29 +1117,34 @@

08 Persistence: SwiftData in one page

context.delete(self); try? context.save() // row }
+
+
◆ Additive & non-destructive by design
+

Notice the optional-by-default fields. Both edit (originalTranscript) and cleanup (cleanedTranscript/cleanupModel) were added after notes already existed — as nil-defaulting optionals, which SwiftData migrates for free (no migration plan, no backfill). And both are non-destructive: the raw transcript is always the source of truth, so an LLM cleanup can be accepted, toggled, or removed without ever losing what was actually said. That's a deliberate stance — an LLM "improving" a note by inventing detail is worse than a messy-but-true one.

+
-

09 Two engines, one socket

-

The same TranscriptionSession interface backs two genuinely different implementations. Comparing them is the clearest illustration of why the abstraction earns its keep:

+

09 Three engines, one socket

+

The same TranscriptionSession interface backs three genuinely different implementations. Comparing them is the clearest illustration of why the abstraction earns its keep:

- - - - - - - - + + + + + + + +
Apple SpeechOn-device Whisper (MLX)
FrameworkSpeechAnalyzer + SpeechTranscribermlx-swift 0.31.4, hand-ported pipeline
ModelApple's, bundled with iOS, no choicewhisper-small.en · 481 MB · your choice of repo
InstallNothing to downloadDownloaded on first use → Application Support
Live partialsYes — words stream as you talkNo — accumulate PCM, decode at finish()
Concurrency typenonisolated final classactor (guards cached weights)
CostReal-time, free~4× real-time decode; ~2.8 GB peak footprint
While recording, the UI…renders the live transcript cardrenders a mic-level meter + timer placeholder
Apple SpeechWhisper (MLX)Parakeet (MLX)
RolePermanent defaultopt-in upgradeopt-in upgrade
FrameworkSpeechAnalyzermlx-swift, hand-portedmlx-swift, hand-ported
ModelApple's, no choicewhisper-small.en · 481 MBparakeet-tdt-0.6b-v2 · ~2.4 GB
InstallNothing to downloadDownload on first useDownload on first use
Live partialsYes — stream as you talkNo — decode at finish()No — decode at finish()
Concurrency typenonisolated classactoractor
While recording, UI…live transcript cardmeter + timer placeholdermeter + timer placeholder
+

And there's a fourth MLX actor that isn't a transcriber at all — MLXLanguageModel (Gemma 4 E2B, the cleanup model from §06). The factory's "one live MLX model at a time" rule spans all of them: Whisper, Parakeet, and the cleanup LLM are never co-resident on the 8 GB device.

-

Why Whisper is an actor and Apple is a class

-

This is the concurrency model paying rent. Whisper caches hundreds of MB of non-Sendable MLX state and is GPU-bound — serializing access through an actor is both correct (no races on the cache) and desirable (decodes shouldn't overlap on one GPU). Apple's transcriber holds no such shared state, so a plain nonisolated class is enough. Same protocol, different isolation, chosen per-conformer — exactly the freedom the nonisolated protocol from §05 preserves.

+

Why the MLX engines are actors and Apple is a class

+

This is the concurrency model paying rent. Each MLX engine caches hundreds of MB to GBs of non-Sendable model state and is GPU-bound — serializing access through an actor is both correct (no races on the cache) and desirable (decodes shouldn't overlap on one GPU). Apple's transcriber holds no such shared state, so a plain nonisolated class is enough. Same protocol, different isolation, chosen per-conformer — exactly the freedom the nonisolated protocol from §05 preserves, and the reason adding Parakeet (and later the cleanup LLM) didn't perturb anything above the provider layer.

◆ The product stance behind the engineering
-

Two axes are kept independent on purpose: where it runs (on-device vs cloud) and whose model (Apple's vs your choice). Apple Speech is on-device-Apple; Whisper is on-device-yours; a future cloud provider would be the third corner. v1 is deliberately local-first, cloud opt-in only — the app must be 100% functional in airplane mode. The protocol is what keeps all three corners reachable without forking the app.

+

Two axes are kept independent on purpose: where it runs (on-device vs cloud) and whose model (Apple's vs your choice). Apple Speech is on-device-Apple and the settled default — it works the instant the app installs, with no download. Whisper and Parakeet are on-device-yours, opt-in upgrades. The current bet is notable: rather than chase a more accurate transcriber, mitigate transcription errors with the on-device cleanup LLM — which is exactly why the second LanguageModel spine exists. Everything stays local-first, cloud opt-in only: the app is 100% functional in airplane mode. The protocols are what keep every corner reachable without forking the app.

@@ -1124,7 +1172,11 @@

The simulator can't run MLX

Free-tier signing expires every 7 days

-

Sideloading on a free Apple ID means the app stops launching after a week — re-build from Xcode to re-sign. Data (SwiftData rows + audio files) survives because the bundle ID is stable. The paid Developer Program is deferred until on-device LLM inference is validated.

+

Sideloading on a free Apple ID means the app stops launching after a week — re-build from Xcode to re-sign. Data (SwiftData rows + audio files) survives because the bundle ID is stable. The paid Developer Program is still deferred — sideload covers the personal-device use case.

+
+
+

The 4-bit LLM needs the memory entitlement

+

The cleanup model's generation peak crosses iOS's ~3 GB jetsam ceiling, so the app ships com.apple.developer.kernel.increased-memory-limit (in Relay Notes.entitlements). The pleasant surprise: the free Apple ID tier accepts it — no paid program needed. The ASR engines (Whisper/Parakeet) stay under the ceiling and don't require it.

The project file is generated, edit it with care

@@ -1147,9 +1199,9 @@

11 Build & tooling, the short version

  • The quotes matter — both the project name and scheme contain a space.
  • Pipe through xcbeautify; raw xcodebuild output buries the actual errors.
  • -
  • Tests use Swift's Testing framework (import Testing, @Test, #expect) — closer to modern JS/Rust test ergonomics than XCTest. ~80 tests, ~12 s warm.
  • +
  • Tests use Swift's Testing framework (import Testing, @Test, #expect) — closer to modern JS/Rust test ergonomics than XCTest. ~150 tests; the MLX numerics (Whisper/Parakeet/LLM) are device-only, validated by DEBUG "smoke" buttons.
  • New test files must be registered in the project: ruby scripts/add_test_file.rb MyNewTests.swift.
  • -
  • Dependencies (mlx-swift, swift-numerics) resolve automatically via Swift Package Manager on first build — nothing to install by hand.
  • +
  • Dependencies resolve via Swift Package Manager on first build — mlx-swift (the ASR engines), plus mlx-swift-lm + swift-huggingface + swift-transformers (the cleanup LLM: model loading, HF download, tokenizers/chat templates). Nothing to install by hand.
@@ -1163,6 +1215,11 @@

12 Where to read next

Transcription/Transcriber.swift — the protocols + the TranscriptionOptions sum type, with the nonisolated rationale in a long comment.
+
+
The spine, againthe pattern, reused
+
+
Enrichment/LanguageModel.swift + MLXLanguageModel.swift + Cleaner.swift — the cleanup feature, built as a carbon copy of the transcription spine.
+
The orchestratorthe state machine in motion
@@ -1182,11 +1239,11 @@

12 Where to read next

▹ The takeaway, in one sentence
-

Relay Notes is a small app that uses Swift's type system (sum-type enums, protocols, optionals) to make illegal states unrepresentable, and Swift's concurrency model (actors, @MainActor, AsyncStream) to make a realtime audio→text→storage pipeline race-free at compile time — all so that the transcription engine can be swapped at runtime without touching the code that orchestrates it.

+

Relay Notes is a small app that uses Swift's type system (sum-type enums, protocols, optionals) to make illegal states unrepresentable, and Swift's concurrency model (actors, @MainActor, AsyncStream) to make a realtime audio→text→storage pipeline race-free at compile time — all so that providers can be swapped at runtime without touching the code that orchestrates them. The proof it works: the same protocol-spine pattern now backs three transcription engines and a second, independent on-device LLM cleanup capability.

-

Generated as an onboarding companion to the Relay Notes codebase · iOS 26 · SwiftUI · Swift 6 strict concurrency · SwiftData · MLX. Code excerpts are lightly trimmed from the real source; read the files for the full, commented versions.

+

Generated as an onboarding companion to the Relay Notes codebase · iOS 26 · SwiftUI · Swift 6 strict concurrency · SwiftData · MLX (mlx-swift + mlx-swift-lm). Code excerpts are lightly trimmed from the real source; read the files for the full, commented versions.