Relay Notes, decoded.
+You know how to architect systems, reason about concurrency, and read a call graph. What you may not know is the iOS dialect: SwiftUI, actors, property wrappers, SwiftData, and the strict-concurrency rules Swift 6 enforces at compile time. This guide is a translation layer — it maps what you already know onto how a real, working iPhone app is built, using Relay Notes as the worked example.
+How to read this
+The sections build on each other but each stands alone. If you want the aha fast, read §02 the Rosetta Stone (concept mapping), then jump straight to §07, which traces a single recording end-to-end through the actual types. The middle sections (§03–§05) are the language/framework primer; the later ones (§06–§11) are this app's architecture and the iOS-specific gotchas.
+Every external capability sits behind a protocol so the runtime provider is swappable without a rebuild. Transcription has three interchangeable engines (Apple Speech, on-device Whisper, on-device Parakeet) plugged into one socket — and the app now does the same trick a second time for an on-device LLM "Clean up" pass (a separate LanguageModel spine). Hold that thought — it's the pattern, and most of the architecture exists to serve it.
01 The 60-second tour
+Relay Notes is intentionally tiny in scope: tap a button, speak, and an on-device transcript is saved — then, optionally, a one-tap on-device LLM "Clean up" pass tidies it. No account, no server, works in airplane mode. That narrowness is a feature — it keeps the surface small enough that the architecture is legible. The pipeline is a straight line, with cleanup as an optional post-hoc step on a saved note:
+ +The "transcribe" box is where the interesting design lives. It's not one thing — it's a socket that accepts any of three on-device engines:
+Apple Speech — permanent default
+Built into iOS (SpeechAnalyzer + SpeechTranscriber). Streams a live transcript as you talk. No download. The "it just works" floor — and the settled default (iOS 27 strengthens it).
Whisper — opt-in upgrade
+whisper-small.en (481 MB) via MLX. Downloaded on first use. Decodes once when you stop. Proves "on-device ≠ Apple-only."
Parakeet — opt-in upgrade
+NVIDIA parakeet-tdt-0.6b-v2 (FastConformer + TDT, ~2.4 GB) via MLX. Also finalize-only. A second your-choice engine, hand-ported the same way.
And the "Clean up" feature is the same pattern again: a one-tap action that runs the saved transcript through an on-device LLM (Gemma 4 E2B) behind a separate LanguageModel protocol. Everything below explains how these pieces are wired so that swapping a provider is a runtime choice, not a code change — and how Swift's type system and concurrency model keep it safe.
02 The Rosetta Stone
+Almost every iOS concept has a name you already know under a different label. Here's the dictionary. Skim it now; the rest of the guide makes it concrete.
+ +protocol — e.g. Transcriber, TranscriptionSession. Conformers can be classes, structs, or actors.enum with associated values — e.g. TranscriptionOptions.apple(…) | .whisperMLXTask { … } ≈ spawning a coroutine; structured concurrency is the default.@MainActor — a compiler-enforced annotation, not a convention you have to remember.Sendable + actor. Swift 6 rejects the build on a potential data race.AsyncStream<T> — consumed with for await x in stream.struct conforming to View, with a body computed property.@State, @Observable. Mutating a tracked field re-runs the affected body.@Environment(\.modelContext) and friends.@Model class Note (entity) + ModelContext (unit of work) + @Query (live result set).T?, unwrapped with guard let / if let. No implicit null.@main struct …App: App with a body: some Scene..xcodeproj (a project.pbxproj file) + Swift Package Manager (Package.resolved).Swift leans value types (struct/enum, copied) far harder than the OO languages you're used to. Reference types (class, actor) are the exception, reserved for identity and shared mutable state. When you see struct here, think "immutable-ish value, cheap to copy, no aliasing surprises." That single shift removes most "wait, why did that change?" confusion.
03 Swift, the parts that bite
+You can read Swift on sight — it's C-family with type inference. Four features show up constantly in this codebase and behave differently than their cousins elsewhere. Learn these four and the source stops surprising you.
+ +1. struct vs class vs actor vs enum
+ Four ways to declare a type, chosen by semantics, not habit:
+| Keyword | Semantics | Used here for |
|---|---|---|
struct | Value — copied on assignment, no shared identity | Note options, RecordingOptions, every View |
enum | Value + closed set of cases, optionally with payloads | TranscriptionOptions, the recorder's State |
class | Reference — shared identity, mutable, ARC-managed | RecorderViewModel, Tunings, LiveAudioEngine |
actor | Reference + serialized access (its own isolation domain) | WhisperMLXTranscriber (guards ~480 MB of model state) |
2. Enums carry data — this is the tagged union you wanted
+This is the single most important Swift idiom in the app. TranscriptionOptions isn't a struct with a pile of nullable fields — it's a closed set of shapes, each with its own type-safe payload:
enum TranscriptionOptions: Sendable {
+ case apple(AppleSpeechOptions) // Apple-only: preset + contextual strings
+ case whisperMLX // Whisper: no decode dials in v1
+}
+
+struct AppleSpeechOptions: Sendable {
+ var preset: SpeechTranscriber.Preset = .transcription
+ var contextualStrings: [String] = []
+}
+ You destructure it with switch or pattern-matching guard. The compiler forces you to handle every case, and you literally cannot read Apple's preset out of a .whisperMLX value — it doesn't exist on that case. That's a whole class of "field is null for this provider" bugs deleted at the type level.
3. The recorder is a state machine, expressed as an enum
+Instead of a tangle of isRecording / isPaused / errorMessage booleans that can contradict each other, the recorder's entire lifecycle is one value that's always exactly one valid state:
enum State: Equatable {
+ case idle
+ case recording(partial: String) // live transcript so far
+ case paused(partial: String) // interrupted by a call/Siri/alarm
+ case finalizing // stopped; transcribing
+ case finished(transcript: String)
+ case failed(message: String)
+}
+ Illegal states are unrepresentable — there is no "recording AND failed" because the value is one case at a time. Transitions are a pure switch, which makes them unit-testable without spinning up audio hardware (see RecorderViewModel.nextState(for:from:)). Drawn out, the whole lifecycle is small:
AVAudioSession interruptions (a call/Siri/alarm); a non-resumable .stopped auto-finalizes so audio is never lost. A failure during start or transcription lands in failed with a user-friendly message.4. Optionals and guard
+ There is no null. "Might be absent" is encoded in the type as T?, and you must unwrap before use. The idiomatic unwrap is guard let — an early-return that also narrows the type for the rest of the scope:
guard let session, let url else {
+ state = .failed(message: "Recording could not be saved. Please try again.")
+ return
+}
+// past this line, `session` and `url` are non-optional
+ some View / any Transcriber: some means "one specific concrete type the compiler knows but I won't name" (opaque return); any means "a boxed value of any conformer, decided at runtime" (existential). The app uses any Transcriber precisely because the concrete engine is a runtime choice.
04 SwiftUI: the UI is a function of state
+If you've used React, SwiftUI will feel familiar: you describe the UI for a given state, and the framework diffs and re-renders. A view is a struct (cheap, disposable, recreated constantly) with a body that returns a description of the UI — never an imperative "now mutate this label."
struct ContentView: View {
+ @Environment(\.modelContext) private var modelContext // injected dependency
+ @State private var viewModel: RecorderViewModel? // owned, re-renders on change
+ @State private var showSettings = false
+
+ var body: some View {
+ NavigationStack {
+ VStack(spacing: 0) {
+ NotesListView(searchText: searchText, reTranscriber: reTranscriber)
+ Divider()
+ if let viewModel {
+ RecorderView(viewModel: viewModel)
+ }
+ }
+ .navigationTitle("Relay Notes")
+ .toolbar { /* settings button */ }
+ .sheet(isPresented: $showSettings) { /* settings sheet */ }
+ }
+ }
+}
+
+ Property wrappers = where state lives
+Those @-prefixed declarations aren't decoration — each one tells SwiftUI a different thing about ownership and reactivity:
| Wrapper | Closest analogue | Means |
|---|---|---|
@State | useState | This view owns this value; mutating it re-renders. |
@Observable | a signal / observable store | Macro on a class; reads in a body auto-subscribe. RecorderViewModel and Tunings use it. |
@Environment | React Context | Pull an ambient dependency (the SwiftData context) without threading it through every initializer. |
@Query | a live DB query / useQuery | A SwiftData fetch that re-runs and re-renders when matching rows change. The notes list is just @Query(sort: \.createdAt, order: .reverse). |
$value | two-way binding | The $ prefix makes a Binding — a read/write handle a child view (e.g. a TextField) can mutate. |
The composition root: where dependencies are wired
+Notice viewModel starts as nil. The app has no DI container — instead there's a composition root, the spot where the real object graph is assembled once, lazily, the first time the view appears:
.task { // runs once when the view appears
+ if viewModel == nil {
+ let tunings = Tunings()
+ tunings.reconcileEngineAvailability(whisperReady: whisperStore.status == .ready)
+ let factory = TranscriberFactory(whisperModelStore: whisperStore)
+ viewModel = RecorderViewModel(
+ engine: LiveAudioEngine(),
+ transcriberFactory: factory,
+ modelContext: modelContext,
+ tunings: tunings
+ )
+ reTranscriber = ReTranscriber(factory: factory, whisperStore: whisperStore)
+ }
+}
+ This is hand-rolled constructor injection: the view model is handed its collaborators (LiveAudioEngine, the factory, the SwiftData context, the tunings) rather than reaching for globals. That's what makes the logic testable — tests construct a RecorderViewModel with fakes. .task is the lifecycle hook (≈ useEffect(() => …, [])), and it's async-aware so it can await without blocking the UI.
Views stay dumb (layout + bindings). The RecorderViewModel holds the state machine and orchestration. This split is what lets the gnarly async audio logic be exercised by 80-odd unit tests while the SwiftUI layer stays a thin shell.
05 Swift 6 concurrency (read this twice)
+This is the section that trips up newcomers the hardest, because Swift 6 promotes data races from "runtime heisenbug" to compile error. The rules are strict, but once they click, the audio pipeline reads cleanly.
+ +Actors = isolation domains
+An actor is a reference type whose mutable state can only be touched one task at a time — the compiler serializes access for you. Calls from outside hop onto the actor and are therefore await-ed. The Whisper transcriber is an actor precisely because it caches ~480 MB of non-thread-safe MLX model state across calls:
actor WhisperMLXTranscriber: Transcriber {
+ private var cache: LoadedAssets? // model + tokenizer + mel filters
+ // serialized by the actor → safe to mutate without a single lock
+}
+
+ @MainActor = the UI isolation domain
+ There's a special global actor, @MainActor, that represents the main thread. Anything that touches UI state is annotated with it, and the compiler then guarantees those members run on the main thread. The view model and the audio engine both opt in:
@MainActor
+@Observable
+final class RecorderViewModel { /* state, tasks, orchestration */ }
+
+@MainActor
+final class LiveAudioEngine { /* setup/teardown on the main actor */ }
+ Notably, this project sets SWIFT_DEFAULT_ACTOR_ISOLATION = MainActor — so types are main-actor by default unless they opt out. That default is convenient for UI code but creates the single sharpest gotcha in the codebase 👇
nonisolated protocol trapWith main-actor-by-default on, an unannotated protocol becomes implicitly @MainActor, and conformance inference then silently smears @MainActor onto your conformers — it once stamped @MainActor onto an actor's synchronous init. The fix is to mark isolation-neutral protocols explicitly:
// Both protocols are nonisolated *on purpose* so each conformer
+// picks its own isolation: AppleSpeechTranscriber is a plain class,
+// WhisperMLXTranscriber is an actor.
+nonisolated protocol Transcriber: Sendable {
+ func transcribe(_ audio: URL, options: TranscriptionOptions) async throws -> String
+ func makeStreamingSession(options: TranscriptionOptions) async throws -> any TranscriptionSession
+}
+ Sendable = "safe to cross an isolation boundary"
+ To pass a value between actors/tasks, the compiler must know it can't introduce a race. Value types (struct/enum of Sendable parts) are automatically Sendable. When you know something is safe but can't prove it to the compiler — like a helper that the audio thread touches single-threaded by construction — you assert it with @unchecked Sendable and take responsibility:
// Runs on the realtime audio thread; single-threaded access by construction.
+private final class TapState: @unchecked Sendable { … }
+
+ AsyncStream = the backbone of the pipeline
+ The whole capture→transcribe handoff is plumbed with async streams — a producer yields values, a consumer for awaits them. Three streams flow through a single recording:
struct LiveRecording: Sendable {
+ let url: URL // where the audio file lands
+ let buffers: AsyncStream<AVAudioPCMBuffer> // mic audio, chunk by chunk
+ let interruptions: AsyncStream<InterruptionEvent> // call/Siri/alarm events
+}
+ And the transcript itself comes back as a stream of ever-growing strings — that's how the live partial transcript updates the UI character-by-character:
+updatesTask = Task { [weak self, session] in
+ for await partial in session.updates { // each value = transcript so far
+ guard let self else { return }
+ if case .recording = self.state {
+ self.state = .recording(partial: partial) // re-renders the live card
+ }
+ }
+}
+
+ await or through an AsyncStream — which is why the realtime audio thread, the GPU actor, and the UI never race.Mentally tag each type with its domain: @MainActor (UI + orchestration), actor (Whisper model), the audio thread (the tap closure + TapState). Every arrow between domains is an await or an AsyncStream. The compiler already verified the crossings are race-free — so you can trust the boundaries and just follow the data.
06 The architecture spine: provider abstraction
+Here's the load-bearing pattern, stated plainly: capabilities hide behind protocols; concrete providers are resolved at runtime. It's used twice now — once for transcription (Transcriber, three engines) and once for on-device cleanup (LanguageModel, the "Clean up" feature). If you internalize this section, the file layout makes sense. Here's the whole system on one slide — five layers, each depending only on the one below; the two protocol spines sit in the teal/indigo band:
Transcriber & friends) and cleanup (indigo — LanguageModel). Everything above talks only to those protocols; the provider row holds the swappable implementations (the three teal/indigo provider boxes are the MLX actors). Add a capability = add a box, not a refactor.The contract has two methods, both on purpose
+nonisolated protocol Transcriber: Sendable {
+ // File-based. UNUSED by the app today — kept for the future cloud-STT
+ // providers (which work on uploaded files) and a "re-transcribe" action.
+ func transcribe(_ audio: URL, options: TranscriptionOptions) async throws -> String
+
+ // Streaming. This is what the recorder actually uses: it returns a session
+ // that the audio engine feeds buffers into.
+ func makeStreamingSession(options: TranscriptionOptions) async throws -> any TranscriptionSession
+}
+ The file-based transcribe(_:options:) looks like dead code — the app only calls the streaming path. It's deliberately retained for cloud STT (which operates on uploaded files) and a future re-transcribe action. This is the kind of intent that lives in comments and the planning docs, not in the call graph. Read before deleting.
A session is the live handle
+The streaming method hands back a TranscriptionSession — the object the audio engine pushes buffers into and reads results out of. Crucially, the session is the authority on its own behavior, so the recorder doesn't branch on engine type:
nonisolated protocol TranscriptionSession: Sendable, AnyObject {
+ var audioFormat: AVAudioFormat? { get } // the PCM format it wants
+ var updates: AsyncStream<String> { get } // live partial transcripts
+ var emitsLivePartials: Bool { get } // Apple: true · Whisper: false
+ var modelDescription: String { get } // provenance, saved on the Note
+ func feed(_ buffer: AVAudioPCMBuffer) // push mic audio in
+ func finish() async throws -> String // stop, return final transcript
+ func cancel() async
+}
+ emitsLivePartials is a nice example of pushing a decision to the type that owns the knowledge: the recorder asks the session "do you stream?" rather than inferring it from an engine enum — so adding an engine doesn't mean editing the recorder.
Resolving the provider: the factory
+A small factory maps the selected engine to a cached concrete instance — with a twist that matters on a memory-constrained phone: it keeps at most one MLX engine resident at a time. Whisper (~0.5 GB) and Parakeet (~1.2 GB) are never used simultaneously, so switching to a different MLX engine evicts the previous model's weights before loading the next. Apple Speech holds no such state, so it's cached independently:
+@MainActor
+final class TranscriberFactory {
+ private var appleSpeech: AppleSpeechTranscriber? // cheap, cached on its own
+ private var liveMLX: (engine: TranscriptionEngine, // at most ONE MLX engine live
+ transcriber: any Transcriber)?
+
+ func transcriber(for engine: TranscriptionEngine) -> any Transcriber {
+ switch engine {
+ case .apple: return appleSpeech ?? makeApple()
+ case .whisperMLX: return liveMLXTranscriber(for: engine) { WhisperMLXTranscriber(store: stores?.whisper) }
+ case .parakeetMLX: return liveMLXTranscriber(for: engine) { ParakeetMLXTranscriber(store: stores?.parakeet) }
+ }
+ }
+
+ private func liveMLXTranscriber(for engine: TranscriptionEngine, _ make: () -> any Transcriber) -> any Transcriber {
+ if let liveMLX, liveMLX.engine == engine { return liveMLX.transcriber }
+ liveMLX = nil // drop the old model (the factory's only strong ref) → its ~GB of weights free
+ let new = make(); liveMLX = (engine, new); return new
+ }
+}
+ Where the weights live on disk is the job of the ModelStores registry — one place that maps each engine to its DownloadableModelStore and answers "is this engine ready?" (Apple is always ready; an MLX engine only once its model is downloaded). That's what gates engine selection in Settings and reconciles a stale saved choice at launch.
Put together, the socket looks like this — three live providers (plus a cloud slot reserved), one interface, the choice deferred to runtime:
+Transcriberdefault · shipping
Transcriberon-device · shipping
Transcriberon-device · shipping
Transcriberopt-in · not built
any Transcriberthe socketThe spine, proven twice: LanguageModel
+ The strongest evidence that the abstraction earns its keep: when the team wanted on-device transcript cleanup (de-filler, punctuation, light structure via a local LLM), they didn't bolt it onto the transcriber — they stamped out a second, identical spine. Same isolation rules, same actor-for-GPU-state pattern, same swap-the-provider-at-runtime promise:
+nonisolated protocol LanguageModel: Sendable {
+ func clean(_ raw: String) async throws -> String
+ // L3 adds `categorize(_:into:)` additively — designed in now, not a reshape.
+}
+
+// The MLX conformer is an actor, exactly like the transcribers — it caches a
+// non-Sendable, GPU-bound ModelContainer (Gemma 4 E2B, 4-bit) across calls.
+actor MLXLanguageModel: LanguageModel {
+ func clean(_ raw: String) async throws -> String {
+ let session = ChatSession(container, instructions: CleanupPrompt.system,
+ generateParameters: .init(temperature: 0)) // greedy: must not invent
+ return try await session.respond(to: raw)
+ }
+ func evict() { container = nil; MLX.GPU.clearCache() } // same "one live model" discipline
+}
+ The orchestrator mirrors ReTranscriber too: Cleaner (a @MainActor @Observable) gates the "Clean up" button on the model being downloaded, runs clean() off a saved note, and hands back a non-destructive candidate the user accepts or declines. Accepting writes a separate field — the raw transcript is never overwritten (more in §08). The prompt lives in one place (CleanupPrompt) so swapping the model never changes behavior.
The point was always to validate "on-device, your-choice model" without betting the app on one vendor. Because each seam is a protocol, a new engine is a new file, not a refactor — and that claim is no longer theoretical: the cleanup feature dropped an entire second capability into the same shape, reusing the same actor-isolation and single-live-model patterns. The abstraction pays for itself with three live transcription engines and an on-device LLM.
+07 Tap → speak → saved: the whole path
+This is the payoff. Here's one recording, traced end-to-end through the real types. Every concept above shows up doing a job.
+ +AsyncStream delivering values over time. The loop is the live phase; the bottom four messages are stop → decode → persist → the list updating itself. The six numbered steps below narrate this same picture.The tap: build the session, start the engine
+Recording/RecorderViewModel.swift → startRecording()
+The view model reads a snapshot of the user's tunings (engine, preset, bitrate) at this instant — mid-recording setting changes intentionally don't take effect until next time. It asks the factory for the right Transcriber, makes a streaming session, then starts the audio engine, handing it the session's preferred audioFormat so capture and recognition agree on a PCM format.
let transcriber = transcriberFactory.transcriber(for: tunings.engine)
+let session = try await transcriber.makeStreamingSession(options: tunings.transcriptionOptions)
+let recording = try await engine.start(options: tunings.recordingOptions,
+ analyzerFormat: session.audioFormat)
+ Capture does double duty on every buffer
+Audio/LiveAudioEngine.swift → installTap + TapState.handle
+An input tap on AVAudioEngine fires on the realtime audio thread for each chunk of mic audio. For every buffer it (a) writes AAC/m4a to disk for later playback, and (b) converts the PCM to the analyzer's format and yields it into an AsyncStream. That's why the saved audio and the transcript come from one capture, not two.
func handle(buffer: AVAudioPCMBuffer) {
+ try? audioFile.write(from: buffer) // (a) persist for playback
+ // (b) convert to the analyzer's format, then:
+ continuation.yield(outBuffer) // → into LiveRecording.buffers
+}
+ The tap closure runs on a realtime thread that must never block. It holds a @unchecked Sendable TapState (single-threaded by construction) — this is exactly the case where you assert thread-safety to the compiler because the runtime contract guarantees it but the type system can't see it.
Three concurrent tasks consume the streams
+RecorderViewModel.startRecording()
+The view model spins up structured tasks, one per stream: feed (push buffers into the session + compute a mic level), updates (live partial transcript → state), and interruptions (call/Siri/alarm → pause/resume/finalize). The state flips to .recording(partial: "") and the UI comes alive.
feedTask = Task { [session] in
+ for await buffer in recording.buffers { session.feed(buffer); … }
+}
+updatesTask = Task { for await partial in session.updates { … } } // live text
+interruptionTask = Task { for await e in recording.interruptions { … } } // pause/resume
+state = .recording(partial: "")
+ Live transcript streams to the screen (Apple) — or a meter does (Whisper)
+AppleSpeechTranscriber.swift / WhisperStreamingSession.swift
+With Apple Speech, each recognizer result (volatile or final) is folded into a growing string and yielded on updates — you watch the words appear. With Whisper there are no partials by design (emitsLivePartials = false); it just accumulates PCM in memory, so the UI shows a live audio-level meter + elapsed timer placeholder instead of a blank card.
Stop: finalize and get the transcript
+RecorderViewModel.stopAndTranscribe()
+State → .finalizing. The engine stops (closing the audio file), the feed/interruption tasks cancel, then session.finish() returns the final transcript. For Apple that drains the last results; for Whisper that's where the entire decode happens — a 5-minute note sits ~80 s on the spinner. Same interface, very different cost profile, hidden behind finish().
state = .finalizing
+let url = await engine.stop()
+let transcript = try await session.finish() // Apple: drain · Whisper: decode now
+ Persist the note
+Models/Note.swift + ModelContext
+A Note is created — storing the audio filename (not a URL), the transcript, and the engine's modelDescription as provenance — inserted into the SwiftData context and saved. Because the notes list is a live @Query, the new row appears in the UI automatically. State → .finished. Done.
let note = Note(audioFilename: url.lastPathComponent,
+ transcript: transcript,
+ transcriptionModel: session.modelDescription)
+modelContext.insert(note)
+try modelContext.save()
+state = .finished(transcript: transcript)
+ The recorder never names "Apple" or "Whisper." It talks to a Transcriber and a TranscriptionSession, asks them what they can do (emitsLivePartials, audioFormat, modelDescription), and lets the streams carry the data. That's the spine doing its job — orchestration with zero engine-specific branches.
08 Persistence: SwiftData in one page
+SwiftData is Apple's modern ORM (a type-safe wrapper over Core Data). If you've used Prisma, Room, or ActiveRecord, you already know the shape — three pieces:
+@Model = entity
A macro on a class that makes its stored properties persistent columns. Note is the only model.
ModelContext = session
The unit of work: insert, delete, save. Injected via @Environment.
@Query = live fetch
A reactive result set. The list view re-renders when rows change — no manual refresh.
@Model
+final class Note {
+ var id: UUID
+ var createdAt: Date
+ var audioFilename: String // ← filename, NOT a URL (see below)
+ var transcript: String // the canonical RAW text — never overwritten
+ var title: String?
+ var transcriptionModel: String? // provenance: "Apple Speech" / "Whisper (small.en)"
+ var originalTranscript: String? // pre-edit baseline, enables revert
+ var cleanedTranscript: String? // LLM-cleaned version (non-destructive); nil = never cleaned
+ var cleanupModel: String? // provenance: "Gemma 4 E2B (MLX 4-bit)"
+}
+
+ The container is wired up once, at the app's entry point — that's all it takes to get a working store:
+@main
+struct Relay_NotesApp: App {
+ var body: some Scene {
+ WindowGroup { ContentView() }
+ .modelContainer(for: Note.self) // creates/opens the SQLite store
+ }
+}
+
+ The app's container path can change between launches, so a persisted absolute URL goes stale. The note stores audioFilename and resolves it against the documents directory at access time:
var audioURL: URL { URL.documentsDirectory.appending(path: audioFilename) }
+ A note has two artifacts: the SwiftData row and an audio file on disk. Deleting the row alone orphans the file. So there's exactly one approved delete, used everywhere:
+func deleteWithAudio(in context: ModelContext) {
+ try? FileManager.default.removeItem(at: audioURL) // file
+ context.delete(self); try? context.save() // row
+}
+ Notice the optional-by-default fields. Both edit (originalTranscript) and cleanup (cleanedTranscript/cleanupModel) were added after notes already existed — as nil-defaulting optionals, which SwiftData migrates for free (no migration plan, no backfill). And both are non-destructive: the raw transcript is always the source of truth, so an LLM cleanup can be accepted, toggled, or removed without ever losing what was actually said. That's a deliberate stance — an LLM "improving" a note by inventing detail is worse than a messy-but-true one.
09 Three engines, one socket
+The same TranscriptionSession interface backs three genuinely different implementations. Comparing them is the clearest illustration of why the abstraction earns its keep:
| Apple Speech | Whisper (MLX) | Parakeet (MLX) | |
|---|---|---|---|
| Role | Permanent default | opt-in upgrade | opt-in upgrade |
| Framework | SpeechAnalyzer | mlx-swift, hand-ported | mlx-swift, hand-ported |
| Model | Apple's, no choice | whisper-small.en · 481 MB | parakeet-tdt-0.6b-v2 · ~2.4 GB |
| Install | Nothing to download | Download on first use | Download on first use |
| Live partials | Yes — stream as you talk | No — decode at finish() | No — decode at finish() |
| Concurrency type | nonisolated class | actor | actor |
| While recording, UI… | live transcript card | meter + timer placeholder | meter + timer placeholder |
And there's a fourth MLX actor that isn't a transcriber at all — MLXLanguageModel (Gemma 4 E2B, the cleanup model from §06). The factory's "one live MLX model at a time" rule spans all of them: Whisper, Parakeet, and the cleanup LLM are never co-resident on the 8 GB device.
Why the MLX engines are actors and Apple is a class
+ This is the concurrency model paying rent. Each MLX engine caches hundreds of MB to GBs of non-Sendable model state and is GPU-bound — serializing access through an actor is both correct (no races on the cache) and desirable (decodes shouldn't overlap on one GPU). Apple's transcriber holds no such shared state, so a plain nonisolated class is enough. Same protocol, different isolation, chosen per-conformer — exactly the freedom the nonisolated protocol from §05 preserves, and the reason adding Parakeet (and later the cleanup LLM) didn't perturb anything above the provider layer.
Two axes are kept independent on purpose: where it runs (on-device vs cloud) and whose model (Apple's vs your choice). Apple Speech is on-device-Apple and the settled default — it works the instant the app installs, with no download. Whisper and Parakeet are on-device-yours, opt-in upgrades. The current bet is notable: rather than chase a more accurate transcriber, mitigate transcription errors with the on-device cleanup LLM — which is exactly why the second LanguageModel spine exists. Everything stays local-first, cloud opt-in only: the app is 100% functional in airplane mode. The protocols are what keep every corner reachable without forking the app.
10 iOS realities you won't hit elsewhere
+Platform friction that has nothing to do with the architecture but everything to do with shipping an iPhone app as a newcomer:
+ +Permissions are async dialogs
+Mic and speech access each trigger a one-time system prompt you must await, and the human-readable reason strings live in Info.plist. The recorder handles denial as a first-class failure state with an actionable message.
The audio session is global, shared state
+AVAudioSession is a singleton you configure (.playAndRecord) and that the OS can yank away mid-recording. Calls, Siri, and alarms arrive as interruption notifications — handled here as a .began → .paused, .resumed → .recording, .stopped → auto-finalize flow.
Background recording needs an entitlement
+Locked-screen capture requires UIBackgroundModes = [audio], which Xcode can't auto-generate — hence a hand-maintained partial Info.plist. Don't delete it (it merges on top of the generated one).
The simulator can't run MLX
+The simulator's GPU lacks the family MLX needs, so any MLX test would crash the whole suite. Those tests are compiled but skipped via #if !targetEnvironment(simulator); on-device math is validated by a DEBUG "smoke" button instead.
Free-tier signing expires every 7 days
+Sideloading on a free Apple ID means the app stops launching after a week — re-build from Xcode to re-sign. Data (SwiftData rows + audio files) survives because the bundle ID is stable. The paid Developer Program is still deferred — sideload covers the personal-device use case.
+The 4-bit LLM needs the memory entitlement
+The cleanup model's generation peak crosses iOS's ~3 GB jetsam ceiling, so the app ships com.apple.developer.kernel.increased-memory-limit (in Relay Notes.entitlements). The pleasant surprise: the free Apple ID tier accepts it — no paid program needed. The ASR engines (Whisper/Parakeet) stay under the ceiling and don't require it.
The project file is generated, edit it with care
+project.pbxproj is a fussy machine file. The project uses Xcode's file-system-synchronized groups; adding files/targets is done via the xcodeproj Ruby gem, not by hand, and validated on a /tmp copy first.
11 Build & tooling, the short version
+Day-to-day is a terminal loop (Xcode stays open in the background for signing, previews, and Instruments). The two commands you'll run most:
+# Run the full test suite in the simulator (the command you'll use most)
+xcodebuild test -project "Relay Notes.xcodeproj" -scheme "Relay Notes" \
+ -destination 'platform=iOS Simulator,name=iPhone 17 Pro' 2>&1 | xcbeautify
+
+# Build only — faster sanity check
+xcodebuild build -project "Relay Notes.xcodeproj" -scheme "Relay Notes" \
+ -destination 'platform=iOS Simulator,name=iPhone 17 Pro' 2>&1 | xcbeautify
+ -
+
- The quotes matter — both the project name and scheme contain a space. +
- Pipe through
xcbeautify; rawxcodebuildoutput buries the actual errors.
+ - Tests use Swift's Testing framework (
import Testing,@Test,#expect) — closer to modern JS/Rust test ergonomics than XCTest. ~150 tests; the MLX numerics (Whisper/Parakeet/LLM) are device-only, validated by DEBUG "smoke" buttons.
+ - New test files must be registered in the project:
ruby scripts/add_test_file.rb MyNewTests.swift.
+ - Dependencies resolve via Swift Package Manager on first build —
mlx-swift(the ASR engines), plusmlx-swift-lm+swift-huggingface+swift-transformers(the cleanup LLM: model loading, HF download, tokenizers/chat templates). Nothing to install by hand.
+
12 Where to read next
+A reading order into the actual repo, now that the map is in your head:
+Transcription/Transcriber.swift — the protocols + the TranscriptionOptions sum type, with the nonisolated rationale in a long comment.Enrichment/LanguageModel.swift + MLXLanguageModel.swift + Cleaner.swift — the cleanup feature, built as a carbon copy of the transcription spine.Recording/RecorderViewModel.swift — follow startRecording() then stopAndTranscribe().Audio/LiveAudioEngine.swift — the double-duty tap and TapState.CLAUDE.md (architecture + conventions), planning/notes.md (roadmap + stance), CHANGE_LOG.md (what shipped & why).Relay Notes is a small app that uses Swift's type system (sum-type enums, protocols, optionals) to make illegal states unrepresentable, and Swift's concurrency model (actors, @MainActor, AsyncStream) to make a realtime audio→text→storage pipeline race-free at compile time — all so that providers can be swapped at runtime without touching the code that orchestrates them. The proof it works: the same protocol-spine pattern now backs three transcription engines and a second, independent on-device LLM cleanup capability.
Generated as an onboarding companion to the Relay Notes codebase · iOS 26 · SwiftUI · Swift 6 strict concurrency · SwiftData · MLX (mlx-swift + mlx-swift-lm). Code excerpts are lightly trimmed from the real source; read the files for the full, commented versions.
+