wastedcode · isingh · Jun 18, 2026 · Jun 18, 2026 · Jun 18, 2026
diff --git a/docs/decisions/0009-tolerate-late-first-transcript-flush.md b/docs/decisions/0009-tolerate-late-first-transcript-flush.md
@@ -0,0 +1,137 @@
+# 0009. Tolerate a late first transcript flush — a bounded settle/anchor, one shared budget
+
+**Status:** accepted
+**Date:** 2026-06-18
+
+## Context
+
+claude writes a session's conversation to `<id>.jsonl` *append-only*, but the
+**first** write for a fresh session is deferred until after boot + first-turn
+processing. Under load — and especially when MCP servers and tool schemas are
+loading — that first flush slips **3–9 s** behind the moment the turn is
+otherwise "done." We measured it directly: under 20 concurrent heavy tool-MCP
+sessions, ~1/20 turns flushed their first record 5–9 s after `wait()→completed`.
+
+This collides with two seams that read the transcript shortly after a turn:
+
+1. **The write-side anchor.** `send()` does not return a positional cursor — it
+   anchors on the *id of the user record this send produced* (ADR-adjacent: "a
+   real id, never a count"). `anchorOwnTurn` polls the transcript for that record
+   for a fixed window (12 × 250 ms = 3 s). If the first flush is late, the record
+   is not on disk inside that window and `send()` falls back to the
+   `DELIVERY_UNCONFIRMED` sentinel — telling the consumer "re-send me" for a turn
+   that actually landed.
+
+2. **The read-side mirror.** `messagesSince(cursor)` / `turnComplete(cursor)` are
+   called *after* `wait()→completed`. But `wait()`'s cross-process flush gate
+   (`io/wait.ts`, `replyOnDisk`) falls back to the **pane** when the transcript is
+   blind (`transcriptCount === 0`), so `completed` can legitimately fire before
+   the first record exists. A separate CLI process — which rebuilds its
+   `SessionObserver` cold on every invocation — then reads the still-absent file
+   and returns `[]` / `false`, even though the contract owes records.
+
+Both are the *same* phenomenon (a late first flush), observed from opposite
+sides of the send/read split. The naïve "wait longer everywhere" risks two
+regressions: stalling the **healthy** fast-flush path (the overwhelming common
+case), and stalling the **unhealthy** never-flushing path (a crashed/slow
+resume) on every call.
+
+## Decision
+
+Tolerate the late first flush at both seams, gated so neither regression occurs,
+against **one shared budget**.
+
+- **`FIRST_FLUSH_BUDGET_MS = 6_000`** (`src/observe/session-observer.ts`) is the
+  single source of truth. The read-side settle uses it directly; the write-side
+  anchor derives its extended poll count from it
+  (`Math.ceil(FIRST_FLUSH_BUDGET_MS / ANCHOR_POLL_MS) = 24`), so the two seams
+  cannot drift apart.
+
+- **Write-side (`anchorOwnTurn`).** Past the base 3 s window, keep polling out to
+  the shared budget **only while `SessionObserver.transcriptAddressableButAbsent()`
+  is true** — the path is known but the file does not yet exist. A session whose
+  file is already present never crosses the base window, so its behaviour is
+  byte-for-byte unchanged. The extension applies only to the full first-turn
+  anchor, never the shorter lost-Enter retry (which must report promptly).
+
+- **Read-side (`settleFirstRecords`).** Before the first `thread()`-backed read,
+  if no record has been seen, poll up to the shared budget for the first records
+  to land. It is bounded **twice**: `#sawRecords` (success — records seen, the
+  common case, returns on the first poll with zero added latency) and
+  `#firstFlushSettled` (exhaustion — the budget was spent once and the file never
+  appeared). The exhaustion latch is load-bearing: without it a
+  locatable-but-never-flushing session would re-pay the full budget on *every*
+  `messagesSince`/`turnComplete`, stalling exactly the daemon `recover()→
+  turnComplete` loop that walks crashed/slow resumes.
+
+**Why 6 s is sufficient despite a 9 s tail.** The budget is measured from the
+*read*, not from spawn. The read-side settle only runs after `wait()→completed`,
+and `wait()` itself burns wall-clock (pane stabilization, the stop-edge settle)
+before returning. So the effective deadline is `read-time + 6 s`, which lands
+well past the 9-s-from-spawn tail. The write-side anchor runs earlier (from
+`send`), but a flush past its 6 s is *expected* to fall back to the sentinel —
+the read-side settle then restores the mirror on the consumer's follow-up read.
+The two seams overlap on purpose; neither has to cover the tail alone.
+
+## Consequences
+
+- **The healthy path is untouched.** A present/fast-flush transcript finds its
+  records on the first poll of either seam — no added latency, no extra polls.
+  The present-from-t0 controls assert this directly (the anchor keeps its exact
+  3 s window; the settle returns sub-poll).
+
+- **The unhealthy path is bounded once.** A never-flushing session waits the
+  budget a single time, then every later read is a no-op that still re-reads the
+  file (so records that land later are surfaced) but never blocks again.
+
+- **The settle holds the per-handle mutex for up to the budget, once.** A
+  concurrent `kill()`/`state()` on the same handle queues behind a first-ever
+  settle. Bounded by the budget and the once-only latch; not moved outside the
+  mutex (that would reopen the byte-interleaving the mutex exists to prevent).
+
+- **The stateless CLI pays the budget per-invocation, not per-process-lifetime.**
+  `claudemux messages` / `turn-complete` rebuild the observer cold each call, so
+  the in-process latches reset between invocations. This is inherent to
+  statelessness — there is no shared state to latch across processes — and is
+  acceptable: a script that polls the CLI against a never-flushing session should
+  bound its own retries, exactly as it must for any other "not ready yet" read.
+
+- **The write anchor keys on file *existence*, not record presence.** The
+  extension engages while `existsSync(path)` is false. claude is observed to
+  create `<id>.jsonl` only at its first content (the live repro's 0-sentinel
+  result requires a genuinely absent file), so this matches reality. A
+  hypothetical future claude that pre-created a zero-byte file would not engage
+  the extension — but the read-side `settleFirstRecords` (which keys on records
+  being *readable*, not the file existing) is the backstop: it would still wait
+  the records out on the consumer's follow-up read. The two seams deliberately
+  use different "is it flushed?" signals; the read side is the stronger one.
+
+- **A send + read pair on one never-flushing session can serialize both budgets.**
+  `send()`'s extended anchor (≤6 s) and a follow-up `messagesSince`/`turnComplete`'s
+  settle (≤6 s) hold *separate* latches (the anchor does not trip the observer's
+  `#firstFlushSettled`), so a single unhealthy session can pay up to ~12 s of
+  serialized blocking across the pair before both give up. Bounded and one-time
+  per observer; accepted rather than coupling the write-anchor's "found my
+  record?" question to the read-settle's "any records at all?" question, which
+  are genuinely different.
+
+- **The sentinel semantics are unchanged.** A flush past the write-side budget
+  still returns `DELIVERY_UNCONFIRMED`, which still reads empty in
+  `messagesSince` (re-send semantics preserved); recovery rides the read-side
+  settle plus a positional `"0"`/prior-id cursor, never the sentinel.
+
+## Evidence
+
+Verified against the code at HEAD, not assumed: `wait()` declares `completed`
+with `replyOnDisk = transcriptCount === 0 || lastMessageRole === "assistant"`
+(`src/io/wait.ts`), so completion can precede the first flush — the read-side
+hole this closes. The daemon consumer `recover()` calls `turnComplete(lastCursor)`
+on a resumed (possibly never-flushed) session (`src/compose.ts`) — the
+unhealthy-path loop the exhaustion latch bounds. Live: the unmodified build
+reproduced 1/20 `DELIVERY_UNCONFIRMED` + empty `messagesSince("0")` under 20
+concurrent heavy tool-MCP sessions; the fixed build returned `[user, assistant]`
+on 20/20 with 0 sentinels (claude 2.1.181). Regression coverage:
+`settleFirstRecords` (late-file-appears, present-from-t0 zero-cost, budget
+exhaustion, and the never-re-pays exhausted latch) and the `anchorOwnTurn`
+extension (absent-then-present anchors a real cursor; present-from-t0 keeps the
+exact 3 s window).
diff --git a/docs/decisions/README.md b/docs/decisions/README.md
@@ -20,3 +20,4 @@ record, it cites one of these repo-relative paths, which resolve on GitHub.
 | [0006](0006-adopt-single-writer-invariant-no-lock.md) | `adopt()` documents a single-writer invariant, adds no cross-process lock, and is a pure attach (no dialog dismissal); `state()`-after-adopt carries the safety. |
 | [0007](0007-pane-dead-detection-and-signal-representation.md) | _(superseded)_ `PaneDead` detection + signal representation. `PaneDead` was removed (`remain-on-exit off` reaps the pane → `SessionGone`). |
 | [0008](0008-scrub-parent-agent-env-for-transcript-persistence.md) | The spawned `claude` runs under an `env -u` scrub of the parent agent's env vars (`CLAUDECODE`, `CLAUDE_CODE_*`, `AI_AGENT`) so its transcript persists; threaded via a new optional `unsetEnv` on `buildArgv`; consumer `env` re-set wins. |
+| [0009](0009-tolerate-late-first-transcript-flush.md) | A late first `<id>.jsonl` flush (3–9 s under MCP/tool load) is tolerated at both seams against one shared `FIRST_FLUSH_BUDGET_MS`: the write-side anchor extends only while the file is addressable-but-absent; the read-side `settleFirstRecords` waits once, latched on success **and** exhaustion so a never-flushing session never re-pays. |
diff --git a/src/io/wait.ts b/src/io/wait.ts
@@ -127,7 +127,10 @@ export async function waitForOutcome(
     // if a readable transcript never yields the reply the CONSUMER's patience
     // bounds it (budget-exceeded), never a library-invented timeout — "time is the
     // policy's." In-process the same observer already cached the record, so the
-    // gate is a no-op; it only pays on the cross-process race.
+    // gate is a no-op; it only pays on the cross-process race. Its read-side
+    // complement (for a separate process that rebuilds the observer cold and reads
+    // before the first flush) is `SessionObserver.settleFirstRecords` — ADR 0009;
+    // keep the two in step if you touch either.
     const paneDone = belief.state === "idle" && (hookDone || armed);
     const replyOnDisk = belief.transcriptCount === 0 || belief.lastMessageRole === "assistant";
     if (paneDone && replyOnDisk) {

diff --git a/src/observe/session-observer.test.ts b/src/observe/session-observer.test.ts
@@ -83,4 +83,104 @@ describe("SessionObserver — incremental belief + thread, hook-resolved transcr
     expect(b.hookChannelHealthy).toBe(false);
     expect(b.state).toBe("working"); // the pane is the only signal
   });
+
+  // Fix #1 — the late-first-flush settle. The transcript path is KNOWN (the hook
+  // reported it) but the file is not on disk yet; a writer creates it after N
+  // polls. The settle must wait the records out (within budget), while a
+  // present-from-t0 transcript must return immediately and pay no extra polls.
+  describe("settleFirstRecords — bounded one-time wait for a delayed first flush", () => {
+    it("returns the records once the absent file appears after a few polls (within budget)", async () => {
+      // Hook names the path, but the file does NOT exist yet — the empty-mirror
+      // signature (tp_exists=false at read time).
+      writeFileSync(
+        rv,
+        `1700000001.0 {"hook_event_name":"SessionStart","transcript_path":"${tx}"}\n`,
+      );
+      const obs = new SessionObserver({ agent: claude, rendezvousPath: rv });
+      // Confirm the early read IS empty (the bug today) and the file is seen absent.
+      expect(obs.thread().messages.length).toBe(0);
+      expect(obs.transcriptAddressableButAbsent()).toBe(true);
+
+      // The first <id>.jsonl bytes land ~750ms later (3 poll cadences) — a late flush.
+      const writer = setTimeout(() => {
+        writeFileSync(tx, `${[userRec("u1", "hi"), asstRec("a1", "u1", "PONG")].join("\n")}\n`);
+      }, 750);
+
+      const t0 = Date.now();
+      await obs.settleFirstRecords(5_000);
+      const elapsed = Date.now() - t0;
+      clearTimeout(writer);
+
+      const roles = obs.thread().messages.map((m) => m.role);
+      expect(roles).toEqual(["user", "assistant"]); // the contract's records, recovered
+      expect(elapsed).toBeGreaterThanOrEqual(700); // it genuinely waited for the flush
+      expect(elapsed).toBeLessThan(2_000); // and returned promptly once it appeared, well inside budget
+      expect(obs.transcriptAddressableButAbsent()).toBe(false); // file present now
+    });
+
+    it("control — file present from t=0 returns immediately and pays no extra wait", async () => {
+      writeFileSync(
+        rv,
+        `1700000001.0 {"hook_event_name":"SessionStart","transcript_path":"${tx}"}\n`,
+      );
+      writeFileSync(tx, `${[userRec("u1", "hi"), asstRec("a1", "u1", "PONG")].join("\n")}\n`);
+      const obs = new SessionObserver({ agent: claude, rendezvousPath: rv });
+
+      const t0 = Date.now();
+      await obs.settleFirstRecords(5_000);
+      const elapsed = Date.now() - t0;
+
+      // First iteration finds the records → no sleep → near-zero latency (the
+      // zero-blast-radius half: a fast-flush session is untouched).
+      expect(elapsed).toBeLessThan(SETTLE_POLL_MS_FLOOR);
+      expect(obs.thread().messages.map((m) => m.role)).toEqual(["user", "assistant"]);
+
+      // And the latch holds: a second settle, even on a path now made absent, is a no-op.
+      rmSync(tx);
+      const t1 = Date.now();
+      await obs.settleFirstRecords(5_000);
+      expect(Date.now() - t1).toBeLessThan(SETTLE_POLL_MS_FLOOR); // never re-enters the poll
+    });
+
+    it("budget-exhausted absent file → returns empty after the budget, never hangs", async () => {
+      writeFileSync(
+        rv,
+        `1700000001.0 {"hook_event_name":"SessionStart","transcript_path":"${tx}"}\n`,
+      );
+      const obs = new SessionObserver({ agent: claude, rendezvousPath: rv });
+      const t0 = Date.now();
+      await obs.settleFirstRecords(500); // tiny budget; file never appears
+      const elapsed = Date.now() - t0;
+      expect(elapsed).toBeGreaterThanOrEqual(450); // it waited the budget
+      expect(elapsed).toBeLessThan(1_500); // then gave up — bounded, no hang
+      expect(obs.thread().messages.length).toBe(0); // honest empty, not invented records
+    });
+
+    // The exhausted-latch (ADR 0009). A locatable-but-NEVER-flushing session — a
+    // crashed/slow resume, an adopt-miss locate, a read before the first turn —
+    // must spend the budget at most ONCE. Without the exhausted latch every later
+    // settle re-pays the full budget, so a daemon polling recover()→turnComplete
+    // would stall the budget on EVERY call against exactly the unhealthy session.
+    it("never re-pays the budget: a never-flushing session settles once, then no-ops", async () => {
+      writeFileSync(
+        rv,
+        `1700000001.0 {"hook_event_name":"SessionStart","transcript_path":"${tx}"}\n`,
+      );
+      const obs = new SessionObserver({ agent: claude, rendezvousPath: rv });
+
+      const t0 = Date.now();
+      await obs.settleFirstRecords(500); // first call spends the budget (file never appears)
+      expect(Date.now() - t0).toBeGreaterThanOrEqual(450);
+
+      // The latch now holds even though NO record was ever seen: the SECOND call
+      // returns immediately instead of re-polling the budget (the flag-A fix).
+      const t1 = Date.now();
+      await obs.settleFirstRecords(500);
+      expect(Date.now() - t1).toBeLessThan(SETTLE_POLL_MS_FLOOR);
+      expect(obs.thread().messages.length).toBe(0); // still honest empty
+    });
+  });
 });
+
+// One poll cadence is 250ms; an "immediate" return must be well under it.
+const SETTLE_POLL_MS_FLOOR = 200;