Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
137 changes: 137 additions & 0 deletions docs/decisions/0009-tolerate-late-first-transcript-flush.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
# 0009. Tolerate a late first transcript flush — a bounded settle/anchor, one shared budget

**Status:** accepted
**Date:** 2026-06-18

## Context

claude writes a session's conversation to `<id>.jsonl` *append-only*, but the
**first** write for a fresh session is deferred until after boot + first-turn
processing. Under load — and especially when MCP servers and tool schemas are
loading — that first flush slips **3–9 s** behind the moment the turn is
otherwise "done." We measured it directly: under 20 concurrent heavy tool-MCP
sessions, ~1/20 turns flushed their first record 5–9 s after `wait()→completed`.

This collides with two seams that read the transcript shortly after a turn:

1. **The write-side anchor.** `send()` does not return a positional cursor — it
anchors on the *id of the user record this send produced* (ADR-adjacent: "a
real id, never a count"). `anchorOwnTurn` polls the transcript for that record
for a fixed window (12 × 250 ms = 3 s). If the first flush is late, the record
is not on disk inside that window and `send()` falls back to the
`DELIVERY_UNCONFIRMED` sentinel — telling the consumer "re-send me" for a turn
that actually landed.

2. **The read-side mirror.** `messagesSince(cursor)` / `turnComplete(cursor)` are
called *after* `wait()→completed`. But `wait()`'s cross-process flush gate
(`io/wait.ts`, `replyOnDisk`) falls back to the **pane** when the transcript is
blind (`transcriptCount === 0`), so `completed` can legitimately fire before
the first record exists. A separate CLI process — which rebuilds its
`SessionObserver` cold on every invocation — then reads the still-absent file
and returns `[]` / `false`, even though the contract owes records.

Both are the *same* phenomenon (a late first flush), observed from opposite
sides of the send/read split. The naïve "wait longer everywhere" risks two
regressions: stalling the **healthy** fast-flush path (the overwhelming common
case), and stalling the **unhealthy** never-flushing path (a crashed/slow
resume) on every call.

## Decision

Tolerate the late first flush at both seams, gated so neither regression occurs,
against **one shared budget**.

- **`FIRST_FLUSH_BUDGET_MS = 6_000`** (`src/observe/session-observer.ts`) is the
single source of truth. The read-side settle uses it directly; the write-side
anchor derives its extended poll count from it
(`Math.ceil(FIRST_FLUSH_BUDGET_MS / ANCHOR_POLL_MS) = 24`), so the two seams
cannot drift apart.

- **Write-side (`anchorOwnTurn`).** Past the base 3 s window, keep polling out to
the shared budget **only while `SessionObserver.transcriptAddressableButAbsent()`
is true** — the path is known but the file does not yet exist. A session whose
file is already present never crosses the base window, so its behaviour is
byte-for-byte unchanged. The extension applies only to the full first-turn
anchor, never the shorter lost-Enter retry (which must report promptly).

- **Read-side (`settleFirstRecords`).** Before the first `thread()`-backed read,
if no record has been seen, poll up to the shared budget for the first records
to land. It is bounded **twice**: `#sawRecords` (success — records seen, the
common case, returns on the first poll with zero added latency) and
`#firstFlushSettled` (exhaustion — the budget was spent once and the file never
appeared). The exhaustion latch is load-bearing: without it a
locatable-but-never-flushing session would re-pay the full budget on *every*
`messagesSince`/`turnComplete`, stalling exactly the daemon `recover()→
turnComplete` loop that walks crashed/slow resumes.

**Why 6 s is sufficient despite a 9 s tail.** The budget is measured from the
*read*, not from spawn. The read-side settle only runs after `wait()→completed`,
and `wait()` itself burns wall-clock (pane stabilization, the stop-edge settle)
before returning. So the effective deadline is `read-time + 6 s`, which lands
well past the 9-s-from-spawn tail. The write-side anchor runs earlier (from
`send`), but a flush past its 6 s is *expected* to fall back to the sentinel —
the read-side settle then restores the mirror on the consumer's follow-up read.
The two seams overlap on purpose; neither has to cover the tail alone.

## Consequences

- **The healthy path is untouched.** A present/fast-flush transcript finds its
records on the first poll of either seam — no added latency, no extra polls.
The present-from-t0 controls assert this directly (the anchor keeps its exact
3 s window; the settle returns sub-poll).

- **The unhealthy path is bounded once.** A never-flushing session waits the
budget a single time, then every later read is a no-op that still re-reads the
file (so records that land later are surfaced) but never blocks again.

- **The settle holds the per-handle mutex for up to the budget, once.** A
concurrent `kill()`/`state()` on the same handle queues behind a first-ever
settle. Bounded by the budget and the once-only latch; not moved outside the
mutex (that would reopen the byte-interleaving the mutex exists to prevent).

- **The stateless CLI pays the budget per-invocation, not per-process-lifetime.**
`claudemux messages` / `turn-complete` rebuild the observer cold each call, so
the in-process latches reset between invocations. This is inherent to
statelessness — there is no shared state to latch across processes — and is
acceptable: a script that polls the CLI against a never-flushing session should
bound its own retries, exactly as it must for any other "not ready yet" read.

- **The write anchor keys on file *existence*, not record presence.** The
extension engages while `existsSync(path)` is false. claude is observed to
create `<id>.jsonl` only at its first content (the live repro's 0-sentinel
result requires a genuinely absent file), so this matches reality. A
hypothetical future claude that pre-created a zero-byte file would not engage
the extension — but the read-side `settleFirstRecords` (which keys on records
being *readable*, not the file existing) is the backstop: it would still wait
the records out on the consumer's follow-up read. The two seams deliberately
use different "is it flushed?" signals; the read side is the stronger one.

- **A send + read pair on one never-flushing session can serialize both budgets.**
`send()`'s extended anchor (≤6 s) and a follow-up `messagesSince`/`turnComplete`'s
settle (≤6 s) hold *separate* latches (the anchor does not trip the observer's
`#firstFlushSettled`), so a single unhealthy session can pay up to ~12 s of
serialized blocking across the pair before both give up. Bounded and one-time
per observer; accepted rather than coupling the write-anchor's "found my
record?" question to the read-settle's "any records at all?" question, which
are genuinely different.

- **The sentinel semantics are unchanged.** A flush past the write-side budget
still returns `DELIVERY_UNCONFIRMED`, which still reads empty in
`messagesSince` (re-send semantics preserved); recovery rides the read-side
settle plus a positional `"0"`/prior-id cursor, never the sentinel.

## Evidence

Verified against the code at HEAD, not assumed: `wait()` declares `completed`
with `replyOnDisk = transcriptCount === 0 || lastMessageRole === "assistant"`
(`src/io/wait.ts`), so completion can precede the first flush — the read-side
hole this closes. The daemon consumer `recover()` calls `turnComplete(lastCursor)`
on a resumed (possibly never-flushed) session (`src/compose.ts`) — the
unhealthy-path loop the exhaustion latch bounds. Live: the unmodified build
reproduced 1/20 `DELIVERY_UNCONFIRMED` + empty `messagesSince("0")` under 20
concurrent heavy tool-MCP sessions; the fixed build returned `[user, assistant]`
on 20/20 with 0 sentinels (claude 2.1.181). Regression coverage:
`settleFirstRecords` (late-file-appears, present-from-t0 zero-cost, budget
exhaustion, and the never-re-pays exhausted latch) and the `anchorOwnTurn`
extension (absent-then-present anchors a real cursor; present-from-t0 keeps the
exact 3 s window).
1 change: 1 addition & 0 deletions docs/decisions/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,4 @@ record, it cites one of these repo-relative paths, which resolve on GitHub.
| [0006](0006-adopt-single-writer-invariant-no-lock.md) | `adopt()` documents a single-writer invariant, adds no cross-process lock, and is a pure attach (no dialog dismissal); `state()`-after-adopt carries the safety. |
| [0007](0007-pane-dead-detection-and-signal-representation.md) | _(superseded)_ `PaneDead` detection + signal representation. `PaneDead` was removed (`remain-on-exit off` reaps the pane → `SessionGone`). |
| [0008](0008-scrub-parent-agent-env-for-transcript-persistence.md) | The spawned `claude` runs under an `env -u` scrub of the parent agent's env vars (`CLAUDECODE`, `CLAUDE_CODE_*`, `AI_AGENT`) so its transcript persists; threaded via a new optional `unsetEnv` on `buildArgv`; consumer `env` re-set wins. |
| [0009](0009-tolerate-late-first-transcript-flush.md) | A late first `<id>.jsonl` flush (3–9 s under MCP/tool load) is tolerated at both seams against one shared `FIRST_FLUSH_BUDGET_MS`: the write-side anchor extends only while the file is addressable-but-absent; the read-side `settleFirstRecords` waits once, latched on success **and** exhaustion so a never-flushing session never re-pays. |
5 changes: 4 additions & 1 deletion src/io/wait.ts
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,10 @@ export async function waitForOutcome(
// if a readable transcript never yields the reply the CONSUMER's patience
// bounds it (budget-exceeded), never a library-invented timeout — "time is the
// policy's." In-process the same observer already cached the record, so the
// gate is a no-op; it only pays on the cross-process race.
// gate is a no-op; it only pays on the cross-process race. Its read-side
// complement (for a separate process that rebuilds the observer cold and reads
// before the first flush) is `SessionObserver.settleFirstRecords` — ADR 0009;
// keep the two in step if you touch either.
const paneDone = belief.state === "idle" && (hookDone || armed);
const replyOnDisk = belief.transcriptCount === 0 || belief.lastMessageRole === "assistant";
if (paneDone && replyOnDisk) {
Expand Down
100 changes: 100 additions & 0 deletions src/observe/session-observer.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -83,4 +83,104 @@ describe("SessionObserver — incremental belief + thread, hook-resolved transcr
expect(b.hookChannelHealthy).toBe(false);
expect(b.state).toBe("working"); // the pane is the only signal
});

// Fix #1 — the late-first-flush settle. The transcript path is KNOWN (the hook
// reported it) but the file is not on disk yet; a writer creates it after N
// polls. The settle must wait the records out (within budget), while a
// present-from-t0 transcript must return immediately and pay no extra polls.
describe("settleFirstRecords — bounded one-time wait for a delayed first flush", () => {
it("returns the records once the absent file appears after a few polls (within budget)", async () => {
// Hook names the path, but the file does NOT exist yet — the empty-mirror
// signature (tp_exists=false at read time).
writeFileSync(
rv,
`1700000001.0 {"hook_event_name":"SessionStart","transcript_path":"${tx}"}\n`,
);
const obs = new SessionObserver({ agent: claude, rendezvousPath: rv });
// Confirm the early read IS empty (the bug today) and the file is seen absent.
expect(obs.thread().messages.length).toBe(0);
expect(obs.transcriptAddressableButAbsent()).toBe(true);

// The first <id>.jsonl bytes land ~750ms later (3 poll cadences) — a late flush.
const writer = setTimeout(() => {
writeFileSync(tx, `${[userRec("u1", "hi"), asstRec("a1", "u1", "PONG")].join("\n")}\n`);
}, 750);

const t0 = Date.now();
await obs.settleFirstRecords(5_000);
const elapsed = Date.now() - t0;
clearTimeout(writer);

const roles = obs.thread().messages.map((m) => m.role);
expect(roles).toEqual(["user", "assistant"]); // the contract's records, recovered
expect(elapsed).toBeGreaterThanOrEqual(700); // it genuinely waited for the flush
expect(elapsed).toBeLessThan(2_000); // and returned promptly once it appeared, well inside budget
expect(obs.transcriptAddressableButAbsent()).toBe(false); // file present now
});

it("control — file present from t=0 returns immediately and pays no extra wait", async () => {
writeFileSync(
rv,
`1700000001.0 {"hook_event_name":"SessionStart","transcript_path":"${tx}"}\n`,
);
writeFileSync(tx, `${[userRec("u1", "hi"), asstRec("a1", "u1", "PONG")].join("\n")}\n`);
const obs = new SessionObserver({ agent: claude, rendezvousPath: rv });

const t0 = Date.now();
await obs.settleFirstRecords(5_000);
const elapsed = Date.now() - t0;

// First iteration finds the records → no sleep → near-zero latency (the
// zero-blast-radius half: a fast-flush session is untouched).
expect(elapsed).toBeLessThan(SETTLE_POLL_MS_FLOOR);
expect(obs.thread().messages.map((m) => m.role)).toEqual(["user", "assistant"]);

// And the latch holds: a second settle, even on a path now made absent, is a no-op.
rmSync(tx);
const t1 = Date.now();
await obs.settleFirstRecords(5_000);
expect(Date.now() - t1).toBeLessThan(SETTLE_POLL_MS_FLOOR); // never re-enters the poll
});

it("budget-exhausted absent file → returns empty after the budget, never hangs", async () => {
writeFileSync(
rv,
`1700000001.0 {"hook_event_name":"SessionStart","transcript_path":"${tx}"}\n`,
);
const obs = new SessionObserver({ agent: claude, rendezvousPath: rv });
const t0 = Date.now();
await obs.settleFirstRecords(500); // tiny budget; file never appears
const elapsed = Date.now() - t0;
expect(elapsed).toBeGreaterThanOrEqual(450); // it waited the budget
expect(elapsed).toBeLessThan(1_500); // then gave up — bounded, no hang
expect(obs.thread().messages.length).toBe(0); // honest empty, not invented records
});

// The exhausted-latch (ADR 0009). A locatable-but-NEVER-flushing session — a
// crashed/slow resume, an adopt-miss locate, a read before the first turn —
// must spend the budget at most ONCE. Without the exhausted latch every later
// settle re-pays the full budget, so a daemon polling recover()→turnComplete
// would stall the budget on EVERY call against exactly the unhealthy session.
it("never re-pays the budget: a never-flushing session settles once, then no-ops", async () => {
writeFileSync(
rv,
`1700000001.0 {"hook_event_name":"SessionStart","transcript_path":"${tx}"}\n`,
);
const obs = new SessionObserver({ agent: claude, rendezvousPath: rv });

const t0 = Date.now();
await obs.settleFirstRecords(500); // first call spends the budget (file never appears)
expect(Date.now() - t0).toBeGreaterThanOrEqual(450);

// The latch now holds even though NO record was ever seen: the SECOND call
// returns immediately instead of re-polling the budget (the flag-A fix).
const t1 = Date.now();
await obs.settleFirstRecords(500);
expect(Date.now() - t1).toBeLessThan(SETTLE_POLL_MS_FLOOR);
expect(obs.thread().messages.length).toBe(0); // still honest empty
});
});
});

// One poll cadence is 250ms; an "immediate" return must be well under it.
const SETTLE_POLL_MS_FLOOR = 200;
Loading
Loading