fix(observe,session): tolerate a late first transcript flush, bounded at both seams by isingh · Pull Request #13 · wastedcode/claudemux

isingh · 2026-06-18T22:21:56Z

Problem

Under load, MCP/tool startup defers claude's first <id>.jsonl write past the fixed ~3 s anchor window (measured 3–9 s under contention). The result: send() returns the DELIVERY_UNCONFIRMED sentinel, and a separate reader sees wait()→completed yet messagesSince("0") returns [] / turnComplete=false — even though the turn landed. It's a late flush, not a no-flush.

Confirmed in code, not just symptom: io/wait.ts declares completed with replyOnDisk = transcriptCount === 0 || lastMessageRole === "assistant", so completion can legitimately precede the first flush; a cold cross-process reader then reads the still-absent file.

Fix — tolerate it at both seams, one shared budget

Write-side (anchorOwnTurn): past the base 3 s window, keep polling out to the shared budget only while the transcript file is addressable-but-absent. A present-file session never crosses the base window — byte-for-byte unchanged.
Read-side (settleFirstRecords): before the first read, poll up to the shared budget for the first records. The read-side complement of wait()'s cross-process flush gate.
One budget: FIRST_FLUSH_BUDGET_MS = 6000 is the single source of truth; the anchor derives its poll count from it so the two seams can't drift.

Hardening from an architecture + code-quality review (second commit)

Exhausted latch (must-fix). The settle's latch was success-only, so a locatable-but-never-flushing session (crashed/slow resume, adopt-miss locate, pre-first-turn read) re-paid the full budget on every call — stalling exactly the daemon recover()→turnComplete loop. Added #firstFlushSettled: the budget is spent at most once; later calls still re-read the file (late records are surfaced) but never block again.
ADR 0009 records the phenomenon, the 3–9 s measurement, why two seams, why 6 s is sufficient (the read trails wait(), so the effective deadline is read-time + 6 s — past the 9 s tail), the existsSync-vs-records-readable boundary, and the additive send+read worst case. wait()'s gate and settleFirstRecords now cross-reference each other.

Tests

never re-pays the budget — proven to fail without the exhausted latch.
absent-forever → full extended window, then bounded give-up — the write-side analogue of the read-side exhaustion test; confirms the 24-poll cap terminates (no hang).
Plus the original regression suite (late-file-appears, present-from-t0 zero-cost controls).

Full suite green: 358 passed, 3 skipped. Typecheck + biome clean.

Live (claude 2.1.181): the unmodified build reproduced 1/20 DELIVERY_UNCONFIRMED + empty messagesSince("0") under 20 concurrent heavy tool-MCP sessions; the fixed build returned [user, assistant] on 20/20 with 0 sentinels. The fast-flush path is unchanged.

🤖 Generated with Claude Code

…irror race) Under load, MCP/tool startup defers claude's first <id>.jsonl write past the fixed ~3s anchor window: send() returns DELIVERY_UNCONFIRMED and the observer reads the still-absent transcript and never re-resolves, so messagesSince("0") returns [] and turnComplete=false even though wait()->completed. It is a LATE flush, not a no-flush (file appears 5-9s later under contention). Two fixes at the read/anchor seam, preserving all core promises (wait() unchanged; messages still race-free after completed — strengthened; no pane-scraping; the sentinel still reads empty): Fix #1 (settle-on-read) — SessionObserver.settleFirstRecords(): a one-time, latched (#sawRecords) bounded poll (<=5s) for the first records when the transcript path is known but no records have been seen. messagesSince/ turnComplete await it (they only run after wait()->completed, so the records the contract owes are en route). A present/already-flushed session finds records on the first iteration and pays nothing; the latch makes every later call a no-op. Zero blast radius on non-MCP / fast-flush sessions. Fix #2 (anchor-on-addressable-but-absent) — anchorOwnTurn extends its poll budget to FIRST_FLUSH_ANCHOR_POLLS (24 -> 6s) ONLY while the transcript file is addressable-but-absent (new SessionObserver.transcriptAddressableButAbsent()), so tool/heavy sessions anchor a real cursor instead of the sentinel. A present-file session never crosses the original 12-poll/3s window — byte-for- byte unchanged — and the shorter lost-Enter retry is not extended. Regression tests: session-observer settle (late file appears after N polls returns the records within budget; present-from-t0 returns immediately with no extra polls + latch holds; budget-exhausted returns empty, no hang) and handle anchor (absent-then-present file anchors a real cursor; present-from-t0 keeps the original 3s window). Full suite green (356 passed, 3 skipped). Verified live (claude 2.1.181): under 20 concurrent heavy tool-MCP sessions the unmodified build reproduced 1/20 DELIVERY_UNCONFIRMED + empty messagesSince("0"); the fixed build returned [user, assistant] on 20/20 with 0 sentinels. The fast-flush control path is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01PeMGuKstkYF9dCgcsxWHWa

…0009 Follow-up hardening on the late-first-flush fix (6ce9849), from an architecture + code-quality review. The settle's latch was success-only: SessionObserver.settleFirstRecords() returned early only once records had been SEEN (#sawRecords). A locatable- but-never-flushing session — a crashed/slow resume, an adopt-miss locate, a read before the first turn — therefore re-paid the FULL budget on every messagesSince/turnComplete, stalling exactly the daemon recover()->turnComplete loop that walks unhealthy sessions. Add a second latch (#firstFlushSettled) set when the budget is spent once: every later call is then a no-op that still re-reads the file (so records landing later are surfaced) but never BLOCKS for the full budget again. Also from the review: - Unify the two seams onto one shared budget. Export FIRST_FLUSH_BUDGET_MS (6s) from the observer; anchorOwnTurn derives its extended poll count (Math.ceil(BUDGET / ANCHOR_POLL_MS) = 24) from it, so the read-settle and write-anchor windows cannot drift apart. - Record the design as ADR 0009 (the phenomenon, the 3-9s measurement, why two seams, why the 6s budget is sufficient given wait() precedes the read, the existsSync-vs-records-readable boundary, and the additive send+read worst case). Cross-reference wait()'s replyOnDisk gate and settleFirstRecords so a future editor cannot defeat one flush-gate without seeing the other. Tests: new "never re-pays the budget" (fails without the exhausted latch) and the write-side "absent-forever -> full extended window, then bounded give-up" (the analogue of the read-side exhaustion test). Full suite green (358 passed, 3 skipped); typecheck + biome clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UgUf4t5AZkFKtgGWu8qxWL

isingh and others added 2 commits June 18, 2026 20:44

isingh merged commit bcfd17d into main Jun 18, 2026
11 checks passed

isingh mentioned this pull request Jun 18, 2026

chore(release): 0.2.4 #14

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(observe,session): tolerate a late first transcript flush, bounded at both seams#13

fix(observe,session): tolerate a late first transcript flush, bounded at both seams#13
isingh merged 2 commits into
mainfrom
fix/first-flush-settle-and-anchor

isingh commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

isingh commented Jun 18, 2026

Problem

Fix — tolerate it at both seams, one shared budget

Hardening from an architecture + code-quality review (second commit)

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant