fix(observe,session): tolerate a late first transcript flush, bounded at both seams#13
Merged
Merged
Conversation
…irror race)
Under load, MCP/tool startup defers claude's first <id>.jsonl write past the
fixed ~3s anchor window: send() returns DELIVERY_UNCONFIRMED and the observer
reads the still-absent transcript and never re-resolves, so messagesSince("0")
returns [] and turnComplete=false even though wait()->completed. It is a LATE
flush, not a no-flush (file appears 5-9s later under contention).
Two fixes at the read/anchor seam, preserving all core promises (wait()
unchanged; messages still race-free after completed — strengthened; no
pane-scraping; the sentinel still reads empty):
Fix #1 (settle-on-read) — SessionObserver.settleFirstRecords(): a one-time,
latched (#sawRecords) bounded poll (<=5s) for the first records when the
transcript path is known but no records have been seen. messagesSince/
turnComplete await it (they only run after wait()->completed, so the records
the contract owes are en route). A present/already-flushed session finds records
on the first iteration and pays nothing; the latch makes every later call a
no-op. Zero blast radius on non-MCP / fast-flush sessions.
Fix #2 (anchor-on-addressable-but-absent) — anchorOwnTurn extends its poll
budget to FIRST_FLUSH_ANCHOR_POLLS (24 -> 6s) ONLY while the transcript file is
addressable-but-absent (new SessionObserver.transcriptAddressableButAbsent()),
so tool/heavy sessions anchor a real cursor instead of the sentinel. A
present-file session never crosses the original 12-poll/3s window — byte-for-
byte unchanged — and the shorter lost-Enter retry is not extended.
Regression tests: session-observer settle (late file appears after N polls
returns the records within budget; present-from-t0 returns immediately with no
extra polls + latch holds; budget-exhausted returns empty, no hang) and handle
anchor (absent-then-present file anchors a real cursor; present-from-t0 keeps the
original 3s window). Full suite green (356 passed, 3 skipped).
Verified live (claude 2.1.181): under 20 concurrent heavy tool-MCP sessions the
unmodified build reproduced 1/20 DELIVERY_UNCONFIRMED + empty messagesSince("0");
the fixed build returned [user, assistant] on 20/20 with 0 sentinels. The
fast-flush control path is unchanged.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01PeMGuKstkYF9dCgcsxWHWa
…0009 Follow-up hardening on the late-first-flush fix (6ce9849), from an architecture + code-quality review. The settle's latch was success-only: SessionObserver.settleFirstRecords() returned early only once records had been SEEN (#sawRecords). A locatable- but-never-flushing session — a crashed/slow resume, an adopt-miss locate, a read before the first turn — therefore re-paid the FULL budget on every messagesSince/turnComplete, stalling exactly the daemon recover()->turnComplete loop that walks unhealthy sessions. Add a second latch (#firstFlushSettled) set when the budget is spent once: every later call is then a no-op that still re-reads the file (so records landing later are surfaced) but never BLOCKS for the full budget again. Also from the review: - Unify the two seams onto one shared budget. Export FIRST_FLUSH_BUDGET_MS (6s) from the observer; anchorOwnTurn derives its extended poll count (Math.ceil(BUDGET / ANCHOR_POLL_MS) = 24) from it, so the read-settle and write-anchor windows cannot drift apart. - Record the design as ADR 0009 (the phenomenon, the 3-9s measurement, why two seams, why the 6s budget is sufficient given wait() precedes the read, the existsSync-vs-records-readable boundary, and the additive send+read worst case). Cross-reference wait()'s replyOnDisk gate and settleFirstRecords so a future editor cannot defeat one flush-gate without seeing the other. Tests: new "never re-pays the budget" (fails without the exhausted latch) and the write-side "absent-forever -> full extended window, then bounded give-up" (the analogue of the read-side exhaustion test). Full suite green (358 passed, 3 skipped); typecheck + biome clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UgUf4t5AZkFKtgGWu8qxWL
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Under load, MCP/tool startup defers claude's first
<id>.jsonlwrite past the fixed ~3 s anchor window (measured 3–9 s under contention). The result:send()returns theDELIVERY_UNCONFIRMEDsentinel, and a separate reader seeswait()→completedyetmessagesSince("0")returns[]/turnComplete=false— even though the turn landed. It's a late flush, not a no-flush.Confirmed in code, not just symptom:
io/wait.tsdeclarescompletedwithreplyOnDisk = transcriptCount === 0 || lastMessageRole === "assistant", so completion can legitimately precede the first flush; a cold cross-process reader then reads the still-absent file.Fix — tolerate it at both seams, one shared budget
anchorOwnTurn): past the base 3 s window, keep polling out to the shared budget only while the transcript file is addressable-but-absent. A present-file session never crosses the base window — byte-for-byte unchanged.settleFirstRecords): before the first read, poll up to the shared budget for the first records. The read-side complement ofwait()'s cross-process flush gate.FIRST_FLUSH_BUDGET_MS = 6000is the single source of truth; the anchor derives its poll count from it so the two seams can't drift.Hardening from an architecture + code-quality review (second commit)
recover()→turnCompleteloop. Added#firstFlushSettled: the budget is spent at most once; later calls still re-read the file (late records are surfaced) but never block again.wait(), so the effective deadline is read-time + 6 s — past the 9 s tail), theexistsSync-vs-records-readable boundary, and the additivesend+read worst case.wait()'s gate andsettleFirstRecordsnow cross-reference each other.Tests
never re-pays the budget— proven to fail without the exhausted latch.absent-forever → full extended window, then bounded give-up— the write-side analogue of the read-side exhaustion test; confirms the 24-poll cap terminates (no hang).Full suite green: 358 passed, 3 skipped. Typecheck + biome clean.
Live (claude 2.1.181): the unmodified build reproduced 1/20
DELIVERY_UNCONFIRMED+ emptymessagesSince("0")under 20 concurrent heavy tool-MCP sessions; the fixed build returned[user, assistant]on 20/20 with 0 sentinels. The fast-flush path is unchanged.🤖 Generated with Claude Code