Skip to content

fix(observe,session): tolerate a late first transcript flush, bounded at both seams#13

Merged
isingh merged 2 commits into
mainfrom
fix/first-flush-settle-and-anchor
Jun 18, 2026
Merged

fix(observe,session): tolerate a late first transcript flush, bounded at both seams#13
isingh merged 2 commits into
mainfrom
fix/first-flush-settle-and-anchor

Conversation

@isingh

@isingh isingh commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Problem

Under load, MCP/tool startup defers claude's first <id>.jsonl write past the fixed ~3 s anchor window (measured 3–9 s under contention). The result: send() returns the DELIVERY_UNCONFIRMED sentinel, and a separate reader sees wait()→completed yet messagesSince("0") returns [] / turnComplete=false — even though the turn landed. It's a late flush, not a no-flush.

Confirmed in code, not just symptom: io/wait.ts declares completed with replyOnDisk = transcriptCount === 0 || lastMessageRole === "assistant", so completion can legitimately precede the first flush; a cold cross-process reader then reads the still-absent file.

Fix — tolerate it at both seams, one shared budget

  • Write-side (anchorOwnTurn): past the base 3 s window, keep polling out to the shared budget only while the transcript file is addressable-but-absent. A present-file session never crosses the base window — byte-for-byte unchanged.
  • Read-side (settleFirstRecords): before the first read, poll up to the shared budget for the first records. The read-side complement of wait()'s cross-process flush gate.
  • One budget: FIRST_FLUSH_BUDGET_MS = 6000 is the single source of truth; the anchor derives its poll count from it so the two seams can't drift.

Hardening from an architecture + code-quality review (second commit)

  • Exhausted latch (must-fix). The settle's latch was success-only, so a locatable-but-never-flushing session (crashed/slow resume, adopt-miss locate, pre-first-turn read) re-paid the full budget on every call — stalling exactly the daemon recover()→turnComplete loop. Added #firstFlushSettled: the budget is spent at most once; later calls still re-read the file (late records are surfaced) but never block again.
  • ADR 0009 records the phenomenon, the 3–9 s measurement, why two seams, why 6 s is sufficient (the read trails wait(), so the effective deadline is read-time + 6 s — past the 9 s tail), the existsSync-vs-records-readable boundary, and the additive send+read worst case. wait()'s gate and settleFirstRecords now cross-reference each other.

Tests

  • never re-pays the budget — proven to fail without the exhausted latch.
  • absent-forever → full extended window, then bounded give-up — the write-side analogue of the read-side exhaustion test; confirms the 24-poll cap terminates (no hang).
  • Plus the original regression suite (late-file-appears, present-from-t0 zero-cost controls).

Full suite green: 358 passed, 3 skipped. Typecheck + biome clean.

Live (claude 2.1.181): the unmodified build reproduced 1/20 DELIVERY_UNCONFIRMED + empty messagesSince("0") under 20 concurrent heavy tool-MCP sessions; the fixed build returned [user, assistant] on 20/20 with 0 sentinels. The fast-flush path is unchanged.

🤖 Generated with Claude Code

isingh and others added 2 commits June 18, 2026 20:44
…irror race)

Under load, MCP/tool startup defers claude's first <id>.jsonl write past the
fixed ~3s anchor window: send() returns DELIVERY_UNCONFIRMED and the observer
reads the still-absent transcript and never re-resolves, so messagesSince("0")
returns [] and turnComplete=false even though wait()->completed. It is a LATE
flush, not a no-flush (file appears 5-9s later under contention).

Two fixes at the read/anchor seam, preserving all core promises (wait()
unchanged; messages still race-free after completed — strengthened; no
pane-scraping; the sentinel still reads empty):

Fix #1 (settle-on-read) — SessionObserver.settleFirstRecords(): a one-time,
latched (#sawRecords) bounded poll (<=5s) for the first records when the
transcript path is known but no records have been seen. messagesSince/
turnComplete await it (they only run after wait()->completed, so the records
the contract owes are en route). A present/already-flushed session finds records
on the first iteration and pays nothing; the latch makes every later call a
no-op. Zero blast radius on non-MCP / fast-flush sessions.

Fix #2 (anchor-on-addressable-but-absent) — anchorOwnTurn extends its poll
budget to FIRST_FLUSH_ANCHOR_POLLS (24 -> 6s) ONLY while the transcript file is
addressable-but-absent (new SessionObserver.transcriptAddressableButAbsent()),
so tool/heavy sessions anchor a real cursor instead of the sentinel. A
present-file session never crosses the original 12-poll/3s window — byte-for-
byte unchanged — and the shorter lost-Enter retry is not extended.

Regression tests: session-observer settle (late file appears after N polls
returns the records within budget; present-from-t0 returns immediately with no
extra polls + latch holds; budget-exhausted returns empty, no hang) and handle
anchor (absent-then-present file anchors a real cursor; present-from-t0 keeps the
original 3s window). Full suite green (356 passed, 3 skipped).

Verified live (claude 2.1.181): under 20 concurrent heavy tool-MCP sessions the
unmodified build reproduced 1/20 DELIVERY_UNCONFIRMED + empty messagesSince("0");
the fixed build returned [user, assistant] on 20/20 with 0 sentinels. The
fast-flush control path is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01PeMGuKstkYF9dCgcsxWHWa
…0009

Follow-up hardening on the late-first-flush fix (6ce9849), from an
architecture + code-quality review.

The settle's latch was success-only: SessionObserver.settleFirstRecords()
returned early only once records had been SEEN (#sawRecords). A locatable-
but-never-flushing session — a crashed/slow resume, an adopt-miss locate, a
read before the first turn — therefore re-paid the FULL budget on every
messagesSince/turnComplete, stalling exactly the daemon recover()->turnComplete
loop that walks unhealthy sessions. Add a second latch (#firstFlushSettled) set
when the budget is spent once: every later call is then a no-op that still
re-reads the file (so records landing later are surfaced) but never BLOCKS for
the full budget again.

Also from the review:
- Unify the two seams onto one shared budget. Export FIRST_FLUSH_BUDGET_MS
  (6s) from the observer; anchorOwnTurn derives its extended poll count
  (Math.ceil(BUDGET / ANCHOR_POLL_MS) = 24) from it, so the read-settle and
  write-anchor windows cannot drift apart.
- Record the design as ADR 0009 (the phenomenon, the 3-9s measurement, why two
  seams, why the 6s budget is sufficient given wait() precedes the read, the
  existsSync-vs-records-readable boundary, and the additive send+read worst
  case). Cross-reference wait()'s replyOnDisk gate and settleFirstRecords so a
  future editor cannot defeat one flush-gate without seeing the other.

Tests: new "never re-pays the budget" (fails without the exhausted latch) and
the write-side "absent-forever -> full extended window, then bounded give-up"
(the analogue of the read-side exhaustion test). Full suite green (358 passed,
3 skipped); typecheck + biome clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UgUf4t5AZkFKtgGWu8qxWL
@isingh isingh merged commit bcfd17d into main Jun 18, 2026
11 checks passed
@isingh isingh mentioned this pull request Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant