Make generation safe under concurrent invocation against a shared outDir#121
Conversation
Parallel task graphs (lint/typecheck/build each depending on docs generation) race on the shared output directory, causing intermittent partial reads, ENOENT on just-replaced files, and half-written outputs. - Write every generated artifact to a temp sibling and atomically rename into place (new internal writeFileAtomic/copyFileAtomic), so concurrent readers see old or new content, never a truncated file. - Close delete-then-recreate windows: agent-skills and mounted markdown mirrors now write new files first and prune stale entries after, instead of rm -rf'ing a live directory before rebuilding it. - Serialize `leadtype generate` runs per outDir with a cross-process mkdir lock under os.tmpdir() (mtime keepalive, 10min stale reclaim, 5min wait timeout that fails loudly). LEADTYPE_NO_LOCK=1 opts out. - Document the concurrency semantics on the generate pipeline page. Fixes #117 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
|
Warning Review limit reachedYou’ve reached a temporary PR review limit under our Fair Usage Limits Policy. Next review available in: 51 minutes Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available. How can I continue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews. How do review limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please refer docs for additional details. Review details⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (17)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 540f95293d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| while (!(await tryAcquire(lockPath))) { | ||
| if (await isStale(lockPath, staleMs)) { | ||
| await rm(lockPath, { force: true, recursive: true }); |
There was a problem hiding this comment.
Guard stale lock removal against new owners
When two waiters contend while a stale lock is being reclaimed, one waiter can stat the old stale directory, another process can remove it and acquire a fresh lock, and then this unconditional rm deletes that fresh lock. The first waiter can then create its own lock, so two leadtype generate runs enter the critical section for the same outDir, defeating serialization in the crashed/stale-lock recovery path; make reclaim compare-and-swap with an owner token/mtime recheck before removing the directory.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 6b4dcc3. Reclaim now renames the lock to a unique trash path first (atomic — exactly one reclaimer wins), then re-verifies the renamed directory really was abandoned: if a fresh lock slipped into the check→rename window, it is renamed back. A plain rm of a fresh lock is no longer possible. The one remaining window (a third process acquiring between our rename-away and rename-back) is microseconds wide and commented at the site.
| // the copy (instead of rm -rf on the whole mirror before it) keeps the | ||
| // mirror readable throughout — a concurrent reader never sees the | ||
| // directory disappear mid-generation. | ||
| const currentFiles = new Set(files); |
There was a problem hiding this comment.
Prevent mirrors from reading their own output
When a mount's urlPrefix resolves inside its own source subtree, such as pathPrefix: "guides" with urlPrefix: "/docs/guides/public", leaving the mirror in place means the earlier glob over sourceDir also sees files from the previous mirror. This keep-set then treats those mirror files as current and the copy step writes them under public/public/... on every run; the old rm(targetDir) avoided that, so exclude targetDir from the source glob or reject nested targets before pruning.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 6b4dcc3. When the mirror target resolves inside its own source subtree, the target is now excluded from the source glob (ignore: ["<target>/**"]), so a previous run's mirror output is never picked up as source content and re-mirrored into public/public/....
There was a problem hiding this comment.
ℹ️ No critical issues — a couple of lock-lifecycle observations worth a look before merge, otherwise mergeable.
Reviewed changes — makes leadtype generate / convertAllMdx safe under concurrent invocation against a shared outDir via atomic writes, write-then-prune, and a cross-process lock.
- Atomic per-file writes — new
internal/atomic-fs.ts(writeFileAtomic/copyFileAtomic: temp sibling + same-dirrename, no fsync) applied to everyoutDirwrite site (convert, llms/agent artifacts, skills, search, feed, MCP card, NLWeb, sync manifest, i18n manifest, mounted mirrors). - Write-then-prune —
skills.tsskillsRoot andcopyMountedMarkdownMirrorsnow write new files first and prune stale entries after, replacingrm -rf-then-rebuild so the surface never disappears mid-run. - Single-flight lock — new
internal/generate-lock.ts:mkdir-based cross-process lock underos.tmpdir()keyed by a hash of the resolved outDir, with mtime keepalive, 10-min stale reclaim, and a 5-min wait timeout; wired intorunGenerateCommandand released infinally.LEADTYPE_NO_LOCK=1opts out. - Docs + changeset — new "Concurrent invocation" section on the generate page and a patch changeset.
ℹ️ Lock recovery relies solely on the 10-minute stale window
release() only runs in the finally of a normal completion or thrown error. If the process is interrupted (Ctrl-C, SIGTERM, OOM kill), the lock directory survives, and nothing reclaims it until the 10-minute stale window elapses. Combined with the 5-minute wait timeout, a developer who cancels a run and immediately re-runs will wait 5 minutes and then hit a hard failure, even though no run is actually active.
The owner.json pid is recorded but never consulted, so there is no liveness check to short-circuit a lock held by a dead process. Hard crashes (SIGKILL) also leave atomic-fs temp files (.{name}.{pid}-{rand}.tmp) in the output directory, which can then be deployed verbatim from public/.
Technical details
# Lock recovery relies solely on the 10-minute stale window
## Affected sites
- `packages/leadtype/src/internal/generate-lock.ts:142-149` — `release` runs only via the caller's `finally`; no `process.on('SIGINT'|'SIGTERM')` cleanup.
- `packages/leadtype/src/internal/generate-lock.ts:84-96` — `owner.json` records `pid` but nothing reads it back for a liveness check.
- `packages/leadtype/src/internal/atomic-fs.ts:36-43` — temp-file `rm` cleanup runs only on a caught error, so a killed process leaks `.{name}.{pid}-{rand}.tmp` siblings into `outDir`.
## Required outcome
- Interrupting a run should not force the next run to wait out the full stale window.
- Leaked temp files should not end up in a deployed output tree.
## Suggested approach (optional)
- Register a best-effort `SIGINT`/`SIGTERM` handler that calls `release()` (synchronous `rmSync` is acceptable in a signal path).
- Use the recorded `pid` for a liveness probe (`process.kill(pid, 0)`) so a lock owned by a dead process is reclaimed immediately rather than after `staleMs`.
- Consider a prune of stale `.*.tmp` siblings at generate start, or document that hard-crash temp files may linger.
## Open questions for the human
- Is cross-platform pid liveness (Windows) in scope, or is the stale window an acceptable floor for crash recovery?Claude Opus | 𝕏
There was a problem hiding this comment.
ℹ️ No new issues — the only change since my last review is a docs-only changelog entry.
Reviewed changes — incremental review of the one new commit since 540f952.
- Changelog entry —
153e845adds a "Concurrent generation safety" section todocs/changelog/0-4.mdxsummarizing the atomic-write, write-then-prune, and single-flight-lock work. Docs-only, no code change, and the description matches the implementation.
The lock-lifecycle observations from my initial review still apply (5-min wait timeout vs 10-min stale window, and recovery after abrupt termination) — nothing in this commit affects them.
Claude Opus | 𝕏
- Reclaim abandoned locks via atomic rename-then-verify instead of a plain rm, so a competing waiter can never delete a fresh lock acquired between its staleness check and the removal (Codex review). - Release the lock on SIGINT/SIGTERM and re-raise, so an interrupted run never stalls the next one; probe the owner.json pid so a hard-killed holder is reclaimed immediately instead of after the stale window (Pullfrog review). - Raise the default wait timeout to 15 minutes (above the 10-minute stale window) so waiters outlive crashed holders and never fail a healthy long run; LEADTYPE_LOCK_TIMEOUT_MS overrides (Pullfrog review). - Sweep temp files leaked by hard-killed runs at the start of the next locked run so they never linger in a deployed public/ (Pullfrog review). - Exclude a nested mirror target from its own source glob so a mount whose urlPrefix resolves inside its source subtree never re-mirrors its previous output (Codex review). - Prune directories left empty after removing stale mirror files (Pullfrog review). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
|
Addressed all review feedback in 6b4dcc3: Lock lifecycle (Pullfrog review body + inline):
Reclaim race (Codex P2): reclaim is now rename-to-trash + verify + hand-back instead of a plain Nested mirror self-copy (Codex P2): mirror targets nested inside their own source subtree are excluded from the source glob. Empty dirs after prune (Pullfrog ℹ️): pruned files' now-empty parent chains are removed with best-effort New tests cover dead-pid reclaim, signal-handler registration/cleanup, and the temp-file sweep (560 total, all passing). Docs/changeset/changelog updated to match the new recovery semantics. |
There was a problem hiding this comment.
✅ No new issues found.
Reviewed changes — the single new commit 6b4dcc33 hardens lock recovery and mirror pruning in response to my prior review; all three earlier findings are now addressed.
- Fast abandoned-lock recovery —
SIGINT/SIGTERMhandlers release the lock on the way out (then re-raise to preserve exit code), waiters probe the recorded holderpidand reclaim a dead holder immediately, and the mtime stale window is now only the fallback for unidentifiable locks. - Reclaim without racing acquirers —
reclaimAbandonedLockrenames the lock to a unique trash path (atomic, one winner), re-checks abandonment, and hands a live lock back if it grabbed one created in the check→rename window. - Wait timeout raised above stale window — default is now 15m (> the 10m stale window) so a healthy long-running holder is waited on rather than failed, with
LEADTYPE_LOCK_TIMEOUT_MSto extend for very large sites. - Leaked temp-file sweep —
sweepLeakedTempFilesremoves.name.pid-hex.tmpdroppings left by a hard-killed run at the start of the next locked run, regex-gated so user dotfiles like.env.tmpare never touched. - Empty mirror-dir pruning + self-mirror exclusion —
removeEmptyMirrorDirswalks pruned files' parent chains andrmdirs newly-empty dirs;copyMountedMarkdownMirrorsnow excludes a mirror nested inside its own source from the source glob. - Docs/changeset text — updated to describe the new recovery behavior and env var.
The reclaim-via-rename path documents a microseconds-wide window where a displaced live lock is dropped; that is a deliberate, acknowledged tradeoff (atomic per-file writes keep artifacts intact), so no action needed.
Claude Opus | 𝕏

Fixes #117
Problem
Running the generators (or
convertAllMdx) concurrently against the sameoutDirwas racy. In a typical CI setup, lint, typecheck, and build run in parallel and each invokes docs generation, so several processes write/read/delete the same files at once — intermittent partial reads, ENOENT on files another run just replaced, and half-written outputs.Four mechanisms, all confirmed in the code:
writeFile(finalPath, …), which truncates first, so concurrent readers (another generate run, or a siblingtsc/next buildreadingpublic/) could observe empty or half-written files.outDir/docs/*.mdback off disk, so two runs interleaved their read-back phases.rm -rf'd before being rebuilt, exposing a transient window where the whole surface was missing.Fix (option 1 + option 2 from the issue)
internal/atomic-fs.ts(writeFileAtomic/copyFileAtomic— temp sibling + rename, same-directory so the rename never crosses filesystems). Applied to every outDir write site: converted docs, llms.txt/llms-full.txt, search index, sitemaps/robots, feeds, MCP server card, NLWeb artifacts, skills, agents.md, i18n manifest, and the collections sync manifest. Readers now see old content or new content, never a torn file.leadtype generatetakes a cross-processmkdirlock keyed by a hash of the resolved out path, stored underos.tmpdir()(deliberately not insidepublic/, which gets deployed verbatim). Concurrent runs wait for the in-flight run. mtime keepalive, 10-minute stale reclaim for crashed runs, 5-minute wait timeout that fails loudly instead of hanging CI.LEADTYPE_NO_LOCK=1opts out (documented escape hatch for e.g. NFS).Verification
convertAllMdxrace with a concurrent reader asserting no torn/vanished files.leadtype generateprocesses against a shared outDir: all exit 0, all artifacts complete and parseable, zero temp-file leaks.Notes
🤖 Generated with Claude Code