Skip to content

Make generation safe under concurrent invocation against a shared outDir#121

Merged
KayleeWilliams merged 3 commits into
mainfrom
KayleeWilliams/generation-is-racy-when-invoked-concurrently-aga
Jul 2, 2026
Merged

Make generation safe under concurrent invocation against a shared outDir#121
KayleeWilliams merged 3 commits into
mainfrom
KayleeWilliams/generation-is-racy-when-invoked-concurrently-aga

Conversation

@KayleeWilliams

Copy link
Copy Markdown
Collaborator

Fixes #117

Problem

Running the generators (or convertAllMdx) concurrently against the same outDir was racy. In a typical CI setup, lint, typecheck, and build run in parallel and each invokes docs generation, so several processes write/read/delete the same files at once — intermittent partial reads, ENOENT on files another run just replaced, and half-written outputs.

Four mechanisms, all confirmed in the code:

  1. No write was atomic — every artifact was a plain writeFile(finalPath, …), which truncates first, so concurrent readers (another generate run, or a sibling tsc/next build reading public/) could observe empty or half-written files.
  2. The pipeline reads its own output mid-run — llms.txt, llms-full.txt, the search index, and agent-readability artifacts are all built by reading outDir/docs/*.md back off disk, so two runs interleaved their read-back phases.
  3. Delete-then-recreate windows — the agent-skills surface and mounted markdown mirrors were rm -rf'd before being rebuilt, exposing a transient window where the whole surface was missing.
  4. No serialization — nothing prevented N runs from racing at all.

Fix (option 1 + option 2 from the issue)

  • Atomic per-file writes: new internal/atomic-fs.ts (writeFileAtomic / copyFileAtomic — temp sibling + rename, same-directory so the rename never crosses filesystems). Applied to every outDir write site: converted docs, llms.txt/llms-full.txt, search index, sitemaps/robots, feeds, MCP server card, NLWeb artifacts, skills, agents.md, i18n manifest, and the collections sync manifest. Readers now see old content or new content, never a torn file.
  • Write-then-prune instead of delete-then-recreate: agent-skills and mounted markdown mirrors write the new files first and prune stale entries after, so the surface never disappears mid-run.
  • Single-flight lock per outDir: leadtype generate takes a cross-process mkdir lock keyed by a hash of the resolved out path, stored under os.tmpdir() (deliberately not inside public/, which gets deployed verbatim). Concurrent runs wait for the in-flight run. mtime keepalive, 10-minute stale reclaim for crashed runs, 5-minute wait timeout that fails loudly instead of hanging CI. LEADTYPE_NO_LOCK=1 opts out (documented escape hatch for e.g. NFS).
  • Docs: new "Concurrent invocation" section on the generate-static-artifacts page, including the recommendation to still prefer a single upstream generation task where the task graph allows it.

Verification

  • 555 tests pass, lint + typecheck clean. New tests: atomic-fs (incl. a 4MB concurrent reader/writer stress), lock semantics (wait-for-holder, stale reclaim, timeout, 5-way serialization), and a shared-outDir convertAllMdx race with a concurrent reader asserting no torn/vanished files.
  • Raced three real leadtype generate processes against a shared outDir: all exit 0, all artifacts complete and parseable, zero temp-file leaks.
  • Baseline diff: output from this branch is byte-identical to unmodified main (modulo run timestamps that differ between any two runs).
  • Performance (interleaved A/B, 300-doc fixture): +7ms on min times (~1.7%); the atomic write is one extra rename (~0.1–0.2ms/file) and the lock is a fixed ~8ms per run.

Notes

  • The collections git-clone cache has its own cross-outDir race (two projects sharing a cache dir); this lock is keyed per outDir and doesn't cover it — worth a separate issue if it bites.

🤖 Generated with Claude Code

Parallel task graphs (lint/typecheck/build each depending on docs
generation) race on the shared output directory, causing intermittent
partial reads, ENOENT on just-replaced files, and half-written outputs.

- Write every generated artifact to a temp sibling and atomically rename
  into place (new internal writeFileAtomic/copyFileAtomic), so concurrent
  readers see old or new content, never a truncated file.
- Close delete-then-recreate windows: agent-skills and mounted markdown
  mirrors now write new files first and prune stale entries after,
  instead of rm -rf'ing a live directory before rebuilding it.
- Serialize `leadtype generate` runs per outDir with a cross-process
  mkdir lock under os.tmpdir() (mtime keepalive, 10min stale reclaim,
  5min wait timeout that fails loudly). LEADTYPE_NO_LOCK=1 opts out.
- Document the concurrency semantics on the generate pipeline page.

Fixes #117

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown

Warning

Review limit reached

You’ve reached a temporary PR review limit under our Fair Usage Limits Policy.

Your recent review volume is higher than typical usage, so adaptive limits are currently applied.

Next review available in: 51 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0d02ae87-22ab-473c-a010-df0593a28e49

📥 Commits

Reviewing files that changed from the base of the PR and between 07ccef9 and 6b4dcc3.

📒 Files selected for processing (17)
  • .changeset/concurrent-generate-safety.md
  • docs/changelog/0-4.mdx
  • docs/pipeline/generate-static-artifacts.mdx
  • packages/leadtype/src/cli/generate.ts
  • packages/leadtype/src/convert/convert-concurrency.test.ts
  • packages/leadtype/src/convert/convert.ts
  • packages/leadtype/src/feed/index.ts
  • packages/leadtype/src/internal/atomic-fs.test.ts
  • packages/leadtype/src/internal/atomic-fs.ts
  • packages/leadtype/src/internal/generate-lock.test.ts
  • packages/leadtype/src/internal/generate-lock.ts
  • packages/leadtype/src/llm/llm.ts
  • packages/leadtype/src/llm/skills.ts
  • packages/leadtype/src/mcp/card.ts
  • packages/leadtype/src/nlweb/artifacts.ts
  • packages/leadtype/src/search/node.ts
  • packages/leadtype/src/sync/sync.ts
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch KayleeWilliams/generation-is-racy-when-invoked-concurrently-aga

Comment @coderabbitai help to get the list of available commands.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 540f95293d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


while (!(await tryAcquire(lockPath))) {
if (await isStale(lockPath, staleMs)) {
await rm(lockPath, { force: true, recursive: true });

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Guard stale lock removal against new owners

When two waiters contend while a stale lock is being reclaimed, one waiter can stat the old stale directory, another process can remove it and acquire a fresh lock, and then this unconditional rm deletes that fresh lock. The first waiter can then create its own lock, so two leadtype generate runs enter the critical section for the same outDir, defeating serialization in the crashed/stale-lock recovery path; make reclaim compare-and-swap with an owner token/mtime recheck before removing the directory.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 6b4dcc3. Reclaim now renames the lock to a unique trash path first (atomic — exactly one reclaimer wins), then re-verifies the renamed directory really was abandoned: if a fresh lock slipped into the check→rename window, it is renamed back. A plain rm of a fresh lock is no longer possible. The one remaining window (a third process acquiring between our rename-away and rename-back) is microseconds wide and commented at the site.

// the copy (instead of rm -rf on the whole mirror before it) keeps the
// mirror readable throughout — a concurrent reader never sees the
// directory disappear mid-generation.
const currentFiles = new Set(files);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Prevent mirrors from reading their own output

When a mount's urlPrefix resolves inside its own source subtree, such as pathPrefix: "guides" with urlPrefix: "/docs/guides/public", leaving the mirror in place means the earlier glob over sourceDir also sees files from the previous mirror. This keep-set then treats those mirror files as current and the copy step writes them under public/public/... on every run; the old rm(targetDir) avoided that, so exclude targetDir from the source glob or reject nested targets before pruning.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 6b4dcc3. When the mirror target resolves inside its own source subtree, the target is now excluded from the source glob (ignore: ["<target>/**"]), so a previous run's mirror output is never picked up as source content and re-mirrored into public/public/....

@pullfrog pullfrog Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ℹ️ No critical issues — a couple of lock-lifecycle observations worth a look before merge, otherwise mergeable.

Reviewed changes — makes leadtype generate / convertAllMdx safe under concurrent invocation against a shared outDir via atomic writes, write-then-prune, and a cross-process lock.

  • Atomic per-file writes — new internal/atomic-fs.ts (writeFileAtomic / copyFileAtomic: temp sibling + same-dir rename, no fsync) applied to every outDir write site (convert, llms/agent artifacts, skills, search, feed, MCP card, NLWeb, sync manifest, i18n manifest, mounted mirrors).
  • Write-then-pruneskills.ts skillsRoot and copyMountedMarkdownMirrors now write new files first and prune stale entries after, replacing rm -rf-then-rebuild so the surface never disappears mid-run.
  • Single-flight lock — new internal/generate-lock.ts: mkdir-based cross-process lock under os.tmpdir() keyed by a hash of the resolved outDir, with mtime keepalive, 10-min stale reclaim, and a 5-min wait timeout; wired into runGenerateCommand and released in finally. LEADTYPE_NO_LOCK=1 opts out.
  • Docs + changeset — new "Concurrent invocation" section on the generate page and a patch changeset.

ℹ️ Lock recovery relies solely on the 10-minute stale window

release() only runs in the finally of a normal completion or thrown error. If the process is interrupted (Ctrl-C, SIGTERM, OOM kill), the lock directory survives, and nothing reclaims it until the 10-minute stale window elapses. Combined with the 5-minute wait timeout, a developer who cancels a run and immediately re-runs will wait 5 minutes and then hit a hard failure, even though no run is actually active.

The owner.json pid is recorded but never consulted, so there is no liveness check to short-circuit a lock held by a dead process. Hard crashes (SIGKILL) also leave atomic-fs temp files (.{name}.{pid}-{rand}.tmp) in the output directory, which can then be deployed verbatim from public/.

Technical details
# Lock recovery relies solely on the 10-minute stale window

## Affected sites
- `packages/leadtype/src/internal/generate-lock.ts:142-149``release` runs only via the caller's `finally`; no `process.on('SIGINT'|'SIGTERM')` cleanup.
- `packages/leadtype/src/internal/generate-lock.ts:84-96``owner.json` records `pid` but nothing reads it back for a liveness check.
- `packages/leadtype/src/internal/atomic-fs.ts:36-43` — temp-file `rm` cleanup runs only on a caught error, so a killed process leaks `.{name}.{pid}-{rand}.tmp` siblings into `outDir`.

## Required outcome
- Interrupting a run should not force the next run to wait out the full stale window.
- Leaked temp files should not end up in a deployed output tree.

## Suggested approach (optional)
- Register a best-effort `SIGINT`/`SIGTERM` handler that calls `release()` (synchronous `rmSync` is acceptable in a signal path).
- Use the recorded `pid` for a liveness probe (`process.kill(pid, 0)`) so a lock owned by a dead process is reclaimed immediately rather than after `staleMs`.
- Consider a prune of stale `.*.tmp` siblings at generate start, or document that hard-crash temp files may linger.

## Open questions for the human
- Is cross-platform pid liveness (Windows) in scope, or is the stale window an acceptable floor for crash recovery?

Pullfrog  | Fix all ➔Fix 👍s ➔View workflow run | Using Claude Opus𝕏

Comment thread packages/leadtype/src/internal/generate-lock.ts Outdated
Comment thread packages/leadtype/src/cli/generate.ts Outdated

@pullfrog pullfrog Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ℹ️ No new issues — the only change since my last review is a docs-only changelog entry.

Reviewed changes — incremental review of the one new commit since 540f952.

  • Changelog entry153e845 adds a "Concurrent generation safety" section to docs/changelog/0-4.mdx summarizing the atomic-write, write-then-prune, and single-flight-lock work. Docs-only, no code change, and the description matches the implementation.

The lock-lifecycle observations from my initial review still apply (5-min wait timeout vs 10-min stale window, and recovery after abrupt termination) — nothing in this commit affects them.

Pullfrog  | Fix it ➔View workflow run | Using Claude Opus𝕏

- Reclaim abandoned locks via atomic rename-then-verify instead of a plain
  rm, so a competing waiter can never delete a fresh lock acquired between
  its staleness check and the removal (Codex review).
- Release the lock on SIGINT/SIGTERM and re-raise, so an interrupted run
  never stalls the next one; probe the owner.json pid so a hard-killed
  holder is reclaimed immediately instead of after the stale window
  (Pullfrog review).
- Raise the default wait timeout to 15 minutes (above the 10-minute stale
  window) so waiters outlive crashed holders and never fail a healthy
  long run; LEADTYPE_LOCK_TIMEOUT_MS overrides (Pullfrog review).
- Sweep temp files leaked by hard-killed runs at the start of the next
  locked run so they never linger in a deployed public/ (Pullfrog review).
- Exclude a nested mirror target from its own source glob so a mount whose
  urlPrefix resolves inside its source subtree never re-mirrors its
  previous output (Codex review).
- Prune directories left empty after removing stale mirror files
  (Pullfrog review).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@KayleeWilliams

Copy link
Copy Markdown
Collaborator Author

Addressed all review feedback in 6b4dcc3:

Lock lifecycle (Pullfrog review body + inline):

  • SIGINT/SIGTERM: the holder now releases its lock on the way out (sync rmSync in a once-handler, then re-raises to preserve default termination and exit code). An interrupted run no longer stalls the next one at all.
  • Dead-holder detection: waiters probe the pid recorded in owner.json (process.kill(pid, 0); the lock lives in the local tmpdir, so the holder is always a same-machine process — this also answers the Windows question: kill(pid, 0) works on Windows Node, and the mtime stale window remains the floor if the probe is ever inconclusive, e.g. EPERM or pid reuse). A SIGKILLed/OOM-killed holder is reclaimed immediately instead of after 10 minutes — verified end-to-end: kill -9 mid-run, next run completed in 0.7s.
  • Wait timeout vs stale window: default wait raised to 15 min (> the 10-min stale window), so waiters always outlive crashed holders and never fail a healthy long run; LEADTYPE_LOCK_TIMEOUT_MS overrides.
  • Leaked temp files: temp siblings left by a hard-killed run are swept at the start of the next locked run (strict .name.pid-12hex.tmp pattern match, so user dotfiles are never touched), so they can't linger in a deployed public/.

Reclaim race (Codex P2): reclaim is now rename-to-trash + verify + hand-back instead of a plain rm — a competing waiter can no longer delete a fresh lock.

Nested mirror self-copy (Codex P2): mirror targets nested inside their own source subtree are excluded from the source glob.

Empty dirs after prune (Pullfrog ℹ️): pruned files' now-empty parent chains are removed with best-effort rmdir.

New tests cover dead-pid reclaim, signal-handler registration/cleanup, and the temp-file sweep (560 total, all passing). Docs/changeset/changelog updated to match the new recovery semantics.

@pullfrog pullfrog Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ No new issues found.

Reviewed changes — the single new commit 6b4dcc33 hardens lock recovery and mirror pruning in response to my prior review; all three earlier findings are now addressed.

  • Fast abandoned-lock recoverySIGINT/SIGTERM handlers release the lock on the way out (then re-raise to preserve exit code), waiters probe the recorded holder pid and reclaim a dead holder immediately, and the mtime stale window is now only the fallback for unidentifiable locks.
  • Reclaim without racing acquirersreclaimAbandonedLock renames the lock to a unique trash path (atomic, one winner), re-checks abandonment, and hands a live lock back if it grabbed one created in the check→rename window.
  • Wait timeout raised above stale window — default is now 15m (> the 10m stale window) so a healthy long-running holder is waited on rather than failed, with LEADTYPE_LOCK_TIMEOUT_MS to extend for very large sites.
  • Leaked temp-file sweepsweepLeakedTempFiles removes .name.pid-hex.tmp droppings left by a hard-killed run at the start of the next locked run, regex-gated so user dotfiles like .env.tmp are never touched.
  • Empty mirror-dir pruning + self-mirror exclusionremoveEmptyMirrorDirs walks pruned files' parent chains and rmdirs newly-empty dirs; copyMountedMarkdownMirrors now excludes a mirror nested inside its own source from the source glob.
  • Docs/changeset text — updated to describe the new recovery behavior and env var.

The reclaim-via-rename path documents a microseconds-wide window where a displaced live lock is dropped; that is a deliberate, acknowledged tradeoff (atomic per-file writes keep artifacts intact), so no action needed.

Pullfrog  | View workflow run | Using Claude Opus𝕏

@KayleeWilliams KayleeWilliams merged commit d569171 into main Jul 2, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Generation is racy when invoked concurrently against a shared outDir

1 participant