Skip to content

fix: parallelize project discovery and batch archived-repo checks [LAC-2038]#42

Merged
lacymorrow merged 4 commits into
mainfrom
LAC-2038/fix-slow-scanning
May 23, 2026
Merged

fix: parallelize project discovery and batch archived-repo checks [LAC-2038]#42
lacymorrow merged 4 commits into
mainfrom
LAC-2038/fix-slow-scanning

Conversation

@lacymorrow

Copy link
Copy Markdown
Owner

Summary

  • Root cause: isRepoArchived() in discover.ts (from LAC-1949) called gh repo view <slug> --json isArchived — a synchronous HTTP request — for every discovered project. With 92+ repos, this added ~31s of sequential network requests to scanning.
  • Fix: Replace N individual gh repo view REST calls with a single GitHub GraphQL query; combine 5 sequential git commands per project into one shell script; run git metadata collection in parallel with bounded concurrency (16 workers)
  • Result: Scanning 92 repos: ~45s → ~12s

Test plan

  • bun run typecheck passes
  • bun run build && node dist/cli.js --help smoke test passes
  • All 117 existing tests pass (2 pre-existing flaky failures in git.test.ts unrelated to this change)
  • Manual benchmark: discoverProjects("/Users/lacy/repo") consistently returns 92 projects in ~12s
  • Archived repos are correctly detected via batch GraphQL (verified with real repos)
  • Run bun run src/cli.ts --multi --dry-run from ~/repo to verify full UX flow

@gemini-code-assist

Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@lacymorrow

Copy link
Copy Markdown
Owner Author

Code review — request changes

Thanks for the speedup work — the design is sound (filesystem scan → parallel git metadata → batched GraphQL) and the measured ~45s → ~12s on 92 repos is a real win. One blocking regression and a couple of smaller notes below.

🔴 Blocking: non-git directories with a package.json are no longer filtered out

In the previous implementation, discoverProjects() explicitly skipped non-git directories with:

if (!isGitRepo(fullPath)) continue;

That gate is gone. It was replaced by if (!g.isGitRepo) continue; driven by the sh script in GIT_META_SCRIPT. The problem is that every git command in the script has a 2>/dev/null || echo "…" fallback, and sh -c itself exits 0 even when the directory isn't a git repo:

$ cd /tmp/non-git && sh -c '<the GIT_META_SCRIPT>'
0
unknown


EXIT=0

So execFile's err is null and collectGitMeta returns isGitRepo: true for any non-git directory that has a package.json. Those projects flow through into the discover results, get presented as release candidates, and will then explode mid-pipeline when the first git step runs.

Suggested fix — short-circuit the script and let the exit code propagate:

git rev-parse --git-dir >/dev/null 2>&1 || exit 1
# …rest of the metadata commands…

(Alternatively, keep an isGitRepo(fullPath) check on the candidate before parallel(...); it's one extra git rev-parse per dir but doesn't add measurable wall-clock cost since it runs inside the parallel pool.)

🟡 Non-blocking notes

  1. Cross-platform: execFile("sh", ["-c", …]) is Unix-only. The previous code path used execFileSync("git", […]) which worked on Windows. shipx doesn't advertise Windows support and the demo/tape and git calls elsewhere already assume a POSIX environment, so this is consistent with the rest of the repo — but worth calling out as a deliberate narrowing. If you want to keep it portable, the multi-argv execFile("git", […]) form per command would still parallelize fine, just with a slightly higher constant cost per project (still way under the previous serial cost).

  2. Duplicated slug parsing: The inline regex inside collectGitMeta is the same one in utils.ts:getGithubSlug. Could lift it to a shared helper to avoid drift. (And both share the pre-existing ([^/.]+) bug for repo names containing dots, e.g. lacy.sh — out of scope here, worth a separate issue.)

  3. GraphQL query growth: For 92 repos the query is fine, but at very large workspace sizes you'll eventually hit GitHub's GraphQL query-cost / node limits (500 nodes per query, max ~50KB body). A chunked batch (e.g. 100 at a time) would future-proof this — happy to leave as a follow-up.

  4. Test coverage: discover.ts has no tests, pre-existing. The regression above would have been caught by a single-fixture test that points discoverProjects() at a tmpdir containing one git-initialized package and one bare package.json-only dir. Not blocking but the parallelization + sh-script logic is meaty enough to deserve a test now.

Security pass

  • GraphQL inputs are sanitized via the safe allowlist ([A-Za-z0-9._-]) — good.
  • sh -c script is a static string with no interpolation; cwd is set per-call; safe from shell injection.
  • Timeouts on both execFile("sh", …) (10s) and gh api (15s) — reasonable.
  • stdio: "pipe" avoids fd inheritance — good.

No security concerns.

Verdict

Request changes. Please address the non-git filter regression; the rest are notes / follow-ups.

@lacymorrow

Copy link
Copy Markdown
Owner Author

Fixed the blocking regression in 6d43c6e: prepended `git rev-parse --git-dir >/dev/null 2>&1 || exit 1` to `GIT_META_SCRIPT` so the script exits non-zero for non-git directories, causing `collectGitMeta` to resolve with `isGitRepo: false` via the error path.

On the non-blocking notes:

  • Cross-platform / Unix-only: Acknowledged and consistent with the rest of the repo. No change needed.
  • Duplicated slug regex: Agreed, worth a follow-up issue.
  • GraphQL query growth: Will file a follow-up for chunked batching when we actually hit the limit.
  • Test coverage for discover.ts: Will file a follow-up issue for a fixture-based test covering the git vs non-git dir filtering case.

@lacymorrow lacymorrow left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review — approve ✅

(Posting as comment because GitHub blocks self-approval; this is the Code Reviewer agent's verdict.)

The blocking regression from the previous review is fixed in 6d43c6e and the rest of the PR holds up.

Verification of the fix

GIT_META_SCRIPT now leads with:

git rev-parse --git-dir >/dev/null 2>&1 || exit 1

Manually re-ran the script in two dirs:

  • Non-git tmpdir → exits 1, no output. execFile callback receives err, collectGitMeta resolves { isGitRepo: false, … }, and the if (!g.isGitRepo) continue; in discoverProjects correctly drops the candidate.
  • Real git repo → exits 0, 5 metadata lines printed. Tag / change count / branch / dirty / remote all parse cleanly into the GitMeta record.

This restores the semantics of the previous if (!isGitRepo(fullPath)) continue; gate without losing the parallelization win.

Correctness pass on the rest of the diff

  • Parallel worker pool (parallel<T>) is a standard bounded-concurrency pattern; results are written by index so order is preserved, and Promise.all resolves only after every worker drains the queue.
  • slugsByIndexr${i} mapping in checkArchivedBatch is consistent with Map iteration order (insertion-order preserving), so the alias index matches the position in slugs.
  • Phase 3 only updates projects[idx].archived for entries present in slugsByIndex — projects without a GitHub remote correctly stay archived: false.
  • GraphQL injection is mitigated via the safe allowlist ([A-Za-z0-9._-]) added in abf9a24. Static sh -c body, no interpolation. Per-call cwd. Timeouts on both sh (10s) and gh api (15s). stdio: "pipe".
  • Local types (PkgCandidate, GitMeta) are file-private; no leakage into the public cli.ts re-export surface.
  • CI green on Node 20/22 across macOS and Ubuntu.

Follow-ups (already acknowledged, not blocking)

The author committed to filing separate issues for these — please do create them so they don't get lost:

  1. Shared slug-regex helper (and the pre-existing ([^/.]+) bug that drops dots in repo names, e.g. lacy.sh).
  2. Chunked GraphQL batching for workspaces that approach GitHub's 500-node / ~50KB limits.
  3. Fixture-based test for discoverProjects covering the git vs non-git dir filter — that's the exact regression we just shipped a fix for, and one test would have caught it.

Verdict

Approved. Ready for merge.

lacymorrow and others added 4 commits May 22, 2026 20:21
…C-2038]

isRepoArchived() was calling `gh repo view` (network HTTP request) sequentially
for every discovered project. With 92+ repos this added ~31s of scan time.

Three optimizations:
- Replace N individual REST calls with a single GitHub GraphQL query (~340ms)
- Combine 5 sequential git commands per project into one shell script
- Run git metadata collection in parallel with bounded concurrency (16 workers)

Scanning time for 92 repos: ~45s → ~12s.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
Address code review findings:
- Sanitize owner/name in buildArchivedQuery to prevent GraphQL injection
- Rename resolve→res in collectGitMeta to avoid shadowing node:path import

Co-Authored-By: Paperclip <noreply@paperclip.ing>
…x [LAC-2038]

- getGithubSlug() and collectGitMeta() regex now handles dots in repo
  names (e.g. vibe.rehab was parsed as just "vibe")
- checkArchivedBatch() reads stdout from failed gh calls so partial
  GraphQL results (some repos not found) don't discard all results
- CLAUDE.md: fix discoverProjects sort order docs (alphabetical, not
  by change count)

Co-Authored-By: Paperclip <noreply@paperclip.ing>
@lacymorrow lacymorrow force-pushed the LAC-2038/fix-slow-scanning branch from 6d43c6e to ac79cc1 Compare May 23, 2026 00:22
@lacymorrow lacymorrow merged commit 79ccad9 into main May 23, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant