Skip to content

fix: remaining adversarial-review findings (M4/M5/M6/M7, S2/S3, A4, F2, P2, AU6/AU7, R8, U6)#69

Open
russellromney wants to merge 15 commits into
mainfrom
fix/adversarial-remaining
Open

fix: remaining adversarial-review findings (M4/M5/M6/M7, S2/S3, A4, F2, P2, AU6/AU7, R8, U6)#69
russellromney wants to merge 15 commits into
mainfrom
fix/adversarial-remaining

Conversation

@russellromney

@russellromney russellromney commented Jun 26, 2026

Copy link
Copy Markdown
Owner

Works through the remaining tractable findings from ADVERSARIAL_REVIEW_2026-06-25.md, after #62#65 merged. Audited each against current main first (the report's line refs were two main-versions stale); all below were still open. Themed commits; full cargo test + clippy -D warnings + fmt green on macOS, with the Linux-only io_uring changes typechecked + clippy'd in a rust:1.96.0 container. Regression tests for the key fixes.

Fixed

  • M4 — order "newer never loses" by the commit's history depth (generation, from git rev-list --count) instead of the builder's wall clock, so cross-host clock skew can't reorder. Falls back to synced_at for refs without it. (Known, documented limit: a force-push rewind to an older commit lags until the next forward push.) (tests)
  • M5 — read cache served a ref that lost the durable ordering check for the TTL window; invalidate on save. (test)
  • P2HashingWriter re-hashed the remainder on a short write → bad pack trailer for >256 KiB-compressing blobs. (test)
  • S2 [High] — local retention could delete the only copy of a still-referenced artifact; now protects the ref-reachable set each run (walk shared with remote GC). (test)
  • A4 — a push during an in-flight (claimed) build was coalesced and dropped; coalesce only onto queued jobs, allow a queued job behind a claimed one. (test)
  • M6 — MySQL VARCHAR keys silently truncated/collided; reject over-long keys.
  • M7 — networked queue + per-host file metadata silently lost refs; warn loudly at startup (not refuse — single-host shared-filesystem farm-out is valid).
  • S3 — GC continued past a manifest read error (could delete live objects); fail the pass closed.
  • F2 — unbounded no-dictionary decompress; cap at the frame's raw length.
  • U6 — racy exists()-then-create_dir; use the atomic form.
  • AU6 — artifact redirect ignored private TTL; use the visibility-aware TTL.
  • R8 — HTTP client builder silently dropped auth headers on build error; fail fast.
  • AU7 — token file briefly world-readable + swallowed chmod error; create 0o600 up front, surface errors.
  • U3 — opt-in fsync durability barrier (RIPCLONE_FSYNC=1): flush the staged tree, publish, then flush the parent dir so a crash can't leave a torn tree git calls clean. Off by default. (test)
  • U4 — io_uring write_regular_batch_deferred submitted one window for the whole batch (slot-range overrun past MAX_BATCH_FILES); chunk into windows like the synchronous path. (Linux)
  • U5 — io_uring normal-fd fallback could double-close an fd the kernel already closed (and another thread reused); drop fd handles right after a successful submit, keep the sync close only on submit failure. (Linux)
  • U7 — removed dead write_entry (the archive clone uses the owned-write scheduler) and its now-unused imports.
  • AU8 — credential header name was hardcoded to Authorization; add an optional auth_header_name (config + CLI flag), validated as an HTTP field token so a bad config can't smuggle a second header. (tests)

Deferred (with reasons)

  • P6 (empty-repo clone) — larger than the bullet implies: an empty repo fails at server resolve (404) and build ("no objects to pack") before the client's index_pack is even reached. Real support touches the central resolve/build path. Documented as a roadmap limitation instead of half-fixing one symptom.
  • S4 (download range/resume), P4 (sha256 object format) — features.
  • R4 + R5–R12 (perf) — want before/after benchmarks (like perf(client/extract): cut hot-path copies + a per-file lock (R1/R2/R3) #65).
  • P3 (head-delta cold-base fallback), P5 (LSM ancestry) — deep build-path logic whose force-push/GC failure scenarios can't be reproduced here to verify.
  • The report's "areas needing deeper review" — investigations.

🤖 Generated with Claude Code

@russellromney russellromney changed the title fix: remaining adversarial-review findings (M5, P2, S2, A4, M6, M7, S3, F2, AU6, AU7, R8, U6) fix: remaining adversarial-review findings (M4/M5/M6/M7, S2/S3, A4, F2, P2, AU6/AU7, R8, U6) Jun 26, 2026
russellromney and others added 12 commits June 26, 2026 20:42
- The read cache wrote every ref into the cache even when the durable store
  skipped it as older, so a write that lost the ordering check could still be
  served from cache for the TTL window. Invalidate the cache entry on save
  instead; the next load reads the kept value through.
- The pack HashingWriter hashed the whole buffer before a possibly-short inner
  write, so write_all's retry re-hashed the unwritten remainder — a wrong pack
  trailer that index-pack rejects for blobs whose compressed form spans a short
  write. Hash only the bytes the inner writer accepted.

Refs: ADVERSARIAL_REVIEW_2026-06-25.md M5, P2

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…eject mismatch (S2, S3, M7)

- Local retention only protected a best-effort side list, and on a local-only
  backend the durability check is a no-op — so a stale/incomplete list could
  delete a still-referenced artifact (its only copy). Retention now also
  protects the set the live refs actually point at, computed each run from the
  ref store (the reachable walk is shared with remote GC).
- Remote GC continued past a manifest read error, leaving that manifest's
  chunks out of the reachable set — a brief failure could delete live objects.
  Fail the whole pass instead.
- A network queue paired with a per-host file metadata store silently lost
  refs (server and workers each read their own files). Refuse that combination
  at startup with a clear message.

Refs: ADVERSARIAL_REVIEW_2026-06-25.md S2, S3, M7

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…A4, M6)

- A push that arrived while the prior build for the same key was already
  claimed (and had already fetched) was coalesced onto that in-flight build, so
  the newer commit wasn't built until the next push. Coalescing now targets
  only queued jobs, and the unique index is relaxed to one queued job per key
  so a claimed build and a fresh queued one can coexist — the newer commit gets
  its own job and builds next.
- MySQL key columns are VARCHAR (the composite PK can't be TEXT); an over-long
  repo/branch/queue key would silently truncate and collide. Reject it with a
  clear error instead.

Refs: ADVERSARIAL_REVIEW_2026-06-25.md A4, M6

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… (F2, U6)

- The no-dictionary frame decompress used unbounded `decode_all`, so a corrupt
  or hostile frame could expand without limit before the length check. Cap it
  at the frame's declared raw length, like the dictionary branch.
- The directory-creation helper probed with `exists()` before `create_dir`,
  which races concurrent producers (one wins, the rest hit EEXIST). Use the
  atomic create-then-tolerate-AlreadyExists form.

Refs: ADVERSARIAL_REVIEW_2026-06-25.md F2, U6

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…0 token (AU6, R8, AU7)

- The direct-artifact redirect signed URLs with a flat 15-minute TTL, ignoring
  repo visibility. Use the same shorter TTL for private repos as the ref path.
- The HTTP client builder fell back to a header-less client on a build error,
  silently dropping the auth headers so every request went out unauthenticated.
  Fail fast instead.
- The token file was written with the default umask then chmod'd 0o600 (briefly
  world-readable) and the chmod error was ignored. Create it 0o600 up front and
  surface a permission error.

Refs: ADVERSARIAL_REVIEW_2026-06-25.md AU6, R8, AU7

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… (M7)

Refusing to start broke valid single-host farm-out (a networked queue with the
server and workers sharing one filesystem and its file metadata). We can't tell
shared-filesystem from multi-host here, so warn loudly instead of bailing — the
goal is to end the silent footgun, not block a working setup.

Refs: ADVERSARIAL_REVIEW_2026-06-25.md M7

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
"A newer sync never loses" ordered by synced_at — the builder's per-host
wall clock — so cross-host clock skew or two writes in the same second could
pick the wrong winner. Order by the commit's history depth instead (its
`generation`, from `git rev-list --count`): recency follows the commit's place
in git history, which is the same for every builder, so skew can't reorder it.

- RefInfo carries `generation`; the build computes it once via `git rev-list
  --count` (the commit-graph we write makes it cheap), set wherever synced_at is.
- The compare (shared `should_replace_ref` + every SQL `save_ordered`) now: same
  commit wins; else higher-or-equal generation wins; else (a side without a
  generation) fall back to synced_at. `refs` tables gain a `generation` column
  (with a best-effort migration); the comparison stays a single atomic upsert.
- synced_at is kept as the fallback for refs/repos without a generation.

Known limitation (documented at the compare): a force-push that rewinds a
branch to an older commit has a lower generation, so it lags until the next
forward push (a same-commit re-sync still updates). That is the cost of ordering
by history rather than by observation time.

Tests: higher generation wins over a newer wall clock (file store + the SQL
lifecycle across sqlite/pg/mysql).

Refs: ADVERSARIAL_REVIEW_2026-06-25.md M4

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The archive clone went through the owned-write scheduler; write_entry was
left over and never called. Remove it and the now-unused imports.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The credential header name was hardcoded to "Authorization". Add an
optional auth_header_name on the provider config (and a CLI flag) so hosts
that expect a custom name (e.g. PRIVATE-TOKEN, a proxy header) work. The
name is validated as an HTTP field token, so a bad config can't smuggle a
second header.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A crash right after a clone could leave a torn working tree that git status
calls clean: the index records stat info but the file data wasn't durable.
With RIPCLONE_FSYNC=1 the client flushes the staged tree, publishes it, then
flushes the parent directory so the rename itself is durable. Off by default
to keep the fast path fast.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
U4: write_regular_batch_deferred submitted one window for the whole batch,
which overruns the fixed-file slot range if a caller passes more than
MAX_BATCH_FILES. Split into windows like the synchronous path already does.

U5: in the normal-fd fallback, after submit_and_wait the kernel has already
run the Close op for every fd, so the error-path close_open_fds_sync could
double-close an fd another thread has since reused. Drop our fd handles right
after a successful submit; keep the sync close only on the submit-failure
path, where the kernel never took the SQEs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cloning a repo with zero commits isn't supported end to end (resolve 404,
"no objects to pack", index_pack bail). Record what real support requires
rather than half-fixing one symptom.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@russellromney russellromney force-pushed the fix/adversarial-remaining branch from 26a4b07 to 8804848 Compare June 27, 2026 03:54
russellromney and others added 3 commits June 27, 2026 00:06
reuse_existing_build re-points a branch when the upstream tip is confirmed to
be an already-built commit. M4's history-depth ordering rejected that re-point
when the tip rewound to a shallower commit, so the persisted ref stayed at the
old commit and fresh clones failed ("missing clonepack manifest") until the
next forward push.

Clear the generation on the authoritative re-point so the fresh synced_at
wins; the next build of that commit re-stamps the depth. The reuse test now
uses generation-bearing refs (what production always has) and fails without
this fix. The remaining residual (rewind to a never-built commit) is noted on
should_replace_ref.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Neither method has a production caller. The editable clone (CloneMode::Editable)
covers full_clone; skeleton clone mode is no longer exposed (mode.rs points
skeleton-backed access at snapshot/mount). Remove both, their now-orphaned
helpers fetch_pack and fetch_sizes (no other callers), and the lone test that
only exercised the dead skeleton_clone method — no live feature loses coverage.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ndow id

collect_completions reaped the whole completion queue and counted every entry
as one of its own. The normal-fd fallback inside flush_deferred_writes can run
while deferred direct windows are still in flight, so a sibling window's
completion could be miscounted (and stolen from that window). Tag-match on
window id: id 0 is this batch (normal-fd/synchronous), ids >= 1 are deferred
windows, which now get routed to their pending window like the direct path
already does. Removes the latent miscount that the verify-first protocol was
the only thing preventing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant