test(uffd): add deterministic race tests for stale-source / madvise-deadlock / faulted-short-circuit#2511
Conversation
PR SummaryHigh Risk Overview In the UFFD handler, introduces explicit Reviewed by Cursor Bugbot for commit 763945a. Bugbot is set up for automated code reviews on this repo. Configure here. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 40be15028f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| dup, err := syscall.Dup(int(uffdFd)) | ||
| require.NoError(t, err) |
There was a problem hiding this comment.
Release childForkMu when Dup fails
childForkMu is acquired before syscall.Dup, but the Dup error path uses require.NoError without an unlock. If Dup fails (for example under transient fd pressure), FailNow exits this test while leaving the global mutex locked, and subsequent cross-process tests block forever on childForkMu.Lock, turning one failure into a suite-wide deadlock.
Useful? React with 👍 / 👎.
| exit, err := fdexit.New() | ||
| if err != nil { | ||
| fmt.Fprintln(os.Stderr, "fdexit.New:", err) | ||
|
|
||
| return |
There was a problem hiding this comment.
Propagate startServe setup failures
When fdexit.New() fails, startServe only logs and returns, but callers still treat the helper as healthy (e.g., WaitReady can succeed even though no serve loop started). In that state, later page-fault operations can block until outer test timeouts, which obscures the root cause and makes failures non-deterministic under low-fd conditions.
Useful? React with 👍 / 👎.
…e from feat/free-page-reporting Lift the production-side UFFD subsystem closure that the upcoming RPC test harness and race tests depend on, taken verbatim from the tip of feat/free-page-reporting (commit f310273). The closure brings in: - pageTracker gains the `removed` pageState - Userfaultfd.Serve() splits readEvents into removes + pagefaults, drains the REMOVE batch under settleRequests.Lock(), then dispatches the pagefault batch - Worker fault dispatch with switch on pageTracker state (note: the state-snapshot read happens in the parent loop here — that is the bug the stacked PR #2512 fixes) - wakeupPipe self-pipe to wake the poll loop when a goroutine defers a page fault - defaultCopyMode override, MapMemoryReadyAddr lifecycle bits - deferred.go (deferred pagefault list) - prefault.go updates for prefetch / WP coordination - remove_test.go (REMOVE-event integration tests) - cross_process_helpers_test.go / helpers_test.go / fd_helpers_test.go REMOVE-aware variants This commit is purely a state lift of packages/orchestrator/pkg/sandbox/uffd/userfaultfd/ — no other paths touched, no Firecracker bumps, no feature flags, no proto regen, no template-manager / API plumbing. Required so that test/uffd-rpc-harness-and-race-tests can target main (via feat/uffd-test-scaffolding) instead of being stacked on top of the full free-page-reporting feature PR (#1896).
Replace the parent↔child cross-process userfaultfd test harness's pile of
single-purpose pipes (offsets, ready, gate-cmd, gate-sync) plus a SIGUSR1
shutdown signal and a SIGUSR2 page-state-snapshot trigger with one
bidirectional Unix domain socket carrying stdlib net/rpc + net/rpc/jsonrpc.
Per-call request IDs are handled by the standard library, so the parent can
issue concurrent RPCs while a handler is parked (which the new fault-barrier
methods need). The only fd we still hand off out-of-band is the userfaultfd
itself (kernel object, has to go through ExtraFiles); the initial source
data is written to a temp file under t.TempDir() because base64-stuffing
megabytes through JSON would be silly.
Replaced surface:
- 4 pipes + 2 signals → 1 socket + 1 RPC channel
- SIGUSR1 graceful exit → Service.Shutdown
- SIGUSR2 + offsets pipe + binary.Write → Service.PageStates
- ready pipe + ReadAll-on-close handshake → Service.WaitReady
- gate-cmd / gate-sync byte protocol → Service.ServePause / Service.ServeResume
Added on top of the same channel:
- Service.InstallFaultBarrier(addr, point) → token: arms a deterministic
barrier in the child's worker goroutine at one of two test-only hook
points (beforeWorkerRLockHook, beforeFaultPageHook).
- Service.WaitFaultHeld(token): blocks until the worker reaches the
barrier — the RPC reply IS the rendezvous.
- Service.ReleaseFault(token): lets the parked worker proceed.
The two hook fields on Userfaultfd default to nil and are nil-checked on
the hot path, so production sees zero behavioral change. They are only
assigned in the child's crossProcessServe wiring.
testConfig gains a sourcePatcher hook so race tests can plant a deterministic
sentinel byte into the random source data BEFORE the content file is written,
without depending on the happenstance value of any randomly-generated byte.
Also serialise the gated cross-process tests (TestRemoveThenWriteGated,
TestWriteThenRemoveGated) by removing t.Parallel(). While the handler is in
the gated `paused` state, any user thread that triggers a queued pagefault
on the registered region is suspended in the kernel's pagefault path. From
the Go runtime's perspective that goroutine is "running" (not in syscall,
since it's a plain memory store) and cannot be preempted until the fault is
served. If a CONCURRENT cross-process test in the same binary triggers a
stop-the-world GC pause during this window, STW will wait forever for the
suspended goroutine to reach a safe point — the kernel cannot deliver the
SIGURG preempt signal until the pagefault is served, and the handler is
paused. Running the gated tests sequentially avoids that interleaving while
leaving every other test (including the race suite) on t.Parallel().
…ed-short-circuit race tests
Three new race tests built on the unix-socket RPC harness and the test-only
fault-barrier hooks. None of them use sleeps, retries, or soak loops — each
test installs explicit barriers on the child's worker goroutine, drives the
racing kernel operation from the parent, and asserts on a concrete post-state.
- TestStaleSourceRaceMissingAndRemove: regression test for the production
fix. Plants a non-zero sentinel into the source page, parks the worker
via barrierBeforeRLock, fires madvise, waits for the REMOVE batch to
commit, releases the worker, then asserts the page is zero-filled.
Pre-fix the worker UFFDIO_COPYs the planted sentinel because it captured
`source = u.src` in the parent loop before the REMOVE landed; post-fix
it re-reads state under RLock and zero-faults. Wallclock < 50ms / variant.
- TestNoMadviseDeadlockWithInflightCopy: liveness regression test for the
user-visible symptom that originally surfaced the race. Parks the worker
via barrierBeforeFaultPage (i.e. holding RLock), fires madvise, asserts
madvise returns within 2s. Catches any future change that accidentally
couples readEvents() to settleRequests as a fast assertion failure
rather than a 30m CI timeout. Wallclock < 50ms / variant.
- TestFaultedShortCircuitOrdering: smoke test on the REMOVE-then-pagefault
batch ordering using the gated harness. Pins the invariant that REMOVE
batches drain before pagefault dispatch in a single Serve iteration.
Wallclock ~120ms / variant.
These tests are intentionally failing on this PR — they are the bug
demonstration. The production fix lives in a stacked follow-up PR; with
that fix applied all three pass -count=20 -timeout=30s deterministically.
40be150 to
763945a
Compare
|
Reopening to force GitHub Actions to re-run the full test workflow. The previous force-push (rebase onto #2475's branch) did not retrigger the |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 763945a. Configure here.
| err = h.executeRemove(operation{offset: pageOffset, mode: operationModeRemove}) | ||
| madviseCancel() | ||
| _ = madviseCtx | ||
| require.NoError(t, err, "MADV_DONTNEED on page %d did not return — handler dispatch wedged", pageIdx) |
There was a problem hiding this comment.
Madvise budget context not enforced in race test
Low Severity
In TestStaleSourceRaceMissingAndRemove, the madviseBudget isn't enforced for the MADV_DONTNEED call. The madviseCtx is created but not used to bound executeRemove or the synchronous unix.Madvise syscall. This allows a hanging MADV_DONTNEED to exceed its intended 2s deadline, instead hitting the outer 5s raceHappyPathBudget and producing a generic "handler is wedged" error.
Reviewed by Cursor Bugbot for commit 763945a. Configure here.
| if _, err := unix.FcntlInt(uintptr(dup), unix.F_SETFD, unix.FD_CLOEXEC); err != nil { | ||
| childForkMu.Unlock() | ||
| require.NoError(t, err) | ||
| } |
There was a problem hiding this comment.
Fork mutex leaked on syscall.Dup failure
Low Severity
After acquiring childForkMu, syscall.Dup is checked with require.NoError, which calls t.FailNow (via runtime.Goexit) when the dup fails — without first releasing the mutex. The subsequent FcntlInt failure path correctly unlocks before failing, but the dup path does not. If syscall.Dup ever fails (e.g., RLIMIT_NOFILE exhaustion under high parallelism), childForkMu stays locked for the lifetime of the test binary and every other concurrent cross-process test that reaches this section will deadlock until the suite timeout.
Reviewed by Cursor Bugbot for commit 763945a. Configure here.
| signal.Notify(exitSignal, syscall.SIGUSR1) | ||
| defer signal.Stop(exitSignal) | ||
| return nil | ||
| } |
There was a problem hiding this comment.
ServeResume can spawn duplicate serve goroutines
Low Severity
ServeResume unconditionally calls s.startServe(), which always allocates a fresh fdexit and launches a new uffd.Serve goroutine, then overwrites s.stop with a closure pointing at the new exit. If ServeResume is ever invoked when serve is already running (no preceding ServePause, or two consecutive resumes), the old serve goroutine and its fdexit are orphaned — s.stop no longer references them and they'll never be signaled — while two goroutines simultaneously poll the same uffd fd. Reading from a single uffd fd from two pollers produces undefined event delivery and, on shutdown, will hang wg.Wait forever waiting for the orphaned worker.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 763945a. Configure here.
|
Closing in favor of the cleaner four-PR split:
The lift commit is now decomposed across #2519 (harness) and #2520 (REMOVE handling + matrix). The race tests live in #2521; the fix lives in #2512. Bug demonstration is built into the stack — #2521's CI fails with |


Summary
Adds a
net/rpc/jsonrpc-over-unix-socket RPC harness for the userfaultfd cross-process tests, plus three deterministic race tests that demonstrate stale-source / madvise-deadlock / faulted-short-circuit bugs in the REMOVE-event handling path.These tests are intentionally failing on this PR — that is the bug demonstration. The production fix lands in stacked PR #2512.
Stack / dependency
feat/uffd-test-scaffolding(PR test(uffd): add cross-process scaffolding for gated and async ops #2475 — generic gated/async cross-process scaffolding).feat/free-page-reporting).Commits
test(uffd): lift REMOVE-event handling and pageTracker.removed closure from feat/free-page-reporting— verbatim state lift ofpackages/orchestrator/pkg/sandbox/uffd/userfaultfd/from PR uffd: add support for UFFD_EVENT_REMOVE events #1896 tip (commitf31027327). 8 files, +764/-213. Includes:pageTrackerremovedpageStateServe()(undersettleRequests.Lock())wakeupPipeself-pipe +defaultCopyMode+MapMemoryReadyAddrlifecycledeferred.go(deferred pagefault list)prefault.goupdates for prefetch / WP coordinationremove_test.go(REMOVE-event integration tests)cross_process_helpers_test.go/helpers_test.go/fd_helpers_test.gotest(uffd): add net/rpc/jsonrpc test harness over unix socket— replaces the parent↔child harness's pile of single-purpose pipes (offsets, ready, gate-cmd, gate-sync) plusSIGUSR1/SIGUSR2signaling with one bidirectional Unix socket carrying stdlibnet/rpc+net/rpc/jsonrpc. Per-call request IDs are handled by the standard library, so the parent can issue concurrent RPCs while a handler is parked.Replaced surface:
SIGUSR1graceful exitService.ShutdownSIGUSR2+ offsets pipe +binary.WriteService.PageStatesService.WaitReadyService.ServePause/Service.ServeResumeNew on the same channel:
Service.InstallFaultBarrier(addr, point) → token— arms a deterministic barrier in the child's worker goroutine at one of two test-only hook points (beforeWorkerRLockHook,beforeFaultPageHook).Service.WaitFaultHeld(token)— blocks until the worker reaches the barrier; the RPC reply IS the rendezvous.Service.ReleaseFault(token)— lets the parked worker proceed.The two hook fields on
Userfaultfddefault toniland are nil-checked on the hot path, so production sees zero behavioral change. They are only assigned in the child'scrossProcessServewiring.testConfiggains asourcePatcherhook so race tests can plant a deterministic sentinel byte into the random source data BEFORE the content file is written, instead of depending on the happenstance value of any randomly-generated byte.test(uffd): add deterministic stale-source / madvise-deadlock / faulted-short-circuit race tests— three race tests built on the harness above. None use sleeps, retries, or soak loops; each installs explicit barriers on the child's worker goroutine, drives the racing kernel operation from the parent, and asserts on a concrete post-state.TestStaleSourceRaceMissingAndRemove— regression for the production fix. Plants sentinel byte0xc3into the source page, parks the worker viabarrierBeforeRLock, firesMADV_DONTNEED, waits for the REMOVE batch to commit, releases the worker, then asserts the page is zero-filled. Pre-fix, the workerUFFDIO_COPYs the planted sentinel because it capturedsource = u.srcin the parent loop before the REMOVE landed; post-fix it re-reads state under RLock and zero-faults. Wallclock < 50ms / variant.Pre-fix failure on this branch (
4kandhugepagevariants), captured from prior Linux sudo runs:TestNoMadviseDeadlockWithInflightCopy— liveness regression for the user-visible symptom that originally surfaced the race. Parks the worker viabarrierBeforeFaultPage(i.e. holding RLock), firesMADV_DONTNEED, assertsmadvisereturns within 2s. Catches any future change that accidentally couplesreadEvents()tosettleRequestsas a fast assertion failure rather than a 30m CI timeout. Wallclock < 50ms / variant.Assertion shape on regression:
TestFaultedShortCircuitOrdering— smoke test on the REMOVE-then-pagefault batch ordering using the gated harness. Pins the invariant that REMOVE batches drain before pagefault dispatch in a singleServeiteration. Wallclock ~120ms / variant.The
4kandhugepagevariants ofTestStaleSourceRaceMissingAndRemoveare the bug-demonstration failures; the other two currently pass and are forward-looking belt-and-suspenders so any future regression surfaces as an instant assertion rather than a 30-minute CI hang.Verification
userfaultfd(2)+MADV_DONTNEEDsemantics) is Linux-only and CGO cross-compile from Darwin is not set up here, sogo build/go vet/ sudo test runs were not executed locally on this push. Assertion text above is from prior Linux sudo runs.-count=20 -timeout=30sdeterministically.Diff size
9 files, ~+1791 / -503 lines vs
feat/uffd-test-scaffolding(the lift accounts for the bulk; harness rewrite + race tests are the rest).Out of scope
No Firecracker version bump, no free-page-reporting feature flag, no template-manager / API plumbing, no proto regen, no CI timeout tweak. Strictly the UFFD test closure.