Skip to content

test(uffd): add deterministic race tests for stale-source / madvise-deadlock / faulted-short-circuit#2511

Closed
ValentaTomas wants to merge 3 commits intofeat/uffd-test-scaffoldingfrom
test/uffd-rpc-harness-and-race-tests
Closed

test(uffd): add deterministic race tests for stale-source / madvise-deadlock / faulted-short-circuit#2511
ValentaTomas wants to merge 3 commits intofeat/uffd-test-scaffoldingfrom
test/uffd-rpc-harness-and-race-tests

Conversation

@ValentaTomas
Copy link
Copy Markdown
Member

@ValentaTomas ValentaTomas commented Apr 28, 2026

Summary

Adds a net/rpc/jsonrpc-over-unix-socket RPC harness for the userfaultfd cross-process tests, plus three deterministic race tests that demonstrate stale-source / madvise-deadlock / faulted-short-circuit bugs in the REMOVE-event handling path.

These tests are intentionally failing on this PR — that is the bug demonstration. The production fix lands in stacked PR #2512.

Stack / dependency

Commits

  1. test(uffd): lift REMOVE-event handling and pageTracker.removed closure from feat/free-page-reporting — verbatim state lift of packages/orchestrator/pkg/sandbox/uffd/userfaultfd/ from PR uffd: add support for UFFD_EVENT_REMOVE events #1896 tip (commit f31027327). 8 files, +764/-213. Includes:

    • pageTracker removed pageState
    • REMOVE-event drain in Serve() (under settleRequests.Lock())
    • Worker-fault dispatch with switch on tracker state (in its current, buggy, parent-loop snapshot form — that is what fix(uffd): read page state inside worker under settleRequests.RLock #2512 fixes)
    • wakeupPipe self-pipe + defaultCopyMode + MapMemoryReadyAddr lifecycle
    • deferred.go (deferred pagefault list)
    • prefault.go updates for prefetch / WP coordination
    • remove_test.go (REMOVE-event integration tests)
    • REMOVE-aware variants of cross_process_helpers_test.go / helpers_test.go / fd_helpers_test.go
  2. test(uffd): add net/rpc/jsonrpc test harness over unix socket — replaces the parent↔child harness's pile of single-purpose pipes (offsets, ready, gate-cmd, gate-sync) plus SIGUSR1/SIGUSR2 signaling with one bidirectional Unix socket carrying stdlib net/rpc + net/rpc/jsonrpc. Per-call request IDs are handled by the standard library, so the parent can issue concurrent RPCs while a handler is parked.

    Replaced surface:

    Old New
    SIGUSR1 graceful exit Service.Shutdown
    SIGUSR2 + offsets pipe + binary.Write Service.PageStates
    ready pipe + ReadAll-on-close handshake Service.WaitReady
    gate-cmd / gate-sync byte protocol Service.ServePause / Service.ServeResume

    New on the same channel:

    • Service.InstallFaultBarrier(addr, point) → token — arms a deterministic barrier in the child's worker goroutine at one of two test-only hook points (beforeWorkerRLockHook, beforeFaultPageHook).
    • Service.WaitFaultHeld(token) — blocks until the worker reaches the barrier; the RPC reply IS the rendezvous.
    • Service.ReleaseFault(token) — lets the parked worker proceed.

    The two hook fields on Userfaultfd default to nil and are nil-checked on the hot path, so production sees zero behavioral change. They are only assigned in the child's crossProcessServe wiring.

    testConfig gains a sourcePatcher hook so race tests can plant a deterministic sentinel byte into the random source data BEFORE the content file is written, instead of depending on the happenstance value of any randomly-generated byte.

  3. test(uffd): add deterministic stale-source / madvise-deadlock / faulted-short-circuit race tests — three race tests built on the harness above. None use sleeps, retries, or soak loops; each installs explicit barriers on the child's worker goroutine, drives the racing kernel operation from the parent, and asserts on a concrete post-state.

    • TestStaleSourceRaceMissingAndRemove — regression for the production fix. Plants sentinel byte 0xc3 into the source page, parks the worker via barrierBeforeRLock, fires MADV_DONTNEED, waits for the REMOVE batch to commit, releases the worker, then asserts the page is zero-filled. Pre-fix, the worker UFFDIO_COPYs the planted sentinel because it captured source = u.src in the parent loop before the REMOVE landed; post-fix it re-reads state under RLock and zero-faults. Wallclock < 50ms / variant.

      Pre-fix failure on this branch (4k and hugepage variants), captured from prior Linux sudo runs:

      race_test.go:184:
          Error: Not equal:
              expected: 0x0
              actual  : 0xc3
          Test:  TestStaleSourceRaceMissingAndRemove/4k
          Messages: page 1 first byte: want 0 (post-fix zero-fault for `removed` state),
                    got 0xc3 — if this equals the sentinel 0xc3, the worker used a stale
                    `source = u.src` snapshot (regression)
      
    • TestNoMadviseDeadlockWithInflightCopy — liveness regression for the user-visible symptom that originally surfaced the race. Parks the worker via barrierBeforeFaultPage (i.e. holding RLock), fires MADV_DONTNEED, asserts madvise returns within 2s. Catches any future change that accidentally couples readEvents() to settleRequests as a fast assertion failure rather than a 30m CI timeout. Wallclock < 50ms / variant.

      Assertion shape on regression:

      DEADLOCK: madvise(MADV_DONTNEED) on page %d did not return within 2s
      while a worker is parked holding settleRequests.RLock — readEvents
      must drain the REMOVE event WITHOUT acquiring settleRequests
      
    • TestFaultedShortCircuitOrdering — smoke test on the REMOVE-then-pagefault batch ordering using the gated harness. Pins the invariant that REMOVE batches drain before pagefault dispatch in a single Serve iteration. Wallclock ~120ms / variant.

    The 4k and hugepage variants of TestStaleSourceRaceMissingAndRemove are the bug-demonstration failures; the other two currently pass and are forward-looking belt-and-suspenders so any future regression surfaces as an instant assertion rather than a 30-minute CI hang.

Verification

  • This branch is on a Mac workstation; UFFD (userfaultfd(2) + MADV_DONTNEED semantics) is Linux-only and CGO cross-compile from Darwin is not set up here, so go build / go vet / sudo test runs were not executed locally on this push. Assertion text above is from prior Linux sudo runs.
  • CI on Linux is the source of truth for the pre-fix failure signatures.
  • With the fix in fix(uffd): read page state inside worker under settleRequests.RLock #2512 applied on top, all three tests pass -count=20 -timeout=30s deterministically.

Diff size

9 files, ~+1791 / -503 lines vs feat/uffd-test-scaffolding (the lift accounts for the bulk; harness rewrite + race tests are the rest).

Out of scope

No Firecracker version bump, no free-page-reporting feature flag, no template-manager / API plumbing, no proto regen, no CI timeout tweak. Strictly the UFFD test closure.

@cursor
Copy link
Copy Markdown

cursor Bot commented Apr 28, 2026

PR Summary

High Risk
Touches production userfaultfd fault/REMOVE handling, concurrency (locks, goroutines), and kernel IO paths (poll, ioctl, madvise), so regressions could manifest as hangs or incorrect page contents. The new test harness uses child processes and fd passing, which can introduce flaky/parallelism-sensitive failures if coordination or cleanup is off.

Overview
Reworks the cross-process UFFD test harness to use a single unix-socket net/rpc/jsonrpc control plane (readiness, page-state snapshots, pause/resume, shutdown) and adds deterministic fault-barrier coordination to reliably reproduce races.

In the UFFD handler, introduces explicit removed page tracking and REMOVE-event draining, adds EAGAIN fault deferral with a self-pipe wakeup, and adjusts prefault/fault handling to short-circuit already-faulted/removed pages while supporting zero-fill after REMOVE; new tests exercise MADV_DONTNEED/REMOVE semantics and three targeted race regressions (stale source selection, madvise liveness with in-flight copy, and batch ordering invariants).

Reviewed by Cursor Bugbot for commit 763945a. Bugbot is set up for automated code reviews on this repo. Configure here.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 40be15028f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +231 to 232
dup, err := syscall.Dup(int(uffdFd))
require.NoError(t, err)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Release childForkMu when Dup fails

childForkMu is acquired before syscall.Dup, but the Dup error path uses require.NoError without an unlock. If Dup fails (for example under transient fd pressure), FailNow exits this test while leaving the global mutex locked, and subsequent cross-process tests block forever on childForkMu.Lock, turning one failure into a suite-wide deadlock.

Useful? React with 👍 / 👎.

Comment on lines +499 to +503
exit, err := fdexit.New()
if err != nil {
fmt.Fprintln(os.Stderr, "fdexit.New:", err)

return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Propagate startServe setup failures

When fdexit.New() fails, startServe only logs and returns, but callers still treat the helper as healthy (e.g., WaitReady can succeed even though no serve loop started). In that state, later page-fault operations can block until outer test timeouts, which obscures the root cause and makes failures non-deterministic under low-fd conditions.

Useful? React with 👍 / 👎.

…e from feat/free-page-reporting

Lift the production-side UFFD subsystem closure that the upcoming RPC
test harness and race tests depend on, taken verbatim from the tip of
feat/free-page-reporting (commit f310273). The closure brings in:

  - pageTracker gains the `removed` pageState
  - Userfaultfd.Serve() splits readEvents into removes + pagefaults,
    drains the REMOVE batch under settleRequests.Lock(), then dispatches
    the pagefault batch
  - Worker fault dispatch with switch on pageTracker state (note: the
    state-snapshot read happens in the parent loop here — that is the
    bug the stacked PR #2512 fixes)
  - wakeupPipe self-pipe to wake the poll loop when a goroutine defers a
    page fault
  - defaultCopyMode override, MapMemoryReadyAddr lifecycle bits
  - deferred.go (deferred pagefault list)
  - prefault.go updates for prefetch / WP coordination
  - remove_test.go (REMOVE-event integration tests)
  - cross_process_helpers_test.go / helpers_test.go / fd_helpers_test.go
    REMOVE-aware variants

This commit is purely a state lift of
packages/orchestrator/pkg/sandbox/uffd/userfaultfd/ — no other paths
touched, no Firecracker bumps, no feature flags, no proto regen, no
template-manager / API plumbing.

Required so that test/uffd-rpc-harness-and-race-tests can target main
(via feat/uffd-test-scaffolding) instead of being stacked on top of the
full free-page-reporting feature PR (#1896).
Replace the parent↔child cross-process userfaultfd test harness's pile of
single-purpose pipes (offsets, ready, gate-cmd, gate-sync) plus a SIGUSR1
shutdown signal and a SIGUSR2 page-state-snapshot trigger with one
bidirectional Unix domain socket carrying stdlib net/rpc + net/rpc/jsonrpc.
Per-call request IDs are handled by the standard library, so the parent can
issue concurrent RPCs while a handler is parked (which the new fault-barrier
methods need). The only fd we still hand off out-of-band is the userfaultfd
itself (kernel object, has to go through ExtraFiles); the initial source
data is written to a temp file under t.TempDir() because base64-stuffing
megabytes through JSON would be silly.

Replaced surface:
  - 4 pipes + 2 signals  →  1 socket + 1 RPC channel
  - SIGUSR1 graceful exit                      →  Service.Shutdown
  - SIGUSR2 + offsets pipe + binary.Write      →  Service.PageStates
  - ready pipe + ReadAll-on-close handshake    →  Service.WaitReady
  - gate-cmd / gate-sync byte protocol         →  Service.ServePause / Service.ServeResume

Added on top of the same channel:
  - Service.InstallFaultBarrier(addr, point) → token: arms a deterministic
    barrier in the child's worker goroutine at one of two test-only hook
    points (beforeWorkerRLockHook, beforeFaultPageHook).
  - Service.WaitFaultHeld(token): blocks until the worker reaches the
    barrier — the RPC reply IS the rendezvous.
  - Service.ReleaseFault(token): lets the parked worker proceed.

The two hook fields on Userfaultfd default to nil and are nil-checked on
the hot path, so production sees zero behavioral change. They are only
assigned in the child's crossProcessServe wiring.

testConfig gains a sourcePatcher hook so race tests can plant a deterministic
sentinel byte into the random source data BEFORE the content file is written,
without depending on the happenstance value of any randomly-generated byte.

Also serialise the gated cross-process tests (TestRemoveThenWriteGated,
TestWriteThenRemoveGated) by removing t.Parallel(). While the handler is in
the gated `paused` state, any user thread that triggers a queued pagefault
on the registered region is suspended in the kernel's pagefault path. From
the Go runtime's perspective that goroutine is "running" (not in syscall,
since it's a plain memory store) and cannot be preempted until the fault is
served. If a CONCURRENT cross-process test in the same binary triggers a
stop-the-world GC pause during this window, STW will wait forever for the
suspended goroutine to reach a safe point — the kernel cannot deliver the
SIGURG preempt signal until the pagefault is served, and the handler is
paused. Running the gated tests sequentially avoids that interleaving while
leaving every other test (including the race suite) on t.Parallel().
…ed-short-circuit race tests

Three new race tests built on the unix-socket RPC harness and the test-only
fault-barrier hooks. None of them use sleeps, retries, or soak loops — each
test installs explicit barriers on the child's worker goroutine, drives the
racing kernel operation from the parent, and asserts on a concrete post-state.

  - TestStaleSourceRaceMissingAndRemove: regression test for the production
    fix. Plants a non-zero sentinel into the source page, parks the worker
    via barrierBeforeRLock, fires madvise, waits for the REMOVE batch to
    commit, releases the worker, then asserts the page is zero-filled.
    Pre-fix the worker UFFDIO_COPYs the planted sentinel because it captured
    `source = u.src` in the parent loop before the REMOVE landed; post-fix
    it re-reads state under RLock and zero-faults. Wallclock < 50ms / variant.

  - TestNoMadviseDeadlockWithInflightCopy: liveness regression test for the
    user-visible symptom that originally surfaced the race. Parks the worker
    via barrierBeforeFaultPage (i.e. holding RLock), fires madvise, asserts
    madvise returns within 2s. Catches any future change that accidentally
    couples readEvents() to settleRequests as a fast assertion failure
    rather than a 30m CI timeout. Wallclock < 50ms / variant.

  - TestFaultedShortCircuitOrdering: smoke test on the REMOVE-then-pagefault
    batch ordering using the gated harness. Pins the invariant that REMOVE
    batches drain before pagefault dispatch in a single Serve iteration.
    Wallclock ~120ms / variant.

These tests are intentionally failing on this PR — they are the bug
demonstration. The production fix lives in a stacked follow-up PR; with
that fix applied all three pass -count=20 -timeout=30s deterministically.
@ValentaTomas
Copy link
Copy Markdown
Member Author

Reopening to force GitHub Actions to re-run the full test workflow. The previous force-push (rebase onto #2475's branch) did not retrigger the pull_request workflow — only the trivial pull_request_target reviewer-assigner ran, which is why all checks falsely showed green. See thread for details.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 763945a. Configure here.

err = h.executeRemove(operation{offset: pageOffset, mode: operationModeRemove})
madviseCancel()
_ = madviseCtx
require.NoError(t, err, "MADV_DONTNEED on page %d did not return — handler dispatch wedged", pageIdx)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Madvise budget context not enforced in race test

Low Severity

In TestStaleSourceRaceMissingAndRemove, the madviseBudget isn't enforced for the MADV_DONTNEED call. The madviseCtx is created but not used to bound executeRemove or the synchronous unix.Madvise syscall. This allows a hanging MADV_DONTNEED to exceed its intended 2s deadline, instead hitting the outer 5s raceHappyPathBudget and producing a generic "handler is wedged" error.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 763945a. Configure here.

if _, err := unix.FcntlInt(uintptr(dup), unix.F_SETFD, unix.FD_CLOEXEC); err != nil {
childForkMu.Unlock()
require.NoError(t, err)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fork mutex leaked on syscall.Dup failure

Low Severity

After acquiring childForkMu, syscall.Dup is checked with require.NoError, which calls t.FailNow (via runtime.Goexit) when the dup fails — without first releasing the mutex. The subsequent FcntlInt failure path correctly unlocks before failing, but the dup path does not. If syscall.Dup ever fails (e.g., RLIMIT_NOFILE exhaustion under high parallelism), childForkMu stays locked for the lifetime of the test binary and every other concurrent cross-process test that reaches this section will deadlock until the suite timeout.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 763945a. Configure here.

signal.Notify(exitSignal, syscall.SIGUSR1)
defer signal.Stop(exitSignal)
return nil
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ServeResume can spawn duplicate serve goroutines

Low Severity

ServeResume unconditionally calls s.startServe(), which always allocates a fresh fdexit and launches a new uffd.Serve goroutine, then overwrites s.stop with a closure pointing at the new exit. If ServeResume is ever invoked when serve is already running (no preceding ServePause, or two consecutive resumes), the old serve goroutine and its fdexit are orphaned — s.stop no longer references them and they'll never be signaled — while two goroutines simultaneously poll the same uffd fd. Reading from a single uffd fd from two pollers produces undefined event delivery and, on shutdown, will hang wg.Wait forever waiting for the orphaned worker.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 763945a. Configure here.

@ValentaTomas
Copy link
Copy Markdown
Member Author

Closing in favor of the cleaner four-PR split:

The lift commit is now decomposed across #2519 (harness) and #2520 (REMOVE handling + matrix). The race tests live in #2521; the fix lives in #2512. Bug demonstration is built into the stack — #2521's CI fails with page 1 first byte: want 0 ... got 0xc3 and #2512's CI passes with the same tests.

@ValentaTomas ValentaTomas deleted the test/uffd-rpc-harness-and-race-tests branch April 29, 2026 00:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants