uffd: add support for UFFD_EVENT_REMOVE events#1896
Conversation
aef7cd8 to
eab27ea
Compare
a1d0a02 to
03e2966
Compare
b766eb3 to
4d197e4
Compare
e639acf to
88420f2
Compare
d2e5d09 to
8ea97e4
Compare
… feat/free-page-reporting # Conflicts: # packages/orchestrator/pkg/sandbox/uffd/userfaultfd/page_tracker.go
Resolve conflicts in favor of HEAD so the REMOVE-event feature on top of
the now-merged pageTracker refactor keeps its additions:
- page_tracker.go: keep `removed` pageState and `pageTracker.get(addr)`.
- userfaultfd.go: keep `faultPage` returning `(bool, error)`, the
deferred-fault retry path, wakeup pipe helpers, and the expanded
`settleRequests` comment; drop main's duplicate field/lock inside
`faultPage` (the lock is now held around the retry batch in Serve).
- cross_process_helpers_test.go: keep HEAD's exported
`pageStateEntry{State uint8, Offset uint64}` wire format plus the
settle-then-RLock snapshot, and emit every entry (not just faulted
ones) so REMOVE tests can observe `removed`.
- missing_test.go / missing_write_test.go: keep
`states.allAccessed()` + `assert.Equal` style instead of
`assert.ElementsMatch` on raw offsets.
Resolves conflicts after #2472 (test harness split-out) merged into main: - helpers_test.go: keep handlerPageStates with both faulted and removed fields, but adopt main's no-bitset approach via a sorted merge of the two already-sorted per-state slices. Carry forward servePause/serveResume hooks needed by the REMOVE event tests. - cross_process_helpers_test.go: keep main's binary.Read decoding loop and re-add the case removed branch. Restore HEAD's gated pause/resume plumbing on the testHandler. - userfaultfd.go / prefault.go: convert plain-string fmt.Errorf calls to errors.New (perfsprint lint introduced by main).
# Conflicts: # packages/orchestrator/pkg/sandbox/uffd/userfaultfd/cross_process_helpers_test.go # packages/orchestrator/pkg/sandbox/uffd/userfaultfd/fd_helpers_test.go
The cross-process UFFD tests in packages/orchestrator/pkg/sandbox/uffd/userfaultfd each fork a child helper process and exercise userfaultfd page-fault servicing across processes. With ~20 t.Parallel() tests across this package plus their parallel subtests, ~29 cross-process tests can be in flight at once on a busy CI runner, starving each other for CPU and pipe I/O. This shows up as repeated 24m SIGQUIT timeouts where the parent goroutines are stuck at: - cross_process_helpers_test.go:130 (writing content into the child pipe) - cross_process_helpers_test.go:152 (waiting for the child's ready signal) - cross_process_helpers_test.go:268 (waiting on readySignal) Pass -parallel=4 to cap concurrent t.Parallel() tests within each package to 4, which serializes the heavyweight cross-process UFFD tests through 4 slots without slowing down lightweight unit-test packages much. Bump the test timeout from 20m to 30m as a safety margin so slow-but-not-stuck runs can complete.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit c8b3d63. Configure here.
- userfaultfd.go: correct the `case faulted` branch comment to refer to the `removed` page state (previously claimed REMOVE marks the page as `missing`) and drop the duplicated "a a" typo. - userfaultfd.go: restore the colon in the "failed uffdio copy: %w" error message that was lost when faultPage was refactored to return (bool, error). - fcversion: simplify HasFreePageReporting to a single boolean return.
…uests.RLock The Serve() loop previously read pageTracker state and captured `source = u.src` in the parent loop, then dispatched a worker goroutine. A REMOVE event for the same page that arrived between the state read and the worker actually acquiring settleRequests.RLock() would silently leave the worker with a stale `source = u.src` snapshot. The worker would then UFFDIO_COPY src bytes into a page the kernel had just MADV_DONTNEED'd, leaving pageTracker == removed and the kernel page mapped with stale src data — and observably deadlocking parent madvise() in the orchestrator unit-test suite. Move the state lookup and source capture inside the goroutine, after RLock(). The read+act+commit sequence is now atomic with respect to the REMOVE batch (which takes settleRequests.Lock()).
…cket JSON-RPC harness
The cross-process userfaultfd test harness used a pile of single-
purpose pipes (offsets, ready, gate-cmd, gate-sync) plus a SIGUSR1
shutdown signal and a SIGUSR2 page-state-snapshot trigger. Every new
piece of test-only inspection or coordination meant adding another
fd, another env var, and another ad-hoc encoder/decoder. That pattern
also made it hard to add deterministic fault-barrier primitives (the
earlier attempt at race-test reproducibility kept reaching for soak
loops because there was no clean place for "park here, fire racing
op, release" handshakes).
Replace all of that with one bidirectional Unix domain socket and a
small length-prefixed JSON-RPC envelope (rpc_test.go). Per-call
request IDs let the parent issue concurrent RPCs while a handler is
parked (which is what the new fault-barrier methods need). The only
fd we still hand off out-of-band is the userfaultfd itself (kernel
object, has to go through ExtraFiles); the initial source data is
written to a temp file under t.TempDir() because base64-stuffing
megabytes through JSON would be silly.
Replaced surface:
- 4 pipes + 2 signals → 1 socket + 1 RPC channel
- SIGUSR1 graceful exit → Shutdown RPC
- SIGUSR2 + offsets pipe + binary.Write protocol → PageStates RPC
- ready pipe + ReadAll-on-close handshake → WaitReady RPC
- gate-cmd / gate-sync byte protocol → ServePause / ServeResume RPCs
Added on top of the same channel:
- InstallFaultBarrier(addr, point) → token: arms a deterministic
barrier in the child's worker goroutine at one of two test-only
hook points (beforeWorkerRLockHook, beforeFaultPageHook).
- WaitFaultHeld(token): blocks until the worker reaches the
barrier — the RPC reply IS the rendezvous.
- ReleaseFault(token): lets the parked worker proceed.
The two hook fields on Userfaultfd default to nil and are nil-checked
on the hot path, so production sees zero behavioral change. They are
only assigned in the child's crossProcessServe wiring.
testConfig gains a sourcePatcher hook so race tests can plant a
deterministic sentinel byte into the random source data BEFORE the
content file is written, without having to depend on the happenstance
value of any randomly-generated byte.
…ces deterministically
Three new race tests built on the unix-socket RPC harness and the
test-only fault-barrier hooks. None of them use sleeps, retries, or
soak loops — each test installs explicit barriers on the child's
worker goroutine, drives the racing kernel operation from the
parent, and asserts on a concrete post-state.
- TestStaleSourceRaceMissingAndRemove: regression test for the
production fix. Plants a non-zero sentinel into the source page,
parks the worker via barrierBeforeRLock, fires madvise, waits
for the REMOVE batch to commit, releases the worker, then
asserts the page is zero-filled. Pre-fix the worker UFFDIO_COPYs
the planted sentinel because it captured `source = u.src` in the
parent loop before the REMOVE landed; post-fix it re-reads state
under RLock and zero-faults. Wallclock < 50ms / variant.
- TestNoMadviseDeadlockWithInflightCopy: liveness regression test
for the user-visible symptom that originally surfaced the race.
Parks the worker via barrierBeforeFaultPage (i.e. holding RLock),
fires madvise, asserts madvise returns within 2s. Catches any
future change that accidentally couples readEvents() to
settleRequests as a fast assertion failure rather than a 30m CI
timeout. Wallclock < 50ms / variant.
- TestFaultedShortCircuitOrdering: smoke test on the
REMOVE-then-pagefault batch ordering using the gated harness.
Pins the invariant that REMOVE batches drain before pagefault
dispatch in a single Serve iteration. Wallclock ~120ms / variant.
All three pass -count=20 -timeout=30s deterministically.
…tionale The previous 30m timeout was load-bearing for the userfaultfd deadlock symptom (the orchestrator suite would silently spin until the runner killed it). With the stale-source race fixed and the deterministic race tests in place, the full orchestrator package now runs in well under a minute even on a contended runner. Drop to 10m, which still leaves ~20× headroom over the steady-state runtime but caps any future regression at a manageable wait. Keep -parallel 4 but rewrite the comment: it is a runner-resource cap (cross-process UFFD tests fork child handler processes; we don't want N-CPU copies racing for the same kernel resources), not a workaround for the deadlock that is now fixed.
…lope Drop the ~330-line hand-rolled length-prefixed JSON RPC layer (rpc_test.go) in favour of stdlib `net/rpc` + `net/rpc/jsonrpc` over the same unix socket. Concurrent in-flight calls and request-id correlation are handled by the standard library; the test harness only needs to register a single Service struct and dial. Also serialise the gated cross-process tests (`TestRemoveThenWriteGated`, `TestWriteThenRemoveGated`, `TestFaultedShortCircuitOrdering`) by removing `t.Parallel()`. While the handler is in the gated `paused` state, any user thread that triggers a queued pagefault on the registered region is suspended in the kernel's pagefault path. From the Go runtime's perspective that goroutine is "running" (not in syscall, since it's a plain memory store) and cannot be preempted until the fault is served. If a CONCURRENT cross-process test in the same binary triggers a stop-the-world GC pause during this window, STW will wait forever for the suspended goroutine to reach a safe point — the kernel cannot deliver the SIGURG preempt signal until the pagefault is served, and the handler is paused. Running the gated tests sequentially avoids that interleaving while leaving every other test (including the rest of the race suite) on `t.Parallel()`.
removed state, and deterministic race tests
#2513
UFFD test harness + REMOVE handling + race tests + fix — restructured stackFor reviewability, the test infra and REMOVE-handling work that was previously a monolithic PR (#2511) plus its stacked fix (#2512) has been split into four small PRs that each land an independently reviewable chunk. The bug demonstration is now built into the stack itself — PR 3's CI fails on the deterministic race tests, and PR 4's CI passes with the exact same tests. Merge order:
On-PR bug demonstration: PR #2521's CI is expected to fail with The original monolithic PR #2511 has been closed in favor of this split. |
|
Closing in favor of the split stack. All work originally in this PR has been re-landed in smaller, independently-reviewable pieces:
The 13 FC-integration files in #2541 are byte-identical to this PR's versions (verified by blob-hash match). Branch |

Enabling Firecracker free-page-reporting feature requires us to handle remove events (UFFD_EVENT_REMOVE) in our userfaultfd handler. These events are triggered whenever Firecracker calls madvise(MADV_DONTNEED) (or similar) on a range of guest memory addresses.
The main thing that changes on our logic is that page faults in a page that has previously been removed need to be served with a zero page rather than a page from the snapshot file.
This commit changes the page fault serving logic to:
This is dependent on the part of #1858 that enables free page reporting on the Firecracker side.
Note
High Risk
High risk because it changes low-level userfaultfd event handling and page population semantics (including new ioctls and concurrency/defer logic), which can affect VM stability and memory correctness. It also introduces a new template/sandbox configuration flag that changes runtime behavior based on Firecracker version and feature flags.
Overview
This PR upgrades the default Firecracker version to
v1.14.1and introduces opt-in free page reporting, wiring a newfreePageReportingsetting from template creation through template build/sandbox startup to Firecracker balloon configuration. To support this, the userfaultfd handler is reworked to read and processUFFD_EVENT_REMOVEevents, track per-page state, zero-fill pages after removal, and defer/retry faults that race with removes, with tests expanded to validate remove, write-protect, and gated ordering scenarios.Written by Cursor Bugbot for commit b35c9f7. This will update automatically on new commits. Configure here.