Skip to content

uffd: add support for UFFD_EVENT_REMOVE events#1896

Closed
bchalios wants to merge 86 commits intomainfrom
feat/free-page-reporting
Closed

uffd: add support for UFFD_EVENT_REMOVE events#1896
bchalios wants to merge 86 commits intomainfrom
feat/free-page-reporting

Conversation

@bchalios
Copy link
Copy Markdown
Contributor

@bchalios bchalios commented Feb 11, 2026

Enabling Firecracker free-page-reporting feature requires us to handle remove events (UFFD_EVENT_REMOVE) in our userfaultfd handler. These events are triggered whenever Firecracker calls madvise(MADV_DONTNEED) (or similar) on a range of guest memory addresses.

The main thing that changes on our logic is that page faults in a page that has previously been removed need to be served with a zero page rather than a page from the snapshot file.

This commit changes the page fault serving logic to:

  1. Introduce tracking of the state of every page in the guest's memory mappings.
  2. Add logic to handle the new UFFD_EVENT_REMOVE event
  3. Modify existing logic to take into account current state when deciding how to handle each page fault

This is dependent on the part of #1858 that enables free page reporting on the Firecracker side.


Note

High Risk
High risk because it changes low-level userfaultfd event handling and page population semantics (including new ioctls and concurrency/defer logic), which can affect VM stability and memory correctness. It also introduces a new template/sandbox configuration flag that changes runtime behavior based on Firecracker version and feature flags.

Overview
This PR upgrades the default Firecracker version to v1.14.1 and introduces opt-in free page reporting, wiring a new freePageReporting setting from template creation through template build/sandbox startup to Firecracker balloon configuration. To support this, the userfaultfd handler is reworked to read and process UFFD_EVENT_REMOVE events, track per-page state, zero-fill pages after removal, and defer/retry faults that race with removes, with tests expanded to validate remove, write-protect, and gated ordering scenarios.

Written by Cursor Bugbot for commit b35c9f7. This will update automatically on new commits. Configure here.

Comment thread packages/orchestrator/internal/sandbox/uffd/memory/mapping.go Outdated
Comment thread packages/orchestrator/internal/sandbox/uffd/userfaultfd/userfaultfd.go Outdated
Comment thread packages/orchestrator/internal/sandbox/uffd/userfaultfd/userfaultfd.go Outdated
Comment thread packages/orchestrator/internal/sandbox/uffd/memory/region.go Outdated
@bchalios bchalios force-pushed the feat/free-page-reporting branch 2 times, most recently from a1d0a02 to 03e2966 Compare February 11, 2026 21:20
@bchalios bchalios force-pushed the feat/free-page-reporting branch 5 times, most recently from b766eb3 to 4d197e4 Compare February 12, 2026 23:48
Comment thread packages/orchestrator/internal/sandbox/uffd/userfaultfd/userfaultfd.go Outdated
Comment thread packages/orchestrator/internal/sandbox/uffd/userfaultfd/userfaultfd.go Outdated
Comment thread packages/orchestrator/internal/sandbox/fc/process.go Outdated
Comment thread packages/shared/pkg/fc/firecracker.yml
Comment thread packages/orchestrator/internal/sandbox/uffd/memory/region.go Outdated
@bchalios bchalios force-pushed the feat/free-page-reporting branch 2 times, most recently from e639acf to 88420f2 Compare February 13, 2026 00:45
Comment thread packages/orchestrator/internal/sandbox/uffd/memory/region.go Outdated
@bchalios bchalios force-pushed the feat/free-page-reporting branch 8 times, most recently from d2e5d09 to 8ea97e4 Compare February 17, 2026 15:49
ValentaTomas and others added 6 commits April 20, 2026 23:02
… feat/free-page-reporting

# Conflicts:
#	packages/orchestrator/pkg/sandbox/uffd/userfaultfd/page_tracker.go
Resolve conflicts in favor of HEAD so the REMOVE-event feature on top of
the now-merged pageTracker refactor keeps its additions:

- page_tracker.go: keep `removed` pageState and `pageTracker.get(addr)`.
- userfaultfd.go: keep `faultPage` returning `(bool, error)`, the
  deferred-fault retry path, wakeup pipe helpers, and the expanded
  `settleRequests` comment; drop main's duplicate field/lock inside
  `faultPage` (the lock is now held around the retry batch in Serve).
- cross_process_helpers_test.go: keep HEAD's exported
  `pageStateEntry{State uint8, Offset uint64}` wire format plus the
  settle-then-RLock snapshot, and emit every entry (not just faulted
  ones) so REMOVE tests can observe `removed`.
- missing_test.go / missing_write_test.go: keep
  `states.allAccessed()` + `assert.Equal` style instead of
  `assert.ElementsMatch` on raw offsets.
Resolves conflicts after #2472 (test harness split-out) merged into main:
- helpers_test.go: keep handlerPageStates with both faulted and removed
  fields, but adopt main's no-bitset approach via a sorted merge of the
  two already-sorted per-state slices. Carry forward servePause/serveResume
  hooks needed by the REMOVE event tests.
- cross_process_helpers_test.go: keep main's binary.Read decoding loop and
  re-add the case removed branch. Restore HEAD's gated pause/resume
  plumbing on the testHandler.
- userfaultfd.go / prefault.go: convert plain-string fmt.Errorf calls to
  errors.New (perfsprint lint introduced by main).
# Conflicts:
#	packages/orchestrator/pkg/sandbox/uffd/userfaultfd/cross_process_helpers_test.go
#	packages/orchestrator/pkg/sandbox/uffd/userfaultfd/fd_helpers_test.go
Comment thread packages/orchestrator/pkg/sandbox/uffd/userfaultfd/userfaultfd.go
The cross-process UFFD tests in packages/orchestrator/pkg/sandbox/uffd/userfaultfd
each fork a child helper process and exercise userfaultfd page-fault
servicing across processes. With ~20 t.Parallel() tests across this package
plus their parallel subtests, ~29 cross-process tests can be in flight at
once on a busy CI runner, starving each other for CPU and pipe I/O. This
shows up as repeated 24m SIGQUIT timeouts where the parent goroutines are
stuck at:
  - cross_process_helpers_test.go:130 (writing content into the child pipe)
  - cross_process_helpers_test.go:152 (waiting for the child's ready signal)
  - cross_process_helpers_test.go:268 (waiting on readySignal)

Pass -parallel=4 to cap concurrent t.Parallel() tests within each package
to 4, which serializes the heavyweight cross-process UFFD tests through
4 slots without slowing down lightweight unit-test packages much. Bump
the test timeout from 20m to 30m as a safety margin so slow-but-not-stuck
runs can complete.
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit c8b3d63. Configure here.

Comment thread packages/orchestrator/pkg/sandbox/uffd/userfaultfd/userfaultfd.go
- userfaultfd.go: correct the `case faulted` branch comment to refer to
  the `removed` page state (previously claimed REMOVE marks the page as
  `missing`) and drop the duplicated "a a" typo.
- userfaultfd.go: restore the colon in the "failed uffdio copy: %w"
  error message that was lost when faultPage was refactored to return
  (bool, error).
- fcversion: simplify HasFreePageReporting to a single boolean return.
…uests.RLock

The Serve() loop previously read pageTracker state and captured
`source = u.src` in the parent loop, then dispatched a worker
goroutine. A REMOVE event for the same page that arrived between the
state read and the worker actually acquiring settleRequests.RLock()
would silently leave the worker with a stale `source = u.src`
snapshot. The worker would then UFFDIO_COPY src bytes into a page the
kernel had just MADV_DONTNEED'd, leaving pageTracker == removed and
the kernel page mapped with stale src data — and observably
deadlocking parent madvise() in the orchestrator unit-test suite.

Move the state lookup and source capture inside the goroutine, after
RLock(). The read+act+commit sequence is now atomic with respect to
the REMOVE batch (which takes settleRequests.Lock()).
…cket JSON-RPC harness

The cross-process userfaultfd test harness used a pile of single-
purpose pipes (offsets, ready, gate-cmd, gate-sync) plus a SIGUSR1
shutdown signal and a SIGUSR2 page-state-snapshot trigger. Every new
piece of test-only inspection or coordination meant adding another
fd, another env var, and another ad-hoc encoder/decoder. That pattern
also made it hard to add deterministic fault-barrier primitives (the
earlier attempt at race-test reproducibility kept reaching for soak
loops because there was no clean place for "park here, fire racing
op, release" handshakes).

Replace all of that with one bidirectional Unix domain socket and a
small length-prefixed JSON-RPC envelope (rpc_test.go). Per-call
request IDs let the parent issue concurrent RPCs while a handler is
parked (which is what the new fault-barrier methods need). The only
fd we still hand off out-of-band is the userfaultfd itself (kernel
object, has to go through ExtraFiles); the initial source data is
written to a temp file under t.TempDir() because base64-stuffing
megabytes through JSON would be silly.

Replaced surface:
  - 4 pipes + 2 signals  →  1 socket + 1 RPC channel
  - SIGUSR1 graceful exit  →  Shutdown RPC
  - SIGUSR2 + offsets pipe + binary.Write protocol  →  PageStates RPC
  - ready pipe + ReadAll-on-close handshake  →  WaitReady RPC
  - gate-cmd / gate-sync byte protocol  →  ServePause / ServeResume RPCs

Added on top of the same channel:
  - InstallFaultBarrier(addr, point) → token: arms a deterministic
    barrier in the child's worker goroutine at one of two test-only
    hook points (beforeWorkerRLockHook, beforeFaultPageHook).
  - WaitFaultHeld(token): blocks until the worker reaches the
    barrier — the RPC reply IS the rendezvous.
  - ReleaseFault(token): lets the parked worker proceed.

The two hook fields on Userfaultfd default to nil and are nil-checked
on the hot path, so production sees zero behavioral change. They are
only assigned in the child's crossProcessServe wiring.

testConfig gains a sourcePatcher hook so race tests can plant a
deterministic sentinel byte into the random source data BEFORE the
content file is written, without having to depend on the happenstance
value of any randomly-generated byte.
…ces deterministically

Three new race tests built on the unix-socket RPC harness and the
test-only fault-barrier hooks. None of them use sleeps, retries, or
soak loops — each test installs explicit barriers on the child's
worker goroutine, drives the racing kernel operation from the
parent, and asserts on a concrete post-state.

  - TestStaleSourceRaceMissingAndRemove: regression test for the
    production fix. Plants a non-zero sentinel into the source page,
    parks the worker via barrierBeforeRLock, fires madvise, waits
    for the REMOVE batch to commit, releases the worker, then
    asserts the page is zero-filled. Pre-fix the worker UFFDIO_COPYs
    the planted sentinel because it captured `source = u.src` in the
    parent loop before the REMOVE landed; post-fix it re-reads state
    under RLock and zero-faults. Wallclock < 50ms / variant.

  - TestNoMadviseDeadlockWithInflightCopy: liveness regression test
    for the user-visible symptom that originally surfaced the race.
    Parks the worker via barrierBeforeFaultPage (i.e. holding RLock),
    fires madvise, asserts madvise returns within 2s. Catches any
    future change that accidentally couples readEvents() to
    settleRequests as a fast assertion failure rather than a 30m CI
    timeout. Wallclock < 50ms / variant.

  - TestFaultedShortCircuitOrdering: smoke test on the
    REMOVE-then-pagefault batch ordering using the gated harness.
    Pins the invariant that REMOVE batches drain before pagefault
    dispatch in a single Serve iteration. Wallclock ~120ms / variant.

All three pass -count=20 -timeout=30s deterministically.
…tionale

The previous 30m timeout was load-bearing for the userfaultfd
deadlock symptom (the orchestrator suite would silently spin until
the runner killed it). With the stale-source race fixed and the
deterministic race tests in place, the full orchestrator package
now runs in well under a minute even on a contended runner. Drop
to 10m, which still leaves ~20× headroom over the steady-state
runtime but caps any future regression at a manageable wait.

Keep -parallel 4 but rewrite the comment: it is a runner-resource
cap (cross-process UFFD tests fork child handler processes; we
don't want N-CPU copies racing for the same kernel resources), not
a workaround for the deadlock that is now fixed.
…lope

Drop the ~330-line hand-rolled length-prefixed JSON RPC layer
(rpc_test.go) in favour of stdlib `net/rpc` + `net/rpc/jsonrpc` over
the same unix socket. Concurrent in-flight calls and request-id
correlation are handled by the standard library; the test harness
only needs to register a single Service struct and dial.

Also serialise the gated cross-process tests
(`TestRemoveThenWriteGated`, `TestWriteThenRemoveGated`,
`TestFaultedShortCircuitOrdering`) by removing `t.Parallel()`. While
the handler is in the gated `paused` state, any user thread that
triggers a queued pagefault on the registered region is suspended
in the kernel's pagefault path. From the Go runtime's perspective
that goroutine is "running" (not in syscall, since it's a plain
memory store) and cannot be preempted until the fault is served.
If a CONCURRENT cross-process test in the same binary triggers a
stop-the-world GC pause during this window, STW will wait forever
for the suspended goroutine to reach a safe point — the kernel
cannot deliver the SIGURG preempt signal until the pagefault is
served, and the handler is paused. Running the gated tests
sequentially avoids that interleaving while leaving every other
test (including the rest of the race suite) on `t.Parallel()`.
@ValentaTomas
Copy link
Copy Markdown
Member

UFFD test harness + REMOVE handling + race tests + fix — restructured stack

For reviewability, the test infra and REMOVE-handling work that was previously a monolithic PR (#2511) plus its stacked fix (#2512) has been split into four small PRs that each land an independently reviewable chunk. The bug demonstration is now built into the stack itself — PR 3's CI fails on the deterministic race tests, and PR 4's CI passes with the exact same tests.

Merge order:

  1. test(uffd): add cross-process scaffolding for gated and async ops #2475 — feat(uffd): cross-process test scaffolding (already in review; unchanged)
  2. refactor(uffd-tests): replace SIGUSR/env-var/pipe IPC with net/rpc/jsonrpc over unix socket #2519 — refactor(uffd-tests): replace SIGUSR/env-var/pipe IPC with net/rpc/jsonrpc over unix socket
  3. feat(uffd): handle UFFD_EVENT_REMOVE; track per-page state; race-safe COPY #2520 — feat(uffd): UFFD_EVENT_REMOVE handling, removed pageState, and matrix-mode tests
  4. test(uffd): add deterministic UFFD stale-source race tests #2521 — test(uffd): deterministic stale-source / madvise-deadlock / faulted-short-circuit race tests
  5. fix(uffd): read page state inside worker under settleRequests.RLock #2512 — fix(uffd): re-read page state inside worker under settleRequests.RLock

On-PR bug demonstration: PR #2521's CI is expected to fail with page 1 first byte: want 0 ... got 0xc3 on TestStaleSourceRaceMissingAndRemove/{4k,hugepage}. PR #2512's CI is expected to pass the same tests. The gap between the two CIs IS the proof that the fix closes the race.

The original monolithic PR #2511 has been closed in favor of this split.

@ValentaTomas
Copy link
Copy Markdown
Member

Closing in favor of the split stack. All work originally in this PR has been re-landed in smaller, independently-reviewable pieces:

The 13 FC-integration files in #2541 are byte-identical to this PR's versions (verified by blob-hash match). Branch feat/free-page-reporting left intact for reference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants