feat(fc): drain virtio-balloon free-page-hinting before pause#2552
feat(fc): drain virtio-balloon free-page-hinting before pause#2552ValentaTomas wants to merge 15 commits intofeat/uffd-fc-free-page-reporting-integrationfrom
Conversation
A small two-state-plus-default tracker backed by roaring bitmaps. Used by upcoming UFFD work to track page states (Missing/Faulted/Removed) and by NBD to track zero pages, replacing ad-hoc map-based trackers with O(1) range ops and cheap snapshot exports.
…state Replace the map-based pageTracker with block.StateTracker[pageState], a roaring-bitmap-backed tracker with O(1) range ops. pageState gains a third value, removed, which is wired at the type level but not yet written anywhere -- #2520 adds the REMOVE-event handler that produces it. Page indices are computed at the call site via header.BlockIdx. pageStateEntries is updated to iterate the exported bitmaps so the cross-process test harness keeps working. Inline the 3-line pageState enum into userfaultfd.go and drop the dedicated page_tracker.go now that pageTracker is gone. Convert block.StateTracker's NewStateTracker / SetRange API from panics to errors. Distinct-state validation and unsupported-state checks now return fmt.Errorf descriptors; the userfaultfd-side init propagates the constructor error through NewUserfaultfdFromFd, and the SetRange call in the worker path logs and continues since these errors only fire on programming bugs.
Production:
- UFFDIO_REGISTER_MODE_REMOVE is requested so the kernel reports
MADV_DONTNEED'd pages via UFFD_EVENT_REMOVE.
- Userfaultfd.Serve splits read events into removes + pagefaults,
drains the REMOVE batch under settleRequests.Lock (calling
pageTracker.SetRange(.., removed) with BlockIdx-computed indices),
then dispatches the pagefault batch.
- Worker dispatch switches on pageTracker.Get(idx): faulted ->
short-circuit, removed -> zero-fill (source = nil), missing ->
copy from u.src. The state read happens inside the worker under
settleRequests.RLock so a concurrent REMOVE can't slip between
the read and the install.
- faultPage gains zero-fill paths for source == nil (4K read =
DONTWAKE zero + WP + wake; 4K write = zero + wake; hugepage =
copy(EmptyHugePage)) and returns (handled, err) so the worker can
defer UFFDIO_COPY EAGAIN back into a deferredFaults queue.
- wakeupPipe + deferredFaults wake the poll loop when a worker
defers, so a deferred fault doesn't sit waiting for an unrelated
UFFD event. The received uffd fd is marked FD_CLOEXEC.
- Prefault short-circuits for faulted || removed.
Tests:
- testConfig gains removeEnabled; the parent unregisters the UFFD
region on cleanup when REMOVE is on so munmap doesn't block on
un-acked events.
- Page-state wire format exposes removed via helpers_test.go.
- operationModeRemove + executeRemove (madvise MADV_DONTNEED).
- runMatrix wraps every existing generic test in remove-off and
remove-on subtests so the no-REMOVE path (still used by
production templates) stays covered while the new path is
exercised. The matrix-level t.Parallel() is intentionally
omitted to cap peak concurrency in CI.
- remove_test.go: TestRemove, TestRemoveThenFault,
TestRemoveThenWriteGated, TestWriteThenRemoveGated. Gated tests
are //nolint:tparallel — a paused gated handler keeps a faulting
goroutine suspended in the kernel pagefault path; a STW GC pause
from a parallel test would wait forever for that goroutine to
reach a safe point.
- race_test.go: deterministic stale-source / madvise-deadlock /
faulted-short-circuit regressions, serialised, with the
FD_CLOEXEC and UFFDIO_COPY-EAGAIN fixes covered.
A worker holding settleRequests.RLock must never block readEvents, because madvise(MADV_DONTNEED) blocks the producer until userspace reads the UFFD_EVENT_REMOVE — and the producer can be the FC balloon thread that other syscalls depend on. Use a dedicated readSerial mutex (not settleRequests) to serialize serve-loop iterations with snapshot-time Export, while keeping the existing settleRequests discipline (workers RLock, REMOVE batch Lock) intact so readEvents remains lock-free relative to workers. Restores liveness for TestNoMadviseDeadlockWithInflightCopy while closing the read-vs-apply race that motivated the prior buggy commit (345f7e9, now amended).
…loon Adds the FC-side integration plumbing for free page reporting on top of the UFFD REMOVE-event handling in #2520: - template-manager proto: optional bool freePageReporting (field 17). - TemplateConfig + sandbox.Config gain a FreePageReporting bool that flows from template create → build phases (base/steps/finalize) → sandbox factory → fc.Process.Create. - fc.apiClient.enableFreePageReporting calls PUT /balloon with free_page_reporting=true after entropy setup and before VM start. - fcversion.HasFreePageReporting gates rollout to FC v1.14+. - Adds free-page-reporting LaunchDarkly feature flag. - create-build CLI: --free-page-reporting flag, defaults to enabled when FC version supports it. - smoketest: opportunistically enables FPR when the FC version under test supports it. UFFD-side changes (REMOVE handling, page tracker, race tests, fix) remain in #2520; this PR is purely the production rollout path.
…rchestrator HasFreePageReporting() was added to fcversion.Info but had zero callers in the production path. Mirror the HasHugePages() pattern: let the orchestrator derive the value from the FC version (authoritative), gated by the FreePageReportingFlag LaunchDarkly flag (default false). Also emit an env.free_page_reporting span attribute alongside the existing env.huge_pages one.
Read the Removed bitmap from PageTracker and emit it as DiffMetadata.Empty so REMOVE'd pages become uuid.Nil mappings in the snapshot header (read as zero on resume). Defensively AndNot the empty set out of dirty: settle drains make these disjoint in practice (Removed pages have no PTE, WP-async only sees present pages with WP cleared), but if the invariant ever breaks the guest's last intent for a Removed page is "free, read zero on restore" — so empty must win, not stale dirty content.
PR SummaryMedium Risk Overview
Reviewed by Cursor Bugbot for commit 18c592f. Bugbot is set up for automated code reviews on this repo. Configure here. |
e8bd708 to
bf00edc
Compare
263a0d0 to
f4e3ab0
Compare
Arm free-page-hinting on the existing balloon device (always set when the balloon is installed; pure runtime toggle), and on pause do a host-initiated hint+wait so MADV_DONTNEED-reclaimed pages are settled before the snapshot. Pages reclaimed this way generate UFFD_EVENT_REMOVE, which the orchestrator already tracks (parent FPR PR), so the snapshot captures them as removed instead of zero-filled. - fc/client.go: rename enableFreePageReporting -> installBalloon; always set FreePageHinting=true; add startBalloonHinting + describeBalloonHinting helpers. - fc/process.go: track balloonInstalled; add DrainBalloon (start + poll guest_cmd >= host_cmd, with host>0 guard against transient nil/zero responses). - sandbox.go: wire featureFlags into Sandbox; call DrainBalloon from Pause behind the flag. Failures are logged but non-fatal. Gated by free-page-hinting-timeout-ms (LD int flag, ms; default 0 = disabled). resume-build gains --fph-timeout-ms for local exercise.
f4e3ab0 to
7619cc9
Compare
920e8ec to
7f22709
Compare
…-integration' into feat/sandbox-pause-fph # Conflicts: # packages/orchestrator/pkg/sandbox/fc/client.go # packages/orchestrator/pkg/sandbox/fc/process.go # packages/orchestrator/pkg/sandbox/uffd/uffd.go # packages/orchestrator/pkg/sandbox/uffd/userfaultfd/helpers_test.go # packages/orchestrator/pkg/sandbox/uffd/userfaultfd/prefault.go # packages/orchestrator/pkg/sandbox/uffd/userfaultfd/race_test.go # packages/orchestrator/pkg/sandbox/uffd/userfaultfd/rpc_services_test.go # packages/orchestrator/pkg/sandbox/uffd/userfaultfd/userfaultfd.go # packages/orchestrator/pkg/template/build/config/config.go # packages/orchestrator/pkg/template/server/create_template.go # packages/shared/pkg/fcversion/sandbox_features.go # packages/shared/pkg/featureflags/flags.go
❌ 8 Tests Failed:
View the full list of 9 ❄️ flaky test(s)
To view more test analytics, go to the Test Analytics Dashboard |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: FPR conflicts with hugepages
- Added
!hugePagescondition to FPR auto-enable logic, matching the server build path's conflict prevention.
- Added
Or push these changes by commenting:
@cursor push 7c518d0d3e
Preview (7c518d0d3e)
diff --git a/packages/orchestrator/cmd/create-build/main.go b/packages/orchestrator/cmd/create-build/main.go
--- a/packages/orchestrator/cmd/create-build/main.go
+++ b/packages/orchestrator/cmd/create-build/main.go
@@ -358,7 +358,8 @@
})
}
- // Default FPR on for FC v1.14+; explicit --free-page-reporting overrides.
+ // Default FPR on for FC v1.14+ unless hugepages is enabled.
+ // Firecracker rejects balloon (free-page-reporting) together with hugepages.
var fprEnabled bool
if freePageReporting != nil {
fprEnabled = *freePageReporting
@@ -366,7 +367,7 @@
versionOnly, _, _ := strings.Cut(fcVersion, "_")
supported, err := utils.IsGTEVersion(versionOnly, "v1.14.0")
if err == nil {
- fprEnabled = supported
+ fprEnabled = !hugePages && supported
}
}You can send follow-ups to the cloud agent here.
Reviewed by Cursor Bugbot for commit a1b3a8f. Configure here.
- Drop FPR-related changes superseded by parent PR (create-build, smoketest, template-manager.proto, generated pb.go). - Delete unused block.StateTracker (parent PR added block.Tracker). - Trim verbose comments in fph_gates, fc/client, fc/process, sandbox.go, featureflags, sandbox_features.
…/sandbox-pause-fph
|
Waiting for the merge of #2541, but otherwise should be ready. |
|
Before enabling in prod we need to deploy the kernel fix though. |
| // MinFreePageHintingKernelVersion is the minimum guest kernel version that | ||
| // contains the FPH/MADV_DONTNEED race fix. Bump once the fixed kernel ships. | ||
| const MinFreePageHintingKernelVersion = "999.0.0" | ||
|
|
||
| func kernelSupportsFreePageHinting(kernelVersion string) bool { | ||
| v := strings.TrimPrefix(kernelVersion, "vmlinux-") | ||
| ok, _ := utils.IsGTEVersion(v, MinFreePageHintingKernelVersion) | ||
|
|
||
| return ok |
There was a problem hiding this comment.
1. Fph kernel gate disables 🐞 Bug ≡ Correctness
MinFreePageHintingKernelVersion is set to 999.0.0, so kernelSupportsFreePageHinting() will never enable FreePageHinting for normal guest kernels and installBalloon() will always configure the balloon with hinting disabled. With hinting disabled, DrainBalloon() will consistently no-op as “not configured”, so enabling free-page-hinting-timeout-ms won’t actually drain anything before pause.
Agent Prompt
### Issue description
Free-page-hinting is effectively impossible to enable because `MinFreePageHintingKernelVersion` is hardcoded to `999.0.0`, making `kernelSupportsFreePageHinting()` always return false for real kernel versions; this causes the balloon to be configured without hinting and makes `DrainBalloon()` a no-op.
### Issue Context
The pre-pause drain is guarded by a timeout feature flag, but the balloon hinting capability is separately gated by the kernel version check; with the current constant, the drain cannot ever perform useful work.
### Fix Focus Areas
- packages/orchestrator/pkg/sandbox/fc/fph_gates.go[10-18]
- packages/orchestrator/pkg/sandbox/fc/process.go[446-454]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
| } | ||
|
|
||
| backoff := 5 * time.Millisecond | ||
| for { | ||
| select { | ||
| case <-ctx.Done(): | ||
| outcome = "timeout" | ||
|
|
||
| return ctx.Err() | ||
| case <-time.After(backoff): | ||
| } | ||
|
|
||
| host, guest, err := p.client.describeBalloonHinting(ctx) | ||
| if err != nil { | ||
| outcome = "describe-failed" | ||
|
|
||
| return fmt.Errorf("balloon hinting status: %w", err) | ||
| } | ||
| // host_cmd is monotonic; require host > 0 to avoid a false-positive | ||
| // completion before FC has accepted the start. | ||
| if host > 0 && guest >= host { | ||
| return nil | ||
| } | ||
| backoff = min(backoff*2, 50*time.Millisecond) | ||
| } | ||
| } |
There was a problem hiding this comment.
🔴 DrainBalloon's completion check host > 0 && guest >= host only protects against the API-ack/VMM-bump race for the first drain in a fresh FC process (where host_cmd starts at 0). For sandboxes resumed from a snapshot taken after a previous successful drain, if Firecracker's virtio-balloon serializer persists the hinting counters (host_cmd=guest_cmd=N from the prior drain), the first describeBalloonHinting after startBalloonHinting returns 200 can read those stale values during the (author-acknowledged) window before FC's VMM thread bumps host_cmd — trivially satisfying N > 0 && N >= N and returning a false-positive completion before the drain has actually run. Currently latent because MinFreePageHintingKernelVersion is the 999.0.0 sentinel; becomes active the moment that constant is bumped to enable the feature, at which point the drain silently no-ops on the resume-then-pause flow this PR primarily targets. Fix: capture hostBefore := host (via describeBalloonHinting) before calling startBalloonHinting and require host > hostBefore && guest >= host in the loop.
Extended reasoning...
What the bug is
Process.DrainBalloon (process.go:720-770) starts free-page-hinting via startBalloonHinting and then polls describeBalloonHinting until completion. The check at line 765 is:
// host_cmd is monotonic; require host > 0 to avoid a false-positive
// completion before FC has accepted the start.
if host > 0 && guest >= host {
return nil
}The accompanying comment makes the author's intent explicit: they were aware that startBalloonHinting returning HTTP 200 (API thread accepted the request) does not mean FC's VMM thread has yet bumped host_cmd — there is a window where the first describeBalloonHinting reads pre-start state. The host > 0 guard correctly handles that race only when host_cmd starts at 0 (fresh FC process, never drained).
The case the guard misses
The PR description says the primary use case is the resume-then-pause flow (sandbox.go:1064 calls DrainBalloon from Sandbox.Pause, which runs on sandboxes from both CreateSandbox and ResumeSandbox). Walk through the sequence on a sandbox resumed from a snapshot that was previously paused with a successful drain:
- Pause Deploy infra in a new project #1: fresh FC,
host_cmd = 0 → N,guest_cmd = 0 → N. Snapshot taken withhost_cmd = guest_cmd = Npersisted in the virtio-balloon device state. - Resume in a new FC process:
loadSnapshotrestores virtio-balloon device state.host_cmdandguest_cmdcome back asN. - Pause Bump golang.org/x/net from 0.2.0 to 0.7.0 in /packages/api #2:
Sandbox.PausecallsDrainBalloon.p.client.startBalloonHinting(ctx, true)returns 200 (API thread accepted).- 5ms initial backoff fires.
- First
describeBalloonHintingruns. The VMM thread may not have processed the start action yet (this is exactly the race the author's comment calls out). Counters still readhost = N,guest = N. - Completion check:
N > 0✓ andN >= N✓ → returnsnil. - The drain effectively never ran. The subsequent snapshot captures pages the guest already considers free — defeating the entire point of the feature.
Step-by-step proof
| step | host_cmd | guest_cmd | check host > 0 && guest >= host |
result |
|---|---|---|---|---|
| Fresh FC, pre-start | 0 | 0 | 0 > 0 = false | loop continues ✓ |
| Fresh FC, post-VMM-bump | 1 | 0 | 1 > 0 ∧ 0 >= 1 = false | loop continues ✓ |
| Fresh FC, drain done | 1 | 1 | true ∧ true | returns ✓ |
| Resumed FC, pre-VMM-bump (counters restored to N=1) | 1 | 1 | true ∧ true | returns immediately, drain skipped ✗ |
| Resumed FC, post-VMM-bump | 2 | 1 | true ∧ 1 >= 2 = false | (would loop, but already exited) |
Addressing the refutation
The strongest counter-argument is that the bug rests on an unverified assumption: that FC's virtio-balloon serializer persists the host_cmd/guest_cmd counters across snapshot/restore. That is genuinely a Firecracker implementation detail and not directly observable from this codebase. However:
- The author's comment establishes the race window between API ack and VMM bump exists within a single FC process. The
host > 0guard exists specifically to address that race for the host_cmd=0 case. It is a hole-by-construction for any non-zero starting value. - If snapshot serialization does not preserve these counters, the proposed fix (
hostBefore := host; require host > hostBefore) is a strict no-op vs. today'shost > 0(sincehostBefore == 0). There is no regression risk. - If snapshot serialization does preserve them (consistent with virtio-balloon needing to remain functional post-resume), today's check fails silently in production once the kernel sentinel is bumped, defeating the feature for its primary use case.
The cost of being defensive here is one extra describeBalloonHinting call before start. The cost of being wrong about FC's serialization in the optimistic direction is a silent feature regression that's hard to detect (a successful no-op produces no error and leaves the snapshot still containing freeable pages).
Suggested fix
hostBefore, _, err := p.client.describeBalloonHinting(ctx)
if err != nil { /* handle */ }
if err := p.client.startBalloonHinting(ctx, true); err != nil { /* ... */ }
backoff := 5 * time.Millisecond
for {
select { /* ... */ }
host, guest, err := p.client.describeBalloonHinting(ctx)
if err != nil { /* ... */ }
if host > hostBefore && guest >= host {
return nil
}
backoff = min(backoff*2, 50*time.Millisecond)
}The pre-start describe also handles the "no balloon configured" case the same way as today (FC will return 400 from describe — though the startBalloonHinting 400 path already covers it for the active code path).


Drains virtio-balloon free-page-hinting before pause so snapshots don't capture pages the guest already considers free. The balloon (from parent FPR PR) always arms
FreePageHinting=true; on pause we callstart_balloon_hintingand polldescribe_balloon_hintinguntilguest_cmd >= host_cmd(withhost > 0guard). Reclaimed pages emitUFFD_EVENT_REMOVE, already tracked by parent.Gated by
free-page-hinting-timeout-msLD flag (ms; default 0 = disabled). Operator opts in once the kernel has the FPH race fix. Stacked on parent FPR branch for the shared balloon-install path; split out from #2550.