Skip to content

feat(fc): drain virtio-balloon free-page-hinting before pause#2552

Open
ValentaTomas wants to merge 15 commits intofeat/uffd-fc-free-page-reporting-integrationfrom
feat/sandbox-pause-fph
Open

feat(fc): drain virtio-balloon free-page-hinting before pause#2552
ValentaTomas wants to merge 15 commits intofeat/uffd-fc-free-page-reporting-integrationfrom
feat/sandbox-pause-fph

Conversation

@ValentaTomas
Copy link
Copy Markdown
Member

@ValentaTomas ValentaTomas commented May 4, 2026

Drains virtio-balloon free-page-hinting before pause so snapshots don't capture pages the guest already considers free. The balloon (from parent FPR PR) always arms FreePageHinting=true; on pause we call start_balloon_hinting and poll describe_balloon_hinting until guest_cmd >= host_cmd (with host > 0 guard). Reclaimed pages emit UFFD_EVENT_REMOVE, already tracked by parent.

Gated by free-page-hinting-timeout-ms LD flag (ms; default 0 = disabled). Operator opts in once the kernel has the FPH race fix. Stacked on parent FPR branch for the shared balloon-install path; split out from #2550.

ValentaTomas and others added 11 commits May 3, 2026 01:26
A small two-state-plus-default tracker backed by roaring bitmaps. Used by
upcoming UFFD work to track page states (Missing/Faulted/Removed) and by
NBD to track zero pages, replacing ad-hoc map-based trackers with O(1)
range ops and cheap snapshot exports.
…state

Replace the map-based pageTracker with block.StateTracker[pageState], a
roaring-bitmap-backed tracker with O(1) range ops. pageState gains a
third value, removed, which is wired at the type level but not yet
written anywhere -- #2520 adds the REMOVE-event handler that produces
it. Page indices are computed at the call site via header.BlockIdx.
pageStateEntries is updated to iterate the exported bitmaps so the
cross-process test harness keeps working.

Inline the 3-line pageState enum into userfaultfd.go and drop the
dedicated page_tracker.go now that pageTracker is gone.

Convert block.StateTracker's NewStateTracker / SetRange API from panics
to errors. Distinct-state validation and unsupported-state checks now
return fmt.Errorf descriptors; the userfaultfd-side init propagates the
constructor error through NewUserfaultfdFromFd, and the SetRange call
in the worker path logs and continues since these errors only fire on
programming bugs.
Production:
  - UFFDIO_REGISTER_MODE_REMOVE is requested so the kernel reports
    MADV_DONTNEED'd pages via UFFD_EVENT_REMOVE.
  - Userfaultfd.Serve splits read events into removes + pagefaults,
    drains the REMOVE batch under settleRequests.Lock (calling
    pageTracker.SetRange(.., removed) with BlockIdx-computed indices),
    then dispatches the pagefault batch.
  - Worker dispatch switches on pageTracker.Get(idx): faulted ->
    short-circuit, removed -> zero-fill (source = nil), missing ->
    copy from u.src. The state read happens inside the worker under
    settleRequests.RLock so a concurrent REMOVE can't slip between
    the read and the install.
  - faultPage gains zero-fill paths for source == nil (4K read =
    DONTWAKE zero + WP + wake; 4K write = zero + wake; hugepage =
    copy(EmptyHugePage)) and returns (handled, err) so the worker can
    defer UFFDIO_COPY EAGAIN back into a deferredFaults queue.
  - wakeupPipe + deferredFaults wake the poll loop when a worker
    defers, so a deferred fault doesn't sit waiting for an unrelated
    UFFD event. The received uffd fd is marked FD_CLOEXEC.
  - Prefault short-circuits for faulted || removed.

Tests:
  - testConfig gains removeEnabled; the parent unregisters the UFFD
    region on cleanup when REMOVE is on so munmap doesn't block on
    un-acked events.
  - Page-state wire format exposes removed via helpers_test.go.
  - operationModeRemove + executeRemove (madvise MADV_DONTNEED).
  - runMatrix wraps every existing generic test in remove-off and
    remove-on subtests so the no-REMOVE path (still used by
    production templates) stays covered while the new path is
    exercised. The matrix-level t.Parallel() is intentionally
    omitted to cap peak concurrency in CI.
  - remove_test.go: TestRemove, TestRemoveThenFault,
    TestRemoveThenWriteGated, TestWriteThenRemoveGated. Gated tests
    are //nolint:tparallel — a paused gated handler keeps a faulting
    goroutine suspended in the kernel pagefault path; a STW GC pause
    from a parallel test would wait forever for that goroutine to
    reach a safe point.
  - race_test.go: deterministic stale-source / madvise-deadlock /
    faulted-short-circuit regressions, serialised, with the
    FD_CLOEXEC and UFFDIO_COPY-EAGAIN fixes covered.
A worker holding settleRequests.RLock must never block readEvents,
because madvise(MADV_DONTNEED) blocks the producer until userspace
reads the UFFD_EVENT_REMOVE — and the producer can be the FC balloon
thread that other syscalls depend on. Use a dedicated readSerial
mutex (not settleRequests) to serialize serve-loop iterations with
snapshot-time Export, while keeping the existing settleRequests
discipline (workers RLock, REMOVE batch Lock) intact so readEvents
remains lock-free relative to workers.

Restores liveness for TestNoMadviseDeadlockWithInflightCopy while
closing the read-vs-apply race that motivated the prior buggy commit
(345f7e9, now amended).
…loon

Adds the FC-side integration plumbing for free page reporting on top of
the UFFD REMOVE-event handling in #2520:

- template-manager proto: optional bool freePageReporting (field 17).
- TemplateConfig + sandbox.Config gain a FreePageReporting bool that
  flows from template create → build phases (base/steps/finalize) →
  sandbox factory → fc.Process.Create.
- fc.apiClient.enableFreePageReporting calls PUT /balloon with
  free_page_reporting=true after entropy setup and before VM start.
- fcversion.HasFreePageReporting gates rollout to FC v1.14+.
- Adds free-page-reporting LaunchDarkly feature flag.
- create-build CLI: --free-page-reporting flag, defaults to enabled
  when FC version supports it.
- smoketest: opportunistically enables FPR when the FC version under
  test supports it.

UFFD-side changes (REMOVE handling, page tracker, race tests, fix)
remain in #2520; this PR is purely the production rollout path.
…rchestrator

HasFreePageReporting() was added to fcversion.Info but had zero callers
in the production path. Mirror the HasHugePages() pattern: let the
orchestrator derive the value from the FC version (authoritative),
gated by the FreePageReportingFlag LaunchDarkly flag (default false).

Also emit an env.free_page_reporting span attribute alongside the
existing env.huge_pages one.
Read the Removed bitmap from PageTracker and emit it as DiffMetadata.Empty
so REMOVE'd pages become uuid.Nil mappings in the snapshot header (read as
zero on resume). Defensively AndNot the empty set out of dirty: settle drains
make these disjoint in practice (Removed pages have no PTE, WP-async only
sees present pages with WP cleared), but if the invariant ever breaks the
guest's last intent for a Removed page is "free, read zero on restore" — so
empty must win, not stale dirty content.
@cursor
Copy link
Copy Markdown

cursor Bot commented May 4, 2026

PR Summary

Medium Risk
Touches snapshot/pause flow and adds new Firecracker API calls with polling and timeouts, so failures could impact snapshot latency or correctness when enabled. Gating defaults off, but misconfiguration or version-detection edge cases could still cause unexpected behavior.

Overview
The new free-page-hinting-timeout-ms flag is wired into the pause path to optionally run a virtio-balloon free-page-hinting “drain” before pausing, but MinFreePageHintingKernelVersion is set to 999.0.0 so hinting will effectively never be enabled via kernel gating as written.

DrainBalloon busy-polls describe_balloon_hinting using time.After in a loop (allocates each iteration) and treats a Firecracker 400 as “not configured”, which could mask other bad-request causes if the API returns 400 for reasons beyond “no balloon configured”.

Reviewed by Cursor Bugbot for commit 18c592f. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread packages/orchestrator/pkg/sandbox/fc/process.go Outdated
Comment thread packages/orchestrator/pkg/sandbox/fc/process.go
@ValentaTomas ValentaTomas force-pushed the feat/sandbox-pause-fph branch 3 times, most recently from 263a0d0 to f4e3ab0 Compare May 4, 2026 00:51
Arm free-page-hinting on the existing balloon device (always set when
the balloon is installed; pure runtime toggle), and on pause do a
host-initiated hint+wait so MADV_DONTNEED-reclaimed pages are settled
before the snapshot. Pages reclaimed this way generate UFFD_EVENT_REMOVE,
which the orchestrator already tracks (parent FPR PR), so the snapshot
captures them as removed instead of zero-filled.

- fc/client.go: rename enableFreePageReporting -> installBalloon;
  always set FreePageHinting=true; add startBalloonHinting +
  describeBalloonHinting helpers.
- fc/process.go: track balloonInstalled; add DrainBalloon (start +
  poll guest_cmd >= host_cmd, with host>0 guard against transient
  nil/zero responses).
- sandbox.go: wire featureFlags into Sandbox; call DrainBalloon from
  Pause behind the flag. Failures are logged but non-fatal.

Gated by free-page-hinting-timeout-ms (LD int flag, ms; default 0 =
disabled). resume-build gains --fph-timeout-ms for local exercise.
@ValentaTomas ValentaTomas force-pushed the feat/sandbox-pause-fph branch from f4e3ab0 to 7619cc9 Compare May 4, 2026 00:55
@ValentaTomas ValentaTomas force-pushed the feat/uffd-fc-free-page-reporting-integration branch 2 times, most recently from 920e8ec to 7f22709 Compare May 5, 2026 08:19
…-integration' into feat/sandbox-pause-fph

# Conflicts:
#	packages/orchestrator/pkg/sandbox/fc/client.go
#	packages/orchestrator/pkg/sandbox/fc/process.go
#	packages/orchestrator/pkg/sandbox/uffd/uffd.go
#	packages/orchestrator/pkg/sandbox/uffd/userfaultfd/helpers_test.go
#	packages/orchestrator/pkg/sandbox/uffd/userfaultfd/prefault.go
#	packages/orchestrator/pkg/sandbox/uffd/userfaultfd/race_test.go
#	packages/orchestrator/pkg/sandbox/uffd/userfaultfd/rpc_services_test.go
#	packages/orchestrator/pkg/sandbox/uffd/userfaultfd/userfaultfd.go
#	packages/orchestrator/pkg/template/build/config/config.go
#	packages/orchestrator/pkg/template/server/create_template.go
#	packages/shared/pkg/fcversion/sandbox_features.go
#	packages/shared/pkg/featureflags/flags.go
@cla-bot cla-bot Bot added the cla-signed label May 6, 2026
@linear-code
Copy link
Copy Markdown

linear-code Bot commented May 6, 2026

@codecov
Copy link
Copy Markdown

codecov Bot commented May 6, 2026

❌ 8 Tests Failed:

Tests completed Failed Passed Skipped
2586 8 2578 7
View the full list of 9 ❄️ flaky test(s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig

Flake rate in main: 72.96% (Passed 43 times, Failed 116 times)

Stack Traces | 191s run time
=== RUN   TestUpdateNetworkConfig
=== PAUSE TestUpdateNetworkConfig
=== CONT  TestUpdateNetworkConfig
--- FAIL: TestUpdateNetworkConfig (191.40s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false

Flake rate in main: 73.20% (Passed 41 times, Failed 112 times)

Stack Traces | 1.28s run time
=== RUN   TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
Executing command curl in sandbox iwn5u65fpt568dfbwnicv
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1354}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35 exited:true status:"exit status 35" error:"exit status 35"}}
Executing command curl in sandbox iwn5u65fpt568dfbwnicv
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1355}}
Executing command curl in sandbox ilt5llmwac0vta2rey29t
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35 exited:true status:"exit status 35" error:"exit status 35"}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{start:{pid:1356}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{data:{stdout:"HTTP/2 302 \r\nx-content-type-options: nosniff\r\nlocation: https://dns.google/\r\ndate: Thu, 07 May 2026 06:41:24 GMT\r\ncontent-type: text/html; charset=UTF-8\r\nserver: HTTP server (unknown)\r\ncontent-length: 216\r\nx-xss-protection: 0\r\nx-frame-options: SAMEORIGIN\r\nalt-svc: h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000\r\n\r\n"}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_network_update_test.go:391: Command [curl] completed successfully in sandbox iwn5u65fpt568dfbwnicv
    sandbox_network_update_test.go:391: 
        	Error Trace:	.../api/sandboxes/sandbox_network_out_test.go:74
        	            				.../api/sandboxes/sandbox_network_update_test.go:60
        	            				.../api/sandboxes/sandbox_network_update_test.go:391
        	Error:      	An error is expected but got nil.
        	Test:       	TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
        	Messages:   	https://8.8.8.8 should be blocked
--- FAIL: TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false (1.28s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost

Flake rate in main: 54.20% (Passed 60 times, Failed 71 times)

Stack Traces | 0s run time
=== RUN   TestBindLocalhost
=== PAUSE TestBindLocalhost
=== CONT  TestBindLocalhost
--- FAIL: TestBindLocalhost (0.00s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_0_0_0_0

Flake rate in main: 58.43% (Passed 37 times, Failed 52 times)

Stack Traces | 7.26s run time
=== RUN   TestBindLocalhost/bind_0_0_0_0
=== PAUSE TestBindLocalhost/bind_0_0_0_0
=== CONT  TestBindLocalhost/bind_0_0_0_0
Executing command python in sandbox igm47hrvs213kuj94oxq9
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1252}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_0_0_0_0
        	Messages:   	Unexpected status code 502 for bind address 0.0.0.0
--- FAIL: TestBindLocalhost/bind_0_0_0_0 (7.26s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_127_0_0_1

Flake rate in main: 52.50% (Passed 38 times, Failed 42 times)

Stack Traces | 7.29s run time
=== RUN   TestBindLocalhost/bind_127_0_0_1
=== PAUSE TestBindLocalhost/bind_127_0_0_1
=== CONT  TestBindLocalhost/bind_127_0_0_1
Executing command python in sandbox i07piyzh0zmwq0j0xezb9
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1252}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_127_0_0_1
        	Messages:   	Unexpected status code 502 for bind address 127.0.0.1
--- FAIL: TestBindLocalhost/bind_127_0_0_1 (7.29s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_::1

Flake rate in main: 59.78% (Passed 37 times, Failed 55 times)

Stack Traces | 8.63s run time
=== RUN   TestBindLocalhost/bind_::1
=== PAUSE TestBindLocalhost/bind_::1
=== CONT  TestBindLocalhost/bind_::1
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1252}}
Executing command python in sandbox il6phw2a28x8vc9w7yvf1
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_::1
        	Messages:   	Unexpected status code 502 for bind address ::1
--- FAIL: TestBindLocalhost/bind_::1 (8.63s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_localhost

Flake rate in main: 60.22% (Passed 37 times, Failed 56 times)

Stack Traces | 7.99s run time
=== RUN   TestBindLocalhost/bind_localhost
=== PAUSE TestBindLocalhost/bind_localhost
=== CONT  TestBindLocalhost/bind_localhost
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1257}}
Executing command python in sandbox ie11ocp845jr0ob24tupo
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_localhost
        	Messages:   	Unexpected status code 502 for bind address localhost
--- FAIL: TestBindLocalhost/bind_localhost (7.99s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity

Flake rate in main: 60.50% (Passed 47 times, Failed 72 times)

Stack Traces | 84.7s run time
=== RUN   TestSandboxMemoryIntegrity
=== PAUSE TestSandboxMemoryIntegrity
=== CONT  TestSandboxMemoryIntegrity
    sandbox_memory_integrity_test.go:26: Build completed successfully
--- FAIL: TestSandboxMemoryIntegrity (84.69s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity/tmpfs_hash

Flake rate in main: 64.08% (Passed 37 times, Failed 66 times)

Stack Traces | 27.1s run time
=== RUN   TestSandboxMemoryIntegrity/tmpfs_hash
=== PAUSE TestSandboxMemoryIntegrity/tmpfs_hash
=== CONT  TestSandboxMemoryIntegrity/tmpfs_hash
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{start:{pid:1262}}
Executing command bash in sandbox iy9751t7prez9eaxh7euh (user: root)
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Total memory: 985 MB\nUsed memory before tmpfs mount: 183 MB\nFree memory before tmpfs mount: 801 MB\nMemory to use in integrity test (80% of free, min 64MB): 640 MB\n"}}
Executing command bash in sandbox iy9751t7prez9eaxh7euh (user: root)
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"640+0 records in\n640+0 records out\n671088640 bytes (671 MB, 640 MiB) copied, 3.2802 s, 205 MB/s\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"\tCommand being timed: \"dd if=/dev/urandom of=/mnt/testfile bs=1M count=640\"\n\tUser time (seconds): 0.00\n\tSystem time (seconds): 3.26\n\tPercent of CPU this job got: 99%\n\tElapsed (wall clock) time (h:mm:ss or m:ss): 0:03.28\n\tAverage shared text size (kbytes): 0\n\tAverage unshared data size (kbytes): 0\n\tAverage stack size (kb"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"ytes): 0\n\tAverage total size (kb"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"ytes): 0\n\tMaxi"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"mum resident "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"set size (kbyte"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"s): 2656\n\tAverag"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"e resident set size (kbytes): 0\n\tMajor (requiring I/O) page faults: 2\n\tMinor (reclaiming a frame) page faults: 342\n\tVoluntary context switches: 3\n\tInvoluntary context switches: 15\n\tSwaps: 0\n\tFile system inputs: 176\n\tFile system outputs: 0\n\tSocket messages sent: 0\n\tSocket messages rece"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"ived: 0\n\tSignals delivered:"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" 0\n\tPage size ("}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"bytes): 4096\n\t"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"Exit status: 0\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Used memory after tmpfs mount and file fill: 828 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{end:{exited:true  status:"exit status 0"}}
    sandbox_memory_integrity_test.go:70: Command [bash] completed successfully in sandbox iixt0hud5fwnai2yjpb73
Executing command bash in sandbox iixt0hud5fwnai2yjpb73 (user: root)
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{start:{pid:1278}}
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{data:{stdout:"7aee752768028dec118f68aa2dca44ad40c008b04c4666ee51624f3e7f525c4d\n"}}
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{end:{exited:true  status:"exit status 0"}}
    sandbox_memory_integrity_test.go:74: Command [bash] completed successfully in sandbox iixt0hud5fwnai2yjpb73
Executing command bash in sandbox iixt0hud5fwnai2yjpb73 (user: root)
    sandbox_memory_integrity_test.go:99: Command [bash] output: event:{start:{pid:1281}}
    sandbox_memory_integrity_test.go:100: 
        	Error Trace:	.../tests/orchestrator/sandbox_memory_integrity_test.go:100
        	Error:      	Received unexpected error:
        	            	failed to execute command bash in sandbox iixt0hud5fwnai2yjpb73: invalid_argument: protocol error: incomplete envelope: unexpected EOF
        	Test:       	TestSandboxMemoryIntegrity/tmpfs_hash
--- FAIL: TestSandboxMemoryIntegrity/tmpfs_hash (27.11s)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: FPR conflicts with hugepages
    • Added !hugePages condition to FPR auto-enable logic, matching the server build path's conflict prevention.

Create PR

Or push these changes by commenting:

@cursor push 7c518d0d3e
Preview (7c518d0d3e)
diff --git a/packages/orchestrator/cmd/create-build/main.go b/packages/orchestrator/cmd/create-build/main.go
--- a/packages/orchestrator/cmd/create-build/main.go
+++ b/packages/orchestrator/cmd/create-build/main.go
@@ -358,7 +358,8 @@
 		})
 	}
 
-	// Default FPR on for FC v1.14+; explicit --free-page-reporting overrides.
+	// Default FPR on for FC v1.14+ unless hugepages is enabled.
+	// Firecracker rejects balloon (free-page-reporting) together with hugepages.
 	var fprEnabled bool
 	if freePageReporting != nil {
 		fprEnabled = *freePageReporting
@@ -366,7 +367,7 @@
 		versionOnly, _, _ := strings.Cut(fcVersion, "_")
 		supported, err := utils.IsGTEVersion(versionOnly, "v1.14.0")
 		if err == nil {
-			fprEnabled = supported
+			fprEnabled = !hugePages && supported
 		}
 	}

You can send follow-ups to the cloud agent here.

Reviewed by Cursor Bugbot for commit a1b3a8f. Configure here.

Comment thread packages/orchestrator/cmd/create-build/main.go Outdated
- Drop FPR-related changes superseded by parent PR (create-build,
  smoketest, template-manager.proto, generated pb.go).
- Delete unused block.StateTracker (parent PR added block.Tracker).
- Trim verbose comments in fph_gates, fc/client, fc/process,
  sandbox.go, featureflags, sandbox_features.
@ValentaTomas
Copy link
Copy Markdown
Member Author

Waiting for the merge of #2541, but otherwise should be ready.

@ValentaTomas ValentaTomas marked this pull request as ready for review May 7, 2026 06:28
@ValentaTomas
Copy link
Copy Markdown
Member Author

Before enabling in prod we need to deploy the kernel fix though.

@qodo-code-review
Copy link
Copy Markdown

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (0) 📎 Requirement gaps (0)

Grey Divider

Issue findings (1)

Findings have been published as inline review comments

Grey Divider

 ⓘ Findings that repeat in the code will be published as multiple comments

Grey Divider

Qodo Logo

Comment on lines +10 to +18
// MinFreePageHintingKernelVersion is the minimum guest kernel version that
// contains the FPH/MADV_DONTNEED race fix. Bump once the fixed kernel ships.
const MinFreePageHintingKernelVersion = "999.0.0"

func kernelSupportsFreePageHinting(kernelVersion string) bool {
v := strings.TrimPrefix(kernelVersion, "vmlinux-")
ok, _ := utils.IsGTEVersion(v, MinFreePageHintingKernelVersion)

return ok
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Fph kernel gate disables 🐞 Bug ≡ Correctness

MinFreePageHintingKernelVersion is set to 999.0.0, so kernelSupportsFreePageHinting() will never
enable FreePageHinting for normal guest kernels and installBalloon() will always configure the
balloon with hinting disabled. With hinting disabled, DrainBalloon() will consistently no-op as “not
configured”, so enabling free-page-hinting-timeout-ms won’t actually drain anything before pause.
Agent Prompt
### Issue description
Free-page-hinting is effectively impossible to enable because `MinFreePageHintingKernelVersion` is hardcoded to `999.0.0`, making `kernelSupportsFreePageHinting()` always return false for real kernel versions; this causes the balloon to be configured without hinting and makes `DrainBalloon()` a no-op.

### Issue Context
The pre-pause drain is guarded by a timeout feature flag, but the balloon hinting capability is separately gated by the kernel version check; with the current constant, the drain cannot ever perform useful work.

### Fix Focus Areas
- packages/orchestrator/pkg/sandbox/fc/fph_gates.go[10-18]
- packages/orchestrator/pkg/sandbox/fc/process.go[446-454]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment on lines +745 to +770
}

backoff := 5 * time.Millisecond
for {
select {
case <-ctx.Done():
outcome = "timeout"

return ctx.Err()
case <-time.After(backoff):
}

host, guest, err := p.client.describeBalloonHinting(ctx)
if err != nil {
outcome = "describe-failed"

return fmt.Errorf("balloon hinting status: %w", err)
}
// host_cmd is monotonic; require host > 0 to avoid a false-positive
// completion before FC has accepted the start.
if host > 0 && guest >= host {
return nil
}
backoff = min(backoff*2, 50*time.Millisecond)
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 DrainBalloon's completion check host > 0 && guest >= host only protects against the API-ack/VMM-bump race for the first drain in a fresh FC process (where host_cmd starts at 0). For sandboxes resumed from a snapshot taken after a previous successful drain, if Firecracker's virtio-balloon serializer persists the hinting counters (host_cmd=guest_cmd=N from the prior drain), the first describeBalloonHinting after startBalloonHinting returns 200 can read those stale values during the (author-acknowledged) window before FC's VMM thread bumps host_cmd — trivially satisfying N > 0 && N >= N and returning a false-positive completion before the drain has actually run. Currently latent because MinFreePageHintingKernelVersion is the 999.0.0 sentinel; becomes active the moment that constant is bumped to enable the feature, at which point the drain silently no-ops on the resume-then-pause flow this PR primarily targets. Fix: capture hostBefore := host (via describeBalloonHinting) before calling startBalloonHinting and require host > hostBefore && guest >= host in the loop.

Extended reasoning...

What the bug is

Process.DrainBalloon (process.go:720-770) starts free-page-hinting via startBalloonHinting and then polls describeBalloonHinting until completion. The check at line 765 is:

// host_cmd is monotonic; require host > 0 to avoid a false-positive
// completion before FC has accepted the start.
if host > 0 && guest >= host {
    return nil
}

The accompanying comment makes the author's intent explicit: they were aware that startBalloonHinting returning HTTP 200 (API thread accepted the request) does not mean FC's VMM thread has yet bumped host_cmd — there is a window where the first describeBalloonHinting reads pre-start state. The host > 0 guard correctly handles that race only when host_cmd starts at 0 (fresh FC process, never drained).

The case the guard misses

The PR description says the primary use case is the resume-then-pause flow (sandbox.go:1064 calls DrainBalloon from Sandbox.Pause, which runs on sandboxes from both CreateSandbox and ResumeSandbox). Walk through the sequence on a sandbox resumed from a snapshot that was previously paused with a successful drain:

  1. Pause Deploy infra in a new project #1: fresh FC, host_cmd = 0 → N, guest_cmd = 0 → N. Snapshot taken with host_cmd = guest_cmd = N persisted in the virtio-balloon device state.
  2. Resume in a new FC process: loadSnapshot restores virtio-balloon device state. host_cmd and guest_cmd come back as N.
  3. Pause Bump golang.org/x/net from 0.2.0 to 0.7.0 in /packages/api #2: Sandbox.Pause calls DrainBalloon.
    • p.client.startBalloonHinting(ctx, true) returns 200 (API thread accepted).
    • 5ms initial backoff fires.
    • First describeBalloonHinting runs. The VMM thread may not have processed the start action yet (this is exactly the race the author's comment calls out). Counters still read host = N, guest = N.
    • Completion check: N > 0 ✓ and N >= N ✓ → returns nil.
    • The drain effectively never ran. The subsequent snapshot captures pages the guest already considers free — defeating the entire point of the feature.

Step-by-step proof

step host_cmd guest_cmd check host > 0 && guest >= host result
Fresh FC, pre-start 0 0 0 > 0 = false loop continues ✓
Fresh FC, post-VMM-bump 1 0 1 > 0 ∧ 0 >= 1 = false loop continues ✓
Fresh FC, drain done 1 1 true ∧ true returns ✓
Resumed FC, pre-VMM-bump (counters restored to N=1) 1 1 true ∧ true returns immediately, drain skipped ✗
Resumed FC, post-VMM-bump 2 1 true ∧ 1 >= 2 = false (would loop, but already exited)

Addressing the refutation

The strongest counter-argument is that the bug rests on an unverified assumption: that FC's virtio-balloon serializer persists the host_cmd/guest_cmd counters across snapshot/restore. That is genuinely a Firecracker implementation detail and not directly observable from this codebase. However:

  • The author's comment establishes the race window between API ack and VMM bump exists within a single FC process. The host > 0 guard exists specifically to address that race for the host_cmd=0 case. It is a hole-by-construction for any non-zero starting value.
  • If snapshot serialization does not preserve these counters, the proposed fix (hostBefore := host; require host > hostBefore) is a strict no-op vs. today's host > 0 (since hostBefore == 0). There is no regression risk.
  • If snapshot serialization does preserve them (consistent with virtio-balloon needing to remain functional post-resume), today's check fails silently in production once the kernel sentinel is bumped, defeating the feature for its primary use case.

The cost of being defensive here is one extra describeBalloonHinting call before start. The cost of being wrong about FC's serialization in the optimistic direction is a silent feature regression that's hard to detect (a successful no-op produces no error and leaves the snapshot still containing freeable pages).

Suggested fix

hostBefore, _, err := p.client.describeBalloonHinting(ctx)
if err != nil { /* handle */ }

if err := p.client.startBalloonHinting(ctx, true); err != nil { /* ... */ }

backoff := 5 * time.Millisecond
for {
    select { /* ... */ }
    host, guest, err := p.client.describeBalloonHinting(ctx)
    if err != nil { /* ... */ }
    if host > hostBefore && guest >= host {
        return nil
    }
    backoff = min(backoff*2, 50*time.Millisecond)
}

The pre-start describe also handles the "no balloon configured" case the same way as today (FC will return 400 from describe — though the startBalloonHinting 400 path already covers it for the active code path).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants