v0.4 foundation: WpBranch primitive + 3 PoCs + wp-bench CLI#157
Merged
Conversation
Sketches the implementation plan for cutting BRANCH pause from ~150ms (v0.3.4 floor on ext4) to < 10ms by removing the synchronous memory write entirely. Approach: switch source RAM to memfd_create, arm UFFDIO_WRITEPROTECT over the full guest range before BRANCH, copy dirty pages async via a uffd handler. The pause window then contains only vCPU + device state dump (microseconds) and the WP-arming syscall (sub-millisecond). Doc covers: motivation, goal/non-goals, alternatives considered (status quo, pre-copy migration, full memcpy, block-device CoW), open questions (THP interaction, KVM-direct memory access paths, format compatibility), implementation phases, risks. Tracking issue: #101. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Standalone binary that exercises the kernel mechanics v0.4 depends on, outside the KVM/Firecracker context. Allocates a 64 MiB memfd, arms UFFDIO_WRITEPROTECT, runs a writer thread + uffd handler thread, then validates that the resulting snapshot file is a consistent point-in-time view (every page starts with its BEFORE label). Goal: prove the kernel side of the v0.4 design is feasible. If this PoC passes (and prints sub-ms WP arm latency), the design proposed in DESIGN-v0.4.md can move to Phase 2 (integrate into forkd-uffd). Tracking #101, related to PR #156. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Minimal raw-KVM Rust binary that answers open question #1 in DESIGN-v0.4.md: does UFFD_WP armed on a memfd-backed host VMA catch writes when the guest accesses memory through EPT (not the host MMU)? Setup: 1 MiB memfd, pre-write BEFORE marker at GPA 0x1000, tiny real-mode guest that does `mov [0x1000], al` with AL=0x42, arm UFFDIO_WRITEPROTECT before vcpu.run(). Validate: - handler caught a write fault at GPA 0x1000 - snapshot byte at 0x1000 is the BEFORE marker (0xBE) - live memfd byte at 0x1000 is the AFTER marker (0x42) If all three hold, EPT-mediated guest writes propagate through MMU notifiers to UFFD_WP on the host VMA, and v0.4's snapshot mechanism is sound under KVM. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Empirical answer to DESIGN-v0.4.md open question #1: yes. EPT-mediated guest writes propagate through MMU notifiers to UFFD_WP on the host VMA. Handler captures pre-write content (0xBE), guest write lands on live memory (0x42), and the ordering invariant holds. Measured (1 MiB region, single guest write): - WP arm latency: 9.3 µs - Total vcpu runtime including 1 trap-and-resume: 211 µs - Fault flags: 0x3 = WRITE|WP (kernel reports both) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Answers DESIGN-v0.4.md open question #2. Compares WP arm latency and first-fault behavior between MADV_HUGEPAGE and MADV_NOHUGEPAGE on the same 64 MiB region. Reads /proc/self/smaps to verify AnonHugePages allocation before/after WP arm and after first write fault. If THP is split at arm time, the arm latency for the hugepage case will be visibly higher than the no-hugepage case, and AnonHugePages will drop after arm. If split happens at first-fault time, the arm will be cheap but the first write will be expensive. If neither, the kernel handles WP at hugepage granularity (faults report the 2 MiB base, not the 4 KiB sub-page). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three-phase comparison: memfd+HUGEPAGE, memfd+NOHUGEPAGE, anon+HUGEPAGE. Key findings: - memfd + MADV_HUGEPAGE is a trap on stock kernels (shmem_enabled=never): VM_HUGEPAGE marker triples WP arm cost (868 µs vs 202 µs baseline) but allocates zero hugepages. - Real THPs (anonymous backing) cost ~2x WP arm; faults still report at 4 KiB granularity. The kernel uses PMD-level WP marker + split-on-first- sub-page-fault rather than synchronous split at arm time. - Populate cost for anon+HUGEPAGE is 10x the memfd baseline (234 ms vs 20 ms for 64 MiB) due to contiguous-region defrag. For forkd: use memfd + no MADV_HUGEPAGE on source VM memory. Fastest arm, most predictable, matches what Firecracker already does. Answers DESIGN-v0.4.md open question #2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Promotes the PoC machinery from experiments/v0.4-*-poc into the
production forkd-uffd crate as two new modules:
- raw.rs (pub(crate)): libc-level wrappers for the userfaultfd
ioctls we use, since the userfaultfd 0.8 crate doesn't yet
expose UFFDIO_WRITEPROTECT or UFFDIO_REGISTER_MODE_WP.
- wp_snapshot.rs (pub): WpBranch struct owning one in-flight v0.4
BRANCH operation. WpBranch::begin arms WP and spawns a handler;
bulk_copy_clean does the still-clean pass; finalize stops the
handler and returns WpBranchStats.
Disjoint from the existing handshake module (v0.3 restore side).
Linux-only behind cfg gates. Includes one smoke test that arms +
finalizes a 16 KiB anon region with no writes; expected to skip on
CI sandboxes that don't allow userfaultfd to unprivileged users.
Tracking #101, PR #156, PR #157.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Creates a memfd of --region-mib MiB, arms UFFDIO_WRITEPROTECT, runs the bulk-copy + handler pair from forkd_uffd::wp_snapshot::WpBranch, and prints timing data: forkd wp-bench [--region-mib 64] [--snapshot /tmp/...] Output shape mirrors `forkd bench`: per-step durations + verification that the snapshot matches the pre-arm content (every byte 0x42). Not the full BRANCH integration — that requires patching the forkd-controller branch_sandbox path to skip FC's synchronous memory.bin write, which is its own multi-day spike documented in DESIGN-v0.4.md. This subcommand is the CLI surface for benchmarking WpBranch on a given kernel/filesystem combo before that lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
v0.4's "live-fork" foundation: the snapshot-side
UFFD_WPprimitive, validated by three kernel-level PoCs, promoted into theforkd-uffdcrate asWpBranch, and exposed via theforkd wp-benchsubcommand. Tracking #101, design RFC in #156.What this PR ships
1. Production library:
forkd_uffd::wp_snapshot::WpBranchA struct that owns one in-flight v0.4 BRANCH:
raw.rs(pub(crate)) wraps the userfaultfd ioctls since theuserfaultfd0.8 crate doesn't exposeUFFDIO_WRITEPROTECTorUFFDIO_REGISTER_MODE_WP. Disjoint from the existinghandshakemodule (v0.3 restore-side).2. CLI surface:
forkd wp-bench3. Three kernel-level PoCs (verifying design assumptions)
experiments/v0.4-uffd-wp-poc/experiments/v0.4-kvm-uffd-wp-poc/experiments/v0.4-thp-uffd-wp-poc/Each has its own
RESULTS.mdwith empirical numbers.v0.4 vs v0.3.4 comparison (extrapolated from PoC data)
v0.3.4 pauseincludes ext4 metadata + memory.bin write.v0.4 pauseisUFFDIO_WRITEPROTECTonly; memory.bin writes happen async outside the critical section.What's NOT in this PR
forkd-controller::branch_sandboxintegration. Firecracker's/snapshot/createalways writesmemory.binsynchronously inside the pause; replacing it requires either an FC patch (vmstate-only mode), bypassing FC's snapshot API entirely with rawKVM_GET_REGS, or accepting the existing pause for vmstate alongside async memory capture. This is a multi-day spike tracked in DESIGN-v0.4.md and the next session.--live-forkflag onforkd snapshotand the controller. Depends on the integration above.Test plan
cargo test --release -p forkd-uffd wp_snapshot— passes (sudo required for unprivileged_userfaultfd)forkd wp-bench --region-mib 64— 203 µs arm, consistent snapshotforkd wp-bench --region-mib 1024— 3.18 ms arm, consistent snapshotBranches touched
crates/forkd-uffdraw,wp_snapshot(~625 lines, Linux-only behind cfg)crates/forkd-cliwp-bench+ dep onforkd-uffd(Linux-only)Cargo.toml(workspace)experiments/v0.4-*-pocmembers🤖 Generated with Claude Code