RFC: v0.4 live-fork via userfaultfd write-protect#156
Conversation
Sketches the implementation plan for cutting BRANCH pause from ~150ms (v0.3.4 floor on ext4) to < 10ms by removing the synchronous memory write entirely. Approach: switch source RAM to memfd_create, arm UFFDIO_WRITEPROTECT over the full guest range before BRANCH, copy dirty pages async via a uffd handler. The pause window then contains only vCPU + device state dump (microseconds) and the WP-arming syscall (sub-millisecond). Doc covers: motivation, goal/non-goals, alternatives considered (status quo, pre-copy migration, full memcpy, block-device CoW), open questions (THP interaction, KVM-direct memory access paths, format compatibility), implementation phases, risks. Tracking issue: #101. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Phase 1 PoC results in #157 — kernel mechanics empirically work:
Linear at ~3 ms/GiB. Snapshot consistency invariant holds (0 violations across all sizes). 1 GiB parent fits comfortably under the < 10 ms target; for >3 GiB parents the < 10 ms claim in the design will need to be qualified by size. |
|
Phase 2 PoC empirical result — open question #1 ✓ answered POSITIVELY
flags=0x3 = So: MMU notifiers → EPT invalidation → uffd_wp fault delivery chain works as the design assumes. The biggest "will the kernel even let us do this" risk is now empirically retired. Full code + results in #157. |
|
Phase 3 PoC — open question #2 (THP) closed. Three backings tested on 64 MiB regions (full data in
The kernel preserves 4 KiB fault granularity in all three. WP arm cost is at most ~2× the baseline even with real hugepages. The trap is the marker-without-hugepages case (memfd+MADV_HUGEPAGE on stock systems) — 4.3× slower with zero benefit. Design implication: forkd should use memfd + no Phase 4 next: KVM_GET_DIRTY_LOG × UFFD_WP coordination (open question #3) and sustained-write storm throughput. |
Promotes the PoC machinery from experiments/v0.4-*-poc into the
production forkd-uffd crate as two new modules:
- raw.rs (pub(crate)): libc-level wrappers for the userfaultfd
ioctls we use, since the userfaultfd 0.8 crate doesn't yet
expose UFFDIO_WRITEPROTECT or UFFDIO_REGISTER_MODE_WP.
- wp_snapshot.rs (pub): WpBranch struct owning one in-flight v0.4
BRANCH operation. WpBranch::begin arms WP and spawns a handler;
bulk_copy_clean does the still-clean pass; finalize stops the
handler and returns WpBranchStats.
Disjoint from the existing handshake module (v0.3 restore side).
Linux-only behind cfg gates. Includes one smoke test that arms +
finalizes a 16 KiB anon region with no writes; expected to skip on
CI sandboxes that don't allow userfaultfd to unprivileged users.
Tracking #101, PR #156, PR #157.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* RFC: v0.4 live-fork via userfaultfd write-protect Sketches the implementation plan for cutting BRANCH pause from ~150ms (v0.3.4 floor on ext4) to < 10ms by removing the synchronous memory write entirely. Approach: switch source RAM to memfd_create, arm UFFDIO_WRITEPROTECT over the full guest range before BRANCH, copy dirty pages async via a uffd handler. The pause window then contains only vCPU + device state dump (microseconds) and the WP-arming syscall (sub-millisecond). Doc covers: motivation, goal/non-goals, alternatives considered (status quo, pre-copy migration, full memcpy, block-device CoW), open questions (THP interaction, KVM-direct memory access paths, format compatibility), implementation phases, risks. Tracking issue: #101. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * experiment(v0.4): Phase 1 PoC — UFFDIO_WRITEPROTECT on memfd Standalone binary that exercises the kernel mechanics v0.4 depends on, outside the KVM/Firecracker context. Allocates a 64 MiB memfd, arms UFFDIO_WRITEPROTECT, runs a writer thread + uffd handler thread, then validates that the resulting snapshot file is a consistent point-in-time view (every page starts with its BEFORE label). Goal: prove the kernel side of the v0.4 design is feasible. If this PoC passes (and prints sub-ms WP arm latency), the design proposed in DESIGN-v0.4.md can move to Phase 2 (integrate into forkd-uffd). Tracking #101, related to PR #156. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * experiment(v0.4): rewrite PoC against raw libc (userfaultfd 0.8.1 lacks WP wrappers) * experiment(v0.4): make region size configurable via REGION_MIB env var * experiment(v0.4): empirical results — WP arm linear at ~3 ms/GiB, 0 violations * experiment(v0.4): Phase 2 PoC — UFFD_WP × KVM guest writes Minimal raw-KVM Rust binary that answers open question #1 in DESIGN-v0.4.md: does UFFD_WP armed on a memfd-backed host VMA catch writes when the guest accesses memory through EPT (not the host MMU)? Setup: 1 MiB memfd, pre-write BEFORE marker at GPA 0x1000, tiny real-mode guest that does `mov [0x1000], al` with AL=0x42, arm UFFDIO_WRITEPROTECT before vcpu.run(). Validate: - handler caught a write fault at GPA 0x1000 - snapshot byte at 0x1000 is the BEFORE marker (0xBE) - live memfd byte at 0x1000 is the AFTER marker (0x42) If all three hold, EPT-mediated guest writes propagate through MMU notifiers to UFFD_WP on the host VMA, and v0.4's snapshot mechanism is sound under KVM. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * experiment(v0.4): pin kvm-bindings to 0.11 (matches kvm-ioctls 0.21) * experiment(v0.4): vcpu needs mut binding for .run() * experiment(v0.4): Phase 2 results — UFFD_WP catches KVM guest writes ✓ Empirical answer to DESIGN-v0.4.md open question #1: yes. EPT-mediated guest writes propagate through MMU notifiers to UFFD_WP on the host VMA. Handler captures pre-write content (0xBE), guest write lands on live memory (0x42), and the ordering invariant holds. Measured (1 MiB region, single guest write): - WP arm latency: 9.3 µs - Total vcpu runtime including 1 trap-and-resume: 211 µs - Fault flags: 0x3 = WRITE|WP (kernel reports both) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * style: cargo fmt PoC sources (fixes CI) * experiment(v0.4): Phase 3 PoC — UFFD_WP × transparent hugepages Answers DESIGN-v0.4.md open question #2. Compares WP arm latency and first-fault behavior between MADV_HUGEPAGE and MADV_NOHUGEPAGE on the same 64 MiB region. Reads /proc/self/smaps to verify AnonHugePages allocation before/after WP arm and after first write fault. If THP is split at arm time, the arm latency for the hugepage case will be visibly higher than the no-hugepage case, and AnonHugePages will drop after arm. If split happens at first-fault time, the arm will be cheap but the first write will be expensive. If neither, the kernel handles WP at hugepage granularity (faults report the 2 MiB base, not the 4 KiB sub-page). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * experiment(v0.4): add Phase C — MAP_ANONYMOUS for real THP allocation * experiment(v0.4): Phase 3 results — THP × UFFD_WP characterized Three-phase comparison: memfd+HUGEPAGE, memfd+NOHUGEPAGE, anon+HUGEPAGE. Key findings: - memfd + MADV_HUGEPAGE is a trap on stock kernels (shmem_enabled=never): VM_HUGEPAGE marker triples WP arm cost (868 µs vs 202 µs baseline) but allocates zero hugepages. - Real THPs (anonymous backing) cost ~2x WP arm; faults still report at 4 KiB granularity. The kernel uses PMD-level WP marker + split-on-first- sub-page-fault rather than synchronous split at arm time. - Populate cost for anon+HUGEPAGE is 10x the memfd baseline (234 ms vs 20 ms for 64 MiB) due to contiguous-region defrag. For forkd: use memfd + no MADV_HUGEPAGE on source VM memory. Fastest arm, most predictable, matches what Firecracker already does. Answers DESIGN-v0.4.md open question #2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(forkd-uffd): add WpBranch — snapshot-side write-protection (v0.4) Promotes the PoC machinery from experiments/v0.4-*-poc into the production forkd-uffd crate as two new modules: - raw.rs (pub(crate)): libc-level wrappers for the userfaultfd ioctls we use, since the userfaultfd 0.8 crate doesn't yet expose UFFDIO_WRITEPROTECT or UFFDIO_REGISTER_MODE_WP. - wp_snapshot.rs (pub): WpBranch struct owning one in-flight v0.4 BRANCH operation. WpBranch::begin arms WP and spawns a handler; bulk_copy_clean does the still-clean pass; finalize stops the handler and returns WpBranchStats. Disjoint from the existing handshake module (v0.3 restore side). Linux-only behind cfg gates. Includes one smoke test that arms + finalizes a 16 KiB anon region with no writes; expected to skip on CI sandboxes that don't allow userfaultfd to unprivileged users. Tracking #101, PR #156, PR #157. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(forkd-uffd): use is_multiple_of for clippy 1.95 * style: cargo fmt forkd-uffd WpBranch sources * fix: clippy lints for stricter CI rustc — doc list blank lines + c-str literal * style: fmt collapse placeholder_fd let binding * fix: mark raw.rs C-define snippet as text doctest (not Rust) * feat(forkd-cli): add wp-bench subcommand — v0.4 WpBranch CLI surface Creates a memfd of --region-mib MiB, arms UFFDIO_WRITEPROTECT, runs the bulk-copy + handler pair from forkd_uffd::wp_snapshot::WpBranch, and prints timing data: forkd wp-bench [--region-mib 64] [--snapshot /tmp/...] Output shape mirrors `forkd bench`: per-step durations + verification that the snapshot matches the pre-arm content (every byte 0x42). Not the full BRANCH integration — that requires patching the forkd-controller branch_sandbox path to skip FC's synchronous memory.bin write, which is its own multi-day spike documented in DESIGN-v0.4.md. This subcommand is the CLI surface for benchmarking WpBranch on a given kernel/filesystem combo before that lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * style: cargo fmt wp_bench * style: cargo fmt wp_bench (CI rust 1.95) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Draft RFC for v0.4. Tracking #101.
Summary
Sketches the implementation plan for cutting BRANCH pause from ~150 ms (v0.3.4 floor on ext4) to < 10 ms by removing the synchronous memory write.
Approach: switch source RAM to
memfd_create, armUFFDIO_WRITEPROTECTover the full guest range before BRANCH, copy dirty pages async via a uffd handler. The pause window then contains only vCPU + device state dump (microseconds) and the WP-arming syscall (sub-millisecond).What's in the doc
UFFDIO_WRITEPROTECT+ async dirty-page copier)Why a draft PR
This is meant to be visible while the implementation lands, not after. Comments / corrections / prior-art pointers especially welcome on the open questions — particularly behavior of
UFFD_WPon memfd-backed VMAs underKVM_RUN.Test plan
The doc itself doesn't need testing; the implementation will. Phase 3 (Week 5 in the plan) reuses the v0.3.4 multi-BRANCH sweep harness in
bench/pause-window/to measure the new pause distribution.🤖 Generated with Claude Code