RFC: v0.4 live-fork via userfaultfd write-protect by WaylandYang · Pull Request #156 · deeplethe/forkd

WaylandYang · 2026-05-24T14:28:04Z

Draft RFC for v0.4. Tracking #101.

Summary

Sketches the implementation plan for cutting BRANCH pause from ~150 ms (v0.3.4 floor on ext4) to < 10 ms by removing the synchronous memory write.

Approach: switch source RAM to memfd_create, arm UFFDIO_WRITEPROTECT over the full guest range before BRANCH, copy dirty pages async via a uffd handler. The pause window then contains only vCPU + device state dump (microseconds) and the WP-arming syscall (sub-millisecond).

What's in the doc

Motivation (why v0.3.4's 150 ms is still too much)
Goal: < 10 ms pause; stretch < 1 ms
Mechanism (memfd + UFFDIO_WRITEPROTECT + async dirty-page copier)
Alternatives considered (pre-copy à la live migration, full memcpy, block-device CoW)
Open questions (THP interaction, KVM-direct memory paths, snapshot format compat)
Phased implementation plan (~8 weeks, PoC → integrate → bench → harden → launch)
Risks (kernel < 5.7, write-fault storms, consistency proof, restore regression)

Why a draft PR

This is meant to be visible while the implementation lands, not after. Comments / corrections / prior-art pointers especially welcome on the open questions — particularly behavior of UFFD_WP on memfd-backed VMAs under KVM_RUN.

Test plan

The doc itself doesn't need testing; the implementation will. Phase 3 (Week 5 in the plan) reuses the v0.3.4 multi-BRANCH sweep harness in bench/pause-window/ to measure the new pause distribution.

🤖 Generated with Claude Code

Sketches the implementation plan for cutting BRANCH pause from ~150ms (v0.3.4 floor on ext4) to < 10ms by removing the synchronous memory write entirely. Approach: switch source RAM to memfd_create, arm UFFDIO_WRITEPROTECT over the full guest range before BRANCH, copy dirty pages async via a uffd handler. The pause window then contains only vCPU + device state dump (microseconds) and the WP-arming syscall (sub-millisecond). Doc covers: motivation, goal/non-goals, alternatives considered (status quo, pre-copy migration, full memcpy, block-device CoW), open questions (THP interaction, KVM-direct memory access paths, format compatibility), implementation phases, risks. Tracking issue: #101. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

WaylandYang · 2026-05-24T15:50:11Z

Phase 1 PoC results in #157 — kernel mechanics empirically work:

Region	WP arm latency
64 MiB	200 µs
256 MiB	815 µs
1024 MiB	3.19 ms

Linear at ~3 ms/GiB. Snapshot consistency invariant holds (0 violations across all sizes). 1 GiB parent fits comfortably under the < 10 ms target; for >3 GiB parents the < 10 ms claim in the design will need to be qualified by size.

WaylandYang · 2026-05-24T16:23:03Z

Phase 2 PoC empirical result — open question #1 ✓ answered POSITIVELY

experiments/v0.4-kvm-uffd-wp-poc/ runs a real-mode KVM guest that does mov [0x1000], al with UFFDIO_WRITEPROTECT armed on the memfd backing the memslot. Result:

[uffd] armed UFFDIO_WRITEPROTECT in 9.283µs
[kvm] running vcpu...
[handler] caught fault at GPA 0x1000 (flags=0x3, write=true)
[kvm] guest halted normally in 211.003µs (1 exits)

Live memfd[0x1000]:  0x42 (AFTER, guest write committed)
Snapshot[0x1000]:    0xbe (BEFORE, captured pre-write)

flags=0x3 = UFFD_PAGEFAULT_FLAG_WRITE | UFFD_PAGEFAULT_FLAG_WP — kernel correctly tags the event as a write-protect fault originating from a guest write.

So: MMU notifiers → EPT invalidation → uffd_wp fault delivery chain works as the design assumes. The biggest "will the kernel even let us do this" risk is now empirically retired.

Full code + results in #157.

WaylandYang · 2026-05-24T16:42:46Z

Phase 3 PoC — open question #2 (THP) closed.

Three backings tested on 64 MiB regions (full data in v0.4-thp-uffd-wp-poc/RESULTS.md):

	WP arm	THPs	Fault granularity
memfd + MADV_HUGEPAGE	868 µs	0 (stock `shmem_enabled=never`)	4 KiB
memfd + MADV_NOHUGEPAGE	202 µs ← baseline	0	4 KiB
anon + MADV_HUGEPAGE	419 µs	17 × 2 MiB	4 KiB

The kernel preserves 4 KiB fault granularity in all three. WP arm cost is at most ~2× the baseline even with real hugepages. The trap is the marker-without-hugepages case (memfd+MADV_HUGEPAGE on stock systems) — 4.3× slower with zero benefit.

Design implication: forkd should use memfd + no MADV_HUGEPAGE for source-VM RAM. Cheapest arm, predictable, matches Firecracker's existing memfd path.

Phase 4 next: KVM_GET_DIRTY_LOG × UFFD_WP coordination (open question #3) and sustained-write storm throughput.

Promotes the PoC machinery from experiments/v0.4-*-poc into the production forkd-uffd crate as two new modules: - raw.rs (pub(crate)): libc-level wrappers for the userfaultfd ioctls we use, since the userfaultfd 0.8 crate doesn't yet expose UFFDIO_WRITEPROTECT or UFFDIO_REGISTER_MODE_WP. - wp_snapshot.rs (pub): WpBranch struct owning one in-flight v0.4 BRANCH operation. WpBranch::begin arms WP and spawns a handler; bulk_copy_clean does the still-clean pass; finalize stops the handler and returns WpBranchStats. Disjoint from the existing handshake module (v0.3 restore side). Linux-only behind cfg gates. Includes one smoke test that arms + finalizes a 16 KiB anon region with no writes; expected to skip on CI sandboxes that don't allow userfaultfd to unprivileged users. Tracking #101, PR #156, PR #157. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* RFC: v0.4 live-fork via userfaultfd write-protect Sketches the implementation plan for cutting BRANCH pause from ~150ms (v0.3.4 floor on ext4) to < 10ms by removing the synchronous memory write entirely. Approach: switch source RAM to memfd_create, arm UFFDIO_WRITEPROTECT over the full guest range before BRANCH, copy dirty pages async via a uffd handler. The pause window then contains only vCPU + device state dump (microseconds) and the WP-arming syscall (sub-millisecond). Doc covers: motivation, goal/non-goals, alternatives considered (status quo, pre-copy migration, full memcpy, block-device CoW), open questions (THP interaction, KVM-direct memory access paths, format compatibility), implementation phases, risks. Tracking issue: #101. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * experiment(v0.4): Phase 1 PoC — UFFDIO_WRITEPROTECT on memfd Standalone binary that exercises the kernel mechanics v0.4 depends on, outside the KVM/Firecracker context. Allocates a 64 MiB memfd, arms UFFDIO_WRITEPROTECT, runs a writer thread + uffd handler thread, then validates that the resulting snapshot file is a consistent point-in-time view (every page starts with its BEFORE label). Goal: prove the kernel side of the v0.4 design is feasible. If this PoC passes (and prints sub-ms WP arm latency), the design proposed in DESIGN-v0.4.md can move to Phase 2 (integrate into forkd-uffd). Tracking #101, related to PR #156. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * experiment(v0.4): rewrite PoC against raw libc (userfaultfd 0.8.1 lacks WP wrappers) * experiment(v0.4): make region size configurable via REGION_MIB env var * experiment(v0.4): empirical results — WP arm linear at ~3 ms/GiB, 0 violations * experiment(v0.4): Phase 2 PoC — UFFD_WP × KVM guest writes Minimal raw-KVM Rust binary that answers open question #1 in DESIGN-v0.4.md: does UFFD_WP armed on a memfd-backed host VMA catch writes when the guest accesses memory through EPT (not the host MMU)? Setup: 1 MiB memfd, pre-write BEFORE marker at GPA 0x1000, tiny real-mode guest that does `mov [0x1000], al` with AL=0x42, arm UFFDIO_WRITEPROTECT before vcpu.run(). Validate: - handler caught a write fault at GPA 0x1000 - snapshot byte at 0x1000 is the BEFORE marker (0xBE) - live memfd byte at 0x1000 is the AFTER marker (0x42) If all three hold, EPT-mediated guest writes propagate through MMU notifiers to UFFD_WP on the host VMA, and v0.4's snapshot mechanism is sound under KVM. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * experiment(v0.4): pin kvm-bindings to 0.11 (matches kvm-ioctls 0.21) * experiment(v0.4): vcpu needs mut binding for .run() * experiment(v0.4): Phase 2 results — UFFD_WP catches KVM guest writes ✓ Empirical answer to DESIGN-v0.4.md open question #1: yes. EPT-mediated guest writes propagate through MMU notifiers to UFFD_WP on the host VMA. Handler captures pre-write content (0xBE), guest write lands on live memory (0x42), and the ordering invariant holds. Measured (1 MiB region, single guest write): - WP arm latency: 9.3 µs - Total vcpu runtime including 1 trap-and-resume: 211 µs - Fault flags: 0x3 = WRITE|WP (kernel reports both) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * style: cargo fmt PoC sources (fixes CI) * experiment(v0.4): Phase 3 PoC — UFFD_WP × transparent hugepages Answers DESIGN-v0.4.md open question #2. Compares WP arm latency and first-fault behavior between MADV_HUGEPAGE and MADV_NOHUGEPAGE on the same 64 MiB region. Reads /proc/self/smaps to verify AnonHugePages allocation before/after WP arm and after first write fault. If THP is split at arm time, the arm latency for the hugepage case will be visibly higher than the no-hugepage case, and AnonHugePages will drop after arm. If split happens at first-fault time, the arm will be cheap but the first write will be expensive. If neither, the kernel handles WP at hugepage granularity (faults report the 2 MiB base, not the 4 KiB sub-page). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * experiment(v0.4): add Phase C — MAP_ANONYMOUS for real THP allocation * experiment(v0.4): Phase 3 results — THP × UFFD_WP characterized Three-phase comparison: memfd+HUGEPAGE, memfd+NOHUGEPAGE, anon+HUGEPAGE. Key findings: - memfd + MADV_HUGEPAGE is a trap on stock kernels (shmem_enabled=never): VM_HUGEPAGE marker triples WP arm cost (868 µs vs 202 µs baseline) but allocates zero hugepages. - Real THPs (anonymous backing) cost ~2x WP arm; faults still report at 4 KiB granularity. The kernel uses PMD-level WP marker + split-on-first- sub-page-fault rather than synchronous split at arm time. - Populate cost for anon+HUGEPAGE is 10x the memfd baseline (234 ms vs 20 ms for 64 MiB) due to contiguous-region defrag. For forkd: use memfd + no MADV_HUGEPAGE on source VM memory. Fastest arm, most predictable, matches what Firecracker already does. Answers DESIGN-v0.4.md open question #2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(forkd-uffd): add WpBranch — snapshot-side write-protection (v0.4) Promotes the PoC machinery from experiments/v0.4-*-poc into the production forkd-uffd crate as two new modules: - raw.rs (pub(crate)): libc-level wrappers for the userfaultfd ioctls we use, since the userfaultfd 0.8 crate doesn't yet expose UFFDIO_WRITEPROTECT or UFFDIO_REGISTER_MODE_WP. - wp_snapshot.rs (pub): WpBranch struct owning one in-flight v0.4 BRANCH operation. WpBranch::begin arms WP and spawns a handler; bulk_copy_clean does the still-clean pass; finalize stops the handler and returns WpBranchStats. Disjoint from the existing handshake module (v0.3 restore side). Linux-only behind cfg gates. Includes one smoke test that arms + finalizes a 16 KiB anon region with no writes; expected to skip on CI sandboxes that don't allow userfaultfd to unprivileged users. Tracking #101, PR #156, PR #157. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(forkd-uffd): use is_multiple_of for clippy 1.95 * style: cargo fmt forkd-uffd WpBranch sources * fix: clippy lints for stricter CI rustc — doc list blank lines + c-str literal * style: fmt collapse placeholder_fd let binding * fix: mark raw.rs C-define snippet as text doctest (not Rust) * feat(forkd-cli): add wp-bench subcommand — v0.4 WpBranch CLI surface Creates a memfd of --region-mib MiB, arms UFFDIO_WRITEPROTECT, runs the bulk-copy + handler pair from forkd_uffd::wp_snapshot::WpBranch, and prints timing data: forkd wp-bench [--region-mib 64] [--snapshot /tmp/...] Output shape mirrors `forkd bench`: per-step durations + verification that the snapshot matches the pre-arm content (every byte 0x42). Not the full BRANCH integration — that requires patching the forkd-controller branch_sandbox path to skip FC's synchronous memory.bin write, which is its own multi-day spike documented in DESIGN-v0.4.md. This subcommand is the CLI surface for benchmarking WpBranch on a given kernel/filesystem combo before that lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * style: cargo fmt wp_bench * style: cargo fmt wp_bench (CI rust 1.95) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

This was referenced May 24, 2026

v0.4+ candidate: live-fork via memfd-backed source RAM + uffd_wp (deferred from v0.3) #101

Open

v0.4 foundation: WpBranch primitive + 3 PoCs + wp-bench CLI #157

Merged

WaylandYang marked this pull request as ready for review May 24, 2026 20:13

WaylandYang merged commit f48d395 into main May 24, 2026
2 checks passed

WaylandYang deleted the design/v0.4-live-fork branch May 24, 2026 20:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: v0.4 live-fork via userfaultfd write-protect#156

RFC: v0.4 live-fork via userfaultfd write-protect#156
WaylandYang merged 1 commit into
mainfrom
design/v0.4-live-fork

WaylandYang commented May 24, 2026

Uh oh!

WaylandYang commented May 24, 2026

Uh oh!

WaylandYang commented May 24, 2026

Uh oh!

WaylandYang commented May 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

WaylandYang commented May 24, 2026

Summary

What's in the doc

Why a draft PR

Test plan

Uh oh!

WaylandYang commented May 24, 2026

Uh oh!

WaylandYang commented May 24, 2026

Uh oh!

WaylandYang commented May 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant