Skip to content

RFC: v0.4 live-fork via userfaultfd write-protect#156

Merged
WaylandYang merged 1 commit into
mainfrom
design/v0.4-live-fork
May 24, 2026
Merged

RFC: v0.4 live-fork via userfaultfd write-protect#156
WaylandYang merged 1 commit into
mainfrom
design/v0.4-live-fork

Conversation

@WaylandYang
Copy link
Copy Markdown
Contributor

Draft RFC for v0.4. Tracking #101.

Summary

Sketches the implementation plan for cutting BRANCH pause from ~150 ms (v0.3.4 floor on ext4) to < 10 ms by removing the synchronous memory write.

Approach: switch source RAM to memfd_create, arm UFFDIO_WRITEPROTECT over the full guest range before BRANCH, copy dirty pages async via a uffd handler. The pause window then contains only vCPU + device state dump (microseconds) and the WP-arming syscall (sub-millisecond).

What's in the doc

  • Motivation (why v0.3.4's 150 ms is still too much)
  • Goal: < 10 ms pause; stretch < 1 ms
  • Mechanism (memfd + UFFDIO_WRITEPROTECT + async dirty-page copier)
  • Alternatives considered (pre-copy à la live migration, full memcpy, block-device CoW)
  • Open questions (THP interaction, KVM-direct memory paths, snapshot format compat)
  • Phased implementation plan (~8 weeks, PoC → integrate → bench → harden → launch)
  • Risks (kernel < 5.7, write-fault storms, consistency proof, restore regression)

Why a draft PR

This is meant to be visible while the implementation lands, not after. Comments / corrections / prior-art pointers especially welcome on the open questions — particularly behavior of UFFD_WP on memfd-backed VMAs under KVM_RUN.

Test plan

The doc itself doesn't need testing; the implementation will. Phase 3 (Week 5 in the plan) reuses the v0.3.4 multi-BRANCH sweep harness in bench/pause-window/ to measure the new pause distribution.

🤖 Generated with Claude Code

Sketches the implementation plan for cutting BRANCH pause from ~150ms
(v0.3.4 floor on ext4) to < 10ms by removing the synchronous memory
write entirely.

Approach: switch source RAM to memfd_create, arm UFFDIO_WRITEPROTECT
over the full guest range before BRANCH, copy dirty pages async via a
uffd handler. The pause window then contains only vCPU + device state
dump (microseconds) and the WP-arming syscall (sub-millisecond).

Doc covers: motivation, goal/non-goals, alternatives considered (status
quo, pre-copy migration, full memcpy, block-device CoW), open questions
(THP interaction, KVM-direct memory access paths, format compatibility),
implementation phases, risks.

Tracking issue: #101.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@WaylandYang
Copy link
Copy Markdown
Contributor Author

Phase 1 PoC results in #157 — kernel mechanics empirically work:

Region WP arm latency
64 MiB 200 µs
256 MiB 815 µs
1024 MiB 3.19 ms

Linear at ~3 ms/GiB. Snapshot consistency invariant holds (0 violations across all sizes). 1 GiB parent fits comfortably under the < 10 ms target; for >3 GiB parents the < 10 ms claim in the design will need to be qualified by size.

@WaylandYang
Copy link
Copy Markdown
Contributor Author

Phase 2 PoC empirical result — open question #1 ✓ answered POSITIVELY

experiments/v0.4-kvm-uffd-wp-poc/ runs a real-mode KVM guest that does mov [0x1000], al with UFFDIO_WRITEPROTECT armed on the memfd backing the memslot. Result:

[uffd] armed UFFDIO_WRITEPROTECT in 9.283µs
[kvm] running vcpu...
[handler] caught fault at GPA 0x1000 (flags=0x3, write=true)
[kvm] guest halted normally in 211.003µs (1 exits)

Live memfd[0x1000]:  0x42 (AFTER, guest write committed)
Snapshot[0x1000]:    0xbe (BEFORE, captured pre-write)

flags=0x3 = UFFD_PAGEFAULT_FLAG_WRITE | UFFD_PAGEFAULT_FLAG_WP — kernel correctly tags the event as a write-protect fault originating from a guest write.

So: MMU notifiers → EPT invalidation → uffd_wp fault delivery chain works as the design assumes. The biggest "will the kernel even let us do this" risk is now empirically retired.

Full code + results in #157.

@WaylandYang
Copy link
Copy Markdown
Contributor Author

Phase 3 PoC — open question #2 (THP) closed.

Three backings tested on 64 MiB regions (full data in v0.4-thp-uffd-wp-poc/RESULTS.md):

WP arm THPs Fault granularity
memfd + MADV_HUGEPAGE 868 µs 0 (stock shmem_enabled=never) 4 KiB
memfd + MADV_NOHUGEPAGE 202 µs ← baseline 0 4 KiB
anon + MADV_HUGEPAGE 419 µs 17 × 2 MiB 4 KiB

The kernel preserves 4 KiB fault granularity in all three. WP arm cost is at most ~2× the baseline even with real hugepages. The trap is the marker-without-hugepages case (memfd+MADV_HUGEPAGE on stock systems) — 4.3× slower with zero benefit.

Design implication: forkd should use memfd + no MADV_HUGEPAGE for source-VM RAM. Cheapest arm, predictable, matches Firecracker's existing memfd path.

Phase 4 next: KVM_GET_DIRTY_LOG × UFFD_WP coordination (open question #3) and sustained-write storm throughput.

WaylandYang added a commit that referenced this pull request May 24, 2026
Promotes the PoC machinery from experiments/v0.4-*-poc into the
production forkd-uffd crate as two new modules:

  - raw.rs (pub(crate)): libc-level wrappers for the userfaultfd
    ioctls we use, since the userfaultfd 0.8 crate doesn't yet
    expose UFFDIO_WRITEPROTECT or UFFDIO_REGISTER_MODE_WP.

  - wp_snapshot.rs (pub): WpBranch struct owning one in-flight v0.4
    BRANCH operation. WpBranch::begin arms WP and spawns a handler;
    bulk_copy_clean does the still-clean pass; finalize stops the
    handler and returns WpBranchStats.

Disjoint from the existing handshake module (v0.3 restore side).
Linux-only behind cfg gates. Includes one smoke test that arms +
finalizes a 16 KiB anon region with no writes; expected to skip on
CI sandboxes that don't allow userfaultfd to unprivileged users.

Tracking #101, PR #156, PR #157.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@WaylandYang WaylandYang marked this pull request as ready for review May 24, 2026 20:13
@WaylandYang WaylandYang merged commit f48d395 into main May 24, 2026
2 checks passed
@WaylandYang WaylandYang deleted the design/v0.4-live-fork branch May 24, 2026 20:14
WaylandYang added a commit that referenced this pull request May 24, 2026
* RFC: v0.4 live-fork via userfaultfd write-protect

Sketches the implementation plan for cutting BRANCH pause from ~150ms
(v0.3.4 floor on ext4) to < 10ms by removing the synchronous memory
write entirely.

Approach: switch source RAM to memfd_create, arm UFFDIO_WRITEPROTECT
over the full guest range before BRANCH, copy dirty pages async via a
uffd handler. The pause window then contains only vCPU + device state
dump (microseconds) and the WP-arming syscall (sub-millisecond).

Doc covers: motivation, goal/non-goals, alternatives considered (status
quo, pre-copy migration, full memcpy, block-device CoW), open questions
(THP interaction, KVM-direct memory access paths, format compatibility),
implementation phases, risks.

Tracking issue: #101.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* experiment(v0.4): Phase 1 PoC — UFFDIO_WRITEPROTECT on memfd

Standalone binary that exercises the kernel mechanics v0.4 depends on,
outside the KVM/Firecracker context. Allocates a 64 MiB memfd, arms
UFFDIO_WRITEPROTECT, runs a writer thread + uffd handler thread, then
validates that the resulting snapshot file is a consistent
point-in-time view (every page starts with its BEFORE label).

Goal: prove the kernel side of the v0.4 design is feasible. If this
PoC passes (and prints sub-ms WP arm latency), the design proposed in
DESIGN-v0.4.md can move to Phase 2 (integrate into forkd-uffd).

Tracking #101, related to PR #156.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* experiment(v0.4): rewrite PoC against raw libc (userfaultfd 0.8.1 lacks WP wrappers)

* experiment(v0.4): make region size configurable via REGION_MIB env var

* experiment(v0.4): empirical results — WP arm linear at ~3 ms/GiB, 0 violations

* experiment(v0.4): Phase 2 PoC — UFFD_WP × KVM guest writes

Minimal raw-KVM Rust binary that answers open question #1 in
DESIGN-v0.4.md: does UFFD_WP armed on a memfd-backed host VMA catch
writes when the guest accesses memory through EPT (not the host MMU)?

Setup: 1 MiB memfd, pre-write BEFORE marker at GPA 0x1000, tiny
real-mode guest that does `mov [0x1000], al` with AL=0x42, arm
UFFDIO_WRITEPROTECT before vcpu.run(). Validate:
- handler caught a write fault at GPA 0x1000
- snapshot byte at 0x1000 is the BEFORE marker (0xBE)
- live memfd byte at 0x1000 is the AFTER marker (0x42)

If all three hold, EPT-mediated guest writes propagate through MMU
notifiers to UFFD_WP on the host VMA, and v0.4's snapshot mechanism
is sound under KVM.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* experiment(v0.4): pin kvm-bindings to 0.11 (matches kvm-ioctls 0.21)

* experiment(v0.4): vcpu needs mut binding for .run()

* experiment(v0.4): Phase 2 results — UFFD_WP catches KVM guest writes ✓

Empirical answer to DESIGN-v0.4.md open question #1: yes. EPT-mediated
guest writes propagate through MMU notifiers to UFFD_WP on the host
VMA. Handler captures pre-write content (0xBE), guest write lands on
live memory (0x42), and the ordering invariant holds.

Measured (1 MiB region, single guest write):
- WP arm latency: 9.3 µs
- Total vcpu runtime including 1 trap-and-resume: 211 µs
- Fault flags: 0x3 = WRITE|WP (kernel reports both)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* style: cargo fmt PoC sources (fixes CI)

* experiment(v0.4): Phase 3 PoC — UFFD_WP × transparent hugepages

Answers DESIGN-v0.4.md open question #2. Compares WP arm latency and
first-fault behavior between MADV_HUGEPAGE and MADV_NOHUGEPAGE on the
same 64 MiB region. Reads /proc/self/smaps to verify AnonHugePages
allocation before/after WP arm and after first write fault.

If THP is split at arm time, the arm latency for the hugepage case
will be visibly higher than the no-hugepage case, and AnonHugePages
will drop after arm. If split happens at first-fault time, the arm
will be cheap but the first write will be expensive. If neither, the
kernel handles WP at hugepage granularity (faults report the 2 MiB
base, not the 4 KiB sub-page).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* experiment(v0.4): add Phase C — MAP_ANONYMOUS for real THP allocation

* experiment(v0.4): Phase 3 results — THP × UFFD_WP characterized

Three-phase comparison: memfd+HUGEPAGE, memfd+NOHUGEPAGE, anon+HUGEPAGE.

Key findings:
- memfd + MADV_HUGEPAGE is a trap on stock kernels (shmem_enabled=never):
  VM_HUGEPAGE marker triples WP arm cost (868 µs vs 202 µs baseline) but
  allocates zero hugepages.
- Real THPs (anonymous backing) cost ~2x WP arm; faults still report at
  4 KiB granularity. The kernel uses PMD-level WP marker + split-on-first-
  sub-page-fault rather than synchronous split at arm time.
- Populate cost for anon+HUGEPAGE is 10x the memfd baseline (234 ms vs
  20 ms for 64 MiB) due to contiguous-region defrag.

For forkd: use memfd + no MADV_HUGEPAGE on source VM memory. Fastest
arm, most predictable, matches what Firecracker already does.

Answers DESIGN-v0.4.md open question #2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(forkd-uffd): add WpBranch — snapshot-side write-protection (v0.4)

Promotes the PoC machinery from experiments/v0.4-*-poc into the
production forkd-uffd crate as two new modules:

  - raw.rs (pub(crate)): libc-level wrappers for the userfaultfd
    ioctls we use, since the userfaultfd 0.8 crate doesn't yet
    expose UFFDIO_WRITEPROTECT or UFFDIO_REGISTER_MODE_WP.

  - wp_snapshot.rs (pub): WpBranch struct owning one in-flight v0.4
    BRANCH operation. WpBranch::begin arms WP and spawns a handler;
    bulk_copy_clean does the still-clean pass; finalize stops the
    handler and returns WpBranchStats.

Disjoint from the existing handshake module (v0.3 restore side).
Linux-only behind cfg gates. Includes one smoke test that arms +
finalizes a 16 KiB anon region with no writes; expected to skip on
CI sandboxes that don't allow userfaultfd to unprivileged users.

Tracking #101, PR #156, PR #157.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(forkd-uffd): use is_multiple_of for clippy 1.95

* style: cargo fmt forkd-uffd WpBranch sources

* fix: clippy lints for stricter CI rustc — doc list blank lines + c-str literal

* style: fmt collapse placeholder_fd let binding

* fix: mark raw.rs C-define snippet as text doctest (not Rust)

* feat(forkd-cli): add wp-bench subcommand — v0.4 WpBranch CLI surface

Creates a memfd of --region-mib MiB, arms UFFDIO_WRITEPROTECT, runs
the bulk-copy + handler pair from forkd_uffd::wp_snapshot::WpBranch,
and prints timing data:

  forkd wp-bench [--region-mib 64] [--snapshot /tmp/...]

Output shape mirrors `forkd bench`: per-step durations + verification
that the snapshot matches the pre-arm content (every byte 0x42).

Not the full BRANCH integration — that requires patching the
forkd-controller branch_sandbox path to skip FC's synchronous
memory.bin write, which is its own multi-day spike documented in
DESIGN-v0.4.md. This subcommand is the CLI surface for benchmarking
WpBranch on a given kernel/filesystem combo before that lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* style: cargo fmt wp_bench

* style: cargo fmt wp_bench (CI rust 1.95)

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant