Skip to content

v0.4 foundation: WpBranch primitive + 3 PoCs + wp-bench CLI#157

Merged
WaylandYang merged 22 commits into
mainfrom
experiment/v0.4-uffd-wp-poc
May 24, 2026
Merged

v0.4 foundation: WpBranch primitive + 3 PoCs + wp-bench CLI#157
WaylandYang merged 22 commits into
mainfrom
experiment/v0.4-uffd-wp-poc

Conversation

@WaylandYang
Copy link
Copy Markdown
Contributor

@WaylandYang WaylandYang commented May 24, 2026

v0.4's "live-fork" foundation: the snapshot-side UFFD_WP primitive, validated by three kernel-level PoCs, promoted into the forkd-uffd crate as WpBranch, and exposed via the forkd wp-bench subcommand. Tracking #101, design RFC in #156.

What this PR ships

1. Production library: forkd_uffd::wp_snapshot::WpBranch

A struct that owns one in-flight v0.4 BRANCH:

let branch = unsafe {
    WpBranch::begin(memfd, region, region_size, snapshot_path)?
};
// arm_duration() is the BRANCH pause-window analog
let arm = branch.arm_duration();
// bulk-copy still-clean pages from main thread while handler thread
// captures dirty pages in the background
let bulk_copied = unsafe { branch.bulk_copy_clean() }?;
// stop handler, fsync, return stats
let stats = branch.finalize()?;

raw.rs (pub(crate)) wraps the userfaultfd ioctls since the userfaultfd 0.8 crate doesn't expose UFFDIO_WRITEPROTECT or UFFDIO_REGISTER_MODE_WP. Disjoint from the existing handshake module (v0.3 restore-side).

2. CLI surface: forkd wp-bench

$ sudo forkd wp-bench --region-mib 1024
  region: 1024 MiB (262144 pages of 4096 bytes)
  populated in 379ms
  arm UFFDIO_WRITEPROTECT: 3.18ms       ← v0.4 pause-window analog
  bulk_copy_clean: 262144 pages in 424ms
  finalize: 6.58s                        ← fsync 1 GiB on ext4
  total: 7.02s
  ✓ snapshot consistent (1073741824 bytes all 0x42)

3. Three kernel-level PoCs (verifying design assumptions)

Experiment What it validates Result
experiments/v0.4-uffd-wp-poc/ UFFD_WP on memfd, host writes 3 ms/GiB linear, 0 consistency violations
experiments/v0.4-kvm-uffd-wp-poc/ UFFD_WP catches KVM guest writes through EPT (open question #1) ✓ flags=0x3, pre-write content captured
experiments/v0.4-thp-uffd-wp-poc/ UFFD_WP × transparent hugepages (open question #2) 4 KiB fault granularity preserved; memfd + NO_HUGEPAGE is cheapest

Each has its own RESULTS.md with empirical numbers.

v0.4 vs v0.3.4 comparison (extrapolated from PoC data)

Parent VM size v0.3.4 pause v0.4 pause Speedup
1 GiB ~150 ms 3 ms 50×
4 GiB ~150 ms 13 ms 12×
8 GiB ~150 ms 26 ms

v0.3.4 pause includes ext4 metadata + memory.bin write. v0.4 pause is UFFDIO_WRITEPROTECT only; memory.bin writes happen async outside the critical section.

What's NOT in this PR

  • The full forkd-controller::branch_sandbox integration. Firecracker's /snapshot/create always writes memory.bin synchronously inside the pause; replacing it requires either an FC patch (vmstate-only mode), bypassing FC's snapshot API entirely with raw KVM_GET_REGS, or accepting the existing pause for vmstate alongside async memory capture. This is a multi-day spike tracked in DESIGN-v0.4.md and the next session.
  • The --live-fork flag on forkd snapshot and the controller. Depends on the integration above.
  • Kernel < 5.7 fallback (graceful degradation to the v0.3.4 path).
  • Multi-vcpu race coverage (the PoCs use a single vcpu).

Test plan

  • cargo test --release -p forkd-uffd wp_snapshot — passes (sudo required for unprivileged_userfaultfd)
  • forkd wp-bench --region-mib 64 — 203 µs arm, consistent snapshot
  • forkd wp-bench --region-mib 1024 — 3.18 ms arm, consistent snapshot
  • CI green: rust (fmt + clippy + build + test) + bench-python

Branches touched

Crate What changed
crates/forkd-uffd new modules raw, wp_snapshot (~625 lines, Linux-only behind cfg)
crates/forkd-cli new subcommand wp-bench + dep on forkd-uffd (Linux-only)
Cargo.toml (workspace) three new experiments/v0.4-*-poc members

🤖 Generated with Claude Code

WaylandYang and others added 5 commits May 24, 2026 22:27
Sketches the implementation plan for cutting BRANCH pause from ~150ms
(v0.3.4 floor on ext4) to < 10ms by removing the synchronous memory
write entirely.

Approach: switch source RAM to memfd_create, arm UFFDIO_WRITEPROTECT
over the full guest range before BRANCH, copy dirty pages async via a
uffd handler. The pause window then contains only vCPU + device state
dump (microseconds) and the WP-arming syscall (sub-millisecond).

Doc covers: motivation, goal/non-goals, alternatives considered (status
quo, pre-copy migration, full memcpy, block-device CoW), open questions
(THP interaction, KVM-direct memory access paths, format compatibility),
implementation phases, risks.

Tracking issue: #101.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Standalone binary that exercises the kernel mechanics v0.4 depends on,
outside the KVM/Firecracker context. Allocates a 64 MiB memfd, arms
UFFDIO_WRITEPROTECT, runs a writer thread + uffd handler thread, then
validates that the resulting snapshot file is a consistent
point-in-time view (every page starts with its BEFORE label).

Goal: prove the kernel side of the v0.4 design is feasible. If this
PoC passes (and prints sub-ms WP arm latency), the design proposed in
DESIGN-v0.4.md can move to Phase 2 (integrate into forkd-uffd).

Tracking #101, related to PR #156.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
WaylandYang and others added 4 commits May 25, 2026 00:20
Minimal raw-KVM Rust binary that answers open question #1 in
DESIGN-v0.4.md: does UFFD_WP armed on a memfd-backed host VMA catch
writes when the guest accesses memory through EPT (not the host MMU)?

Setup: 1 MiB memfd, pre-write BEFORE marker at GPA 0x1000, tiny
real-mode guest that does `mov [0x1000], al` with AL=0x42, arm
UFFDIO_WRITEPROTECT before vcpu.run(). Validate:
- handler caught a write fault at GPA 0x1000
- snapshot byte at 0x1000 is the BEFORE marker (0xBE)
- live memfd byte at 0x1000 is the AFTER marker (0x42)

If all three hold, EPT-mediated guest writes propagate through MMU
notifiers to UFFD_WP on the host VMA, and v0.4's snapshot mechanism
is sound under KVM.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Empirical answer to DESIGN-v0.4.md open question #1: yes. EPT-mediated
guest writes propagate through MMU notifiers to UFFD_WP on the host
VMA. Handler captures pre-write content (0xBE), guest write lands on
live memory (0x42), and the ordering invariant holds.

Measured (1 MiB region, single guest write):
- WP arm latency: 9.3 µs
- Total vcpu runtime including 1 trap-and-resume: 211 µs
- Fault flags: 0x3 = WRITE|WP (kernel reports both)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@WaylandYang WaylandYang changed the title experiment: v0.4 Phase 1 PoC — UFFDIO_WRITEPROTECT on memfd (passes) experiment: v0.4 PoCs Phase 1 + 2 — UFFD_WP works on memfd AND under KVM May 24, 2026
WaylandYang and others added 4 commits May 25, 2026 00:25
Answers DESIGN-v0.4.md open question #2. Compares WP arm latency and
first-fault behavior between MADV_HUGEPAGE and MADV_NOHUGEPAGE on the
same 64 MiB region. Reads /proc/self/smaps to verify AnonHugePages
allocation before/after WP arm and after first write fault.

If THP is split at arm time, the arm latency for the hugepage case
will be visibly higher than the no-hugepage case, and AnonHugePages
will drop after arm. If split happens at first-fault time, the arm
will be cheap but the first write will be expensive. If neither, the
kernel handles WP at hugepage granularity (faults report the 2 MiB
base, not the 4 KiB sub-page).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three-phase comparison: memfd+HUGEPAGE, memfd+NOHUGEPAGE, anon+HUGEPAGE.

Key findings:
- memfd + MADV_HUGEPAGE is a trap on stock kernels (shmem_enabled=never):
  VM_HUGEPAGE marker triples WP arm cost (868 µs vs 202 µs baseline) but
  allocates zero hugepages.
- Real THPs (anonymous backing) cost ~2x WP arm; faults still report at
  4 KiB granularity. The kernel uses PMD-level WP marker + split-on-first-
  sub-page-fault rather than synchronous split at arm time.
- Populate cost for anon+HUGEPAGE is 10x the memfd baseline (234 ms vs
  20 ms for 64 MiB) due to contiguous-region defrag.

For forkd: use memfd + no MADV_HUGEPAGE on source VM memory. Fastest
arm, most predictable, matches what Firecracker already does.

Answers DESIGN-v0.4.md open question #2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@WaylandYang WaylandYang marked this pull request as ready for review May 24, 2026 18:36
@WaylandYang WaylandYang changed the title experiment: v0.4 PoCs Phase 1 + 2 — UFFD_WP works on memfd AND under KVM experiment: v0.4 PoCs Phase 1+2+3 — UFFD_WP on memfd, under KVM, under THP May 24, 2026
WaylandYang and others added 9 commits May 25, 2026 02:40
Promotes the PoC machinery from experiments/v0.4-*-poc into the
production forkd-uffd crate as two new modules:

  - raw.rs (pub(crate)): libc-level wrappers for the userfaultfd
    ioctls we use, since the userfaultfd 0.8 crate doesn't yet
    expose UFFDIO_WRITEPROTECT or UFFDIO_REGISTER_MODE_WP.

  - wp_snapshot.rs (pub): WpBranch struct owning one in-flight v0.4
    BRANCH operation. WpBranch::begin arms WP and spawns a handler;
    bulk_copy_clean does the still-clean pass; finalize stops the
    handler and returns WpBranchStats.

Disjoint from the existing handshake module (v0.3 restore side).
Linux-only behind cfg gates. Includes one smoke test that arms +
finalizes a 16 KiB anon region with no writes; expected to skip on
CI sandboxes that don't allow userfaultfd to unprivileged users.

Tracking #101, PR #156, PR #157.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Creates a memfd of --region-mib MiB, arms UFFDIO_WRITEPROTECT, runs
the bulk-copy + handler pair from forkd_uffd::wp_snapshot::WpBranch,
and prints timing data:

  forkd wp-bench [--region-mib 64] [--snapshot /tmp/...]

Output shape mirrors `forkd bench`: per-step durations + verification
that the snapshot matches the pre-arm content (every byte 0x42).

Not the full BRANCH integration — that requires patching the
forkd-controller branch_sandbox path to skip FC's synchronous
memory.bin write, which is its own multi-day spike documented in
DESIGN-v0.4.md. This subcommand is the CLI surface for benchmarking
WpBranch on a given kernel/filesystem combo before that lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@WaylandYang WaylandYang changed the title experiment: v0.4 PoCs Phase 1+2+3 — UFFD_WP on memfd, under KVM, under THP v0.4 foundation: WpBranch primitive + 3 PoCs + wp-bench CLI May 24, 2026
@WaylandYang WaylandYang merged commit 3e39301 into main May 24, 2026
2 checks passed
@WaylandYang WaylandYang deleted the experiment/v0.4-uffd-wp-poc branch May 24, 2026 20:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant