Muislam/pmem memdev region by russell-islam · Pull Request #137 · russell-islam/cloud-hypervisor

russell-islam · 2026-05-18T22:05:57Z

No description provided.

QEMU anchors its machine `device_memory` region (where pc-dimm, nvdimm, virtio-pmem and virtio-mem regions live) at a 1 GiB-aligned GPA immediately above the top of guest RAM. The alignment is `pc_get_device_memory_range()` in `hw/i386/pc.c` for x86 and the equivalent helpers on other targets. Cloud Hypervisor does not yet have a single dedicated memory-device region; the upcoming refactor will introduce one. Define a shared `DEVICE_MEMORY_ALIGN = 1 GiB` constant in the arch crate so the rest of the series can reference it without adding more arch knobs. No behaviour change: the constant is unused at this point. Assisted-by: Claude:Opus-4.7 Signed-off-by: Muminul Islam <muislam@microsoft.com>

`AddressAllocator::allocate(None, ...)` performs a top-down search and packs allocations at the high end of the managed range. That is the desired policy for PCI64 BAR placement but a poor fit for memory devices such as virtio-pmem, which work better when packed near the start of a dedicated window (mirroring the QEMU memory-device region behaviour). Add `allocate_first_fit` alongside `allocate` for callers that want bottom-up placement on the same allocator. The new method delegates to a new private helper `first_available_range_low` which scans ranges in forward order and returns the first hole that satisfies size and alignment. When the caller passes `Some(base)`, behaviour is identical to `allocate(Some(base), ...)` so the snapshot-restore path can be shared verbatim with the existing API. No existing caller is changed by this commit; the new method is a pure addition. Subsequent commits wire it into a dedicated memory-device region in the VMM and place virtio-pmem there. Assisted-by: Claude:Opus-4.7 Signed-off-by: Muminul Islam <muislam@microsoft.com>

Add a `DeviceMemoryRegion` type inside `memory_manager` that bundles a base GPA, size, and a bottom-up `AddressAllocator` scoped to that range. It mirrors QEMU's machine `device_memory` region (`hw/i386/pc.c::pc_get_device_memory_range` + `hw/mem/memory-device.c`) which anchors a dedicated GPA window immediately above guest RAM and allocates virtio-pmem / virtio-mem / pc-dimm devices inside it bottom-up. The new type is currently unused; subsequent commits in this series build it during `MemoryManager::new`, route virtio-mem zones and virtio-pmem allocations through it, and re-base the per-segment PCI64 mem64 allocator to start above it. A temporary `#[allow(dead_code)]` covers the helpers until those commits land in the same series. Also add a dedicated `Error::CreateDeviceMemoryAllocator` variant for the construction failure path, parallel to `CreateSystemAllocator`. Assisted-by: Claude:Opus-4.7 Signed-off-by: Muminul Islam <muislam@microsoft.com>

Expose an optional `device_memory_size: Option<u64>` on `MemoryConfig` plumbed through the `--memory device_memory_size=...` CLI option, the OpenAPI schema, and every struct-literal call site. The knob mirrors QEMU's `maxmem - mem` indirectly: it gives operators explicit control over the size of the dedicated guest-physical "device memory" region introduced by the upcoming refactor (where virtio-pmem, virtio-mem zones, and future pc-dimm-equivalents live). When left unset (the default for unmodified callers and snapshots that pre-date this change), the region size will be derived from the existing `hotplug_size` plus pmem headroom in a later commit, preserving today's behaviour. The field is wired but not yet consumed; the next commits build the region using this value and route memory-device allocations through it. Assisted-by: Claude:Opus-4.7 Signed-off-by: Muminul Islam <muislam@microsoft.com>

Add an `Option<DeviceMemoryRegion>` field to `MemoryManager` plus a `device_memory()` accessor. The field is initialised to `None` in `MemoryManager::new` for now; the next commit in this series will compute the base/size, build the region from the new `MemoryConfig::device_memory_size` knob, and store it here. This scaffolding commit keeps the layout untouched: `MemoryManager` still hands out PCI64 BAR address space and virtio-mem zones exactly as before, so behaviour is identical for both fresh boots and snapshot restore. A temporary `#[allow(dead_code)]` covers the accessor until later commits start routing memory-device allocations through it. Assisted-by: Claude:Opus-4.7 Signed-off-by: Muminul Islam <muislam@microsoft.com>

Anchor a dedicated guest-physical "device memory" region 1 GiB-aligned above the top of guest RAM and place memory devices inside it, matching QEMU's `pc_get_device_memory_range()` / `hw/mem/memory-device.c::memory_device_get_free_addr()` layout where pc-dimm, nvdimm, virtio-pmem and virtio-mem all share a single bottom-up first-fit window. Fresh-boot path: - Compute the region base by rounding the top of RAM up to `arch::DEVICE_MEMORY_ALIGN` (1 GiB). - Size from `MemoryConfig::device_memory_size` when set, otherwise sum the per-zone `hotplug_size`s plus declared pmem device sizes so the region is large enough for all memory devices without requiring the user to set device_memory_size manually. - Validate the resulting top against the platform-MMIO ceiling and fail boot with a clear message when `phys_bits` is too low. - Skip region creation entirely when the computed size is zero so plain VMs without any memory devices keep their current layout bit-for-bit. Per-zone placement: - virtio-mem zones now allocate from the region via `allocate_first_fit(None, hotplug_size, VIRTIO_MEM_ALIGN_SIZE)`, instead of bumping from `start_of_device_area`. Without a region the legacy bump still applies, preserving today's behaviour for plain hotplug-free VMs. - ACPI RAM hotplug zones reserve their capacity at the bottom of the region (128 MiB-aligned) so the address window matches the one declared up front. After all zones are placed, `start_of_device_area` is set to `region.top()` so the per-segment PCI64 BAR allocator starts above the device-memory region (matching QEMU's `pci_hole64` placement above `device_memory`). This pushes PCI64 BARs higher, which the next commit codifies for the per-segment allocator. Restore paths pass 0 for pmem_device_sizes since they rebuild the region from snapshot data. Assisted-by: Claude:Opus-4.7 Signed-off-by: Muminul Islam <muislam@microsoft.com>

Route virtio-pmem allocation through the dedicated device-memory region in `MemoryManager` when one is present, falling back to the per-segment 64-bit MMIO window when the region is absent or full (e.g. entirely consumed by virtio-mem zones). This brings the layout in line with QEMU's `hw/i386/pc.c::pc_get_device_memory_range` + `hw/mem/memory-device.c::memory_device_get_free_addr`: pc-dimm, nvdimm and virtio-pmem all live inside a single bottom-up first-fit window anchored just above guest RAM, while PCI64 BARs sit in the separate `pci_hole64` above it. Fresh-allocation path: - Use an `and_then/or_else` chain: try `device_memory.allocator().allocate_first_fit(...)` first so pmem GPAs are packed near the start of the region (bottom-up), mirroring `memory_device_get_free_addr`. If that returns None (region full or absent), gracefully fall back to the per-segment mem64 allocator. - virtio-pmem is now segment-agnostic when the region is in use, matching QEMU's single per-machine `device_memory` window. The `pci_segment` selector still chooses which PCI segment the virtio device attaches to; only the GPA backing the region is drawn from the shared pool. Restore path: - Honour the GPA recorded in the snapshot. Try the device-memory region first when present and the saved base falls inside it (newer snapshots), and fall back to the per-segment mem64 allocator otherwise (snapshots taken before the device-memory region was introduced map back at their original GPAs). This fixes the bug where 48-bit `phys_bits` hosts placed pmem near ~64 TiB and guests with smaller MAXMEM ceilings (e.g. Ubuntu Jammy 5.15 x86_64, MAXMEM ~11 TiB) rejected the hot-add with `nd_pmem: probe of namespace0.0 failed with error -22`. Plain VMs without declared memory devices preserve their previous layout because `device_memory` is `None` and the legacy per-segment path remains active. Assisted-by: Claude:Opus-4.7 Signed-off-by: Muminul Islam <muislam@microsoft.com>

Add an optional `device_memory: Option<(u64, u64)>` field to `MemoryManagerSnapshotData` and use `#[serde(default)]` so snapshots produced before this commit still deserialize and continue to take the legacy `start_of_device_area` fallback on restore. Save path: - `snapshot_data()` records `Some((base, size))` when the `MemoryManager` is using a `DeviceMemoryRegion`, and `None` otherwise. Restore path: - When the snapshot carries a `device_memory` entry, rebuild the `DeviceMemoryRegion` at the saved `(base, size)` and re-reserve every virtio-mem zone at its saved GPA via the explicit-base branch of `AddressAllocator::allocate`. This keeps the region's allocator and the actual guest layout consistent so subsequent virtio-pmem hot-adds via `allocate_first_fit` cannot hand out overlapping addresses. - When the field is absent (older snapshots or VMs that never used the region), keep `device_memory = None` and fall through to the existing legacy path that was already restoring fine with the bump-pointer `start_of_device_area`. Together with the per-zone hot-plug reservation in the fresh-boot path (earlier in this series), this closes the loop: virtio-pmem allocations on a restored VM continue to land inside the same shared device-memory window they came from on the source VM, and live migration / live upgrade between builds with and without this series keeps working in both directions. Assisted-by: Claude:Opus-4.7 Signed-off-by: Muminul Islam <muislam@microsoft.com>

Add a short device_memory_size section to docs/memory.md describing the dedicated device-memory region introduced in this series: - what it is (a single QEMU-style window above guest RAM hosting virtio-pmem, virtio-mem zones, and ACPI RAM hot-plug slots), - how it is sized (sum of declared hotplug_sizes by default, overridable via device_memory_size=...), - how to disable it (device_memory_size=0 keeps the legacy bump-pointer layout for plain VMs). Also update the struct snippet and the --memory CLI synopsis at the top of the page to include the new field. Assisted-by: Claude:Opus-4.7 Signed-off-by: Muminul Islam <muislam@microsoft.com>

russell-islam added 8 commits May 18, 2026 15:49

russell-islam force-pushed the muislam/pmem-memdev-region branch from 7195bfb to e28058a Compare May 18, 2026 23:03

russell-islam force-pushed the muislam/pmem-memdev-region branch 2 times, most recently from b860914 to 7dfd7fa Compare May 19, 2026 02:14

russell-islam mentioned this pull request May 19, 2026

Pmem device does not show up on some host with phys_bits=48 on mshv or nested kvm on Hyper-V cloud-hypervisor/cloud-hypervisor#8191

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Muislam/pmem memdev region#137

Muislam/pmem memdev region#137
russell-islam wants to merge 9 commits into
mainfrom
muislam/pmem-memdev-region

russell-islam commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

russell-islam commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant