Muislam/pmem memdev region#137
Open
russell-islam wants to merge 9 commits into
Open
Conversation
QEMU anchors its machine `device_memory` region (where pc-dimm, nvdimm, virtio-pmem and virtio-mem regions live) at a 1 GiB-aligned GPA immediately above the top of guest RAM. The alignment is `pc_get_device_memory_range()` in `hw/i386/pc.c` for x86 and the equivalent helpers on other targets. Cloud Hypervisor does not yet have a single dedicated memory-device region; the upcoming refactor will introduce one. Define a shared `DEVICE_MEMORY_ALIGN = 1 GiB` constant in the arch crate so the rest of the series can reference it without adding more arch knobs. No behaviour change: the constant is unused at this point. Assisted-by: Claude:Opus-4.7 Signed-off-by: Muminul Islam <muislam@microsoft.com>
`AddressAllocator::allocate(None, ...)` performs a top-down search and packs allocations at the high end of the managed range. That is the desired policy for PCI64 BAR placement but a poor fit for memory devices such as virtio-pmem, which work better when packed near the start of a dedicated window (mirroring the QEMU memory-device region behaviour). Add `allocate_first_fit` alongside `allocate` for callers that want bottom-up placement on the same allocator. The new method delegates to a new private helper `first_available_range_low` which scans ranges in forward order and returns the first hole that satisfies size and alignment. When the caller passes `Some(base)`, behaviour is identical to `allocate(Some(base), ...)` so the snapshot-restore path can be shared verbatim with the existing API. No existing caller is changed by this commit; the new method is a pure addition. Subsequent commits wire it into a dedicated memory-device region in the VMM and place virtio-pmem there. Assisted-by: Claude:Opus-4.7 Signed-off-by: Muminul Islam <muislam@microsoft.com>
Add a `DeviceMemoryRegion` type inside `memory_manager` that bundles a base GPA, size, and a bottom-up `AddressAllocator` scoped to that range. It mirrors QEMU's machine `device_memory` region (`hw/i386/pc.c::pc_get_device_memory_range` + `hw/mem/memory-device.c`) which anchors a dedicated GPA window immediately above guest RAM and allocates virtio-pmem / virtio-mem / pc-dimm devices inside it bottom-up. The new type is currently unused; subsequent commits in this series build it during `MemoryManager::new`, route virtio-mem zones and virtio-pmem allocations through it, and re-base the per-segment PCI64 mem64 allocator to start above it. A temporary `#[allow(dead_code)]` covers the helpers until those commits land in the same series. Also add a dedicated `Error::CreateDeviceMemoryAllocator` variant for the construction failure path, parallel to `CreateSystemAllocator`. Assisted-by: Claude:Opus-4.7 Signed-off-by: Muminul Islam <muislam@microsoft.com>
Expose an optional `device_memory_size: Option<u64>` on `MemoryConfig` plumbed through the `--memory device_memory_size=...` CLI option, the OpenAPI schema, and every struct-literal call site. The knob mirrors QEMU's `maxmem - mem` indirectly: it gives operators explicit control over the size of the dedicated guest-physical "device memory" region introduced by the upcoming refactor (where virtio-pmem, virtio-mem zones, and future pc-dimm-equivalents live). When left unset (the default for unmodified callers and snapshots that pre-date this change), the region size will be derived from the existing `hotplug_size` plus pmem headroom in a later commit, preserving today's behaviour. The field is wired but not yet consumed; the next commits build the region using this value and route memory-device allocations through it. Assisted-by: Claude:Opus-4.7 Signed-off-by: Muminul Islam <muislam@microsoft.com>
Add an `Option<DeviceMemoryRegion>` field to `MemoryManager` plus a `device_memory()` accessor. The field is initialised to `None` in `MemoryManager::new` for now; the next commit in this series will compute the base/size, build the region from the new `MemoryConfig::device_memory_size` knob, and store it here. This scaffolding commit keeps the layout untouched: `MemoryManager` still hands out PCI64 BAR address space and virtio-mem zones exactly as before, so behaviour is identical for both fresh boots and snapshot restore. A temporary `#[allow(dead_code)]` covers the accessor until later commits start routing memory-device allocations through it. Assisted-by: Claude:Opus-4.7 Signed-off-by: Muminul Islam <muislam@microsoft.com>
Anchor a dedicated guest-physical "device memory" region 1 GiB-aligned above the top of guest RAM and place memory devices inside it, matching QEMU's `pc_get_device_memory_range()` / `hw/mem/memory-device.c::memory_device_get_free_addr()` layout where pc-dimm, nvdimm, virtio-pmem and virtio-mem all share a single bottom-up first-fit window. Fresh-boot path: - Compute the region base by rounding the top of RAM up to `arch::DEVICE_MEMORY_ALIGN` (1 GiB). - Size from `MemoryConfig::device_memory_size` when set, otherwise sum the per-zone `hotplug_size`s plus declared pmem device sizes so the region is large enough for all memory devices without requiring the user to set device_memory_size manually. - Validate the resulting top against the platform-MMIO ceiling and fail boot with a clear message when `phys_bits` is too low. - Skip region creation entirely when the computed size is zero so plain VMs without any memory devices keep their current layout bit-for-bit. Per-zone placement: - virtio-mem zones now allocate from the region via `allocate_first_fit(None, hotplug_size, VIRTIO_MEM_ALIGN_SIZE)`, instead of bumping from `start_of_device_area`. Without a region the legacy bump still applies, preserving today's behaviour for plain hotplug-free VMs. - ACPI RAM hotplug zones reserve their capacity at the bottom of the region (128 MiB-aligned) so the address window matches the one declared up front. After all zones are placed, `start_of_device_area` is set to `region.top()` so the per-segment PCI64 BAR allocator starts above the device-memory region (matching QEMU's `pci_hole64` placement above `device_memory`). This pushes PCI64 BARs higher, which the next commit codifies for the per-segment allocator. Restore paths pass 0 for pmem_device_sizes since they rebuild the region from snapshot data. Assisted-by: Claude:Opus-4.7 Signed-off-by: Muminul Islam <muislam@microsoft.com>
Route virtio-pmem allocation through the dedicated device-memory
region in `MemoryManager` when one is present, falling back to the
per-segment 64-bit MMIO window when the region is absent or full
(e.g. entirely consumed by virtio-mem zones).
This brings the layout in line with QEMU's
`hw/i386/pc.c::pc_get_device_memory_range` +
`hw/mem/memory-device.c::memory_device_get_free_addr`: pc-dimm,
nvdimm and virtio-pmem all live inside a single bottom-up first-fit
window anchored just above guest RAM, while PCI64 BARs sit in the
separate `pci_hole64` above it.
Fresh-allocation path:
- Use an `and_then/or_else` chain: try
`device_memory.allocator().allocate_first_fit(...)` first so
pmem GPAs are packed near the start of the region (bottom-up),
mirroring `memory_device_get_free_addr`. If that returns None
(region full or absent), gracefully fall back to the per-segment
mem64 allocator.
- virtio-pmem is now segment-agnostic when the region is in use,
matching QEMU's single per-machine `device_memory` window. The
`pci_segment` selector still chooses which PCI segment the
virtio device attaches to; only the GPA backing the region is
drawn from the shared pool.
Restore path:
- Honour the GPA recorded in the snapshot. Try the device-memory
region first when present and the saved base falls inside it
(newer snapshots), and fall back to the per-segment mem64
allocator otherwise (snapshots taken before the device-memory
region was introduced map back at their original GPAs).
This fixes the bug where 48-bit `phys_bits` hosts placed pmem near
~64 TiB and guests with smaller MAXMEM ceilings (e.g. Ubuntu Jammy
5.15 x86_64, MAXMEM ~11 TiB) rejected the hot-add with
`nd_pmem: probe of namespace0.0 failed with error -22`. Plain VMs
without declared memory devices preserve their previous layout
because `device_memory` is `None` and the legacy per-segment path
remains active.
Assisted-by: Claude:Opus-4.7
Signed-off-by: Muminul Islam <muislam@microsoft.com>
Add an optional `device_memory: Option<(u64, u64)>` field to
`MemoryManagerSnapshotData` and use `#[serde(default)]` so snapshots
produced before this commit still deserialize and continue to take
the legacy `start_of_device_area` fallback on restore.
Save path:
- `snapshot_data()` records `Some((base, size))` when the
`MemoryManager` is using a `DeviceMemoryRegion`, and `None`
otherwise.
Restore path:
- When the snapshot carries a `device_memory` entry, rebuild the
`DeviceMemoryRegion` at the saved `(base, size)` and re-reserve
every virtio-mem zone at its saved GPA via the explicit-base
branch of `AddressAllocator::allocate`. This keeps the region's
allocator and the actual guest layout consistent so subsequent
virtio-pmem hot-adds via `allocate_first_fit` cannot hand out
overlapping addresses.
- When the field is absent (older snapshots or VMs that never
used the region), keep `device_memory = None` and fall through
to the existing legacy path that was already restoring fine
with the bump-pointer `start_of_device_area`.
Together with the per-zone hot-plug reservation in the fresh-boot
path (earlier in this series), this closes the loop: virtio-pmem
allocations on a restored VM continue to land inside the same
shared device-memory window they came from on the source VM, and
live migration / live upgrade between builds with and without this
series keeps working in both directions.
Assisted-by: Claude:Opus-4.7
Signed-off-by: Muminul Islam <muislam@microsoft.com>
7195bfb to
e28058a
Compare
Add a short device_memory_size section to docs/memory.md describing
the dedicated device-memory region introduced in this series:
- what it is (a single QEMU-style window above guest RAM hosting
virtio-pmem, virtio-mem zones, and ACPI RAM hot-plug slots),
- how it is sized (sum of declared hotplug_sizes by default,
overridable via device_memory_size=...),
- how to disable it (device_memory_size=0 keeps the legacy
bump-pointer layout for plain VMs).
Also update the struct snippet and the --memory CLI synopsis at
the top of the page to include the new field.
Assisted-by: Claude:Opus-4.7
Signed-off-by: Muminul Islam <muislam@microsoft.com>
b860914 to
7dfd7fa
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.