Skip to content

Muislam/pmem memdev region#137

Open
russell-islam wants to merge 9 commits into
mainfrom
muislam/pmem-memdev-region
Open

Muislam/pmem memdev region#137
russell-islam wants to merge 9 commits into
mainfrom
muislam/pmem-memdev-region

Conversation

@russell-islam
Copy link
Copy Markdown
Owner

No description provided.

QEMU anchors its machine `device_memory` region (where pc-dimm,
nvdimm, virtio-pmem and virtio-mem regions live) at a 1 GiB-aligned
GPA immediately above the top of guest RAM. The alignment is
`pc_get_device_memory_range()` in `hw/i386/pc.c` for x86 and the
equivalent helpers on other targets.

Cloud Hypervisor does not yet have a single dedicated memory-device
region; the upcoming refactor will introduce one. Define a shared
`DEVICE_MEMORY_ALIGN = 1 GiB` constant in the arch crate so the
rest of the series can reference it without adding more arch knobs.

No behaviour change: the constant is unused at this point.

Assisted-by: Claude:Opus-4.7
Signed-off-by: Muminul Islam <muislam@microsoft.com>
`AddressAllocator::allocate(None, ...)` performs a top-down search and
packs allocations at the high end of the managed range. That is the
desired policy for PCI64 BAR placement but a poor fit for memory
devices such as virtio-pmem, which work better when packed near the
start of a dedicated window (mirroring the QEMU memory-device region
behaviour).

Add `allocate_first_fit` alongside `allocate` for callers that want
bottom-up placement on the same allocator. The new method delegates to
a new private helper `first_available_range_low` which scans ranges in
forward order and returns the first hole that satisfies size and
alignment. When the caller passes `Some(base)`, behaviour is identical
to `allocate(Some(base), ...)` so the snapshot-restore path can be
shared verbatim with the existing API.

No existing caller is changed by this commit; the new method is a pure
addition. Subsequent commits wire it into a dedicated memory-device
region in the VMM and place virtio-pmem there.

Assisted-by: Claude:Opus-4.7
Signed-off-by: Muminul Islam <muislam@microsoft.com>
Add a `DeviceMemoryRegion` type inside `memory_manager` that bundles
a base GPA, size, and a bottom-up `AddressAllocator` scoped to that
range. It mirrors QEMU's machine `device_memory` region
(`hw/i386/pc.c::pc_get_device_memory_range` + `hw/mem/memory-device.c`)
which anchors a dedicated GPA window immediately above guest RAM and
allocates virtio-pmem / virtio-mem / pc-dimm devices inside it
bottom-up.

The new type is currently unused; subsequent commits in this series
build it during `MemoryManager::new`, route virtio-mem zones and
virtio-pmem allocations through it, and re-base the per-segment PCI64
mem64 allocator to start above it. A temporary `#[allow(dead_code)]`
covers the helpers until those commits land in the same series.

Also add a dedicated `Error::CreateDeviceMemoryAllocator` variant for
the construction failure path, parallel to `CreateSystemAllocator`.

Assisted-by: Claude:Opus-4.7
Signed-off-by: Muminul Islam <muislam@microsoft.com>
Expose an optional `device_memory_size: Option<u64>` on `MemoryConfig`
plumbed through the `--memory device_memory_size=...` CLI option, the
OpenAPI schema, and every struct-literal call site.

The knob mirrors QEMU's `maxmem - mem` indirectly: it gives operators
explicit control over the size of the dedicated guest-physical
"device memory" region introduced by the upcoming refactor (where
virtio-pmem, virtio-mem zones, and future pc-dimm-equivalents live).
When left unset (the default for unmodified callers and snapshots
that pre-date this change), the region size will be derived from the
existing `hotplug_size` plus pmem headroom in a later commit,
preserving today's behaviour.

The field is wired but not yet consumed; the next commits build the
region using this value and route memory-device allocations through
it.

Assisted-by: Claude:Opus-4.7
Signed-off-by: Muminul Islam <muislam@microsoft.com>
Add an `Option<DeviceMemoryRegion>` field to `MemoryManager` plus a
`device_memory()` accessor. The field is initialised to `None` in
`MemoryManager::new` for now; the next commit in this series will
compute the base/size, build the region from the new
`MemoryConfig::device_memory_size` knob, and store it here.

This scaffolding commit keeps the layout untouched: `MemoryManager`
still hands out PCI64 BAR address space and virtio-mem zones exactly
as before, so behaviour is identical for both fresh boots and
snapshot restore.

A temporary `#[allow(dead_code)]` covers the accessor until later
commits start routing memory-device allocations through it.

Assisted-by: Claude:Opus-4.7
Signed-off-by: Muminul Islam <muislam@microsoft.com>
Anchor a dedicated guest-physical "device memory" region 1 GiB-aligned
above the top of guest RAM and place memory devices inside it, matching
QEMU's `pc_get_device_memory_range()` /
`hw/mem/memory-device.c::memory_device_get_free_addr()` layout where
pc-dimm, nvdimm, virtio-pmem and virtio-mem all share a single
bottom-up first-fit window.

Fresh-boot path:

 - Compute the region base by rounding the top of RAM up to
   `arch::DEVICE_MEMORY_ALIGN` (1 GiB).
 - Size from `MemoryConfig::device_memory_size` when set, otherwise
   sum the per-zone `hotplug_size`s plus declared pmem device sizes
   so the region is large enough for all memory devices without
   requiring the user to set device_memory_size manually.
 - Validate the resulting top against the platform-MMIO ceiling and
   fail boot with a clear message when `phys_bits` is too low.
 - Skip region creation entirely when the computed size is zero so
   plain VMs without any memory devices keep their current layout
   bit-for-bit.

Per-zone placement:

 - virtio-mem zones now allocate from the region via
   `allocate_first_fit(None, hotplug_size, VIRTIO_MEM_ALIGN_SIZE)`,
   instead of bumping from `start_of_device_area`. Without a region
   the legacy bump still applies, preserving today's behaviour for
   plain hotplug-free VMs.
 - ACPI RAM hotplug zones reserve their capacity at the bottom of
   the region (128 MiB-aligned) so the address window matches the
   one declared up front.

After all zones are placed, `start_of_device_area` is set to
`region.top()` so the per-segment PCI64 BAR allocator starts above
the device-memory region (matching QEMU's `pci_hole64` placement
above `device_memory`). This pushes PCI64 BARs higher, which the
next commit codifies for the per-segment allocator.

Restore paths pass 0 for pmem_device_sizes since they rebuild the
region from snapshot data.

Assisted-by: Claude:Opus-4.7
Signed-off-by: Muminul Islam <muislam@microsoft.com>
Route virtio-pmem allocation through the dedicated device-memory
region in `MemoryManager` when one is present, falling back to the
per-segment 64-bit MMIO window when the region is absent or full
(e.g. entirely consumed by virtio-mem zones).

This brings the layout in line with QEMU's
`hw/i386/pc.c::pc_get_device_memory_range` +
`hw/mem/memory-device.c::memory_device_get_free_addr`: pc-dimm,
nvdimm and virtio-pmem all live inside a single bottom-up first-fit
window anchored just above guest RAM, while PCI64 BARs sit in the
separate `pci_hole64` above it.

Fresh-allocation path:

  - Use an `and_then/or_else` chain: try
    `device_memory.allocator().allocate_first_fit(...)` first so
    pmem GPAs are packed near the start of the region (bottom-up),
    mirroring `memory_device_get_free_addr`. If that returns None
    (region full or absent), gracefully fall back to the per-segment
    mem64 allocator.
  - virtio-pmem is now segment-agnostic when the region is in use,
    matching QEMU's single per-machine `device_memory` window. The
    `pci_segment` selector still chooses which PCI segment the
    virtio device attaches to; only the GPA backing the region is
    drawn from the shared pool.

Restore path:

  - Honour the GPA recorded in the snapshot. Try the device-memory
    region first when present and the saved base falls inside it
    (newer snapshots), and fall back to the per-segment mem64
    allocator otherwise (snapshots taken before the device-memory
    region was introduced map back at their original GPAs).

This fixes the bug where 48-bit `phys_bits` hosts placed pmem near
~64 TiB and guests with smaller MAXMEM ceilings (e.g. Ubuntu Jammy
5.15 x86_64, MAXMEM ~11 TiB) rejected the hot-add with
`nd_pmem: probe of namespace0.0 failed with error -22`. Plain VMs
without declared memory devices preserve their previous layout
because `device_memory` is `None` and the legacy per-segment path
remains active.

Assisted-by: Claude:Opus-4.7
Signed-off-by: Muminul Islam <muislam@microsoft.com>
Add an optional `device_memory: Option<(u64, u64)>` field to
`MemoryManagerSnapshotData` and use `#[serde(default)]` so snapshots
produced before this commit still deserialize and continue to take
the legacy `start_of_device_area` fallback on restore.

Save path:

  - `snapshot_data()` records `Some((base, size))` when the
    `MemoryManager` is using a `DeviceMemoryRegion`, and `None`
    otherwise.

Restore path:

  - When the snapshot carries a `device_memory` entry, rebuild the
    `DeviceMemoryRegion` at the saved `(base, size)` and re-reserve
    every virtio-mem zone at its saved GPA via the explicit-base
    branch of `AddressAllocator::allocate`. This keeps the region's
    allocator and the actual guest layout consistent so subsequent
    virtio-pmem hot-adds via `allocate_first_fit` cannot hand out
    overlapping addresses.

  - When the field is absent (older snapshots or VMs that never
    used the region), keep `device_memory = None` and fall through
    to the existing legacy path that was already restoring fine
    with the bump-pointer `start_of_device_area`.

Together with the per-zone hot-plug reservation in the fresh-boot
path (earlier in this series), this closes the loop: virtio-pmem
allocations on a restored VM continue to land inside the same
shared device-memory window they came from on the source VM, and
live migration / live upgrade between builds with and without this
series keeps working in both directions.

Assisted-by: Claude:Opus-4.7
Signed-off-by: Muminul Islam <muislam@microsoft.com>
@russell-islam russell-islam force-pushed the muislam/pmem-memdev-region branch from 7195bfb to e28058a Compare May 18, 2026 23:03
Add a short device_memory_size section to docs/memory.md describing
the dedicated device-memory region introduced in this series:

  - what it is (a single QEMU-style window above guest RAM hosting
    virtio-pmem, virtio-mem zones, and ACPI RAM hot-plug slots),
  - how it is sized (sum of declared hotplug_sizes by default,
    overridable via device_memory_size=...),
  - how to disable it (device_memory_size=0 keeps the legacy
    bump-pointer layout for plain VMs).

Also update the struct snippet and the --memory CLI synopsis at
the top of the page to include the new field.

Assisted-by: Claude:Opus-4.7
Signed-off-by: Muminul Islam <muislam@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant