Skip to content

Muislam/fix mshv cvm boot#124

Open
russell-islam wants to merge 838 commits into
muislam/v51.1from
muislam/fix-mshv-cvm-boot
Open

Muislam/fix mshv cvm boot#124
russell-islam wants to merge 838 commits into
muislam/v51.1from
muislam/fix-mshv-cvm-boot

Conversation

@russell-islam
Copy link
Copy Markdown
Owner

No description provided.

weltling and others added 30 commits April 14, 2026 22:11
Single allocated cluster reads are submitted to io_uring for true
async completion. Mixed mapping reads (zero, compressed, backing,
multi cluster) fall back to synchronous pread64 with synthetic
completions.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Synchronous per cluster write path - gather guest data from iovecs,
map each cluster through QcowMetadata, and pwrite to the allocated
host offset. Partial cluster writes with a backing file read the
backing data first so map_cluster_for_write can perform COW.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Flush dirty metadata caches and sync the underlying file via
QcowMetadata::flush, then signal synthetic completion.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Deallocate clusters through QcowMetadata::deallocate_bytes, then
apply the resulting DeallocAction list (punch hole or write zeroes
at host offsets). write_zeroes delegates to punch_hole since
unallocated QCOW2 clusters inherently read as zero.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
try_clone shares the Arc wrapped metadata and backing file.
new_async_io creates a QcowAsync worker with its own io_uring
instance for the given ring depth.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
When io_uring is available and not disabled, open QCOW2 images
with QcowDiskAsync for asynchronous reads. Falls back to
QcowDiskSync otherwise.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Implement batch_requests_enabled() and submit_batch_requests() for
QcowAsync. Without batching, each read_vectored call performs its own
io_uring submit() syscall. With batching, the virtio queue handler
collects all pending requests and submits them in a single call,
pushing multiple SQEs before one submit() syscall.

Each request in the batch is classified through the metadata layer.
Requests that hit the fast path (single allocated cluster mapping)
are pushed to the io_uring submission queue. Requests that require
the slow path (compressed, backing, zero fill, or mixed mappings)
are completed synchronously and queued as synthetic completions.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add a QcowAsync unit test for punch hole completion. The test
verifies that a punch hole request reports successful completion
and that the deallocated range reads back as zeroes.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add a QcowAsync unit test for write zeroes completion. The test
verifies that a write zeroes request reports successful completion
and that the zeroed range reads back as zeroes.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add a QcowAsync unit test that writes a byte pattern through
write_vectored, reads it back through read_vectored, and verifies
the data matches. This exercises the core async write and read
paths end to end.

Also add an async_write helper for future tests.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add a QcowAsync unit test that writes distinct patterns into two
adjacent clusters, then issues a single read spanning the cluster
boundary. Verifies that multi mapping read resolution returns the
correct data from both clusters.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add a QcowAsync unit test that submits a batch of interleaved
write and read requests via submit_batch_requests. Verifies that
all completions arrive with the correct user_data and that the
read back data matches the written data.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add a QcowAsync unit test that reads from a range that was never
written. Verifies the fundamental QCOW contract that unallocated
clusters return zeroes.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add a QcowAsync unit test that writes 4K into the middle of a
cluster, then reads the entire cluster back. Verifies that the
written region matches and surrounding bytes remain zero. This
exercises the COW path where unwritten parts of a newly allocated
cluster must be zero filled.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Write data, punch hole to deallocate, then rewrite the same
range and verify the new contents read back correctly.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Write a distinct byte pattern into each of eight consecutive
clusters in a single operation, then read the full range back
and verify per cluster contents.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
The unit test container runs with Docker default seccomp
profile which blocks io_uring_setup, io_uring_enter and
io_uring_register. This causes all qcow_async unit tests to
fail with EPERM when creating an io_uring instance.

Add --security-opt seccomp=unconfined to the unit test docker
run invocation. The container already has --device access and
cap_net_admin, so this does not materially change the security
posture.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
The address that is passed from the guest should be treated as
untrusted. Currently an invalid address will panic the VMM. This only
allows the guest to hurt itself, but we shouldn't have the VMM crashing.
Instead let's return an error if possible or invalidate the queue if it
happen during setup.

The data flow from guest to translate_gva/translate_gpa is:

  1. Guest writes a raw u64 address into a virtio descriptor in the
     shared descriptor table (guest memory).
  2. The virtio-queue crate reads this descriptor via read_obj() and
     returns the addr field as-is in a GuestAddress — no validation.
  3. Device code calls .translate_gva(access_platform, len) on the
     GuestAddress.
  4. With IOMMU (access_platform is Some): the address is an IOVA that
     must be translated to a GPA via the IOMMU mapping table. If the
     guest provides an unmapped IOVA, translation returns Err.
     Previously, .unwrap() here panicked the VMM.
  5. Without IOMMU (access_platform is None): translate_gva is a no-op
     (returns self). The raw address flows to GuestMemory::read_obj()
     which validates it — out-of-range addresses return
     Err(InvalidGuestAddress), so no host memory corruption is possible.

Signed-off-by: Dylan Reid <dgreid@fb.com>
Bumps [softprops/action-gh-release](https://github.com/softprops/action-gh-release) from 2 to 3.
- [Release notes](https://github.com/softprops/action-gh-release/releases)
- [Changelog](https://github.com/softprops/action-gh-release/blob/master/CHANGELOG.md)
- [Commits](softprops/action-gh-release@v2...v3)

---
updated-dependencies:
- dependency-name: softprops/action-gh-release
  dependency-version: '3'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
This is a prerequisite for the next commit where we need shared access.

On-behalf-of: SAP philipp.schuster@sap.com
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
Extract AcpiCpuHotplugController from CpuManager and move the BusDevice
implementation to the new type. This separates VMM-internal vCPU
management from the guest-visible ACPI CPU hotplug MMIO interface.

Besides clarifying responsibilities and reducing technical debt, this
fixes a rare deadlock involving pause handling and MMIO access.

New responsibilities:
- CpuManager manages VMM-internal vCPU lifecycle and coordination
- AcpiCpuHotplugController implements the guest-visible ACPI CPU hotplug
  MMIO interface

A vCPU thread may exit KVM_RUN to perform an MMIO access previously
handled by CpuManager. If the VMM thread begins processing a `pause`
event before that MMIO operation acquires access to CpuManager,
CpuManager::pause() will block waiting for the vCPU thread to ACK
the pause, while the vCPU thread is blocked waiting to complete the MMIO
operation through the same CpuManager - which it can never lock - the
VMM is deadlocked.

This can occur during early boot or CPU hotplug when pause events race
with MMIO accesses. The issue is rare and timing-dependent, but real.
For reproducing: run `ch-remote pause|resume` in a loop while booting
a Linux VM (via direct kernel boot).

With the new design, these MMIO operations no longer depend on
CpuManager, which removes the deadlock path entirely.

On-behalf-of: SAP philipp.schuster@sap.com
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
This improves the documentation at various places.

On-behalf-of: SAP philipp.schuster@sap.com
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
TL;DR: Add note about how we expect code comments/documentation

This updates the coding standards as discussed [0]. The general
guideline is to write down as little process as possible and leave room
for pragmatic exceptions, maintainer and contributor preferences while
still striving for excellent code quality.

[0]: cloud-hypervisor#7990 (comment)

On-behalf-of: SAP philipp.schuster@sap.com
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
The old name read as 'metadata for a QCOW2 backing file' rather
than what it actually is: a QCOW2 backing file reader. Rename to
Qcow2Backing to parallel RawBacking and clarify intent.

Suggested-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add pub fn alignment() to RawFile so that callers can
query the O_DIRECT buffer alignment requirement probed
at file open time.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
When the data file is opened with O_DIRECT, buffer address, length,
and file offset must satisfy the device alignment.

Add AlignedBuf RAII wrapper and aligned_pread/aligned_pwrite helpers
in qcow_common that use bounce buffers when alignment constraints
are not met. For writes with misaligned offset, a read modify write
is performed on the aligned region.

gather_from_iovecs_into gathers iovec data directly into a caller
provided buffer, avoiding an intermediate Vec allocation.

Fixes: cloud-hypervisor#8007
Signed-off-by: CMGS <ilskdw@gmail.com>
Co-authored-by: Anatol Belski <anbelski@linux.microsoft.com>
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Store the alignment from the data file in QcowSync. Use AlignedBuf
directly in read_vectored and write_vectored as the intermediate
buffer so that aligned_pread/aligned_pwrite can skip the bounce
copy when offset and length are naturally aligned.

Use gather_from_iovecs_into to gather iovec data directly into the
aligned buffer.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Store the alignment from the data file in QcowAsync. Use
aligned_pread in scatter_read_sync and aligned_pwrite with
gather_from_iovecs_into in cow_write_sync, matching the
QcowSync approach.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add direct_io variants for suitable tests by extracting
test bodies into _impl(direct_io: bool) functions. Each
original test calls _impl(false) and a new _direct_io test
calls _impl(true).

When direct_io is true, RawFile probes alignment and QcowSync
exercises the AlignedBuf and bounce buffer paths in
read_vectored and write_vectored.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
russell-islam and others added 30 commits April 25, 2026 19:14
The GvaGpaValid flag in the intercept message indicates whether
the provided GPA corresponds to the decoded GVA. Without checking
this flag, the emulator may incorrectly use a stale GPA mapping
when the hypervisor invalidates it.

Add a check for the GvaGpaValid flag before using the cached
(GVA, GPA) mapping. If the flag is clear, use a sentinel value
to force translate() to perform a proper hypercall-based
translation instead of using an invalid cached mapping.

Signed-off-by: Pedro Barbuda <pbarbuda@microsoft.com>
Signed-off-by: Muminul Islam <muislam@microsoft.com>
This daemon is only a testing tool used during integration testing. The
previous code auto-detected the image type from the file's magic bytes
and opened qcow2 images via QcowFile. Such behaviour has been the cause
of security issues in the past.

Drop the qcow2 detection and open path so only raw images are handled.

Signed-off-by: Rob Bradford <rbradford@meta.com>
The aarch64 integration script hardcoded `LAST_RELEASE_VERSION="v39.0"`
for the live-upgrade binary download, while the live-migration runner
already accepts a `MIGRATABLE_VERSION` env override with a `vxx.0`
regex check. Standardise the aarch64 script on the same env-override
block so both arches honour the same knob with the same validation.

Default is unchanged (v39.0).

Assisted-by: Claude:Opus-4.7
Signed-off-by: Rob Bradford <rbradford@meta.com>
Move the live migration test running from their own script into the
x86-64 script (on aarch64 they were already in the same script.) They
were historically separate as they were new. Now they are established it
makes sense for them to be combined.

The timeout in the GitHub workflow has been extended to accommodate the
extra work in the same step.

The Rust test scopes are unchanged - the running of the tests has been
moved.

Assisted-by: Claude:Opus-4.7
Signed-off-by: Rob Bradford <rbradford@meta.com>
Move the helper methods used for live migration to the common utils
(like many other tests use).

Assisted-by: Claude:Opus-4.7
Signed-off-by: Rob Bradford <rbradford@meta.com>
Move the live migration tests themselves into the common scopes allowing
the tests to now run interleaved together hopefully reducing CI time.

On MSHV the live migration tests are now not compiled in rather than
compiled in and skipped (as the helpers are not compiled in for MSHV.)

Assisted-by: Claude:Opus-4.7
Signed-off-by: Rob Bradford <rbradford@meta.com>
Reduce `nr_hugepages` from 12 GB to 6 GB on both architectures. The
number if huge pages needed (if all the tests run at once) is 4GiB so
this gives 50% headroom.

This should reduce the number of tests that fail/flake out due to lack
of memory.

Signed-off-by: Rob Bradford <rbradford@meta.com>
`test_live_migration_virtio_fs` and its `_local` variant lived in
`common_sequential` because they shared `~/workloads/shared_dir` as
the virtiofsd backing and wrote/deleted the same `migration_test_file`
and `post_migration_file` paths inside it. Two instances running
concurrently would race on those files.

Give each invocation its own backing directory under `guest.tmp_dir`,
which is already per-test unique and gets cleaned up by the `TempDir`
drop. The test logic is otherwise unchanged. Move both wrappers and
the helper from `common_sequential` to `common_parallel` and update
the sequential-tests comment accordingly.

The boot footprint is small (512 MB src + 512 MB dest), so two
concurrent instances comfortably fit alongside the rest of the
parallel suite. The remaining sequential live-migration tests
(balloon, NUMA) genuinely need their isolation slot for memory
headroom.

Signed-off-by: Rob Bradford <rbradford@meta.com>
To aid debugging of this test failing print the output from the restored
VMM instance too.

Signed-off-by: Rob Bradford <rbradford@meta.com>
The four vhost-user device wrappers (blk, fs, generic_vhost_user, net)
each carried an identical reset() body that resumed the worker thread,
asked the backend to reset, signalled kill_evt and dropped interrupt_cb.
Move the shared body into VhostUserCommon::reset() so behaviour stays in
one place.

No behavioural change.

Signed-off-by: Rob Bradford <rbradford@meta.com>
Assisted-by: Claude:claude-opus-4-7
The virtio specification treats reset as the recovery operation and so
must take the device back to a fresh state, and the driver waits for the
status read-back to converge before continuing. There is no defined way
for the device to report a reset failure to the driver.

Previously the implementations of reset() would return early and not
complete all their cleanup leaving them in an inconsistent state. Now
log errors and continue through the execution.

Signed-off-by: Rob Bradford <rbradford@meta.com>
Assisted-by: Claude:claude-opus-4-7
Per the virtio spec a device reset must return the device to its
power-on state. The reset path was only zeroing queue_select. Add
VirtioPciCommonConfig::reset() and call it from the transport's reset
path so the configuration is cleared. Also relax the condition to allow
the device to be reset at any time to match the spec.

Signed-off-by: Rob Bradford <rbradford@meta.com>
Previously the interrupt was created in VirtioPciDevice, moved via the
Option::take() to the VirtioPciDeviceActivator and then moved to the
VirtioDevice upon activation. On reset it would be moved back ready for
reactivation.

Since this already an Arc type remove the wrapping Option and instead
refcount it such that the VirtioPciDevice can continue to hold onto it
for later activations.

This significantly simplifies the reset() logic as there is no need to
hand back the interrupt.

A few devices used whether the interrupt was Some to make triggering an
interrupt a no-op. However the MSI-X interrupt routing already drops the
interrupt if the driver hasn't yet configured the vector so it is safe
to trigger the interrupt before device activation (e.g. balloon resize
request before driver loaded).

VirtioCommon still retains an Option<..> for the interrupt as the
interrupt is not known until activation time (after this has been
created). A helper VirtioCommon::trigger_interrupt() has been added to
handle this.

Signed-off-by: Rob Bradford <rbradford@meta.com>
Rather than calculate in the DeviceManager and pass it through do it in
the device where it already has all the required information.

Signed-off-by: Rob Bradford <rbradford@meta.com>
msix_num is guaranteed to be at least 1 so this check (and the Option)
that it returns can be removed.

Signed-off-by: Rob Bradford <rbradford@meta.com>
Since this is always created there is no need to make it an Option type
simplifying the code. Historically it was an Option to support INTx
based virtio but that was removed long ago.

Signed-off-by: Rob Bradford <rbradford@meta.com>
The VirtioInterrupt is now always created so the Option<..> can always
be removed.

As a side effect the interrupt_source_group can also be removed from the
struct.

Signed-off-by: Rob Bradford <rbradford@meta.com>
Introduce QcowDisk, a unified DiskFile implementation for QCOW2
disk images that handles backend selection at runtime via a
use_io_uring flag, matching the pattern used by FixedVhdDisk.

The wrapper delegates to QcowSync or QcowAsync based on the flag
and includes a compile time guard that returns an error when
io_uring is requested but the feature is not enabled.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Replace QcowDiskSync and QcowDiskAsync with QcowDisk in all QCOW2
benchmark helpers. The sync helpers pass use_io_uring=false, the
async helpers pass use_io_uring=true.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Update open_qcow2 to construct QcowDisk instead of choosing
between QcowDiskAsync and QcowDiskSync. The backend decision
is now made inside QcowDisk::create_async_io.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Replace QcowDiskSync and QcowDiskAsync constructors in the
qcow_sync and qcow_async test modules with QcowDisk::new,
passing use_io_uring=false and use_io_uring=true respectively.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Delete QcowDiskSync and QcowDiskAsync wrapper structs along with
their DiskFile trait impls. Only the AsyncIo worker structs
QcowSync and QcowAsync remain. Reduce module visibility of
qcow_sync and qcow_async to pub(crate).

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Verify that QcowDisk reports the expected virtual size.

Assisted-by: Claude:Opus-4.6
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Verify that the sync backend disables batch requests and the
io_uring backend enables them.

Assisted-by: Claude:Opus-4.6
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Verify that try_clone preserves the backend dispatch for both
sync and io_uring backends.

Assisted-by: Claude:Opus-4.6
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Verify that a freshly created sparse QCOW2 image reports a
physical size smaller than its logical size.

Assisted-by: Claude:Opus-4.6
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add support for backend that is connected via the vhost-user-generic
frontend to generate an interrupt into the guest when it has made a
change to the configuration. This is useful for devices that can change
the exposed configuration at runtime.

Assisted-by: Claude:Opus-4.7
Signed-off-by: Rob Bradford <rbradford@meta.com>
The PageTypeConfig introduced for multi-hypervisor support
incorrectly mapped MSHV CPUID pages to NORMAL (type 0) and
secrets pages to UNMEASURED (type 3). The correct MSHV values
are CPUID (type 5) and SECRETS (type 4).

This caused MSHV CVM guests to fail during boot with a
segmentation fault, since the SNP firmware could not
recognize the CPUID or secrets pages, leading to missing
CPU feature data and invalid attestation secrets.

Fixes a regression introduced in commit 75ed2c9 ("vmm:
add KVM SEV-SNP support to IGVM loader").

Signed-off-by: Muminul Islam <muislam@microsoft.com>
Commit 7d65187 ("vmm: make RSDP address optional in
configure_system") changed the boot path to always call
configure_system(), even when rsdp_addr is None. Previously,
the call was guarded by `if let Some(rsdp_adr) = rsdp_addr`,
which meant SEV-SNP guests (where rsdp_addr is None because
ACPI tables come from IGVM) skipped configure_system entirely.

After the refactor, configure_system runs unconditionally and
writes EBDA pointers, SMBIOS tables, MP tables, and boot
parameters into guest memory. For SEV-SNP guests this corrupts
encrypted/validated pages set up by the IGVM loader, causing a
segmentation fault during boot.

Restore the original behavior by gating configure_system on
rsdp_addr.is_some(), so SEV-SNP guests skip the call as they
did before the refactor.

Signed-off-by: Muminul Islam <muislam@microsoft.com>
The SnpCpuidInfo allocation and guest_memory.read() in the
page import loop are only needed for the KVM CPUID retry
path. On MSHV, reading back from protected guest memory at
this point is invalid and can cause a segmentation fault.

Gate both the SnpCpuidInfo variable and the read behind
cfg(feature = "kvm") and a runtime hypervisor type check
so MSHV skips this code entirely.

Signed-off-by: Muminul Islam <muislam@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.