Muislam/fix mshv cvm boot#124
Open
russell-islam wants to merge 838 commits into
Open
Conversation
Single allocated cluster reads are submitted to io_uring for true async completion. Mixed mapping reads (zero, compressed, backing, multi cluster) fall back to synchronous pread64 with synthetic completions. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Synchronous per cluster write path - gather guest data from iovecs, map each cluster through QcowMetadata, and pwrite to the allocated host offset. Partial cluster writes with a backing file read the backing data first so map_cluster_for_write can perform COW. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Flush dirty metadata caches and sync the underlying file via QcowMetadata::flush, then signal synthetic completion. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Deallocate clusters through QcowMetadata::deallocate_bytes, then apply the resulting DeallocAction list (punch hole or write zeroes at host offsets). write_zeroes delegates to punch_hole since unallocated QCOW2 clusters inherently read as zero. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
try_clone shares the Arc wrapped metadata and backing file. new_async_io creates a QcowAsync worker with its own io_uring instance for the given ring depth. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
When io_uring is available and not disabled, open QCOW2 images with QcowDiskAsync for asynchronous reads. Falls back to QcowDiskSync otherwise. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Implement batch_requests_enabled() and submit_batch_requests() for QcowAsync. Without batching, each read_vectored call performs its own io_uring submit() syscall. With batching, the virtio queue handler collects all pending requests and submits them in a single call, pushing multiple SQEs before one submit() syscall. Each request in the batch is classified through the metadata layer. Requests that hit the fast path (single allocated cluster mapping) are pushed to the io_uring submission queue. Requests that require the slow path (compressed, backing, zero fill, or mixed mappings) are completed synchronously and queued as synthetic completions. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add a QcowAsync unit test for punch hole completion. The test verifies that a punch hole request reports successful completion and that the deallocated range reads back as zeroes. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add a QcowAsync unit test for write zeroes completion. The test verifies that a write zeroes request reports successful completion and that the zeroed range reads back as zeroes. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add a QcowAsync unit test that writes a byte pattern through write_vectored, reads it back through read_vectored, and verifies the data matches. This exercises the core async write and read paths end to end. Also add an async_write helper for future tests. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add a QcowAsync unit test that writes distinct patterns into two adjacent clusters, then issues a single read spanning the cluster boundary. Verifies that multi mapping read resolution returns the correct data from both clusters. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add a QcowAsync unit test that submits a batch of interleaved write and read requests via submit_batch_requests. Verifies that all completions arrive with the correct user_data and that the read back data matches the written data. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add a QcowAsync unit test that reads from a range that was never written. Verifies the fundamental QCOW contract that unallocated clusters return zeroes. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add a QcowAsync unit test that writes 4K into the middle of a cluster, then reads the entire cluster back. Verifies that the written region matches and surrounding bytes remain zero. This exercises the COW path where unwritten parts of a newly allocated cluster must be zero filled. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Write data, punch hole to deallocate, then rewrite the same range and verify the new contents read back correctly. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Write a distinct byte pattern into each of eight consecutive clusters in a single operation, then read the full range back and verify per cluster contents. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
The unit test container runs with Docker default seccomp profile which blocks io_uring_setup, io_uring_enter and io_uring_register. This causes all qcow_async unit tests to fail with EPERM when creating an io_uring instance. Add --security-opt seccomp=unconfined to the unit test docker run invocation. The container already has --device access and cap_net_admin, so this does not materially change the security posture. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
The address that is passed from the guest should be treated as
untrusted. Currently an invalid address will panic the VMM. This only
allows the guest to hurt itself, but we shouldn't have the VMM crashing.
Instead let's return an error if possible or invalidate the queue if it
happen during setup.
The data flow from guest to translate_gva/translate_gpa is:
1. Guest writes a raw u64 address into a virtio descriptor in the
shared descriptor table (guest memory).
2. The virtio-queue crate reads this descriptor via read_obj() and
returns the addr field as-is in a GuestAddress — no validation.
3. Device code calls .translate_gva(access_platform, len) on the
GuestAddress.
4. With IOMMU (access_platform is Some): the address is an IOVA that
must be translated to a GPA via the IOMMU mapping table. If the
guest provides an unmapped IOVA, translation returns Err.
Previously, .unwrap() here panicked the VMM.
5. Without IOMMU (access_platform is None): translate_gva is a no-op
(returns self). The raw address flows to GuestMemory::read_obj()
which validates it — out-of-range addresses return
Err(InvalidGuestAddress), so no host memory corruption is possible.
Signed-off-by: Dylan Reid <dgreid@fb.com>
Bumps [softprops/action-gh-release](https://github.com/softprops/action-gh-release) from 2 to 3. - [Release notes](https://github.com/softprops/action-gh-release/releases) - [Changelog](https://github.com/softprops/action-gh-release/blob/master/CHANGELOG.md) - [Commits](softprops/action-gh-release@v2...v3) --- updated-dependencies: - dependency-name: softprops/action-gh-release dependency-version: '3' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>
This is a prerequisite for the next commit where we need shared access. On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
Extract AcpiCpuHotplugController from CpuManager and move the BusDevice implementation to the new type. This separates VMM-internal vCPU management from the guest-visible ACPI CPU hotplug MMIO interface. Besides clarifying responsibilities and reducing technical debt, this fixes a rare deadlock involving pause handling and MMIO access. New responsibilities: - CpuManager manages VMM-internal vCPU lifecycle and coordination - AcpiCpuHotplugController implements the guest-visible ACPI CPU hotplug MMIO interface A vCPU thread may exit KVM_RUN to perform an MMIO access previously handled by CpuManager. If the VMM thread begins processing a `pause` event before that MMIO operation acquires access to CpuManager, CpuManager::pause() will block waiting for the vCPU thread to ACK the pause, while the vCPU thread is blocked waiting to complete the MMIO operation through the same CpuManager - which it can never lock - the VMM is deadlocked. This can occur during early boot or CPU hotplug when pause events race with MMIO accesses. The issue is rare and timing-dependent, but real. For reproducing: run `ch-remote pause|resume` in a loop while booting a Linux VM (via direct kernel boot). With the new design, these MMIO operations no longer depend on CpuManager, which removes the deadlock path entirely. On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
This improves the documentation at various places. On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
TL;DR: Add note about how we expect code comments/documentation This updates the coding standards as discussed [0]. The general guideline is to write down as little process as possible and leave room for pragmatic exceptions, maintainer and contributor preferences while still striving for excellent code quality. [0]: cloud-hypervisor#7990 (comment) On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
The old name read as 'metadata for a QCOW2 backing file' rather than what it actually is: a QCOW2 backing file reader. Rename to Qcow2Backing to parallel RawBacking and clarify intent. Suggested-by: Philipp Schuster <philipp.schuster@cyberus-technology.de> Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add pub fn alignment() to RawFile so that callers can query the O_DIRECT buffer alignment requirement probed at file open time. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
When the data file is opened with O_DIRECT, buffer address, length, and file offset must satisfy the device alignment. Add AlignedBuf RAII wrapper and aligned_pread/aligned_pwrite helpers in qcow_common that use bounce buffers when alignment constraints are not met. For writes with misaligned offset, a read modify write is performed on the aligned region. gather_from_iovecs_into gathers iovec data directly into a caller provided buffer, avoiding an intermediate Vec allocation. Fixes: cloud-hypervisor#8007 Signed-off-by: CMGS <ilskdw@gmail.com> Co-authored-by: Anatol Belski <anbelski@linux.microsoft.com> Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Store the alignment from the data file in QcowSync. Use AlignedBuf directly in read_vectored and write_vectored as the intermediate buffer so that aligned_pread/aligned_pwrite can skip the bounce copy when offset and length are naturally aligned. Use gather_from_iovecs_into to gather iovec data directly into the aligned buffer. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Store the alignment from the data file in QcowAsync. Use aligned_pread in scatter_read_sync and aligned_pwrite with gather_from_iovecs_into in cow_write_sync, matching the QcowSync approach. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add direct_io variants for suitable tests by extracting test bodies into _impl(direct_io: bool) functions. Each original test calls _impl(false) and a new _direct_io test calls _impl(true). When direct_io is true, RawFile probes alignment and QcowSync exercises the AlignedBuf and bounce buffer paths in read_vectored and write_vectored. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
The GvaGpaValid flag in the intercept message indicates whether the provided GPA corresponds to the decoded GVA. Without checking this flag, the emulator may incorrectly use a stale GPA mapping when the hypervisor invalidates it. Add a check for the GvaGpaValid flag before using the cached (GVA, GPA) mapping. If the flag is clear, use a sentinel value to force translate() to perform a proper hypercall-based translation instead of using an invalid cached mapping. Signed-off-by: Pedro Barbuda <pbarbuda@microsoft.com> Signed-off-by: Muminul Islam <muislam@microsoft.com>
This daemon is only a testing tool used during integration testing. The previous code auto-detected the image type from the file's magic bytes and opened qcow2 images via QcowFile. Such behaviour has been the cause of security issues in the past. Drop the qcow2 detection and open path so only raw images are handled. Signed-off-by: Rob Bradford <rbradford@meta.com>
The aarch64 integration script hardcoded `LAST_RELEASE_VERSION="v39.0"` for the live-upgrade binary download, while the live-migration runner already accepts a `MIGRATABLE_VERSION` env override with a `vxx.0` regex check. Standardise the aarch64 script on the same env-override block so both arches honour the same knob with the same validation. Default is unchanged (v39.0). Assisted-by: Claude:Opus-4.7 Signed-off-by: Rob Bradford <rbradford@meta.com>
Move the live migration test running from their own script into the x86-64 script (on aarch64 they were already in the same script.) They were historically separate as they were new. Now they are established it makes sense for them to be combined. The timeout in the GitHub workflow has been extended to accommodate the extra work in the same step. The Rust test scopes are unchanged - the running of the tests has been moved. Assisted-by: Claude:Opus-4.7 Signed-off-by: Rob Bradford <rbradford@meta.com>
Move the helper methods used for live migration to the common utils (like many other tests use). Assisted-by: Claude:Opus-4.7 Signed-off-by: Rob Bradford <rbradford@meta.com>
Move the live migration tests themselves into the common scopes allowing the tests to now run interleaved together hopefully reducing CI time. On MSHV the live migration tests are now not compiled in rather than compiled in and skipped (as the helpers are not compiled in for MSHV.) Assisted-by: Claude:Opus-4.7 Signed-off-by: Rob Bradford <rbradford@meta.com>
Reduce `nr_hugepages` from 12 GB to 6 GB on both architectures. The number if huge pages needed (if all the tests run at once) is 4GiB so this gives 50% headroom. This should reduce the number of tests that fail/flake out due to lack of memory. Signed-off-by: Rob Bradford <rbradford@meta.com>
`test_live_migration_virtio_fs` and its `_local` variant lived in `common_sequential` because they shared `~/workloads/shared_dir` as the virtiofsd backing and wrote/deleted the same `migration_test_file` and `post_migration_file` paths inside it. Two instances running concurrently would race on those files. Give each invocation its own backing directory under `guest.tmp_dir`, which is already per-test unique and gets cleaned up by the `TempDir` drop. The test logic is otherwise unchanged. Move both wrappers and the helper from `common_sequential` to `common_parallel` and update the sequential-tests comment accordingly. The boot footprint is small (512 MB src + 512 MB dest), so two concurrent instances comfortably fit alongside the rest of the parallel suite. The remaining sequential live-migration tests (balloon, NUMA) genuinely need their isolation slot for memory headroom. Signed-off-by: Rob Bradford <rbradford@meta.com>
To aid debugging of this test failing print the output from the restored VMM instance too. Signed-off-by: Rob Bradford <rbradford@meta.com>
The four vhost-user device wrappers (blk, fs, generic_vhost_user, net) each carried an identical reset() body that resumed the worker thread, asked the backend to reset, signalled kill_evt and dropped interrupt_cb. Move the shared body into VhostUserCommon::reset() so behaviour stays in one place. No behavioural change. Signed-off-by: Rob Bradford <rbradford@meta.com> Assisted-by: Claude:claude-opus-4-7
The virtio specification treats reset as the recovery operation and so must take the device back to a fresh state, and the driver waits for the status read-back to converge before continuing. There is no defined way for the device to report a reset failure to the driver. Previously the implementations of reset() would return early and not complete all their cleanup leaving them in an inconsistent state. Now log errors and continue through the execution. Signed-off-by: Rob Bradford <rbradford@meta.com> Assisted-by: Claude:claude-opus-4-7
Per the virtio spec a device reset must return the device to its power-on state. The reset path was only zeroing queue_select. Add VirtioPciCommonConfig::reset() and call it from the transport's reset path so the configuration is cleared. Also relax the condition to allow the device to be reset at any time to match the spec. Signed-off-by: Rob Bradford <rbradford@meta.com>
Previously the interrupt was created in VirtioPciDevice, moved via the Option::take() to the VirtioPciDeviceActivator and then moved to the VirtioDevice upon activation. On reset it would be moved back ready for reactivation. Since this already an Arc type remove the wrapping Option and instead refcount it such that the VirtioPciDevice can continue to hold onto it for later activations. This significantly simplifies the reset() logic as there is no need to hand back the interrupt. A few devices used whether the interrupt was Some to make triggering an interrupt a no-op. However the MSI-X interrupt routing already drops the interrupt if the driver hasn't yet configured the vector so it is safe to trigger the interrupt before device activation (e.g. balloon resize request before driver loaded). VirtioCommon still retains an Option<..> for the interrupt as the interrupt is not known until activation time (after this has been created). A helper VirtioCommon::trigger_interrupt() has been added to handle this. Signed-off-by: Rob Bradford <rbradford@meta.com>
Rather than calculate in the DeviceManager and pass it through do it in the device where it already has all the required information. Signed-off-by: Rob Bradford <rbradford@meta.com>
msix_num is guaranteed to be at least 1 so this check (and the Option) that it returns can be removed. Signed-off-by: Rob Bradford <rbradford@meta.com>
Since this is always created there is no need to make it an Option type simplifying the code. Historically it was an Option to support INTx based virtio but that was removed long ago. Signed-off-by: Rob Bradford <rbradford@meta.com>
The VirtioInterrupt is now always created so the Option<..> can always be removed. As a side effect the interrupt_source_group can also be removed from the struct. Signed-off-by: Rob Bradford <rbradford@meta.com>
Introduce QcowDisk, a unified DiskFile implementation for QCOW2 disk images that handles backend selection at runtime via a use_io_uring flag, matching the pattern used by FixedVhdDisk. The wrapper delegates to QcowSync or QcowAsync based on the flag and includes a compile time guard that returns an error when io_uring is requested but the feature is not enabled. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Replace QcowDiskSync and QcowDiskAsync with QcowDisk in all QCOW2 benchmark helpers. The sync helpers pass use_io_uring=false, the async helpers pass use_io_uring=true. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Update open_qcow2 to construct QcowDisk instead of choosing between QcowDiskAsync and QcowDiskSync. The backend decision is now made inside QcowDisk::create_async_io. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Replace QcowDiskSync and QcowDiskAsync constructors in the qcow_sync and qcow_async test modules with QcowDisk::new, passing use_io_uring=false and use_io_uring=true respectively. Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Delete QcowDiskSync and QcowDiskAsync wrapper structs along with their DiskFile trait impls. Only the AsyncIo worker structs QcowSync and QcowAsync remain. Reduce module visibility of qcow_sync and qcow_async to pub(crate). Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Verify that QcowDisk reports the expected virtual size. Assisted-by: Claude:Opus-4.6 Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Verify that the sync backend disables batch requests and the io_uring backend enables them. Assisted-by: Claude:Opus-4.6 Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Verify that try_clone preserves the backend dispatch for both sync and io_uring backends. Assisted-by: Claude:Opus-4.6 Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Verify that a freshly created sparse QCOW2 image reports a physical size smaller than its logical size. Assisted-by: Claude:Opus-4.6 Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Add support for backend that is connected via the vhost-user-generic frontend to generate an interrupt into the guest when it has made a change to the configuration. This is useful for devices that can change the exposed configuration at runtime. Assisted-by: Claude:Opus-4.7 Signed-off-by: Rob Bradford <rbradford@meta.com>
The PageTypeConfig introduced for multi-hypervisor support incorrectly mapped MSHV CPUID pages to NORMAL (type 0) and secrets pages to UNMEASURED (type 3). The correct MSHV values are CPUID (type 5) and SECRETS (type 4). This caused MSHV CVM guests to fail during boot with a segmentation fault, since the SNP firmware could not recognize the CPUID or secrets pages, leading to missing CPU feature data and invalid attestation secrets. Fixes a regression introduced in commit 75ed2c9 ("vmm: add KVM SEV-SNP support to IGVM loader"). Signed-off-by: Muminul Islam <muislam@microsoft.com>
Commit 7d65187 ("vmm: make RSDP address optional in configure_system") changed the boot path to always call configure_system(), even when rsdp_addr is None. Previously, the call was guarded by `if let Some(rsdp_adr) = rsdp_addr`, which meant SEV-SNP guests (where rsdp_addr is None because ACPI tables come from IGVM) skipped configure_system entirely. After the refactor, configure_system runs unconditionally and writes EBDA pointers, SMBIOS tables, MP tables, and boot parameters into guest memory. For SEV-SNP guests this corrupts encrypted/validated pages set up by the IGVM loader, causing a segmentation fault during boot. Restore the original behavior by gating configure_system on rsdp_addr.is_some(), so SEV-SNP guests skip the call as they did before the refactor. Signed-off-by: Muminul Islam <muislam@microsoft.com>
The SnpCpuidInfo allocation and guest_memory.read() in the page import loop are only needed for the KVM CPUID retry path. On MSHV, reading back from protected guest memory at this point is invalid and can cause a segmentation fault. Gate both the SnpCpuidInfo variable and the read behind cfg(feature = "kvm") and a runtime hypervisor type check so MSHV skips this code entirely. Signed-off-by: Muminul Islam <muislam@microsoft.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.