Skip to content

[linux-nvidia-6.17-next] CXL VFIO: Add CXL Type-2 device passthrough support#407

Open
JiandiAnNVIDIA wants to merge 51 commits into
NVIDIA:24.04_linux-nvidia-6.17-nextfrom
JiandiAnNVIDIA:cxl-vfio_2026-04-23
Open

[linux-nvidia-6.17-next] CXL VFIO: Add CXL Type-2 device passthrough support#407
JiandiAnNVIDIA wants to merge 51 commits into
NVIDIA:24.04_linux-nvidia-6.17-nextfrom
JiandiAnNVIDIA:cxl-vfio_2026-04-23

Conversation

@JiandiAnNVIDIA
Copy link
Copy Markdown

@JiandiAnNVIDIA JiandiAnNVIDIA commented May 6, 2026

Description

This patch series adds VFIO CXL Type-2 device passthrough support to the nvidia-6.17 kernel, enabling CXL-capable accelerator devices to be assigned to virtual machines via VFIO. It includes:

  1. VFIO get_region_info refactoring - Upstream series that splits VFIO_DEVICE_GET_REGION_INFO into its own driver op and introduces get_region_info_caps, which is a prerequisite for the CXL VFIO region implementation
  2. VFIO CXL Type-2 passthrough - Manish Honap's series adding CXL awareness to vfio-pci-core, including HDM decoder register emulation, DPA region mapping with demand-fault mmap, CXL DVSEC config virtualization, and CXL region management
  3. VFIO CXL guest-initiated reset - Manish Honap's RFC-v2 series enabling guest-initiated CXL protocol reset with HDM decoder base address preservation and DVSEC STATUS2 virtualization

Key Features Added:

  • CXL Type-2 device detection and initialization within vfio-pci-core
  • HDM decoder register emulation framework for guest access
  • DPA (Device Physical Address) VFIO region with demand-fault mmap and reset zap
  • CXL DVSEC configuration space write virtualization
  • CXL component BAR sparse mmap advertisement to userspace
  • Guest-initiated CXL protocol reset (cxl_dev_reset)
  • HDM decoder base address preservation across reset
  • DVSEC STATUS2 register virtualization in vconfig shadow
  • Module parameter disable_cxl for per-device opt-out
  • UAPI header (include/uapi/cxl/cxl_regs.h) for CXL register defines

LP: https://bugs.launchpad.net/ubuntu/+source/linux-nvidia-6.17/+bug/2152222


Justification

VFIO CXL passthrough is required for assigning CXL Type-2 accelerator devices (GPUs, SmartNICs) to virtual machines:


Source

Patch Breakdown (51 patches):

# Category Count Source
1 Upstream VFIO prerequisites (Hisilicon + nvgrace-gpu fix) 3 Upstream torvalds/master (merged)
2 VFIO get_region_info series 22 Upstream torvalds/master (merged in v6.19)
3 Manish Honap's VFIO CXL Type-2 series v2 (19/20, selftest skipped) 19 LKML (v2, not yet merged)
4 Manish Honap's VFIO CXL reset series RFC-v2 6 Internal (RFC-v2, not yet merged)
5 Config annotations update 1 OOT (build config)
TOTAL 51

Notes on upstream prerequisites (item 1):

Three upstream commits cherry-picked:

  • 4868d2d52df6 — crypto: hisilicon - qm updates BAR configuration
  • 2131c1517f30 — hisi_acc_vfio_pci: adapt to new migration configuration
  • 767b1ed8b980 — vfio/nvgrace-gpu: fix grammatical error

The first two resolve a dependency for e238f147d517 ("vfio/hisi: Convert
to the get_region_info op"). The third fixes a pre-existing comment typo in
the nvgrace-gpu driver that would otherwise cause a patch-ID mismatch with
upstream 1b0ecb5baf4a ("vfio/pci: Convert all PCI drivers to
get_region_info_caps").

Notes on the VFIO get_region_info series (item 2):

22 upstream commits from Jason Gunthorpe's series, already merged in v6.19:

https://lore.kernel.org/all/0-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com/

These refactor the VFIO region info infrastructure that the CXL VFIO
passthrough series depends on.

Notes on Manish's VFIO CXL series (item 3):

19 out of 20 patches ported from:

https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/

Patch 20/20 (selftests) was skipped as the upstream VFIO selftest
infrastructure (tools/testing/selftests/vfio/) is not present in
the NV-Kernels base.

Conflict resolutions were required for 10 of 19 patches due to the
NV-Kernels base diverging from upstream in two ways:

  1. CXL PCI function declarations (cxl_find_regblock,
    cxl_probe_component_regs, cxl_await_range_active,
    cxl_regblock_get_bar_info) are in include/cxl/pci.h
    unconditionally (per Srirangan/Alejandro convention from PR [linux-nvidia-6.17-next] Add CXL Type-2 device support, RAS error handling, reset, state save/restore, and interleaving support #342),
    rather than in include/cxl/cxl.h with CONFIG_CXL_BUS guards
    as Manish's patches expect.
  2. Missing upstream xe driver, dmabuf, and p2pdma support causes
    context mismatches in Kconfig, Makefiles, and VFIO headers.

Notes on Manish's CXL reset series (item 4):

6 patches from internal RFC-v2 posting:

  • [RFC-v2 0/6] vfio/cxl: Guest-initiated CXL protocol reset

Patch 1/6 had a conflict resolution identical to item 3 (declarations
added to include/cxl/pci.h instead of include/cxl/cxl.h).

Lore Links:

Upstream Status:

Series Status
3 upstream prerequisites (Hisilicon + nvgrace-gpu) ✅ Merged in torvalds/master
22 VFIO get_region_info commits ✅ Merged in torvalds/master (v6.19)
Manish VFIO CXL v2 (19 patches) ⏳ Under review, not yet merged
Manish VFIO CXL reset RFC-v2 (6 patches) ⏳ Internal, not yet posted upstream

Testing

Build Validation:

  • Built successfully for ARM64 4K page size kernel
  • Built successfully for ARM64 64K page size kernel

Config Verification:

CXL VFIO config enabled:

CONFIG_VFIO_CXL_CORE=y

Runtime Testing:

  • Boot test on ARM64 system
  • CXL Type-2 device enumeration via VFIO
  • CXL DPA region mmap from guest
  • CXL guest-initiated reset test

Notes

  • CONFIG_VFIO_CXL_CORE is a new bool config enabled for both amd64 and
    arm64. It depends on VFIO_PCI_CORE (module), CXL_BUS (built-in), and
    CXL_MEM (built-in). As a bool, it compiles into the vfio-pci-core module.
  • This series depends on the CXL infrastructure established in PR [linux-nvidia-6.17-next] Add CXL Type-2 device support, RAS error handling, reset, state save/restore, and interleaving support #342
    (Alejandro's v23, Srirangan's save/restore and reset series).
  • A new UAPI header include/uapi/cxl/cxl_regs.h is introduced for CXL
    component and HDM register defines, using UAPI-safe macros (__GENMASK,
    _BITUL) and raw hex sizes instead of kernel-internal SZ_* macros.
  • Patch 20/20 of Manish's series (CXL Type-2 VFIO assignment selftest) was
    intentionally skipped as the upstream VFIO selftest infrastructure is not
    present in the NV-Kernels base.

Longfang Liu added 2 commits May 4, 2026 20:44
On new platforms greater than QM_HW_V3, the configuration region for the
live migration function of the accelerator device is no longer
placed in the VF, but is instead placed in the PF.

Therefore, the configuration region of the live migration function
needs to be opened when the QM driver is loaded. When the QM driver
is uninstalled, the driver needs to clear this configuration.

Signed-off-by: Longfang Liu <liulongfang@huawei.com>
Reviewed-by: Shameer Kolothum <shameerkolothum@gmail.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Link: https://lore.kernel.org/r/20251030015744.131771-2-liulongfang@huawei.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit 4868d2d)
Signed-off-by: Jiandi An <jan@nvidia.com>
On new platforms greater than QM_HW_V3, the migration region has been
relocated from the VF to the PF. The VF's own configuration space is
restored to the complete 64KB, and there is no need to divide the
size of the BAR configuration space equally. The driver should be
modified accordingly to adapt to the new hardware device.

On the older hardware platform QM_HW_V3, the live migration configuration
region is placed in the latter 32K portion of the VF's BAR2 configuration
space. On the new hardware platform QM_HW_V4, the live migration
configuration region also exists in the same 32K area immediately following
the VF's BAR2, just like on QM_HW_V3.

However, access to this region is now controlled by hardware. Additionally,
a copy of the live migration configuration region is present in the PF's
BAR2 configuration space. On the new hardware platform QM_HW_V4, when an
older version of the driver is loaded, it behaves like QM_HW_V3 and uses
the configuration region in the VF, ensuring that the live migration
function continues to work normally. When the new version of the driver is
loaded, it directly uses the configuration region in the PF. Meanwhile,
hardware configuration disables the live migration configuration region
in the VF's BAR2: reads return all 0xF values, and writes are silently
ignored.

Signed-off-by: Longfang Liu <liulongfang@huawei.com>
Reviewed-by: Shameer Kolothum <shameerkolothum@gmail.com>
Link: https://lore.kernel.org/r/20251030015744.131771-3-liulongfang@huawei.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit 2131c15)
Signed-off-by: Jiandi An <jan@nvidia.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

PR Validation Report

Patchscan ✅ No Missing Fixes

All cherry-picked commits checked — no missing upstream fixes found.

PR Lint ✅ All checks passed

Details
Checking 51 commits...

Cherry-pick digest:
┌──────────────┬──────────────────────────────────────────────────────────────────┬────────────┬─────────┬───────────────────────────┐
│ Local        │ Referenced upstream / Patch subject                              │ Patch-ID   │ Subject │ SoB chain                 │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ aef7e33d0e74 │ [SAUCE] config: enable config_vfio_cxl_core for cxl type-2 passt │ N/A        │ N/A     │ jan                       │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ e8c83314f4be │ [SAUCE] vfio/cxl: implement vfio_cxl_reset()                     │ N/A        │ N/A     │ mhonap, jan               │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 37fca8563f93 │ [SAUCE] vfio/cxl: virtualize dvsec status2 register in vconfig s │ N/A        │ N/A     │ mhonap, jan               │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 15ef3e96a36c │ [SAUCE] vfio/cxl: preserve hdm decoder base addresses across res │ N/A        │ N/A     │ mhonap, jan               │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 8c92d195d1d0 │ [SAUCE] vfio/cxl: ensure pci memory space is enabled before post │ N/A        │ N/A     │ mhonap, jan               │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ e5183b49c784 │ [SAUCE] vfio/pci: wire cxl dpa reset handling                    │ N/A        │ N/A     │ mhonap, jan               │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ d53532800534 │ [SAUCE] cxl: export the cxl reset helpers for vfio users         │ N/A        │ N/A     │ mhonap, jan               │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 646f12a6a8f8 │ docs: vfio-pci: document cxl type-2 device passthrough           │ match      │ found   │ ok, backporter: jan       │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 5bc0b3ea82af │ vfio/cxl: provide opt-out for cxl feature                        │ match      │ found   │ ok, backporter: jan       │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 534faac8aa63 │ vfio/pci: advertise cxl cap and sparse component bar to userspac │ match      │ found   │ ok, backporter: jan       │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 24dd6678d476 │ vfio/cxl: register regions with vfio layer                       │ match      │ found   │ ok, backporter: jan       │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 1447b99bccc3 │ vfio/cxl: virtualize cxl dvsec config writes                     │ noted      │ found   │ ok, backporter: jan       │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 05b9195202cc │ vfio/cxl: dpa vfio region with demand fault mmap and reset zap   │ match      │ found   │ ok, backporter: jan       │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ fb580ac046d5 │ vfio/cxl: cxl region management support                          │ match      │ found   │ ok, backporter: jan       │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ d64c61c2ba91 │ vfio/cxl: wait for hdm ranges and create memdev                  │ match      │ found   │ ok, backporter: jan       │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ ad3979839aba │ vfio/cxl: introduce hdm decoder register emulation framework     │ match      │ found   │ ok, backporter: jan       │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 0fbd7b2effd7 │ vfio/pci: export config access helpers                           │ match      │ found   │ ok, backporter: jan       │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 84fbfbcead31 │ vfio/cxl: detect cxl dvsec and probe hdm block                   │ match      │ found   │ ok, backporter: jan       │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ cb87876e8e1d │ vfio/pci: add config_vfio_cxl_core and stub cxl hooks            │ match      │ found   │ ok, backporter: jan       │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ de3e1a60a995 │ vfio/pci: add cxl state to vfio_pci_core_device                  │ noted      │ found   │ ok, backporter: jan       │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 05c1da9d786a │ vfio: uapi for cxl-capable pci device assignment                 │ match      │ found   │ ok, backporter: jan       │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ d3141453f48b │ cxl: record bir and bar offset in cxl_register_map               │ noted      │ found   │ ok, backporter: jan       │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ d0fde9879972 │ cxl: split cxl_await_range_active() from media-ready wait        │ noted      │ found   │ ok, backporter: jan       │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 199d5d2f2ca4 │ cxl: move component/hdm register defines to uapi/cxl/cxl_regs.h  │ noted      │ found   │ ok, backporter: jan       │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ e02c1b7ac02a │ cxl: declare cxl_find_regblock and cxl_probe_component_regs in p │ noted      │ found   │ ok, backporter: jan       │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ fd317b86093e │ cxl: add cxl_get_hdm_info() for hdm decoder metadata             │ match      │ found   │ ok, backporter: jan       │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 54d50bbc6111 │ 56c069307dfd vfio: Remove the get_region_info op                 │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 21085759fbcd │ dc10734610e2 vfio: Move the remaining drivers to get_region_info │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ c0ad388ba741 │ 182c62861ba5 vfio/platform: Convert to get_region_info_caps      │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 2bf5a2cbb154 │ 1b0ecb5baf4a vfio/pci: Convert all PCI drivers to get_region_inf │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ bc1c993e783d │ 973af0c40eaf vfio/ccw: Convert to get_region_info_caps           │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 0282af066b10 │ 93165757c023 vfio/gvt: Convert to get_region_info_caps           │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 29e1217fd909 │ 45f9fa18109d vfio/mbochs: Convert mbochs to use vfio_info_add_ca │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 7dd77b841190 │ 775f726a742a vfio: Add get_region_info_caps op                   │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ e7da10685f7f │ f97859503859 vfio: Require drivers to implement get_region_info  │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 6c250ce18f9e │ e664067b6035 vfio/gvt: Provide a get_region_info op              │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 76b5171d117d │ 61b3f7b5a729 vfio/ccw: Provide a get_region_info op              │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 619333df0ce8 │ b9827eff6b4a vfio/cdx: Provide a get_region_info op              │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 8ba94bf6a94e │ 6cdae5d0c326 vfio/fsl: Provide a get_region_info op              │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 073f13c17982 │ d4635df279f5 vfio/platform: Provide a get_region_info op         │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 554dca9a1de1 │ 8339fccda837 vfio/mbochs: Provide a get_region_info op           │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 0fbfd736592c │ cf16acc0af09 vfio/mdpy: Provide a get_region_info op             │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 4df20815cb64 │ 078775527109 vfio/mtty: Provide a get_region_info op             │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ e54b8e086acd │ f3fddb71dd50 vfio/pci: Fill in the missing get_region_info ops   │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 702622746ce4 │ 5ac720647477 vfio/nvgrace: Convert to the get_region_info op     │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ fad0d0d38ca4 │ c044eefa4786 vfio/virtio: Convert to the get_region_info op      │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 6b97c1b33bef │ e238f147d517 vfio/hisi: Convert to the get_region_info op        │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 897cefa739f7 │ 113557b04068 vfio: Provide a get_region_info op                  │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 449e051b54c2 │ 767b1ed8b980 vfio/nvgrace-gpu: fix grammatical error             │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 0c7d38232410 │ 2131c1517f30 hisi_acc_vfio_pci: adapt to new migration configura │ match      │ match   │ preserved + jan added     │
├──────────────┼──────────────────────────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 38c6eb3eed52 │ 4868d2d52df6 crypto: hisilicon - qm updates BAR configuration    │ match      │ match   │ preserved + jan added     │
└──────────────┴──────────────────────────────────────────────────────────────────┴────────────┴─────────┴───────────────────────────┘

Lint: all checks passed.

@JiandiAnNVIDIA JiandiAnNVIDIA changed the title CXL VFIO: Add CXL Type-2 device passthrough support [linux-nvidia-6.17-next] CXL VFIO: Add CXL Type-2 device passthrough support May 6, 2026
Morduan Zang and others added 26 commits May 6, 2026 02:03
The word "as" in the comment should be replaced with "is",
and there is an extra space in the comment.

Signed-off-by: Morduan Zang <zhangdandan@uniontech.com>
Reviewed-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/54E1ED6C5A2682C8+20250814110358.285412-1-zhangdandan@uniontech.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
(cherry picked from commit 767b1ed)
Signed-off-by: Jiandi An <jan@nvidia.com>
Instead of hooking the general ioctl op, have the core code directly
decode VFIO_DEVICE_GET_REGION_INFO and call an op just for it.

This is intended to allow mechanical changes to the drivers to pull their
VFIO_DEVICE_GET_REGION_INFO int oa function. Later patches will improve
the function signature to consolidate more code.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/1-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit 113557b)
Signed-off-by: Jiandi An <jan@nvidia.com>
Change the function signature of hisi_acc_vfio_pci_ioctl()
and re-indent it.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/2-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(backported from commit e238f14)
[jan: resolve minor conflict in hisi_acc_vfio_pci_ioctl()]
Signed-off-by: Jiandi An <jan@nvidia.com>
Remove virtiovf_vfio_pci_core_ioctl() and change the signature of
virtiovf_pci_ioctl_get_region_info().

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/3-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit c044eef)
Signed-off-by: Jiandi An <jan@nvidia.com>
Change the signature of nvgrace_gpu_ioctl_get_region_info()

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Ankit Agrawal <ankita@nvidia.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/4-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit 5ac7206)
Signed-off-by: Jiandi An <jan@nvidia.com>
Now that every variant driver provides a get_region_info op remove the
ioctl based dispatch from vfio_pci_core_ioctl().

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/5-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit f3fddb7)
Signed-off-by: Jiandi An <jan@nvidia.com>
Move it out of mtty_ioctl() and re-indent it.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/6-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit 0787755)
Signed-off-by: Jiandi An <jan@nvidia.com>
Move it out of mdpy_ioctl() and re-indent it.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/7-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit cf16acc)
Signed-off-by: Jiandi An <jan@nvidia.com>
Move it out of mbochs_ioctl() and re-indent it.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/8-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit 8339fcc)
Signed-off-by: Jiandi An <jan@nvidia.com>
Move it out of vfio_platform_ioctl() and re-indent it. Add it to all
platform drivers.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Mostafa Saleh <smostafa@google.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/9-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit d4635df)
Signed-off-by: Jiandi An <jan@nvidia.com>
Move it out of vfio_fsl_mc_ioctl() and re-indent it.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/10-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit 6cdae5d)
Signed-off-by: Jiandi An <jan@nvidia.com>
Change the signature of vfio_cdx_ioctl_get_region_info() and hook it to
the op.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/11-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit b9827ef)
Signed-off-by: Jiandi An <jan@nvidia.com>
Move it out of vfio_ccw_mdev_ioctl() and re-indent it.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Eric Farman <farman@linux.ibm.com>
Link: https://lore.kernel.org/r/12-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit 61b3f7b)
Signed-off-by: Jiandi An <jan@nvidia.com>
Move it out of intel_vgpu_ioctl() and re-indent it.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/13-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit e664067)
Signed-off-by: Jiandi An <jan@nvidia.com>
Remove the fallback through the ioctl callback, no drivers use this now.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Mostafa Saleh <smostafa@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/14-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit f978595)
Signed-off-by: Jiandi An <jan@nvidia.com>
This op does the copy to/from user for the info and can return back
a cap chain through a vfio_info_cap * result.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/15-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit 775f726)
Signed-off-by: Jiandi An <jan@nvidia.com>
This driver open codes the cap chain manipulations. Instead use
vfio_info_add_capability() and the get_region_info_caps() op.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/16-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit 45f9fa1)
Signed-off-by: Jiandi An <jan@nvidia.com>
Remove the duplicate code and change info to a pointer.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/17-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit 9316575)
Signed-off-by: Jiandi An <jan@nvidia.com>
Remove the duplicate code and flatten the call chain.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Eric Farman <farman@linux.ibm.com>
Link: https://lore.kernel.org/r/18-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit 973af0c)
Signed-off-by: Jiandi An <jan@nvidia.com>
Since the core function signature changes it has to flow up to all
drivers.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/19-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit 1b0ecb5)
Signed-off-by: Jiandi An <jan@nvidia.com>
Remove the duplicate code and change info to a pointer. caps are not used.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Mostafa Saleh <smostafa@google.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit 182c628)
Signed-off-by: Jiandi An <jan@nvidia.com>
Remove the duplicate code and change info to a pointer. caps are not used.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/21-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit dc10734)
Signed-off-by: Jiandi An <jan@nvidia.com>
No driver uses it now, all are using get_region_info_caps().

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/22-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
(cherry picked from commit 56c0693)
Signed-off-by: Jiandi An <jan@nvidia.com>
cxl_probe_component_regs() finds the HDM decoder block during device probe
and caches its location, but does not record the decoder count and does
not expose the result outside drivers/cxl/.

vfio-cxl needs the decoder count and the byte offset and size of the HDM
block without re-running the probe sequence. Record decoder_cnt in
rmap->count when parsing the HDM capability in cxl_probe_component_regs(),
extend struct cxl_reg_map with a count member, and add cxl_get_hdm_info()
to return offset, size, and count from the cached map.

Export under the CXL namespace; stub to -EOPNOTSUPP when CONFIG_CXL_BUS
is off.

Co-developed-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
Signed-off-by: Jiandi An <jan@nvidia.com>
…nent_regs in public header

vfio-cxl lives outside drivers/cxl/ but still needs to locate the
component register block and fill cxl_component_reg_map. Those
prototypes were stuck in the internal drivers/cxl/cxl.h.

Move the declarations to include/cxl/cxl.h next to the other
vfio-facing hooks, with stubs when CXL bus support is disabled.
Drop the duplicate prototypes from the private header.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
[jan: Move cxl_probe_component_regs() to include/cxl/pci.h instead of include/cxl/cxl.h to align with existing Srirangan/Alejandro convention; skip cxl_find_regblock() move as it is already in include/cxl/pci.h; add struct cxl_component_reg_map forward declaration]
Signed-off-by: Jiandi An <jan@nvidia.com>
…xl/cxl_regs.h

VFIO and other code outside the CXL core needs the same offset/mask
constants the core uses for the component register block and HDM
decoders.

Pull them into a new include/uapi/cxl/cxl_regs.h
(GPL-2.0 WITH Linux-syscall-note) and include it from
include/cxl/cxl.h. Use the uapi-friendly __GENMASK helpers where
needed. Section comments in the new file reference CXL spec r4.0 numbering.

For UAPI change, replaced the SZ_64K with actual size as the macro
will not be available for userspace programs.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
[jan: Remove defines from include/cxl/cxl.h instead of drivers/cxl/cxl.h as they were already moved there by Srirangan's SAUCE commit]
Signed-off-by: Jiandi An <jan@nvidia.com>
mmhonap added 12 commits May 6, 2026 02:03
Introduce the Kconfig option CONFIG_VFIO_CXL_CORE and the necessary
build rules to compile CXL.mem passthrough infrastructure for
vendor-specific CXL devices into the vfio-pci-core module.  The new
option depends on VFIO_PCI_CORE, CXL_BUS and CXL_MEM.

Wire up the detection and cleanup entry-point stubs in
vfio_pci_core_register_device() and vfio_pci_core_unregister_device()
so that subsequent patches can fill in the CXL-specific logic without
touching the vfio-pci-core flow again.

The vfio_cxl_core.c file added here is an empty skeleton; the actual
CXL detection and initialisation code is introduced in the following
patch to keep this build-system patch reviewable on its own.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
[jan: Resolve context mismatches in Kconfig, Makefile, and vfio_pci_priv.h due to missing upstream xe/dmabuf support in NV-Kernels base]
Signed-off-by: Jiandi An <jan@nvidia.com>
Detect a vendor-specific CXL device at vfio-pci bind time and probe
its HDM decoder register block.

vfio_cxl_create_device_state() allocates per-device state via devm and
reads MEM_CAPABLE and CACHE_CAPABLE from the CXL DVSEC.

vfio_cxl_setup_regs() locates the component register block, temporarily
maps it, calls cxl_probe_component_regs() to find the HDM block, then
releases the mapping.

vfio_pci_cxl_detect_and_init() chains these two steps. If either fails,
vdev->cxl stays NULL and the device falls back to plain vfio-pci.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
Signed-off-by: Jiandi An <jan@nvidia.com>
Promote vfio_raw_config_write() and vfio_raw_config_read() to non-static so
that the CXL DVSEC write handler in the next patch can call them.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
Signed-off-by: Jiandi An <jan@nvidia.com>
… framework

Add HDM decoder register emulation for CXL devices assigned to a guest.

New file vfio_cxl_emu.c allocates comp_reg_virt[] covering the full
component register block (CXL_COMPONENT_REG_BLOCK_SIZE), snapshots it
from MMIO after probe, and registers a VFIO device region
(VFIO_REGION_SUBTYPE_CXL_COMP_REGS) with read/write ops but no mmap,
so every access hits the emulated buffer and write dispatchers.

vfio_cxl_setup_virt_regs() is called from the tail of
vfio_cxl_setup_regs(); vfio_cxl_clean_virt_regs() runs on cleanup.

HDM decoder register defines come from include/uapi/cxl/cxl_regs.h.
Bits with no hardware equivalent stay in vfio_cxl_priv.h.

hdm_decoder_n_ctrl_write() allows the guest to clear the LOCK bit.
A firmware-committed decoder arrives with LOCK=1; the guest driver
must clear it before reprogramming BASE and SIZE with the VM's GPA.
Such a write clears the bit in the shadow while preserving all other
fields.

Co-developed-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
[jan: Resolve Makefile context mismatch due to missing upstream dmabuf support in NV-Kernels base]
Signed-off-by: Jiandi An <jan@nvidia.com>
After HDM registers are mapped, call cxl_await_range_active() so we
only proceed when DVSEC ranges report active without touching the
memdev register group Type-2 may lack.

Re-snapshot component regs (vfio_cxl_reinit_comp_regs) once
MEM_ACTIVE so firmware final SIZE_HIGH etc. land in comp_reg_virt.

Read committed decoder size from hardware, set capacity via
cxl_set_capacity(), and devm_cxl_add_memdev().

Signed-off-by: Manish Honap <mhonap@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
Signed-off-by: Jiandi An <jan@nvidia.com>
Region Management makes use of APIs provided by CXL_CORE as below:

CREATE_REGION flow:
1. Validate request (size, decoder availability)
2. Allocate HPA via cxl_get_hpa_freespace()
3. Allocate DPA via cxl_request_dpa()
4. Create region via cxl_create_region() - commits HDM decoder
5. Get HPA range via cxl_get_region_range()

DESTROY_REGION flow:
1. Detach decoder via cxl_decoder_detach()
2. Free DPA via cxl_dpa_free()
3. Release root decoder via cxl_put_root_decoder()

Use DEFINE_FREE scope helpers so error paths unwind cleanly.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
Signed-off-by: Jiandi An <jan@nvidia.com>
…nd reset zap

Wire the CXL DPA range up as a VFIO demand-paged region so QEMU can
mmap guest device memory directly. Faults call vmf_insert_pfn() to
insert one PFN at a time rather than mapping the full range upfront.

CXL region lifecycle:
- The CXL memory region is registered with VFIO layer during
  vfio_pci_open_device
- mmap() establishes the VMA with vm_ops but inserts no PTEs
- Each guest page fault calls vfio_cxl_region_page_fault() which
  inserts a single PFN under the memory_lock read side
- On device reset, vfio_cxl_zap_region_locked() sets region_active=false
  and calls unmap_mapping_range() to invalidate all DPA PTEs atomically
  while holding memory_lock for writing
- Faults racing with reset see region_active==false and return
  VM_FAULT_SIGBUS
- vfio_cxl_reactivate_region() restores region_active after successful
  hardware reset

Also integrate the zap/reactivate calls into vfio_pci_ioctl_reset() so
that FLR correctly invalidates DPA mappings and restores them on success.

Co-developed-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
[jan: Resolve context mismatches in vfio_pci_core.c and vfio_pci_priv.h due to missing upstream dmabuf support in NV-Kernels base]
Signed-off-by: Jiandi An <jan@nvidia.com>
CXL devices have CXL DVSEC registers in the configuration space.
Many of them affect the behaviors of the devices, e.g. enabling
CXL.io/CXL.mem/CXL.cache. However, these configurations are owned by
the host and a virtualization policy should be applied when handling
the access from the guest.

Introduce the emulation of CXL configuration space to handle the access
of the virtual CXL configuration space from the guest.

vfio-pci-core already allocates vdev->vconfig as the authoritative
virtual config space shadow. Directly use vdev->vconfig:
  - DVSEC reads return data from vdev->vconfig (already populated by
    vfio_config_init() via vfio_ecap_init())
  - DVSEC writes go through new CXL-aware write handlers that update
    vdev->vconfig in place
  - The writable DVSEC registers are marked virtual in vdev->pci_config_map

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
[jan: Resolve context mismatches in Makefile and vfio_pci_core.h due to missing upstream dmabuf/p2pdma forward declarations in NV-Kernels base]
Signed-off-by: Jiandi An <jan@nvidia.com>
Register the DPA and component register region with VFIO layer.
Region indices for both these regions are cached for quick lookup.

vfio_cxl_register_cxl_region()
- memremap(WB) the region HPA (treat CXL.mem as RAM, not MMIO)
- Register VFIO_REGION_SUBTYPE_CXL
- Records dpa_region_idx.

vfio_cxl_register_comp_regs_region()
- Registers VFIO_REGION_SUBTYPE_CXL_COMP_REGS with size
  hdm_reg_offset + hdm_reg_size
- Records comp_reg_region_idx.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
Signed-off-by: Jiandi An <jan@nvidia.com>
…AR to userspace

Expose CXL device capability through the VFIO device info ioctl and give
userspace access to the GPU/accelerator register windows in the component
BAR while protecting the CXL component register block.

vfio_cxl_get_info() fills VFIO_DEVICE_INFO_CAP_CXL with the HDM register
BAR index and byte offset, commit flags, and VFIO region indices for the
DPA and COMP_REGS regions. HDM decoder count and the HDM block offset
within COMP_REGS are not populated; both are derivable from the CXL
Capability Array in the COMP_REGS region itself.

vfio_cxl_get_region_info() handles VFIO_DEVICE_GET_REGION_INFO for the
component register BAR. It builds a sparse-mmap capability that advertises
only the GPU/accelerator register windows, carving out the CXL component
register block. Three physical layouts are handled:

  Topology A  comp block at BAR end:    one area [0, comp_reg_offset)
  Topology B  comp block at BAR start:  one area [comp_end, bar_len)
  Topology C  comp block in the middle: two areas, one on each side

vfio_cxl_mmap_overlaps_comp_regs() checks whether an mmap request overlaps
[comp_reg_offset, comp_reg_offset + comp_reg_size). vfio_pci_core_mmap()
calls it to reject access to the component register block while allowing
mmap of the GPU register windows in the sparse capability. This replaces
the earlier blanket rejection of any mmap on the component BAR index.

Hook both helpers into vfio_pci_ioctl_get_info() and
vfio_pci_ioctl_get_region_info() in vfio_pci_core.c.

The component BAR cannot be claimed exclusively since the CXL subsystem
holds persistent sub-range iomem claims during HDM decoder setup.
pci_request_selected_regions() returns EBUSY; pass bars=0 to skip the
request and map directly via pci_iomap(). Physical ownership is assured
by driver binding.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
Signed-off-by: Jiandi An <jan@nvidia.com>
This commit provides an opt-out mechanism to disable the CXL
support from vfio module. The opt-out is provided both
build time and module load time.

Build time option CONFIG_VFIO_CXL_CORE is used to enable/disable
CXL support in vfio-pci module.

For runtime disabling the CXL support, use the module parameter
disable_cxl. This is a per-device opt-out on the core device
set by the driver before registration.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
[jan: Resolve context mismatch in vfio_pci.c probe function due to missing upstream pci_ops assignment in NV-Kernels base]
Signed-off-by: Jiandi An <jan@nvidia.com>
…ough

Add Documentation/driver-api/vfio-pci-cxl.rst describing the architecture,
VFIO interfaces, and operational constraints for CXL Type-2 (cache-coherent
accelerator) passthrough via vfio-pci-core, and link it from the driver-api
index.

The document covers:
- VFIO_DEVICE_FLAGS_CXL and VFIO_DEVICE_INFO_CAP_CXL: what the capability
  struct contains and what the FIRMWARE_COMMITTED and CACHE_CAPABLE flags mean
- How to derive hdm_decoder_offset and hdm_count from the COMP_REGS region
  by traversing the CXL Capability Array to find cap ID 0x5 and reading the
  HDM Decoder Capability register
- Topology-aware sparse mmap on the component BAR (topologies A, B, C
  covering comp block at end, start, or middle of the BAR)
- Two extra VFIO device regions: COMP_REGS for the emulated HDM register
  state and the DPA memory window
- DVSEC config write virtualization: what the guest sees vs. hardware
- FLR coordination: DPA PTEs zapped before reset, restored after

Signed-off-by: Manish Honap <mhonap@nvidia.com>
(backported from https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/)
Signed-off-by: Jiandi An <jan@nvidia.com>
@JiandiAnNVIDIA JiandiAnNVIDIA force-pushed the cxl-vfio_2026-04-23 branch from 3eede80 to 502020b Compare May 6, 2026 07:04
@nvmochs
Copy link
Copy Markdown
Collaborator

nvmochs commented May 6, 2026

@JiandiAnNVIDIA

FYI, these 2 commits have already been merged to 6.17-HWE via Richard's PR a week or 2 back and can be removed from this PR:
b5c0462 fwctl: Fix class init ordering to avoid NULL pointer dereference on device removal
f7d2825 PCI/NPEM: Set LED_HW_PLUGGABLE for hotplug-capable ports

@nvmochs
Copy link
Copy Markdown
Collaborator

nvmochs commented May 6, 2026

@JiandiAnNVIDIA

502020b (HEAD, jiandi/cxl-vfio_2026-04-23) NVIDIA: VR: SAUCE: config: Enable CONFIG_VFIO_CXL_CORE for CXL Type-2 passthrough
acb85c3 NVIDIA: VR: SAUCE: vfio/cxl: Implement vfio_cxl_reset()
d862868 NVIDIA: VR: SAUCE: vfio/cxl: virtualize DVSEC STATUS2 register in vconfig shadow
fcc7b7b NVIDIA: VR: SAUCE: vfio/cxl: preserve HDM decoder base addresses across reset
f5c8879 NVIDIA: VR: SAUCE: vfio/cxl: Ensure PCI Memory Space is enabled before post-reset BAR access
979f0aa NVIDIA: VR: SAUCE: vfio/pci: Wire CXL DPA reset handling
9543998 NVIDIA: VR: SAUCE: cxl: Export the CXL reset helpers for VFIO users

For these new patches that have not been posted to LKML yet, I reviewed with Codex, specifically focusing on if there are any potential regressions introduced.

Here are the 5 findings with their related commits and the concern.


1. STATUS2 shadowing can break Cache WBI polling

Related commit:
d862868 - vfio/cxl: virtualize DVSEC STATUS2 register in vconfig shadow

The patch makes CXL_DVSEC_STATUS2_OFFSET reads return from vdev->vconfig instead of hardware in drivers/vfio/pci/cxl/vfio_cxl_config.c:282.

That is needed for virtualized reset result bits, but STATUS2 also contains live hardware status, especially:

CXL_DVSEC_STATUS2_CACHE_INVALID

INITIATE_CACHE_WBI is still forwarded to hardware in drivers/vfio/pci/cxl/vfio_cxl_config.c:178, but there is no refresh of STATUS2.Cache_Invalid into the shadow. A guest that triggers Cache WBI and polls
STATUS2.Cache_Invalid may read stale vconfig forever.

Suggested fix: make STATUS2 reads merge hardware and shadow state. For example, read hardware STATUS2, preserve live hardware bits like CACHE_INVALID, and override only the virtualized reset outcome bits from shadow.


2. Existing VFIO reset paths can leave PCI_COMMAND_MEMORY enabled

Related commits:
979f0aa - vfio/pci: Wire CXL DPA reset handling
f5c8879 - vfio/cxl: Ensure PCI Memory Space is enabled before post-reset BAR access

979f0aa wires vfio_cxl_finish_reset() into existing reset paths, including VFIO_DEVICE_RESET in drivers/vfio/pci/vfio_pci_core.c:1238 and guest FLR config writes in drivers/vfio/pci/vfio_pci_config.c:902.

f5c8879 adds vfio_cxl_enable_memory_space() in drivers/vfio/pci/cxl/vfio_cxl_core.c:704, which sets PCI_COMMAND_MEMORY before reading BAR-backed HDM registers. It does not restore the previous command value.

So if the guest had Memory Space disabled before reset, the reset path can return with Memory Space enabled. That changes guest-visible PCI config behavior.

Suggested fix: save the original command value, temporarily enable Memory Space for HDM register reads, then restore the original command value before returning.


3. Guest CXL reset bypasses CXL core reset coordination

Related commits:
9543998 - cxl: Export the CXL reset helpers for VFIO users
acb85c3 - vfio/cxl: Implement vfio_cxl_reset()

acb85c3 implements guest CXL reset by directly calling cxl_dev_reset() in drivers/vfio/pci/cxl/vfio_cxl_config.c:131.

The CXL sysfs reset path does more than call cxl_dev_reset(): it takes cxl_reset_mutex, saves/disables sibling CXL functions, runs reset, then restores siblings in drivers/cxl/core/pci.c:1385.

The VFIO path skips that coordination. On multi-function CXL devices, or with concurrent reset paths, a guest-initiated reset can run without quiescing sibling functions or serializing against another CXL reset.

Suggested fix: expose a CXL helper that performs the same global reset serialization and sibling save/disable/restore, while allowing VFIO to skip host memory offlining.


4. HDM BASE preservation only handles 16 decoders

Related commit:
fcc7b7b - vfio/cxl: preserve HDM decoder base addresses across reset

vfio_cxl_reinit_hdm_shadow() snapshots BASE registers into fixed arrays of 16 entries in drivers/vfio/pci/cxl/vfio_cxl_core.c:728:

__le32 saved_lo[16] = {}, saved_hi[16] = {};
u8 n, count = min_t(u8, cxl->hdm_count, ARRAY_SIZE(saved_lo));

But hdm_count comes from the HDM decoder count field, and local CXL code allows more than 16 decoders. If cxl->hdm_count > 16, decoders 16+ lose preserved guest GPA BASE values across reset.

Suggested fix: allocate based on cxl->hdm_count, or explicitly reject/cap devices above 16 decoders if that is the intended support limit.


5. Post-reset STATUS2 read ignores config access failure

Related commit:
acb85c3 - vfio/cxl: Implement vfio_cxl_reset()

After reset, vfio_cxl_reset() reads hardware STATUS2 in drivers/vfio/pci/cxl/vfio_cxl_config.c:147:

pci_read_config_word(pdev, dvsec + CXL_DVSEC_STATUS2_OFFSET, &hw_status2);

The return value is ignored, and hw_status2 is used immediately afterward to stamp reset result bits into vconfig. If config access fails after reset, the guest may see undefined or misleading reset status.

Suggested fix: check the return value. On failure, initialize hw_status2 deterministically and report/stamp CXL_RESET_ERROR, or return the config error from vfio_cxl_reset().


For the patches picked from LKML, I verified they match the LKML posting exactly and for those that did not match, that your backport was correct and matched the context notes in the commit message:

LKML # Commit Subject Backport Fidelity
01/20 fd317b8 cxl: Add cxl_get_hdm_info() for HDM decoder metadata Identical patch-id. No context note needed.
02/20 e02c1b7 cxl: Declare cxl_find_regblock and cxl_probe_component_regs in public header Non-identical. [jan] note accurately explains use of include/cxl/pci.h, existing cxl_find_regblock(), and added forward declaration.
03/20 199d5d2 cxl: Move component/HDM register defines to uapi/cxl/cxl_regs.h Non-identical. [jan] note accurately explains removing defines from include/cxl/cxl.h instead of drivers/cxl/cxl.h.
04/20 d0fde98 cxl: Split cxl_await_range_active() from media-ready wait Non-identical. [jan] note accurately explains declaration moved to include/cxl/pci.h unconditionally.
05/20 d314145 cxl: Record BIR and BAR offset in cxl_register_map Non-identical. [jan] note accurately explains declaration moved to include/cxl/pci.h unconditionally.
06/20 05c1da9 vfio: UAPI for CXL-capable PCI device assignment Identical patch-id. No context note needed.
07/20 de3e1a6 vfio/pci: Add CXL state to vfio_pci_core_device Non-identical. [jan] note accurately explains vfio_pci_core.h context and added #include <cxl/pci.h>.
08/20 cb87876 vfio/pci: Add CONFIG_VFIO_CXL_CORE and stub CXL hooks Non-identical. [jan] note accurately explains Kconfig/Makefile/header context from missing upstream xe/dmabuf.
09/20 84fbfbc vfio/cxl: Detect CXL DVSEC and probe HDM block Identical patch-id. No context note needed.
10/20 0fbd7b2 vfio/pci: Export config access helpers Identical patch-id. No context note needed.
11/20 ad39798 vfio/cxl: Introduce HDM decoder register emulation framework Non-identical. [jan] note accurately explains Makefile context from missing upstream dmabuf.
12/20 d64c61c vfio/cxl: Wait for HDM ranges and create memdev Identical patch-id. No context note needed.
13/20 fb580ac vfio/cxl: CXL region management support Identical patch-id. No context note needed.
14/20 05b9195 vfio/cxl: DPA VFIO region with demand fault mmap and reset zap Non-identical. [jan] note accurately explains VFIO context from missing upstream dmabuf.
15/20 1447b99 vfio/cxl: Virtualize CXL DVSEC config writes Non-identical. [jan] note accurately explains Makefile/header context from missing upstream dmabuf/p2pdma.
16/20 24dd667 vfio/cxl: Register regions with VFIO layer Identical patch-id. No context note needed.
17/20 534faac vfio/pci: Advertise CXL cap and sparse component BAR to userspace Identical patch-id. No context note needed.
18/20 5bc0b3e vfio/cxl: Provide opt-out for CXL feature Non-identical. [jan] note accurately explains probe context from missing upstream pci_ops assignment.
19/20 646f12a docs: vfio-pci: Document CXL Type-2 device passthrough Identical patch-id. No context note needed.

Note: By “identical”, I mean identical patch content by git patch-id --verbatim against the LKML v2 patch, not byte-identical commit metadata.


54d50bb vfio: Remove the get_region_info op
2108575 vfio: Move the remaining drivers to get_region_info_caps
c0ad388 vfio/platform: Convert to get_region_info_caps
2bf5a2c vfio/pci: Convert all PCI drivers to get_region_info_caps
bc1c993 vfio/ccw: Convert to get_region_info_caps
0282af0 vfio/gvt: Convert to get_region_info_caps
29e1217 vfio/mbochs: Convert mbochs to use vfio_info_add_capability()
7dd77b8 vfio: Add get_region_info_caps op
e7da106 vfio: Require drivers to implement get_region_info
6c250ce vfio/gvt: Provide a get_region_info op
76b5171 vfio/ccw: Provide a get_region_info op
619333d vfio/cdx: Provide a get_region_info op
8ba94bf vfio/fsl: Provide a get_region_info op
073f13c vfio/platform: Provide a get_region_info op
554dca9 vfio/mbochs: Provide a get_region_info op
0fbfd73 vfio/mdpy: Provide a get_region_info op
4df2081 vfio/mtty: Provide a get_region_info op
e54b8e0 vfio/pci: Fill in the missing get_region_info ops
7026227 vfio/nvgrace: Convert to the get_region_info op
fad0d0d vfio/virtio: Convert to the get_region_info op
6b97c1b vfio/hisi: Convert to the get_region_info op
897cefa vfio: Provide a get_region_info op
449e051 vfio/nvgrace-gpu: fix grammatical error
0c7d382 hisi_acc_vfio_pci: adapt to new migration configuration
38c6eb3 crypto: hisilicon - qm updates BAR configuration
All of these match upstream exactly, no issues.

@JiandiAnNVIDIA
Copy link
Copy Markdown
Author

For these new patches that have not been posted to LKML yet, I reviewed with Codex, specifically focusing on if there are any potential regressions introduced.

Here are the 5 findings with their related commits and the concern.

I think I'm going to let Manish review this then. This RFC series is going through internal review with Vikram, Alex, etc. continuing posting comments. I'll ask them if they are in need to want this in now so they have a story for QS kernel supporting vfio cxl or if they are okay waiting until this is posted to LKML or have all these AI / Cursor identified concerns fixed in their series. Maybe Manish should run cursor/AI on his patch series while developing the patches.

@JiandiAnNVIDIA
Copy link
Copy Markdown
Author

@JiandiAnNVIDIA

FYI, these 2 commits have already been merged to 6.17-HWE via Richard's PR a week or 2 back and can be removed from this PR: b5c0462 fwctl: Fix class init ordering to avoid NULL pointer dereference on device removal f7d2825 PCI/NPEM: Set LED_HW_PLUGGABLE for hotplug-capable ports

These two commits are not among the commits I included in this PR right?
https://github.com/NVIDIA/NV-Kernels/pull/407/commits

Mine PR commits started after

b5c046256379 (NV-Kernels/24.04_linux-nvidia-6.17-next) fwctl: Fix class init ordering to avoid NULL pointer dereference on device removal
f7d28252cc3c PCI/NPEM: Set LED_HW_PLUGGABLE for hotplug-capable ports
80bac297db8a UBUNTU: Ubuntu-nvidia-6.17-6.17.0-1017.17

Starting at
38c6eb3eed52 crypto: hisilicon - qm updates BAR configuration

jan@jan-dev:~/sb/nv-kernels-cxl-latest/NV-Kernels$ git log --oneline
502020b2b5f2 (HEAD -> cxl-vfio_2026-04-23, origin/cxl-vfio_2026-04-23) NVIDIA: VR: SAUCE: config: Enable CONFIG_VFIO_CXL_CORE for CXL Type-2 passthrough
acb85c3f4802 NVIDIA: VR: SAUCE: vfio/cxl: Implement vfio_cxl_reset()
d86286892ebd NVIDIA: VR: SAUCE: vfio/cxl: virtualize DVSEC STATUS2 register in vconfig shadow
fcc7b7b52ef8 NVIDIA: VR: SAUCE: vfio/cxl: preserve HDM decoder base addresses across reset
f5c8879ca425 NVIDIA: VR: SAUCE: vfio/cxl: Ensure PCI Memory Space is enabled before post-reset BAR access
979f0aad4c42 NVIDIA: VR: SAUCE: vfio/pci: Wire CXL DPA reset handling
9543998beea7 NVIDIA: VR: SAUCE: cxl: Export the CXL reset helpers for VFIO users
646f12a6a8f8 NVIDIA: VR: SAUCE: docs: vfio-pci: Document CXL Type-2 device passthrough
5bc0b3ea82af NVIDIA: VR: SAUCE: vfio/cxl: Provide opt-out for CXL feature
534faac8aa63 NVIDIA: VR: SAUCE: vfio/pci: Advertise CXL cap and sparse component BAR to userspace
24dd6678d476 NVIDIA: VR: SAUCE: vfio/cxl: Register regions with VFIO layer
1447b99bccc3 NVIDIA: VR: SAUCE: vfio/cxl: Virtualize CXL DVSEC config writes
05b9195202cc NVIDIA: VR: SAUCE: vfio/cxl: DPA VFIO region with demand fault mmap and reset zap
fb580ac046d5 NVIDIA: VR: SAUCE: vfio/cxl: CXL region management support
d64c61c2ba91 NVIDIA: VR: SAUCE: vfio/cxl: Wait for HDM ranges and create memdev
ad3979839aba NVIDIA: VR: SAUCE: vfio/cxl: Introduce HDM decoder register emulation framework
0fbd7b2effd7 NVIDIA: VR: SAUCE: vfio/pci: Export config access helpers
84fbfbcead31 NVIDIA: VR: SAUCE: vfio/cxl: Detect CXL DVSEC and probe HDM block
cb87876e8e1d NVIDIA: VR: SAUCE: vfio/pci: Add CONFIG_VFIO_CXL_CORE and stub CXL hooks
de3e1a60a995 NVIDIA: VR: SAUCE: vfio/pci: Add CXL state to vfio_pci_core_device
05c1da9d786a NVIDIA: VR: SAUCE: vfio: UAPI for CXL-capable PCI device assignment
d3141453f48b NVIDIA: VR: SAUCE: cxl: Record BIR and BAR offset in cxl_register_map
d0fde9879972 NVIDIA: VR: SAUCE: cxl: Split cxl_await_range_active() from media-ready wait
199d5d2f2ca4 NVIDIA: VR: SAUCE: cxl: Move component/HDM register defines to uapi/cxl/cxl_regs.h
e02c1b7ac02a NVIDIA: VR: SAUCE: cxl: Declare cxl_find_regblock and cxl_probe_component_regs in public header
fd317b86093e NVIDIA: VR: SAUCE: cxl: Add cxl_get_hdm_info() for HDM decoder metadata
54d50bbc6111 vfio: Remove the get_region_info op
21085759fbcd vfio: Move the remaining drivers to get_region_info_caps
c0ad388ba741 vfio/platform: Convert to get_region_info_caps
2bf5a2cbb154 vfio/pci: Convert all PCI drivers to get_region_info_caps
bc1c993e783d vfio/ccw: Convert to get_region_info_caps
0282af066b10 vfio/gvt: Convert to get_region_info_caps
29e1217fd909 vfio/mbochs: Convert mbochs to use vfio_info_add_capability()
7dd77b841190 vfio: Add get_region_info_caps op
e7da10685f7f vfio: Require drivers to implement get_region_info
6c250ce18f9e vfio/gvt: Provide a get_region_info op
76b5171d117d vfio/ccw: Provide a get_region_info op
619333df0ce8 vfio/cdx: Provide a get_region_info op
8ba94bf6a94e vfio/fsl: Provide a get_region_info op
073f13c17982 vfio/platform: Provide a get_region_info op
554dca9a1de1 vfio/mbochs: Provide a get_region_info op
0fbfd736592c vfio/mdpy: Provide a get_region_info op
4df20815cb64 vfio/mtty: Provide a get_region_info op
e54b8e086acd vfio/pci: Fill in the missing get_region_info ops
702622746ce4 vfio/nvgrace: Convert to the get_region_info op
fad0d0d38ca4 vfio/virtio: Convert to the get_region_info op
6b97c1b33bef vfio/hisi: Convert to the get_region_info op
897cefa739f7 vfio: Provide a get_region_info op
449e051b54c2 vfio/nvgrace-gpu: fix grammatical error
0c7d38232410 hisi_acc_vfio_pci: adapt to new migration configuration
38c6eb3eed52 crypto: hisilicon - qm updates BAR configuration
b5c046256379 (NV-Kernels/24.04_linux-nvidia-6.17-next) fwctl: Fix class init ordering to avoid NULL pointer dereference on device removal
f7d28252cc3c PCI/NPEM: Set LED_HW_PLUGGABLE for hotplug-capable ports
80bac297db8a UBUNTU: Ubuntu-nvidia-6.17-6.17.0-1017.17

@nvmochs
Copy link
Copy Markdown
Collaborator

nvmochs commented May 6, 2026

@JiandiAnNVIDIA
FYI, these 2 commits have already been merged to 6.17-HWE via Richard's PR a week or 2 back and can be removed from this PR: b5c0462 fwctl: Fix class init ordering to avoid NULL pointer dereference on device removal f7d2825 PCI/NPEM: Set LED_HW_PLUGGABLE for hotplug-capable ports

These two commits are not among the commits I included in this PR right? https://github.com/NVIDIA/NV-Kernels/pull/407/commits

Mine PR commits started after

b5c046256379 (NV-Kernels/24.04_linux-nvidia-6.17-next) fwctl: Fix class init ordering to avoid NULL pointer dereference on device removal
f7d28252cc3c PCI/NPEM: Set LED_HW_PLUGGABLE for hotplug-capable ports
80bac297db8a UBUNTU: Ubuntu-nvidia-6.17-6.17.0-1017.17

My mistake, you are correct.

mmhonap and others added 7 commits May 9, 2026 01:35
Export two helpers for VFIO:
  - pci_cxl_reset_capable()
  - cxl_dev_reset()

The change does not alter the reset flow itself, the capability checks,
or the sysfs ABI. It only lifts the helper out of the private path so
later VFIO patches can call the same code.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
Signed-off-by: Jiandi An <jan@nvidia.com>
This change adds/renames the vfio-cxl code nuggets to better suite
the cxl-reset handling mechanism in later patches.

- Rename the CXL DPA region helpers to prepare_reset() and finish_reset
  so call sites read as a matched pair around pci_try_reset_function
  Also call prepare_reset()/finish_reset() around
  pci_try_reset_function() in both the PCIe BCR FLR path and the
  Function FLR path, matching the logic already used on the
  VFIO_DEVICE_RESET ioctl path.

- When pci_try_reset_function() fails: finish_reset() consults the
  hardware COMMITTED state before re-enabling the DPA mapping, so it is
  safe on error and avoids leaving the DPA region wedged off after a
  transient reset failure.

- Add vfio_cxl_reset_capable(), a small wrapper over
  pci_cxl_reset_capable()

Signed-off-by: Manish Honap <mhonap@nvidia.com>
Signed-off-by: Jiandi An <jan@nvidia.com>
…e post-reset BAR access

A reset caller may disable Memory Space to quiesce device DMA before
issuing the reset. pci_try_reset_function() saves and restores
PCI_COMMAND around the FLR. If the memory space was disabled before FLR,
it will be restored in disabled state.

vfio_cxl_finish_reset() reads HDM decoder registers through the
component register BAR immediately after reset. Accessing a BAR with
Memory Space disabled produces an Unsupported Request completion; on
platforms that promote UR to a fatal error this triggers DPC.

Add vfio_cxl_enable_memory_space() and call it at the start of
vfio_cxl_finish_reset() before touching any BAR.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
Signed-off-by: Jiandi An <jan@nvidia.com>
…ss reset

reinit_comp_regs() mirrors post-reset hardware state (all-zeros) into
comp_reg_virt[], including HDM decoder BASE registers. For decoders that
the device manager committed with a guest-physical address before the
reset, pci_dev_restore() re-commits the hardware decoders with the
host-physical base. The kernel provides no notification that BASE was
cleared during reinit, so the emulated GPA bases are silently lost.

Add vfio_cxl_reinit_hdm_shadow() which snapshots the GPA decoder bases
before calling reinit_comp_regs() and restores them after, keeping the
emulated decoder consistent with what the device manager set.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
Signed-off-by: Jiandi An <jan@nvidia.com>
…nfig shadow

STATUS2 was read directly from hardware while all other DVSEC registers
were served from the vconfig shadow. This created two problems:

1. VOLATILE_HDM_PRES_ERROR (RW1CS, bit 3): guest writes cleared the
   hardware bit but the shadow was not updated, so subsequent reads still
   returned the set bit from hardware (which the hardware had cleared).

2. CXL_RESET_COMPLETE and CXL_RESET_ERROR (bits 1-2): these outcome bits
   will be written by vfio_cxl_reset() into the shadow after a protocol
   reset. Hardware does not update them on its own; serving reads from
   hardware would hide the outcome from the guest.

Add STATUS2 to the read switch so reads come from the shadow, and update
cxl_dvsec_status2_write() to mirror VOLATILE_HDM_PRES_ERROR clears into
the shadow after forwarding to hardware.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
Signed-off-by: Jiandi An <jan@nvidia.com>
Add vfio_cxl_reset() to drive a CXL protocol reset on behalf of a guest.

Unlike cxl_do_reset(), this path skips host memory offlining since the
DPA region is guest memory. The function takes memory_lock for the full
sequence, calls vfio_cxl_prepare_reset() to zap DPA region PTEs, drives
the hardware via pci_dev_save_and_disable() + cxl_dev_reset() +
pci_dev_restore(), then calls vfio_cxl_finish_reset() to reinitialise
emulated state.

STATUS2 outcome bits (CXL_RESET_COMPLETE / CXL_RESET_ERROR) are written
back to vconfig after the reset so the guest can poll for result without
reading hardware. pci_dev_restore() overwrites the saved pre-reset state,
so the hardware value is re-read after restore before the outcome is stamped.

When the guest writes INIT_CXL_RST into DVSEC CONTROL2, invoke
vfio_cxl_reset() to perform a CXL protocol reset. The bit is not
forwarded to hardware; cxl_dev_reset() drives the reset sequence
directly. Silently drop writes on devices that do not advertise
RST_CAPABLE to avoid log noise for the reserved-bit case.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
Signed-off-by: Jiandi An <jan@nvidia.com>
… passthrough

Enable VFIO CXL core support on amd64 and arm64 to allow CXL Type-2
device passthrough via vfio-pci.

Signed-off-by: Jiandi An <jan@nvidia.com>
@JiandiAnNVIDIA JiandiAnNVIDIA force-pushed the cxl-vfio_2026-04-23 branch from 502020b to aef7e33 Compare May 9, 2026 06:52
@nirmoy
Copy link
Copy Markdown
Collaborator

nirmoy commented May 11, 2026

lastest CI is clean just missing LP link: nirmoy#14 (comment)
have to wait for PR 410 to make CI happy.

Update: @JiandiAnNVIDIA you can edit the PR description to retrigger the github action

@nvmochs
Copy link
Copy Markdown
Collaborator

nvmochs commented May 11, 2026

I re-reviewed the latest updates with Codex and it confirmed that the updates address findings 1 and 5 from my previous review. After providing more data from a VR system with CXL devices, I was able to conclude that issues 3 and 4 are unfounded with Strata. Finding 2 remains, but it is scoped to the CXL VFIO devices and reset paths (Non-CXL VFIO devices are unaffected because vfio_cxl_finish_reset() returns when vdev->cxl == NULL.), so I am not concerned about a regression.

Acked-by: Matthew R. Ochs <mochs@nvidia.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants