Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
a91ea20
Added slash_compat.h force include and fix from_timer backport issue …
amd-vserbu Jun 8, 2026
ed0f69d
driver: large-page qdma transfers, NoC channel steering, libqdma patches
amd-vserbu Jun 8, 2026
80745c8
libvrtd: hugepage host buffers and granule-aware partial sync
amd-vserbu Jun 8, 2026
91cc4f3
smi: validate bandwidth modes with raw SLASH and stock qdma backends
amd-vserbu Jun 8, 2026
2816573
packaging: ship libqdma patches, add deps, harden install test, ignor…
amd-vserbu Jun 8, 2026
96f7b05
driver: guard libqdma pr_fmt under force-included compat header
amd-vserbu Jun 10, 2026
6a5a3e9
driver: verify qdma host-profile readback and add tunable hugepage de…
amd-vserbu Jun 10, 2026
b2a9e24
smi: add validate buffer placement, channel allocation, and bandwidth…
amd-vserbu Jun 10, 2026
fe8072f
docs: document new validate placement and bandwidth options
amd-vserbu Jun 10, 2026
bbb568f
Added 4k|2M explicit page specification
amd-vserbu Jun 10, 2026
5bd5345
driver: add qdma registered-buffer abi with pinned, pre-dma-mapped tr…
amd-vserbu Jun 12, 2026
b94e319
libslash: add qdma buffer register/unregister/transfer wrappers and mock
amd-vserbu Jun 12, 2026
5dc6e4f
tests: cover qdma registered-buffer kselftest abi
amd-vserbu Jun 12, 2026
23dca65
docs: document qdma registered-buffer abi
amd-vserbu Jun 12, 2026
a8146eb
smi: add --ring-size-index and use registered buffers for raw transfers
amd-vserbu Jun 12, 2026
a037eb1
docs: document validate --ring-size-index option
amd-vserbu Jun 12, 2026
8a74b03
vrtd: plumb mm-channel selection through buffer open
amd-vserbu Jun 12, 2026
c36c43c
driver+libslash: added transfer performance hint
amd-vserbu Jun 12, 2026
3707e38
vrt/vrtd: use new performance buffer ioctl api
amd-vserbu Jun 12, 2026
37cdd75
qdma stack: change policy from dual-channel to the more complex v80 p…
amd-vserbu Jun 15, 2026
5f7fe3d
driver+vrt+smi: drop libqdma sg/channel patches, make transfers 4 KiB…
amd-vserbu Jun 15, 2026
817670d
Changed API to kernel allocated-buffers
amd-vserbu Jun 17, 2026
6599ce7
vrtd: track client ownership on raw buffers and close per-owner
amd-vserbu Jun 17, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -85,3 +85,6 @@ driver/kcompat/.scratch/

# Python test coverage
.coverage

# Project-local scratch space
/tmp/
247 changes: 212 additions & 35 deletions docs/reference/kernel-abi/index.rst

Large diffs are not rendered by default.

154 changes: 152 additions & 2 deletions docs/reference/smi/commands.rst
Original file line number Diff line number Diff line change
Expand Up @@ -151,11 +151,27 @@ validate
--------

Run memory integrity and bandwidth tests against a board's HBM and DDR
subsystems.
subsystems. For each memory path, bandwidth is reported as single-direction
C2H read, single-direction H2C write, and simultaneous bidirectional
throughput (read, write, and total). After the per-memory phases, a final
parallel phase drives HBM and DDR simultaneously with ``2 * N`` buffers for
single-direction tests and ``4 * N`` threads for bidirectional tests; this
phase is skipped when ``--ddr-only`` or ``--hbm-only`` is given.

.. code-block:: text

v80-smi validate -d <BDF> [-j|--threads <N>]
v80-smi validate -d <BDF> [-j|--threads <N>] [-R|--no-reset] [--mm-channel <spec>] [--buffer-size <size>] [--offset <size>] [--starting-offset <size>] [--raw-transfer-test | --use-qdma-driver] [--ddr-only | --hbm-only] [--channel-allocation <auto|paired>] [--channel-region-stride <size>] [--ring-size-index <0-15>] [--bandwidth-iterations <N>] [--bandwidth-duration <seconds>]

Requirements by mode:

* Default mode uses VRTD buffers, requires a running VRTD daemon, and resets
the board unless ``--no-reset`` is given.
* ``--raw-transfer-test`` bypasses VRTD for transfers and requires the SLASH
QDMA driver device node for the board. It skips reset.
* ``--use-qdma-driver`` bypasses both VRTD and SLASH for transfers and requires
the stock ``qdma-pf`` driver to be bound to the board's QDMA PF. This backend
is built only when ``SMI_ENABLE_QDMA_DRIVER_BACKEND`` is enabled at CMake
configure time.

.. option:: -d, --device <BDF>

Expand All @@ -164,6 +180,140 @@ subsystems.
.. option:: -j, --threads <N>

Number of parallel buffers/threads for the validation test (1–64, default 8).
Bidirectional phases use ``2 * N`` logical positions in each enabled memory
space.

.. option:: --buffer-size <size>

Size of each test buffer. Values may be bare bytes or use ``k``/``K`` or
``m``/``M`` suffixes. The default and maximum are ``512M``. Values must be
4 KiB-aligned.

.. option:: --offset <size>

Distance between logical buffer positions. The default is ``512M``. Values
may be bare bytes or use ``k``/``K`` or ``m``/``M`` suffixes, must be
4 KiB-aligned, and must be at least ``--buffer-size`` so buffers do not
overlap.

.. option:: --starting-offset <size>

Offset from each memory-space base for logical position 0. The default is
``0``. Values may be bare bytes or use ``k``/``K`` or ``m``/``M`` suffixes
and must be 4 KiB-aligned.

Buffers are placed at ``memory_base + starting_offset + position * offset``.
Single-direction phases use positions ``0..N-1``. Bidirectional phases use
positions ``0..2N-1`` with reads on even positions and writes on odd positions.
The full range must remain inside the 64 x 512 MB DDR/HBM address space. If any
placement option is specified in default VRTD mode, ``validate`` uses raw VRTD
buffers so the exact addresses are honored; this requires raw memory access
permission.

The largest phase maps up to ``4 * N * buffer-size`` of host buffers when both
HBM and DDR are enabled, or ``2 * N * buffer-size`` with ``--ddr-only`` or
``--hbm-only``; the command fails early if that exceeds currently available
host memory.

.. option:: -R, --no-reset

Skip the device reset step before running memory tests.

.. option:: --mm-channel <spec>

AXI-MM / NoC channel selection for each buffer's QDMA queue pair, in every
mode. ``spec`` is either a single value applied to all buffers, or a
comma-separated list giving one channel per logical buffer position
(exactly ``2 x --threads`` entries; there is no repeating/wrap, and any
other length is an error):

* ``auto`` (the default) lets the driver stripe queues across both channels
by ``qid & 1``.
* ``0`` / ``1`` pin the queue to that AXI-MM channel (and hence NoC channel).
* e.g. with ``-j 1`` the list ``0,1`` puts buffer position 0 on channel 0 and
position 1 on channel 1. Bidirectional phases use positions ``0..2N-1``;
single-direction phases use the first ``N`` entries.

This is independent of ``--channel-allocation`` (which controls the device
address): ``--mm-channel`` controls the host-side NoC ingress (NMU) per
queue. With ``--use-qdma-driver`` the selection maps to the stock driver's
per-queue MM-channel attribute.

.. option:: --raw-transfer-test

Use libslash raw QDMA transfers instead of VRTD buffers. This mode implies
``--no-reset`` and requires the SLASH QDMA driver device to be present.

.. option:: --use-qdma-driver

Run the raw transfer test over the off-the-shelf Xilinx QDMA driver
(``/dev/qdma<idx>-MM-<qid>``) instead of SLASH. smi provisions the queues
itself: it raises the function's ``qmax`` via sysfs if needed, creates and
starts bidirectional AXI-MM queue pairs over generic netlink (the same
``xnl_pf`` interface ``dma-ctl`` uses), then transfers over the per-queue
char devices. Queue pairs are spread round-robin across the function's MM
engine channels (``channel = qid % mm_channel_max``); the CPM5 QDMA on the
V80 exposes two, so the test exercises both. This mode implies
``--no-reset`` and is mutually exclusive with ``--raw-transfer-test``. It
requires the stock ``qdma-pf`` driver to be bound to the board's PF (it
cannot be bound at the same time as the SLASH driver), and typically
requires root to raise ``qmax`` and open the queue devices.

.. option:: --ddr-only

Run only the DDR memory tests and skip the HBM phase. Mutually exclusive
with ``--hbm-only``.

.. option:: --hbm-only

Run only the HBM memory tests and skip the DDR phase. Mutually exclusive
with ``--ddr-only``.

.. option:: --channel-allocation <auto|paired>

Raw-transfer-only (``--raw-transfer-test`` or ``--use-qdma-driver``) control
over how QDMA MM/NoC channels map onto device memory. On CPM5 the host-side
NoC ingress port (NMU) is chosen per queue by the SW-context
mm-channel/host_id (SLASH uses ``qid & 1``), while the memory-side NoC egress
endpoint (NSU / pseudo-channel) is chosen by the device address. Default
``auto`` keeps the historical behaviour: channel ``qid & 1`` with linear
addressing, so both NMUs can converge on a single NSU and bandwidth caps at
one path. ``paired`` couples the two: even positions land in memory region 0
on channel 0, odd positions in region 1 on channel 1 (one
``--channel-region-stride`` apart), giving two independent NMU->NSU paths.
This mirrors the off-the-shelf ``dma-perf`` ``offset_ch0``/``offset_ch1``
knobs and is the placement that lets both NoC ports contribute bandwidth.

.. option:: --channel-region-stride <size>

In ``--channel-allocation paired`` mode, the byte distance between the two
per-channel memory regions (the NSU / pseudo-channel stride). Default ``16G``
(== half the per-memory address space, matching the dma-perf HBM
``offset_ch1 - offset_ch0`` spacing). Must be a non-zero multiple of 4 KiB.
Accepts bare bytes or ``k``/``K``, ``m``/``M``, ``g``/``G`` suffixes.

.. option:: --ring-size-index <0-15>

Raw-transfer-only (``--raw-transfer-test`` or ``--use-qdma-driver``).
Override the QDMA descriptor-ring size index used when creating SLASH raw
queue pairs or starting stock-driver queues. When omitted, each backend keeps
its existing default. Useful A/B values for 4 KiB descriptor throughput are
``0``, ``11``, ``13``, and ``15``.

.. option:: --bandwidth-iterations <N>

Raw-transfer-only (``--raw-transfer-test`` or ``--use-qdma-driver``). Repeat
each whole-buffer transfer in every bandwidth phase ``N`` times and report
bandwidth over the sustained loop. The default is ``1``, which preserves the
historical one-shot measurement.

.. option:: --bandwidth-duration <seconds>

Raw-transfer-only duration mode. When non-zero, each bandwidth phase repeats
whole-buffer transfers until the requested wall-clock duration has elapsed
and counts only completed transfers. This is useful for comparing SLASH's raw
path against long-running tools such as ``dma-perf``. A value of ``0`` uses
``--bandwidth-iterations`` instead.

debug
-----
Expand Down
101 changes: 98 additions & 3 deletions driver/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -42,15 +42,28 @@ else
LIBQDMA_PATH := $(LIBQDMA_FALLBACK)
endif

# SLASH carries a few local modifications to the pinned QDMA submodule's
# libqdma sources (see $(LIBQDMA_PATCH_DIR)/). The submodule itself stays
# pristine; the patches are applied to whichever libqdma tree is being built
# (the DKMS-local ./libqdma or the in-tree submodule) by the libqdma-patches
# target before the module is compiled. See that target for details.
LIBQDMA_PATCH_DIR := patches

SLASH_QDMA_OP_DEBUG ?= 0

# Per-transfer timing instrumentation. Set to 1 to emit one dmesg line per
# DMA transfer breaking down the kernel phases. Default off (zero overhead).
SLASH_QDMA_TIMING ?= 0

# Kcompat feature flags. Defaults are "n"; the all: recipe runs
# driver/kcompat/probe.sh against $(KDIR) to detect the actual values
# and passes them into the kbuild recursion. Each pair (modern API +
# legacy fallback) is covered by one probe — if the modern form is
# absent, the legacy form is the unconditional fallback in slash_compat.h.
SLASH_HAVE_VM_FLAGS_SET ?= n
SLASH_HAVE_MODULE_IMPORT_NS_TOKEN ?= n
SLASH_HAVE_URING_CMD ?= n
SLASH_HAVE_URING_SQE_CMD ?= n

# Set GCOV=1 to instrument the module for kernel gcov coverage.
# Not set by default — never enable this in production builds.
Expand All @@ -72,6 +85,7 @@ ccflags-y += \
\
-DTANDEM_BOOT_SUPPORTED=1 \
-DSLASH_QDMA_OP_DEBUG=$(SLASH_QDMA_OP_DEBUG) \
-DSLASH_QDMA_TIMING=$(SLASH_QDMA_TIMING) \
-DSLASH_VERSION_STR=\"$(SLASH_VERSION)\"

ifeq ($(SLASH_HAVE_VM_FLAGS_SET),y)
Expand All @@ -82,6 +96,25 @@ ifeq ($(SLASH_HAVE_MODULE_IMPORT_NS_TOKEN),y)
ccflags-y += -DSLASH_HAVE_MODULE_IMPORT_NS_TOKEN
endif

# Optional io_uring uring_cmd async transfer path. Probed by kcompat; absent on
# kernels without CONFIG_IO_URING or uring_cmd support (e.g. RHEL 9, Ubuntu
# 22.04 GA), where the synchronous transfer ioctl remains the only path.
ifeq ($(SLASH_HAVE_URING_CMD),y)
ccflags-y += -DSLASH_HAVE_URING_CMD
endif

# Selects the io_uring SQE payload accessor: io_uring_sqe_cmd(cmd->sqe) when
# present (newer kernels + distro backports), else cmd->cmd. Only meaningful
# when SLASH_HAVE_URING_CMD is also set.
ifeq ($(SLASH_HAVE_URING_SQE_CMD),y)
ccflags-y += -DSLASH_HAVE_URING_SQE_CMD
endif

# Force-include the compat header into every TU (including the pinned libqdma
# submodule sources we don't modify) so kernel-API shims such as from_timer()
# reach third-party code too. Safe on all kernels: the shims are guarded.
ccflags-y += -include $(src)/slash_compat.h


LIBQDMA_OBJS := \
$(LIBQDMA_PATH)/qdma_mbox.o \
Expand Down Expand Up @@ -120,18 +153,80 @@ $(MODULE)-objs += $(LIBQDMA_OBJS) $(QDMA_ACCESS_OBJS)

KCOMPAT := "$(SHELL)" "$(PWD)/kcompat/probe.sh"

all:
all: libqdma-patches
@flags="$$($(KCOMPAT) "$(KDIR)" | tr '\n' ' ')"; \
echo "slash: kcompat: $$flags"; \
$(MAKE) -C "$(KDIR)" M="$(PWD)" $$flags modules

# Apply SLASH's local libqdma patches ($(LIBQDMA_PATCH_DIR)/*.patch) to the
# libqdma source tree in use, in filename order, right before building.
#
# The pinned submodule is not edited directly by commits: patches live in-tree
# and are stamped onto the working copy here. Application is idempotent — each patch is first tested
# for being already applied (reverse dry-run) and skipped if so — so repeated
# `make` runs, incremental builds, and DKMS rebuilds are all safe. A patch that
# neither applies cleanly nor is already present aborts the build.
#
# $(PWD) is the driver dir for both `make` (in-tree) and DKMS (MAKE[0] runs
# `make -C driver ...`); ./libqdma is the DKMS-packaged copy, otherwise fall
# back to the in-tree submodule path. Uses patch(1) so it is independent of
# whether the libqdma tree lives inside a git checkout.
libqdma-patches:
@set -e; \
patch_dir="$(PWD)/$(LIBQDMA_PATCH_DIR)"; \
set -- "$$patch_dir"/*.patch; \
if [ ! -e "$$1" ]; then exit 0; fi; \
if [ -d "$(PWD)/libqdma" ]; then lq="$(PWD)/libqdma"; \
else lq="$(PWD)/$(LIBQDMA_FALLBACK)"; fi; \
if [ ! -d "$$lq" ]; then \
echo "slash: ERROR libqdma sources not found at $$lq" >&2; \
echo "slash: run 'git submodule update --init --recursive' first" >&2; \
exit 1; \
fi; \
command -v patch >/dev/null 2>&1 || { \
echo "slash: ERROR patch(1) not found; it is required to apply libqdma patches" >&2; \
exit 1; }; \
for p in "$$@"; do \
name="$$(basename "$$p")"; \
if patch -R -p1 -d "$$lq" --dry-run -f -s -i "$$p" >/dev/null 2>&1; then \
echo "slash: libqdma patch already applied, skipping: $$name"; \
elif patch -p1 -d "$$lq" --dry-run -f -s -i "$$p" >/dev/null 2>&1; then \
echo "slash: applying libqdma patch: $$name"; \
patch -p1 -d "$$lq" -f -s -i "$$p"; \
else \
echo "slash: ERROR libqdma patch does not apply cleanly: $$name" >&2; \
echo "slash: (libqdma tree at $$lq is neither pristine nor already patched)" >&2; \
exit 1; \
fi; \
done

# Best-effort revert of the libqdma patches, restoring the submodule working
# copy to pristine. Useful when editing the patches themselves. Never fails the
# build: patches that are not currently applied are simply skipped.
unpatch-libqdma:
@set -e; \
patch_dir="$(PWD)/$(LIBQDMA_PATCH_DIR)"; \
set -- "$$patch_dir"/*.patch; \
if [ ! -e "$$1" ]; then exit 0; fi; \
if [ -d "$(PWD)/libqdma" ]; then lq="$(PWD)/libqdma"; \
else lq="$(PWD)/$(LIBQDMA_FALLBACK)"; fi; \
[ -d "$$lq" ] || exit 0; \
for p in $$(printf '%s\n' "$$@" | tac); do \
name="$$(basename "$$p")"; \
if patch -R -p1 -d "$$lq" --dry-run -f -s -i "$$p" >/dev/null 2>&1; then \
echo "slash: reverting libqdma patch: $$name"; \
patch -R -p1 -d "$$lq" -f -s -i "$$p"; \
fi; \
done

clean:
$(MAKE) -C "$(KDIR)" M="$(PWD)" clean
-$(MAKE) -C "$(KDIR)" M="$(PWD)" clean
rm -rf "$(PWD)/kcompat/.scratch"
$(MAKE) unpatch-libqdma

install: all
sudo install -d -m 755 /lib/modules/$(shell uname -r)/extra
sudo install -m 644 $(MODULE).ko /lib/modules/$(shell uname -r)/extra
sudo depmod -a

.PHONY: all clean install
.PHONY: all clean install libqdma-patches unpatch-libqdma
48 changes: 48 additions & 0 deletions driver/README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,58 @@
# SLASH kernel module

## Module parameters

Exposed under `/sys/module/slash/parameters/` (all writable at runtime; see
`modinfo slash.ko`):

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `qdma_num_threads` | uint | 8 | Number of libqdma worker threads. |
| `qdma_debugfs_path` | charp | disabled | debugfs mount path for libqdma. |

### A/B testing NoC channel bandwidth

The AXI-MM / NoC channel is chosen per queue pair when it is added (the
`mm_channel` field of the qpair-add ioctl, `enum slash_qdma_mm_channel`):
`auto` stripes queues across both channels by `qid & 1`, while `0` / `1` pin a
queue to a single channel. Every queue creator carries this setting, so it can
be driven per buffer to check whether both PCIe NMUs (NoC channels) actually
contribute bandwidth. With `v80-smi validate`:

```sh
# All queues on NoC channel 0 (NMU S00)
sudo v80-smi validate -d <BDF> --raw-transfer-test --no-reset --mm-channel 0

# All queues on NoC channel 1 (NMU S01)
sudo v80-smi validate -d <BDF> --raw-transfer-test --no-reset --mm-channel 1

# Split across both channels (qid & 1)
sudo v80-smi validate -d <BDF> --raw-transfer-test --no-reset --mm-channel auto

# Explicit per-buffer split (even positions -> channel 0, odd -> channel 1)
sudo v80-smi validate -d <BDF> --raw-transfer-test --no-reset --mm-channel 0,1
```

Debug builds with `SLASH_QDMA_OP_DEBUG=1` log each queue's selected
`mm_channel` when it is added. If the split run is no faster than a single
forced channel, traffic is not being spread across both NMUs. The per-queue
setting affects every queue created through this driver (both the VRTD buffer
path and `--raw-transfer-test`); the off-the-shelf Xilinx QDMA driver path
(`--use-qdma-driver`) honors `--mm-channel` through its own channel attribute.

## Testing

The test suite requires a physical V80 to be present and the module to be
loaded into a running kernel.

## Local libqdma patches

SLASH carries small patches for the pinned `libqdma` submodule under
`driver/patches/`. The driver `Makefile` applies them before building, and
`make clean` attempts to revert them so the submodule working copy returns to
its pristine pinned state. DKMS packages include the same patch directory and
depend on `patch(1)`.

### Prerequisites

- A kernel built with `CONFIG_GCOV_KERNEL=y` (only needed for coverage runs).
Expand Down
Loading
Loading