Skip to content

[ROCm] Add AMD GPU support to the Python backend#747

Open
jeffdaily wants to merge 1 commit into
CERN:masterfrom
jeffdaily:moat-port
Open

[ROCm] Add AMD GPU support to the Python backend#747
jeffdaily wants to merge 1 commit into
CERN:masterfrom
jeffdaily:moat-port

Conversation

@jeffdaily

Copy link
Copy Markdown

This adds an AMD GPU build of TIGRE's Python backend with ROCm/HIP, alongside the existing CUDA build. The CUDA path is selected by default and the AMD support is added behind USE_HIP / BUILD_WITH_HIP, so every effort has been made to leave the CUDA build untouched; the ROCm build is opt-in through environment variables. The CUDA sources were compile-checked with nvcc but were not run, as the development hosts have no NVIDIA GPU.

How to review: setup.py first (the build wiring), then Common/CUDA/cuda_to_hip.h (the CUDA->HIP symbol mapping and the texture path), then the per-source changes (texture filter mode, the Siddon ray-loop bound, and the cuRAND -> hipRAND include), then the install-doc update.

setup.py keeps the existing hand-rolled per-.cu compiler driver and adds a BUILD_WITH_HIP branch (both the Unix and the MSVC/Windows paths). With BUILD_WITH_HIP=1 it compiles the .cu sources with hipcc (-x hip --offload-arch=$HIP_ARCH, default gfx90a; comma-separated for multiple targets), links amdhip64 instead of cudart, and skips the nvcc-at-import probing that is absent on a ROCm-only host. On Windows it drives clang++ directly with --hip-path and -fms-runtime-lib=dll to match the MSVC /MD CRT.

The main porting concern is the texture path. The backprojection kernels and the interpolated forward projector sample a 3D float array with trilinear filtering (cudaFilterModeLinear + cudaReadModeElementType). Whether AMD hardware can do that fetch is architecture-dependent: gfx90a (CDNA2) rejects creation of such a texture, while RDNA (gfx1100, gfx1201) accepts it and filters in hardware. Rather than hard-code one path, cuda_to_hip.h decides at runtime: a one-time self-test creates the real texture configuration over a small ramp and confirms a known sample actually interpolates -- not merely that creation succeeded, since some hardware accepts the texture but silently point-samples, which would ship wrong results. When supported, the texture is created Linear and tex3D_TIGRE forwards to the hardware fetch; otherwise it is created Point and tex3D_TIGRE interpolates in software, point-sampling the eight neighbours and lerping with CUDA's unnormalized -0.5 texel-center convention (out-of-array neighbours read 0, matching cudaAddressModeBorder). The texture's creation mode and the sampling path are driven from the single cached verdict so they cannot disagree. On CUDA the runtime helper reports supported without testing and the AMD-only code is excluded by the preprocessor, so the original hardware-filtered textures are used.

A separate robustness fix bounds the Siddon ray-marching loop. The loop length Np is computed from index bounds chosen by exact float-equality tests on quantities formed with __fdividef, an approximate division whose result can differ slightly between back ends; a flipped equality test could select an out-of-range bound and make Np astronomical (a data-dependent hang). A straight ray crosses at most Nx+Ny+Nz voxel-boundary planes, so Np is now capped at that geometric maximum. This cap is the one change on a shared code path; it is a no-op for every valid ray and on CUDA.

RandomNumberGenerator.cu uses device-side cuRAND, mapped to hipRAND in the compat header; the directed-rounding intrinsics __fsqrt_rd / __frcp_rd (absent in HIP) map to the round-to-nearest variants, immaterial for an nRMSE/adjointness-graded reconstruction.

This work was authored with the assistance of Claude, an AI assistant by Anthropic.

Test Plan:

Linux, gfx90a (AMD Instinct MI250X), ROCm 7.2.1 -- software trilinear fallback (hardware linear fp32 textures rejected on CDNA2):

export BUILD_WITH_HIP=1 ROCM_PATH=/opt/rocm HIP_ARCH=gfx90a
pip install -e . --no-build-isolation

Linux, gfx1100 (RDNA3), ROCm 7.2.1, and Windows 11, gfx1201 (RDNA4), ROCm 7.14 -- hardware linear texture filtering (the self-test reports supported and the hardware fetch is used). Windows drives clang++ directly:

BUILD_WITH_HIP=1 HIPCC=<rocm>/lib/llvm/bin/clang++.exe ROCM_PATH=<rocm> HIP_ARCH=gfx1201
pip install -e . --no-build-isolation

On the 256^3 head phantom (cone geometry): forward project (Siddon and interpolated), back project (matched and FDK), reconstruct with FDK and OS-SART; all outputs finite. Interpolated (trilinear) projection matches Siddon to within 0.5% relative norm; adjointness <Ax(x),y> ~= <x,Atb(y)> relative residual ~1.2e-05; sart/ossart/sirt/cgls/fista and the TV-regularized asd_pocs/awasd_pocs/os_asd_pocs solvers all produce finite reconstructions. Verified with both the software-fallback path (gfx90a) and the hardware-filtering path (gfx1100, gfx1201).

The CUDA build was compile-checked with nvcc (CUDA 13.3, IS_FOR_PYTIGRE, BUILD_WITH_HIP off) -- all changed sources compile; it was not linked or run, as no NVIDIA GPU was available. Other AMD architectures are buildable via HIP_ARCH but were not exercised here.

This adds an AMD GPU build of TIGRE's Python backend with ROCm/HIP, alongside the existing CUDA build. The CUDA path is selected by default and the AMD support is added behind USE_HIP / BUILD_WITH_HIP, so every effort has been made to leave the CUDA build untouched; the ROCm build is opt-in through environment variables. The CUDA sources were compile-checked with nvcc but were not run, as the development hosts have no NVIDIA GPU.

How to review: setup.py first (the build wiring), then Common/CUDA/cuda_to_hip.h (the CUDA->HIP symbol mapping and the texture path), then the per-source changes (texture filter mode, the Siddon ray-loop bound, and the cuRAND -> hipRAND include), then the install-doc update.

setup.py keeps the existing hand-rolled per-.cu compiler driver and adds a BUILD_WITH_HIP branch (both the Unix and the MSVC/Windows paths). With BUILD_WITH_HIP=1 it compiles the .cu sources with hipcc (-x hip --offload-arch=$HIP_ARCH, default gfx90a; comma-separated for multiple targets), links amdhip64 instead of cudart, and skips the nvcc-at-import probing that is absent on a ROCm-only host. On Windows it drives clang++ directly with --hip-path and -fms-runtime-lib=dll to match the MSVC /MD CRT.

The main porting concern is the texture path. The backprojection kernels and the interpolated forward projector sample a 3D float array with trilinear filtering (cudaFilterModeLinear + cudaReadModeElementType). Whether AMD hardware can do that fetch is architecture-dependent: gfx90a (CDNA2) rejects creation of such a texture, while RDNA (gfx1100, gfx1201) accepts it and filters in hardware. Rather than hard-code one path, cuda_to_hip.h decides at runtime: a one-time self-test creates the real texture configuration over a small ramp and confirms a known sample actually interpolates -- not merely that creation succeeded, since some hardware accepts the texture but silently point-samples, which would ship wrong results. When supported, the texture is created Linear and tex3D_TIGRE forwards to the hardware fetch; otherwise it is created Point and tex3D_TIGRE interpolates in software, point-sampling the eight neighbours and lerping with CUDA's unnormalized -0.5 texel-center convention (out-of-array neighbours read 0, matching cudaAddressModeBorder). The texture's creation mode and the sampling path are driven from the single cached verdict so they cannot disagree. On CUDA the runtime helper reports supported without testing and the AMD-only code is excluded by the preprocessor, so the original hardware-filtered textures are used.

A separate robustness fix bounds the Siddon ray-marching loop. The loop length Np is computed from index bounds chosen by exact float-equality tests on quantities formed with __fdividef, an approximate division whose result can differ slightly between back ends; a flipped equality test could select an out-of-range bound and make Np astronomical (a data-dependent hang). A straight ray crosses at most Nx+Ny+Nz voxel-boundary planes, so Np is now capped at that geometric maximum. This cap is the one change on a shared code path; it is a no-op for every valid ray and on CUDA.

RandomNumberGenerator.cu uses device-side cuRAND, mapped to hipRAND in the compat header; the directed-rounding intrinsics __fsqrt_rd / __frcp_rd (absent in HIP) map to the round-to-nearest variants, immaterial for an nRMSE/adjointness-graded reconstruction.

This work was authored with the assistance of Claude, an AI assistant by Anthropic.

Test Plan:

Linux, gfx90a (AMD Instinct MI250X), ROCm 7.2.1 -- software trilinear fallback (hardware linear fp32 textures rejected on CDNA2):

    export BUILD_WITH_HIP=1 ROCM_PATH=/opt/rocm HIP_ARCH=gfx90a
    pip install -e . --no-build-isolation

Linux, gfx1100 (RDNA3), ROCm 7.2.1, and Windows 11, gfx1201 (RDNA4), ROCm 7.14 -- hardware linear texture filtering (the self-test reports supported and the hardware fetch is used). Windows drives clang++ directly:

    BUILD_WITH_HIP=1 HIPCC=<rocm>/lib/llvm/bin/clang++.exe ROCM_PATH=<rocm> HIP_ARCH=gfx1201
    pip install -e . --no-build-isolation

On the 256^3 head phantom (cone geometry): forward project (Siddon and interpolated), back project (matched and FDK), reconstruct with FDK and OS-SART; all outputs finite. Interpolated (trilinear) projection matches Siddon to within 0.5% relative norm; adjointness <Ax(x),y> ~= <x,Atb(y)> relative residual ~1.2e-05; sart/ossart/sirt/cgls/fista and the TV-regularized asd_pocs/awasd_pocs/os_asd_pocs solvers all produce finite reconstructions. Verified with both the software-fallback path (gfx90a) and the hardware-filtering path (gfx1100, gfx1201).

The CUDA build was compile-checked with nvcc (CUDA 13.3, IS_FOR_PYTIGRE, BUILD_WITH_HIP off) -- all changed sources compile; it was not linked or run, as no NVIDIA GPU was available. Other AMD architectures are buildable via HIP_ARCH but were not exercised here.
@AnderBiguri

Copy link
Copy Markdown
Member

hi @jeffdaily, spectacular PR, thanks a ton!

I don't have access to an AMD GPU at the moment, but I will try to get one ASAP and test this in my side to understand all the changes better.

@AnderBiguri AnderBiguri mentioned this pull request Jun 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants