[ROCm] Add AMD GPU support to the Python backend by jeffdaily · Pull Request #747 · CERN/TIGRE

jeffdaily · 2026-06-19T22:58:22Z

This adds an AMD GPU build of TIGRE's Python backend with ROCm/HIP, alongside the existing CUDA build. The CUDA path is selected by default and the AMD support is added behind USE_HIP / BUILD_WITH_HIP, so every effort has been made to leave the CUDA build untouched; the ROCm build is opt-in through environment variables. The CUDA sources were compile-checked with nvcc but were not run, as the development hosts have no NVIDIA GPU.

How to review: setup.py first (the build wiring), then Common/CUDA/cuda_to_hip.h (the CUDA->HIP symbol mapping and the texture path), then the per-source changes (texture filter mode, the Siddon ray-loop bound, and the cuRAND -> hipRAND include), then the install-doc update.

setup.py keeps the existing hand-rolled per-.cu compiler driver and adds a BUILD_WITH_HIP branch (both the Unix and the MSVC/Windows paths). With BUILD_WITH_HIP=1 it compiles the .cu sources with hipcc (-x hip --offload-arch=$HIP_ARCH, default gfx90a; comma-separated for multiple targets), links amdhip64 instead of cudart, and skips the nvcc-at-import probing that is absent on a ROCm-only host. On Windows it drives clang++ directly with --hip-path and -fms-runtime-lib=dll to match the MSVC /MD CRT.

The main porting concern is the texture path. The backprojection kernels and the interpolated forward projector sample a 3D float array with trilinear filtering (cudaFilterModeLinear + cudaReadModeElementType). Whether AMD hardware can do that fetch is architecture-dependent: gfx90a (CDNA2) rejects creation of such a texture, while RDNA (gfx1100, gfx1201) accepts it and filters in hardware. Rather than hard-code one path, cuda_to_hip.h decides at runtime: a one-time self-test creates the real texture configuration over a small ramp and confirms a known sample actually interpolates -- not merely that creation succeeded, since some hardware accepts the texture but silently point-samples, which would ship wrong results. When supported, the texture is created Linear and tex3D_TIGRE forwards to the hardware fetch; otherwise it is created Point and tex3D_TIGRE interpolates in software, point-sampling the eight neighbours and lerping with CUDA's unnormalized -0.5 texel-center convention (out-of-array neighbours read 0, matching cudaAddressModeBorder). The texture's creation mode and the sampling path are driven from the single cached verdict so they cannot disagree. On CUDA the runtime helper reports supported without testing and the AMD-only code is excluded by the preprocessor, so the original hardware-filtered textures are used.

A separate robustness fix bounds the Siddon ray-marching loop. The loop length Np is computed from index bounds chosen by exact float-equality tests on quantities formed with __fdividef, an approximate division whose result can differ slightly between back ends; a flipped equality test could select an out-of-range bound and make Np astronomical (a data-dependent hang). A straight ray crosses at most Nx+Ny+Nz voxel-boundary planes, so Np is now capped at that geometric maximum. This cap is the one change on a shared code path; it is a no-op for every valid ray and on CUDA.

RandomNumberGenerator.cu uses device-side cuRAND, mapped to hipRAND in the compat header; the directed-rounding intrinsics __fsqrt_rd / __frcp_rd (absent in HIP) map to the round-to-nearest variants, immaterial for an nRMSE/adjointness-graded reconstruction.

This work was authored with the assistance of Claude, an AI assistant by Anthropic.

Test Plan:

Linux, gfx90a (AMD Instinct MI250X), ROCm 7.2.1 -- software trilinear fallback (hardware linear fp32 textures rejected on CDNA2):

export BUILD_WITH_HIP=1 ROCM_PATH=/opt/rocm HIP_ARCH=gfx90a
pip install -e . --no-build-isolation

Linux, gfx1100 (RDNA3), ROCm 7.2.1, and Windows 11, gfx1201 (RDNA4), ROCm 7.14 -- hardware linear texture filtering (the self-test reports supported and the hardware fetch is used). Windows drives clang++ directly:

BUILD_WITH_HIP=1 HIPCC=<rocm>/lib/llvm/bin/clang++.exe ROCM_PATH=<rocm> HIP_ARCH=gfx1201
pip install -e . --no-build-isolation

On the 256^3 head phantom (cone geometry): forward project (Siddon and interpolated), back project (matched and FDK), reconstruct with FDK and OS-SART; all outputs finite. Interpolated (trilinear) projection matches Siddon to within 0.5% relative norm; adjointness <Ax(x),y> ~= <x,Atb(y)> relative residual ~1.2e-05; sart/ossart/sirt/cgls/fista and the TV-regularized asd_pocs/awasd_pocs/os_asd_pocs solvers all produce finite reconstructions. Verified with both the software-fallback path (gfx90a) and the hardware-filtering path (gfx1100, gfx1201).

The CUDA build was compile-checked with nvcc (CUDA 13.3, IS_FOR_PYTIGRE, BUILD_WITH_HIP off) -- all changed sources compile; it was not linked or run, as no NVIDIA GPU was available. Other AMD architectures are buildable via HIP_ARCH but were not exercised here.

This adds an AMD GPU build of TIGRE's Python backend with ROCm/HIP, alongside the existing CUDA build. The CUDA path is selected by default and the AMD support is added behind USE_HIP / BUILD_WITH_HIP, so every effort has been made to leave the CUDA build untouched; the ROCm build is opt-in through environment variables. The CUDA sources were compile-checked with nvcc but were not run, as the development hosts have no NVIDIA GPU. How to review: setup.py first (the build wiring), then Common/CUDA/cuda_to_hip.h (the CUDA->HIP symbol mapping and the texture path), then the per-source changes (texture filter mode, the Siddon ray-loop bound, and the cuRAND -> hipRAND include), then the install-doc update. setup.py keeps the existing hand-rolled per-.cu compiler driver and adds a BUILD_WITH_HIP branch (both the Unix and the MSVC/Windows paths). With BUILD_WITH_HIP=1 it compiles the .cu sources with hipcc (-x hip --offload-arch=$HIP_ARCH, default gfx90a; comma-separated for multiple targets), links amdhip64 instead of cudart, and skips the nvcc-at-import probing that is absent on a ROCm-only host. On Windows it drives clang++ directly with --hip-path and -fms-runtime-lib=dll to match the MSVC /MD CRT. The main porting concern is the texture path. The backprojection kernels and the interpolated forward projector sample a 3D float array with trilinear filtering (cudaFilterModeLinear + cudaReadModeElementType). Whether AMD hardware can do that fetch is architecture-dependent: gfx90a (CDNA2) rejects creation of such a texture, while RDNA (gfx1100, gfx1201) accepts it and filters in hardware. Rather than hard-code one path, cuda_to_hip.h decides at runtime: a one-time self-test creates the real texture configuration over a small ramp and confirms a known sample actually interpolates -- not merely that creation succeeded, since some hardware accepts the texture but silently point-samples, which would ship wrong results. When supported, the texture is created Linear and tex3D_TIGRE forwards to the hardware fetch; otherwise it is created Point and tex3D_TIGRE interpolates in software, point-sampling the eight neighbours and lerping with CUDA's unnormalized -0.5 texel-center convention (out-of-array neighbours read 0, matching cudaAddressModeBorder). The texture's creation mode and the sampling path are driven from the single cached verdict so they cannot disagree. On CUDA the runtime helper reports supported without testing and the AMD-only code is excluded by the preprocessor, so the original hardware-filtered textures are used. A separate robustness fix bounds the Siddon ray-marching loop. The loop length Np is computed from index bounds chosen by exact float-equality tests on quantities formed with __fdividef, an approximate division whose result can differ slightly between back ends; a flipped equality test could select an out-of-range bound and make Np astronomical (a data-dependent hang). A straight ray crosses at most Nx+Ny+Nz voxel-boundary planes, so Np is now capped at that geometric maximum. This cap is the one change on a shared code path; it is a no-op for every valid ray and on CUDA. RandomNumberGenerator.cu uses device-side cuRAND, mapped to hipRAND in the compat header; the directed-rounding intrinsics __fsqrt_rd / __frcp_rd (absent in HIP) map to the round-to-nearest variants, immaterial for an nRMSE/adjointness-graded reconstruction. This work was authored with the assistance of Claude, an AI assistant by Anthropic. Test Plan: Linux, gfx90a (AMD Instinct MI250X), ROCm 7.2.1 -- software trilinear fallback (hardware linear fp32 textures rejected on CDNA2): export BUILD_WITH_HIP=1 ROCM_PATH=/opt/rocm HIP_ARCH=gfx90a pip install -e . --no-build-isolation Linux, gfx1100 (RDNA3), ROCm 7.2.1, and Windows 11, gfx1201 (RDNA4), ROCm 7.14 -- hardware linear texture filtering (the self-test reports supported and the hardware fetch is used). Windows drives clang++ directly: BUILD_WITH_HIP=1 HIPCC=<rocm>/lib/llvm/bin/clang++.exe ROCM_PATH=<rocm> HIP_ARCH=gfx1201 pip install -e . --no-build-isolation On the 256^3 head phantom (cone geometry): forward project (Siddon and interpolated), back project (matched and FDK), reconstruct with FDK and OS-SART; all outputs finite. Interpolated (trilinear) projection matches Siddon to within 0.5% relative norm; adjointness <Ax(x),y> ~= <x,Atb(y)> relative residual ~1.2e-05; sart/ossart/sirt/cgls/fista and the TV-regularized asd_pocs/awasd_pocs/os_asd_pocs solvers all produce finite reconstructions. Verified with both the software-fallback path (gfx90a) and the hardware-filtering path (gfx1100, gfx1201). The CUDA build was compile-checked with nvcc (CUDA 13.3, IS_FOR_PYTIGRE, BUILD_WITH_HIP off) -- all changed sources compile; it was not linked or run, as no NVIDIA GPU was available. Other AMD architectures are buildable via HIP_ARCH but were not exercised here.

AnderBiguri · 2026-06-22T09:08:47Z

hi @jeffdaily, spectacular PR, thanks a ton!

I don't have access to an AMD GPU at the moment, but I will try to get one ASAP and test this in my side to understand all the changes better.

AnderBiguri mentioned this pull request Jun 22, 2026

HIP interface #646

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] Add AMD GPU support to the Python backend#747

[ROCm] Add AMD GPU support to the Python backend#747
jeffdaily wants to merge 1 commit into
CERN:masterfrom
jeffdaily:moat-port

jeffdaily commented Jun 19, 2026

Uh oh!

AnderBiguri commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jeffdaily commented Jun 19, 2026

Uh oh!

AnderBiguri commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants