[ROCm] Add AMD GPU support to the Python backend#747
Open
jeffdaily wants to merge 1 commit into
Open
Conversation
This adds an AMD GPU build of TIGRE's Python backend with ROCm/HIP, alongside the existing CUDA build. The CUDA path is selected by default and the AMD support is added behind USE_HIP / BUILD_WITH_HIP, so every effort has been made to leave the CUDA build untouched; the ROCm build is opt-in through environment variables. The CUDA sources were compile-checked with nvcc but were not run, as the development hosts have no NVIDIA GPU.
How to review: setup.py first (the build wiring), then Common/CUDA/cuda_to_hip.h (the CUDA->HIP symbol mapping and the texture path), then the per-source changes (texture filter mode, the Siddon ray-loop bound, and the cuRAND -> hipRAND include), then the install-doc update.
setup.py keeps the existing hand-rolled per-.cu compiler driver and adds a BUILD_WITH_HIP branch (both the Unix and the MSVC/Windows paths). With BUILD_WITH_HIP=1 it compiles the .cu sources with hipcc (-x hip --offload-arch=$HIP_ARCH, default gfx90a; comma-separated for multiple targets), links amdhip64 instead of cudart, and skips the nvcc-at-import probing that is absent on a ROCm-only host. On Windows it drives clang++ directly with --hip-path and -fms-runtime-lib=dll to match the MSVC /MD CRT.
The main porting concern is the texture path. The backprojection kernels and the interpolated forward projector sample a 3D float array with trilinear filtering (cudaFilterModeLinear + cudaReadModeElementType). Whether AMD hardware can do that fetch is architecture-dependent: gfx90a (CDNA2) rejects creation of such a texture, while RDNA (gfx1100, gfx1201) accepts it and filters in hardware. Rather than hard-code one path, cuda_to_hip.h decides at runtime: a one-time self-test creates the real texture configuration over a small ramp and confirms a known sample actually interpolates -- not merely that creation succeeded, since some hardware accepts the texture but silently point-samples, which would ship wrong results. When supported, the texture is created Linear and tex3D_TIGRE forwards to the hardware fetch; otherwise it is created Point and tex3D_TIGRE interpolates in software, point-sampling the eight neighbours and lerping with CUDA's unnormalized -0.5 texel-center convention (out-of-array neighbours read 0, matching cudaAddressModeBorder). The texture's creation mode and the sampling path are driven from the single cached verdict so they cannot disagree. On CUDA the runtime helper reports supported without testing and the AMD-only code is excluded by the preprocessor, so the original hardware-filtered textures are used.
A separate robustness fix bounds the Siddon ray-marching loop. The loop length Np is computed from index bounds chosen by exact float-equality tests on quantities formed with __fdividef, an approximate division whose result can differ slightly between back ends; a flipped equality test could select an out-of-range bound and make Np astronomical (a data-dependent hang). A straight ray crosses at most Nx+Ny+Nz voxel-boundary planes, so Np is now capped at that geometric maximum. This cap is the one change on a shared code path; it is a no-op for every valid ray and on CUDA.
RandomNumberGenerator.cu uses device-side cuRAND, mapped to hipRAND in the compat header; the directed-rounding intrinsics __fsqrt_rd / __frcp_rd (absent in HIP) map to the round-to-nearest variants, immaterial for an nRMSE/adjointness-graded reconstruction.
This work was authored with the assistance of Claude, an AI assistant by Anthropic.
Test Plan:
Linux, gfx90a (AMD Instinct MI250X), ROCm 7.2.1 -- software trilinear fallback (hardware linear fp32 textures rejected on CDNA2):
export BUILD_WITH_HIP=1 ROCM_PATH=/opt/rocm HIP_ARCH=gfx90a
pip install -e . --no-build-isolation
Linux, gfx1100 (RDNA3), ROCm 7.2.1, and Windows 11, gfx1201 (RDNA4), ROCm 7.14 -- hardware linear texture filtering (the self-test reports supported and the hardware fetch is used). Windows drives clang++ directly:
BUILD_WITH_HIP=1 HIPCC=<rocm>/lib/llvm/bin/clang++.exe ROCM_PATH=<rocm> HIP_ARCH=gfx1201
pip install -e . --no-build-isolation
On the 256^3 head phantom (cone geometry): forward project (Siddon and interpolated), back project (matched and FDK), reconstruct with FDK and OS-SART; all outputs finite. Interpolated (trilinear) projection matches Siddon to within 0.5% relative norm; adjointness <Ax(x),y> ~= <x,Atb(y)> relative residual ~1.2e-05; sart/ossart/sirt/cgls/fista and the TV-regularized asd_pocs/awasd_pocs/os_asd_pocs solvers all produce finite reconstructions. Verified with both the software-fallback path (gfx90a) and the hardware-filtering path (gfx1100, gfx1201).
The CUDA build was compile-checked with nvcc (CUDA 13.3, IS_FOR_PYTIGRE, BUILD_WITH_HIP off) -- all changed sources compile; it was not linked or run, as no NVIDIA GPU was available. Other AMD architectures are buildable via HIP_ARCH but were not exercised here.
Member
|
hi @jeffdaily, spectacular PR, thanks a ton! I don't have access to an AMD GPU at the moment, but I will try to get one ASAP and test this in my side to understand all the changes better. |
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This adds an AMD GPU build of TIGRE's Python backend with ROCm/HIP, alongside the existing CUDA build. The CUDA path is selected by default and the AMD support is added behind USE_HIP / BUILD_WITH_HIP, so every effort has been made to leave the CUDA build untouched; the ROCm build is opt-in through environment variables. The CUDA sources were compile-checked with nvcc but were not run, as the development hosts have no NVIDIA GPU.
How to review: setup.py first (the build wiring), then Common/CUDA/cuda_to_hip.h (the CUDA->HIP symbol mapping and the texture path), then the per-source changes (texture filter mode, the Siddon ray-loop bound, and the cuRAND -> hipRAND include), then the install-doc update.
setup.py keeps the existing hand-rolled per-.cu compiler driver and adds a BUILD_WITH_HIP branch (both the Unix and the MSVC/Windows paths). With BUILD_WITH_HIP=1 it compiles the .cu sources with hipcc (-x hip --offload-arch=$HIP_ARCH, default gfx90a; comma-separated for multiple targets), links amdhip64 instead of cudart, and skips the nvcc-at-import probing that is absent on a ROCm-only host. On Windows it drives clang++ directly with --hip-path and -fms-runtime-lib=dll to match the MSVC /MD CRT.
The main porting concern is the texture path. The backprojection kernels and the interpolated forward projector sample a 3D float array with trilinear filtering (cudaFilterModeLinear + cudaReadModeElementType). Whether AMD hardware can do that fetch is architecture-dependent: gfx90a (CDNA2) rejects creation of such a texture, while RDNA (gfx1100, gfx1201) accepts it and filters in hardware. Rather than hard-code one path, cuda_to_hip.h decides at runtime: a one-time self-test creates the real texture configuration over a small ramp and confirms a known sample actually interpolates -- not merely that creation succeeded, since some hardware accepts the texture but silently point-samples, which would ship wrong results. When supported, the texture is created Linear and tex3D_TIGRE forwards to the hardware fetch; otherwise it is created Point and tex3D_TIGRE interpolates in software, point-sampling the eight neighbours and lerping with CUDA's unnormalized -0.5 texel-center convention (out-of-array neighbours read 0, matching cudaAddressModeBorder). The texture's creation mode and the sampling path are driven from the single cached verdict so they cannot disagree. On CUDA the runtime helper reports supported without testing and the AMD-only code is excluded by the preprocessor, so the original hardware-filtered textures are used.
A separate robustness fix bounds the Siddon ray-marching loop. The loop length Np is computed from index bounds chosen by exact float-equality tests on quantities formed with __fdividef, an approximate division whose result can differ slightly between back ends; a flipped equality test could select an out-of-range bound and make Np astronomical (a data-dependent hang). A straight ray crosses at most Nx+Ny+Nz voxel-boundary planes, so Np is now capped at that geometric maximum. This cap is the one change on a shared code path; it is a no-op for every valid ray and on CUDA.
RandomNumberGenerator.cu uses device-side cuRAND, mapped to hipRAND in the compat header; the directed-rounding intrinsics __fsqrt_rd / __frcp_rd (absent in HIP) map to the round-to-nearest variants, immaterial for an nRMSE/adjointness-graded reconstruction.
This work was authored with the assistance of Claude, an AI assistant by Anthropic.
Test Plan:
Linux, gfx90a (AMD Instinct MI250X), ROCm 7.2.1 -- software trilinear fallback (hardware linear fp32 textures rejected on CDNA2):
Linux, gfx1100 (RDNA3), ROCm 7.2.1, and Windows 11, gfx1201 (RDNA4), ROCm 7.14 -- hardware linear texture filtering (the self-test reports supported and the hardware fetch is used). Windows drives clang++ directly:
On the 256^3 head phantom (cone geometry): forward project (Siddon and interpolated), back project (matched and FDK), reconstruct with FDK and OS-SART; all outputs finite. Interpolated (trilinear) projection matches Siddon to within 0.5% relative norm; adjointness <Ax(x),y> ~= <x,Atb(y)> relative residual ~1.2e-05; sart/ossart/sirt/cgls/fista and the TV-regularized asd_pocs/awasd_pocs/os_asd_pocs solvers all produce finite reconstructions. Verified with both the software-fallback path (gfx90a) and the hardware-filtering path (gfx1100, gfx1201).
The CUDA build was compile-checked with nvcc (CUDA 13.3, IS_FOR_PYTIGRE, BUILD_WITH_HIP off) -- all changed sources compile; it was not linked or run, as no NVIDIA GPU was available. Other AMD architectures are buildable via HIP_ARCH but were not exercised here.