Skip to content

Support HIP GiMMiK vector-width metadata#567

Open
tomjen12 wants to merge 1 commit into
PyFR:developfrom
tomjen12:hip-gimmik-width-support
Open

Support HIP GiMMiK vector-width metadata#567
tomjen12 wants to merge 1 commit into
PyFR:developfrom
tomjen12:hip-gimmik-width-support

Conversation

@tomjen12

Copy link
Copy Markdown

Summary

Adds support for HIP GiMMiK kernels with vector-width metadata.
When GiMMiK reports meta['width'], PyFR now uses ceil(n / width) to compute the launch grid. This supports the width variants added in PyFR/GiMMiK#19.

@tomjen12

Copy link
Copy Markdown
Author

Hi @FreddieWitherden , could you please review this PR when you have a chance?
I also noticed I cannot request reviewers from the sidebar. Is reviewer assignment limited to repository members?

@FreddieWitherden FreddieWitherden self-assigned this Jun 18, 2026
@FreddieWitherden

FreddieWitherden commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Overall looks good. Figuring out the best way for GiMMiK to expose to callers what grid/block sizes its kernels want as a function of n has always been a fiddly endeavour. I think the addition of width here is sensible. Previously we've done this by having HIP use the compile time n kernel variants since then GiMMiK can just provide the grid and block sizes up-front. However, this is bad for JIT times and did not appear to provide much of a speedup, hence why currently PyFR always asks GiMMiK for dynamic n kernels. We can change this though if fixed n would allow for appreciably better performance (and this is indeed what CUDA does).

@tomjen12

Copy link
Copy Markdown
Author

Hi @FreddieWitherden ,

I noticed that the 7 cases currently excluded by the PyFR-side GiMMiK suitability check (nuq > 128 and density > 0.15) are the p4 tet m132/m460 cases and all p5 tet cases. When I relax this condition locally and allow GiMMiK to be considered, 4 of these 7 cases perform better with GiMMiK than rocBLAS. Do you know the original motivation for this threshold, and whether it was intended to apply equally to the HIP path?

if nuq > 128 and nnz / arr.size > 0.15:

@FreddieWitherden

Copy link
Copy Markdown
Contributor

Hi @FreddieWitherden ,

I noticed that the 7 cases currently excluded by the PyFR-side GiMMiK suitability check (nuq > 128 and density > 0.15) are the p4 tet m132/m460 cases and all p5 tet cases. When I relax this condition locally and allow GiMMiK to be considered, 4 of these 7 cases perform better with GiMMiK than rocBLAS. Do you know the original motivation for this threshold, and whether it was intended to apply equally to the HIP path?

if nuq > 128 and nnz / arr.size > 0.15:

Originally it was put in place for keeping compilation time down. Historically, for dense operations BLAS libraries have done a better job than GiMMiK and the code sizes for these kernels have resulted in multi-minute compilation times (for kernels which will be rejected by the auto-tuning). The thresholds themselves have not been tuned in a very long time and may not be appropriate for current hardware.

The tet cases, with the exception of M6, are almost entirely dense and so it feels like a BLAS kernel should do better. With that said, recent experience on NVIDIA hardware has suggested that getting dense capabilities into GiMMiK (and being able to specialize on our exact sizes) is worthwhile and so are open to purely dense kernels.

@tomjen12

Copy link
Copy Markdown
Author

Thanks, that makes sense. I will measure the compile/autotune time for the 7 currently excluded tet cases, along with the GiMMiK vs rocBLAS performance, before deciding whether relaxing the threshold is worthwhile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants