Add tuned HIP GiMMiK preload-C and width variants with non-temporal loads and stores#19
Add tuned HIP GiMMiK preload-C and width variants with non-temporal loads and stores#19tomjen12 wants to merge 14 commits into
Conversation
|
Hi @FreddieWitherden , could you please review this PR when you have a chance? |
| if A[j, kx] != 0: | ||
| nzixs.append((l_idx, kx)) | ||
|
|
||
| has_dotp = any(A[j, kx] != 0 for kx in range(k)) |
There was a problem hiding this comment.
As A is a numpy array this can likely be simplified as A[j].any() (I think!). Same for the above for where we can eliminate the explicit Python loop with something built upon A[j, kbx].
|
|
||
| % if width == 2: | ||
| static inline __device__ ${dtype} | ||
| gimmik_vmul(${dtype[:-1]} a, ${dtype} b) |
There was a problem hiding this comment.
Might be worth moving these up into a shared file which other kernels can include via Mako. Keeps all of the vector stuff in one place.
|
Overall looks good. Few quick things:
|
|
Couple of other notes (mostly for me):
|
|
Thanks for the review and advice.
This enables the tuned variants for validated gfx90a and gfx942 wave64 targets while avoiding other architectures for now.
|
|
Thanks! Going through the PR there appear to be three main improvements.
All in, if I've understood everything correctly I do not think we need any new templates, just extra parameterization on the templates we have. This tends to make things more manageable and easier to understand so it would be preferable. Again, it is possible I've missed something about the new kernels which really does require a different overall structure, so let me know there. |
|
Have just run the PR through my benchmark suite (p = 2 to 5, double precision, ~4 GiB of data so we always asymptote) and find the overall performance to be about the same (geomean ratio PR/base: 1.003x). Did your benchmarking also include the rocBLAS kernels? It could be that the wins are coming from kernels where rocBLAS does better than either the new or old GiMMiK kernels. |
|
Thanks for checking this. Could you share a bit more detail about your benchmark setup?
base_arch = gcn_arch.split(':', 1)[0] if gcn_arch else None
if base_arch not in {'gfx90a', 'gfx942'} or warp_size != 64:
returnCould you confirm whether your test platform reports gfx90a or gfx942 with warp_size == 64? If not, the new tuned variants may not be emitted. As a quick check, could you also try temporarily disabling this arch gate and rerun the benchmark? |
|
Regarding the template structure, I will also take another look and see how much can be merged back into the existing HIP templates. |
It was a caching issue on my end. I can now reproduce your results on an MI300X w/ROCm 7.14. |
Thank you for investigating. Assuming non-temporal loads/stores are widely available (so compiles for older CDNA and RDNA) I think the best course of action is to enable them everywhere all the time for HIP. This simplifies the code and puts us at parity with CUDA where we also ask for quasi-non temporal stores. In terms of preloading I agree with making this an option in the auto-tuner. Extra kernel variations are not as big of a deal for HIP since the kernels are generic on Only other thing to check is that we yield kernels in a sensible order. To keep start-up times down PyFR can be instructed to only consider the first (The long term goal is to make things adaptive; PyFR when it compiles a kernel passes information about the register use and spillage back to GiMMiK. At the moment this is discarded, but could be useful in the future for things like "don't bother with a width = 2 kernel if width = 1 spills".) Finally, while it is not a priority at the moment, can you check there are no major regressions single precision. |
|
Thanks. I've removed the I also checked single precision and do not see any major regressions. For the 53 GiMMiK-accessible cases tested, the geometric mean is about a 4% improvement, with the worst case at 98% of the previous baseline, which is likely within run-to-run noise. For the kernel yield order, I agree this is worth looking at. My colleague is planning to push a follow-up commit with the non-temporal B optimization. Once that lands, I can use the updated tuning data to pick a better order. |
On memory-bound operators the B matrix is read once from HBM and reused
only within a work-group via LDS -- it is never re-read across blocks.
A normal global load still allocates B's line in L2, which is pure
overhead: the line is never reused, it evicts genuinely-reusable data,
and it adds cache-allocate/eviction traffic. This is the read-side mirror
of the non-temporal C store we already use.
NTB loads B with a non-temporal load (load_b -> __builtin_nontemporal
path) so B bypasses L2 instead of polluting it. It moves the same number
of bytes but keeps the cache clean, raising effective bandwidth.
Implemented as a flag on the existing templates rather than new files:
- base.mako: add a load_b() wrapper (non-temporal B load).
- bstream-msplit{,-preload-c}.mako: gate the B read behind an `ntload`
flag (context.get('ntload', False)); renders byte-identically to the
plain kernel when the flag is absent.
- hip.py: emit `*-ntb` variants by passing ntload=True inside the
existing width loop, so NTB combines with width (w1/w2) automatically.
Backward-compatible (plain variants unchanged) and CDNA-gated like the
other tuned variants. On MI300X (gfx942) NTB passes the accuracy check
(~1e-15) and wins the autotune in the majority of memory-bound cases
(~+4.5% bandwidth on those), being chosen over the plain bstream-msplit.
| @@ -0,0 +1,104 @@ | |||
| <%inherit file='base'/> | |||
There was a problem hiding this comment.
Can we merge this into the msplit kernel which preload as an option using % if/else as appropriate to switch between the two?
There was a problem hiding this comment.
Done. I merged the preload-C path into the existing msplit template behind a preload option, and removed the separate preload-C template.
| { return 0; } | ||
| % endif | ||
|
|
||
| % if width == 1: |
There was a problem hiding this comment.
Can we define overloads like the CUDA backend does:
https://github.com/PyFR/GiMMiK/blob/master/gimmik/kernels/cuda/base.mako#L13
Just keeps the code a little cleaner and more consistent.
There was a problem hiding this comment.
Done. I updated the HIP base template to use vector operator overloads in the same style as CUDA. I omitted operator+= since HIP already provides it for vector types and defining it here caused an overload ambiguity.
Reduce the HIP tuned kernel search space from 28 variants to 12 and order the remaining variants to try common winners earlier.
|
I've pushed another round of cleanup based on the review feedback:
|


Summary
Adds tuned HIP GiMMiK preload and vector-width variants for PyFR-style matrix multiplication cases.
The new variants include C-preload kernels for cstream/bstream and aligned width-2 preload variants. Existing and new HIP templates now use shared non-temporal load/store helpers for C accesses.
Results
On MI300X PyFR GiMMiK matmul benchmarks across 53 double-precision p2-p5 cases, bandwidth efficiency versus a 4.4 TB/s target improved from:
Companion PyFR PR: PyFR/PyFR#567
Test plan