I suggest adding a Vulkan compute backend for Windows/Linux GPU inference, as Vulkan has proven itself being a reliable (just works™), and often already installed GPU Compute API (almost no dependencies). It is also cross platform, and supports any GPU, even older ones.
It tends however to perform slightly worse compared to CUDA/ROCm/Metal on certain aspects, but I think that it is still worth implementing it, also considering how much of a speedup it gives compared to the native BLAS backend.
Llama.cpp has had a great success with it so far.
I suggest adding a Vulkan compute backend for Windows/Linux GPU inference, as Vulkan has proven itself being a reliable (just works™), and often already installed GPU Compute API (almost no dependencies). It is also cross platform, and supports any GPU, even older ones.
It tends however to perform slightly worse compared to CUDA/ROCm/Metal on certain aspects, but I think that it is still worth implementing it, also considering how much of a speedup it gives compared to the native BLAS backend.
Llama.cpp has had a great success with it so far.