cuda: generalize router-select for arbitrary expert count (fixes Pro on CUDA, #427)#435
Draft
newjordan wants to merge 1 commit into
Draft
cuda: generalize router-select for arbitrary expert count (fixes Pro on CUDA, #427)#435newjordan wants to merge 1 commit into
newjordan wants to merge 1 commit into
Conversation
The CUDA routed-expert top-k select was hardcoded to the Flash config: n_expert == 256, n_expert_used == 6, routed weight scale == 1.5. Both the batch (prefill) and single-token (decode) dispatchers rejected anything else with a silent `return 0`. DeepSeek-V4 PRO routes 384 experts with a weight scale of 2.5, so prefill failed immediately at the guard, surfacing only as `cuda prefill failed`. - Thread n_expert and the model's expert_weight_scale into router_select_kernel, router_select_parallel_kernel and router_select_warp_topk_kernel; replace the literal 256 (strides, loops, bounds, shared arrays, parallel block width) with n_expert and the literal 1.5 with the scale. - warp-topk: each of the 32 lanes now owns n_expert/32 experts (8 for 256, 12 for 384); per-lane register arrays and shared scratch capped at 512 experts. - Both dispatch guards accept any n_expert that is a non-zero multiple of 32 up to 512 (still require n_expert_used == 6); the buffer-size checks and the router-bias range now use n_expert. Flash (256/6/1.5) is unchanged: per_lane stays 8 and the arithmetic is identical, so output is bit-for-bit the same. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Author
|
Hey antirez, if you have any issues, or want a different format just let me know. Cheers and TY so much for DS4 work! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
On the CUDA backend, DeepSeek-V4 PRO (
DeepSeek-V4-Pro-IQ2XXS-...-Instruct-imatrix.gguf, 384 routed experts) crashes on the very first prompt:This is #427 (DGX Spark GB10, every
--ssd-streaming/ ctx / prefill-chunk combination). Flash runs fine; PRO never starts.Root cause
The CUDA routed-expert top-k select is hardcoded to the Flash configuration. Both dispatchers —
ds4_gpu_router_select_batch_tensor(prefill) andds4_gpu_router_select_tensor(decode) — open with:PRO is
n_expert = 384andexpert_weight_scale = 2.5, so the guard returns0with no CUDA error;metal_graph_encode_layer_ffn_batchthen only reports the genericffn batch encode failed. The three select kernels also bake in 256 (logits + t*256,__shared__ float sprob[256], the warp-topk's 32 lanes × 8 experts,<<<n_tokens, 256>>>) and the1.5scale.Fix
Generalize the router-select to any
n_expertthat is a non-zero multiple of 32 up to 512 (still requiresn_expert_used == 6), using the model's routed weight scale:n_expertandexpert_weight_scaleintorouter_select_kernel,router_select_parallel_kernel, androuter_select_warp_topk_kernel256(strides, loops, bounds, shared scratch, parallel block width) withn_expert, and the literal1.5with the scalen_expert/32experts (8 for 256, 12 for 384); per-lane register arrays and shared scratch are capped at 512 expertsn_expertFlash (256/6/1.5) is unchanged by construction:
per_lanestays 8 and the arithmetic is identical, so output is bit-for-bit the same.One file,
ds4_cuda.cu, +64/−48.Validation
Machine: DGX Spark GB10 (sm_121), CUDA 13.0 toolkit + driver,
--ssd-streaming.ds4_test --logprob-vectorsunchanged. (byte-identical by construction; formal vector run in progress, will post the result)AI usage disclosure
YES — AI assisted with the diagnosis, the diff, and PR preparation; the contributor reviewed the code, the validation, and the submitted content.
Fixes #427.