BUG REPORT
Title:
cuda prefill failed / gpu layer 0 ffn batch encode failed on DGX Spark with
DeepSeek-V4-PRO Q2-imatrix + SSD streaming
Hardware:
- DGX Spark (GB10, SM 12.1, 128 GB unified memory)
- Samsung PM9E1 Gen5 SSD
- Fresh install < 48 hours old
- CUDA backend (make cuda-spark)
Model:
DeepSeek-V4-Pro-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-Instruct-imatrix.gguf (433 GB)
What fails:
Both CLI and ds4-server crash immediately on first prompt with:
ds4: CUDA loading model tensors into device cache: 2.04 GiB
ds4: gpu layer 0 ffn batch encode failed
ds4: prompt processing failed: cuda prefill failed
Commands tried (all failed):
CLI:
./ds4 --cuda --model gguf/DeepSeek-V4-Pro-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-Instruct-imatrix.gguf
--ssd-streaming
./ds4 --cuda --model ... --ssd-streaming --ctx 32768
./ds4 --cuda --model ... --ssd-streaming --ctx 65536 --prefill-chunk 2048
./ds4 --cuda --model ... --ssd-streaming --ssd-streaming-cold --ctx 32768 --prefill-chunk 1024
./ds4 --cuda --model ... --ssd-streaming --ctx 200000 --prefill-chunk 4096
Server:
./ds4-server --cuda --model ... --ssd-streaming --ctx 1048576 --host 0.0.0.0 --port 8081
./ds4-server --cuda --model ... --ssd-streaming --ctx 200000 --host 0.0.0.0 --port 8081
Also tried:
- ctx from 32768 to 1048576
- prefill-chunk values: 512, 1024, 2048, 4096
- --ssd-streaming-cold
- Fresh rebuild: git pull + make clean + make cuda-spark multiple times
Note:
- SSD streaming was enabled (new feature added <72h ago)
- No MTP draft used
- Only PRO model was tested
Expected:
PRO Q2 should load and run at low speed with SSD streaming (as announced for DGX Spark).
Actual:
Immediate prefill crash on layer 0 every time.
Grok suggests investigate this approach (am AFK so cant try):
diff --git a/ds4_cuda.cu b/ds4_cuda.cu
index 8e4f2c1..c3a9d2e 100644
--- a/ds4_cuda.cu
+++ b/ds4_cuda.cu
@@ -XXXX,XX +XXXX,XX @@ extern "C" int ds4_gpu_stream_expert_cache_prepare_selected_batch(
const uint64_t down_expert_bytes = table->down_expert_bytes;
-
return cuda_stream_selected_cache_begin_compact_load(
-
model_map, model_size, layer,
-
compact_ids.data(), slot_ids.data(),
-
n_total_expert, (uint32_t)compact_ids.size(), slot_count,
-
gate_offset, up_offset, down_offset,
-
gate_expert_bytes, down_expert_bytes,
-
-
-
// Layer 0 during first prefill with SSD streaming is often a cold start.
-
// Make it non-strict so we can fall back instead of hard-failing the whole prefill.
-
const int strict_failure = (layer == 0 ? 0 : 1);
-
return cuda_stream_selected_cache_begin_compact_load(
-
model_map, model_size, layer,
-
compact_ids.data(), slot_ids.data(),
-
n_total_expert, (uint32_t)compact_ids.size(), slot_count,
-
gate_offset, up_offset, down_offset,
-
gate_expert_bytes, down_expert_bytes,
-
-
BUG REPORT
Title:
cuda prefill failed / gpu layer 0 ffn batch encode failed on DGX Spark with
DeepSeek-V4-PRO Q2-imatrix + SSD streaming
Hardware:
Model:
DeepSeek-V4-Pro-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-Instruct-imatrix.gguf (433 GB)
What fails:
Both CLI and ds4-server crash immediately on first prompt with:
Commands tried (all failed):
CLI:
./ds4 --cuda --model gguf/DeepSeek-V4-Pro-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-Instruct-imatrix.gguf
--ssd-streaming
./ds4 --cuda --model ... --ssd-streaming --ctx 32768
./ds4 --cuda --model ... --ssd-streaming --ctx 65536 --prefill-chunk 2048
./ds4 --cuda --model ... --ssd-streaming --ssd-streaming-cold --ctx 32768 --prefill-chunk 1024
./ds4 --cuda --model ... --ssd-streaming --ctx 200000 --prefill-chunk 4096
Server:
./ds4-server --cuda --model ... --ssd-streaming --ctx 1048576 --host 0.0.0.0 --port 8081
./ds4-server --cuda --model ... --ssd-streaming --ctx 200000 --host 0.0.0.0 --port 8081
Also tried:
Note:
Expected:
PRO Q2 should load and run at low speed with SSD streaming (as announced for DGX Spark).
Actual:
Immediate prefill crash on layer 0 every time.
Grok suggests investigate this approach (am AFK so cant try):
diff --git a/ds4_cuda.cu b/ds4_cuda.cu
index 8e4f2c1..c3a9d2e 100644
--- a/ds4_cuda.cu
+++ b/ds4_cuda.cu
@@ -XXXX,XX +XXXX,XX @@ extern "C" int ds4_gpu_stream_expert_cache_prepare_selected_batch(
const uint64_t down_expert_bytes = table->down_expert_bytes;