Immediate prefill crash on layer 0 every time. dgx spark on fully up to date DGX OS and firmwares.

BUG REPORT
Title:
cuda prefill failed / gpu layer 0 ffn batch encode failed on DGX Spark with
DeepSeek-V4-PRO Q2-imatrix + SSD streaming

Hardware:
- DGX Spark (GB10, SM 12.1, 128 GB unified memory)
- Samsung PM9E1 Gen5 SSD
- Fresh install < 48 hours old
- CUDA backend (make cuda-spark)

Model:
DeepSeek-V4-Pro-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-Instruct-imatrix.gguf (433 GB)

What fails:
Both CLI and ds4-server crash immediately on first prompt with:

    ds4: CUDA loading model tensors into device cache: 2.04 GiB
    ds4: gpu layer 0 ffn batch encode failed
    ds4: prompt processing failed: cuda prefill failed

Commands tried (all failed):

CLI:
./ds4 --cuda --model gguf/DeepSeek-V4-Pro-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-Instruct-imatrix.gguf \
  --ssd-streaming

./ds4 --cuda --model ... --ssd-streaming --ctx 32768
./ds4 --cuda --model ... --ssd-streaming --ctx 65536 --prefill-chunk 2048
./ds4 --cuda --model ... --ssd-streaming --ssd-streaming-cold --ctx 32768 --prefill-chunk 1024
./ds4 --cuda --model ... --ssd-streaming --ctx 200000 --prefill-chunk 4096

Server:
./ds4-server --cuda --model ... --ssd-streaming --ctx 1048576 --host 0.0.0.0 --port 8081
./ds4-server --cuda --model ... --ssd-streaming --ctx 200000 --host 0.0.0.0 --port 8081

Also tried:
- ctx from 32768 to 1048576
- prefill-chunk values: 512, 1024, 2048, 4096
- --ssd-streaming-cold
- Fresh rebuild: git pull + make clean + make cuda-spark multiple times

Note:
- SSD streaming was enabled (new feature added <72h ago)
- No MTP draft used
- Only PRO model was tested

Expected:
PRO Q2 should load and run at low speed with SSD streaming (as announced for DGX Spark).

Actual:
Immediate prefill crash on layer 0 every time.

Grok suggests investigate this approach (am AFK so cant try):
diff --git a/ds4_cuda.cu b/ds4_cuda.cu
index 8e4f2c1..c3a9d2e 100644
--- a/ds4_cuda.cu
+++ b/ds4_cuda.cu
@@ -XXXX,XX +XXXX,XX @@ extern "C" int ds4_gpu_stream_expert_cache_prepare_selected_batch(
         const uint64_t down_expert_bytes = table->down_expert_bytes;

-        return cuda_stream_selected_cache_begin_compact_load(
-                model_map, model_size, layer,
-                compact_ids.data(), slot_ids.data(),
-                n_total_expert, (uint32_t)compact_ids.size(), slot_count,
-                gate_offset, up_offset, down_offset,
-                gate_expert_bytes, down_expert_bytes,
-                1,   // strict_failure
-                0);
+        // Layer 0 during first prefill with SSD streaming is often a cold start.
+        // Make it non-strict so we can fall back instead of hard-failing the whole prefill.
+        const int strict_failure = (layer == 0 ? 0 : 1);
+
+        return cuda_stream_selected_cache_begin_compact_load(
+                model_map, model_size, layer,
+                compact_ids.data(), slot_ids.data(),
+                n_total_expert, (uint32_t)compact_ids.size(), slot_count,
+                gate_offset, up_offset, down_offset,
+                gate_expert_bytes, down_expert_bytes,
+                strict_failure,
+                0);



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Immediate prefill crash on layer 0 every time. dgx spark on fully up to date DGX OS and firmwares. #427

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Immediate prefill crash on layer 0 every time. dgx spark on fully up to date DGX OS and firmwares. #427

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions