Skip to content

Immediate prefill crash on layer 0 every time. dgx spark on fully up to date DGX OS and firmwares. #427

@bioseed

Description

@bioseed

BUG REPORT
Title:
cuda prefill failed / gpu layer 0 ffn batch encode failed on DGX Spark with
DeepSeek-V4-PRO Q2-imatrix + SSD streaming

Hardware:

  • DGX Spark (GB10, SM 12.1, 128 GB unified memory)
  • Samsung PM9E1 Gen5 SSD
  • Fresh install < 48 hours old
  • CUDA backend (make cuda-spark)

Model:
DeepSeek-V4-Pro-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-Instruct-imatrix.gguf (433 GB)

What fails:
Both CLI and ds4-server crash immediately on first prompt with:

ds4: CUDA loading model tensors into device cache: 2.04 GiB
ds4: gpu layer 0 ffn batch encode failed
ds4: prompt processing failed: cuda prefill failed

Commands tried (all failed):

CLI:
./ds4 --cuda --model gguf/DeepSeek-V4-Pro-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-Instruct-imatrix.gguf
--ssd-streaming

./ds4 --cuda --model ... --ssd-streaming --ctx 32768
./ds4 --cuda --model ... --ssd-streaming --ctx 65536 --prefill-chunk 2048
./ds4 --cuda --model ... --ssd-streaming --ssd-streaming-cold --ctx 32768 --prefill-chunk 1024
./ds4 --cuda --model ... --ssd-streaming --ctx 200000 --prefill-chunk 4096

Server:
./ds4-server --cuda --model ... --ssd-streaming --ctx 1048576 --host 0.0.0.0 --port 8081
./ds4-server --cuda --model ... --ssd-streaming --ctx 200000 --host 0.0.0.0 --port 8081

Also tried:

  • ctx from 32768 to 1048576
  • prefill-chunk values: 512, 1024, 2048, 4096
  • --ssd-streaming-cold
  • Fresh rebuild: git pull + make clean + make cuda-spark multiple times

Note:

  • SSD streaming was enabled (new feature added <72h ago)
  • No MTP draft used
  • Only PRO model was tested

Expected:
PRO Q2 should load and run at low speed with SSD streaming (as announced for DGX Spark).

Actual:
Immediate prefill crash on layer 0 every time.

Grok suggests investigate this approach (am AFK so cant try):
diff --git a/ds4_cuda.cu b/ds4_cuda.cu
index 8e4f2c1..c3a9d2e 100644
--- a/ds4_cuda.cu
+++ b/ds4_cuda.cu
@@ -XXXX,XX +XXXX,XX @@ extern "C" int ds4_gpu_stream_expert_cache_prepare_selected_batch(
const uint64_t down_expert_bytes = table->down_expert_bytes;

  •    return cuda_stream_selected_cache_begin_compact_load(
    
  •            model_map, model_size, layer,
    
  •            compact_ids.data(), slot_ids.data(),
    
  •            n_total_expert, (uint32_t)compact_ids.size(), slot_count,
    
  •            gate_offset, up_offset, down_offset,
    
  •            gate_expert_bytes, down_expert_bytes,
    
  •            1,   // strict_failure
    
  •            0);
    
  •    // Layer 0 during first prefill with SSD streaming is often a cold start.
    
  •    // Make it non-strict so we can fall back instead of hard-failing the whole prefill.
    
  •    const int strict_failure = (layer == 0 ? 0 : 1);
    
  •    return cuda_stream_selected_cache_begin_compact_load(
    
  •            model_map, model_size, layer,
    
  •            compact_ids.data(), slot_ids.data(),
    
  •            n_total_expert, (uint32_t)compact_ids.size(), slot_count,
    
  •            gate_offset, up_offset, down_offset,
    
  •            gate_expert_bytes, down_expert_bytes,
    
  •            strict_failure,
    
  •            0);
    

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions