[WIP] DeepSeek V4 by am17an · Pull Request #24162 · ggml-org/llama.cpp

am17an · 2026-06-05T07:11:47Z

Overview

Still a WIP, lots of work to do before this is usable. At the current stage it passes long context/tool calling tests but is quite slow. All the complexity is in the new llama-kv-cache-dsv4 + deepseekv4 model class + no new ggml ops at the moment.

To run you the flash version at least 100 GB VRAM (using the q2_k quantized GGUF here), for the full flash version 160+ GB. Here's how I was running the server on a DGX spark

llama-server -m dsv4-q2_k.gguf -fa 0 -c 32768 --jinja --chat-template-file models/templates/deepseek-ai-DeepSeek-V4.jinja --fit off

Note that it is extremely slow at the moment (~4-5 toks/sec)

Thanks to @pwilkin for the correct chat template + debugging help
Thanks to @fairydreaming for his help in debugging + contributing fixes

Additional information

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES, paired with both codex and claude.

fairydreaming · 2026-06-05T11:19:19Z

@am17an I wonder what's the purpose of f32 casts and conts after mulmats here?

diff --git a/src/models/deepseek-v4.cpp b/src/models/deepseek-v4.cpp
index da3536f37..c8e17ef4e 100644
--- a/src/models/deepseek-v4.cpp
+++ b/src/models/deepseek-v4.cpp
@@ -828,11 +828,9 @@ ggml_tensor * llama_model_deepseek_v4_flash::graph::build_attention(
     ggml_tensor * hca_state_score = nullptr;
     if (ratio == DSV4_HCA_RATIO && inp_dsv4->get_hca().state_idxs) {
         hca_state_kv = build_lora_mm(layer.attn_comp_wkv, cur);
-        hca_state_kv = ggml_cont(ctx0, ggml_cast(ctx0, hca_state_kv, GGML_TYPE_F32));
         cb(hca_state_kv, "hca_state_kv", il);
 
         hca_state_score = build_lora_mm(layer.attn_comp_wgate, cur);
-        hca_state_score = ggml_cont(ctx0, ggml_cast(ctx0, hca_state_score, GGML_TYPE_F32));
         cb(hca_state_score, "hca_state_score", il);
 
         ggml_tensor * ape = layer.attn_comp_ape;
@@ -848,11 +846,9 @@ ggml_tensor * llama_model_deepseek_v4_flash::graph::build_attention(
 
     if (ratio == DSV4_CSA_RATIO && inp_dsv4->get_csa().state_idxs) {
         ggml_tensor * csa_state_kv = build_lora_mm(layer.attn_comp_wkv, cur);
-        csa_state_kv = ggml_cont(ctx0, ggml_cast(ctx0, csa_state_kv, GGML_TYPE_F32));
         cb(csa_state_kv, "csa_state_kv", il);
 
         ggml_tensor * csa_state_score = build_lora_mm(layer.attn_comp_wgate, cur);
-        csa_state_score = ggml_cont(ctx0, ggml_cast(ctx0, csa_state_score, GGML_TYPE_F32));
         cb(csa_state_score, "csa_state_score", il);
 
         ggml_tensor * csa_ape = layer.attn_comp_ape;
@@ -902,11 +898,9 @@ ggml_tensor * llama_model_deepseek_v4_flash::graph::build_attention(
         ggml_build_forward_expand(gf, csa_state_score);
 
         ggml_tensor * lid_state_kv = build_lora_mm(layer.indexer_comp_wkv, cur);
-        lid_state_kv = ggml_cont(ctx0, ggml_cast(ctx0, lid_state_kv, GGML_TYPE_F32));
         cb(lid_state_kv, "lid_state_kv", il);
 
         ggml_tensor * lid_state_score = build_lora_mm(layer.indexer_comp_wgate, cur);
-        lid_state_score = ggml_cont(ctx0, ggml_cast(ctx0, lid_state_score, GGML_TYPE_F32));
         cb(lid_state_score, "lid_state_score", il);
 
         ggml_tensor * lid_ape = layer.indexer_comp_ape;

Removed them and got the same logits.

am17an · 2026-06-05T12:34:26Z

@fairydreaming it's an artifact of debugging, you can push your changes to this branch (I added you as collaborator)

am17an added 10 commits June 4, 2026 16:23

convert: add dsv4 conversion

da6dc9d

add basic setup

df5506b

add llm_graph_input_dsv4

441b736

add save-load state

2170238

add sinkhorn eps - correction by @fairydreaming

5534b47

add rope fix

4e36bd1

cleanup dead code

20616c1

fix bugs

f9c9734

support pro model: added by @fairydreaming

22676c1

remove redundant V cache

9e00db6

github-actions Bot added model Model specific python python script changes ggml changes relating to the ggml tensor library for machine learning labels Jun 5, 2026

pwilkin and others added 3 commits June 5, 2026 14:34

Chat template

94e724b

remove debugging leftovers

703e49c

Add mechanism for inlining templates based on architecture

41c9b16

github-actions Bot added script Script related testing Everything test related labels Jun 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] DeepSeek V4 #24162

[WIP] DeepSeek V4 #24162
am17an wants to merge 13 commits into
ggml-org:masterfrom
am17an:dsv4

am17an commented Jun 5, 2026

Uh oh!

fairydreaming commented Jun 5, 2026

Uh oh!

am17an commented Jun 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

am17an commented Jun 5, 2026

Overview

Additional information

Requirements

Uh oh!

fairydreaming commented Jun 5, 2026

Uh oh!

am17an commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

am17an commented Jun 5, 2026 •

edited

Loading