Skip to content

[WIP] DeepSeek V4 #24162

Draft
am17an wants to merge 13 commits into
ggml-org:masterfrom
am17an:dsv4
Draft

[WIP] DeepSeek V4 #24162
am17an wants to merge 13 commits into
ggml-org:masterfrom
am17an:dsv4

Conversation

@am17an
Copy link
Copy Markdown
Contributor

@am17an am17an commented Jun 5, 2026

Overview

Still a WIP, lots of work to do before this is usable. At the current stage it passes long context/tool calling tests but is quite slow. All the complexity is in the new llama-kv-cache-dsv4 + deepseekv4 model class + no new ggml ops at the moment.

To run you the flash version at least 100 GB VRAM (using the q2_k quantized GGUF here), for the full flash version 160+ GB. Here's how I was running the server on a DGX spark

llama-server -m dsv4-q2_k.gguf -fa 0 -c 32768 --jinja --chat-template-file models/templates/deepseek-ai-DeepSeek-V4.jinja --fit off

Note that it is extremely slow at the moment (~4-5 toks/sec)

Thanks to @pwilkin for the correct chat template + debugging help
Thanks to @fairydreaming for his help in debugging + contributing fixes

Additional information

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES, paired with both codex and claude.

@github-actions github-actions Bot added model Model specific python python script changes ggml changes relating to the ggml tensor library for machine learning labels Jun 5, 2026
@fairydreaming
Copy link
Copy Markdown
Collaborator

@am17an I wonder what's the purpose of f32 casts and conts after mulmats here?

diff --git a/src/models/deepseek-v4.cpp b/src/models/deepseek-v4.cpp
index da3536f37..c8e17ef4e 100644
--- a/src/models/deepseek-v4.cpp
+++ b/src/models/deepseek-v4.cpp
@@ -828,11 +828,9 @@ ggml_tensor * llama_model_deepseek_v4_flash::graph::build_attention(
     ggml_tensor * hca_state_score = nullptr;
     if (ratio == DSV4_HCA_RATIO && inp_dsv4->get_hca().state_idxs) {
         hca_state_kv = build_lora_mm(layer.attn_comp_wkv, cur);
-        hca_state_kv = ggml_cont(ctx0, ggml_cast(ctx0, hca_state_kv, GGML_TYPE_F32));
         cb(hca_state_kv, "hca_state_kv", il);
 
         hca_state_score = build_lora_mm(layer.attn_comp_wgate, cur);
-        hca_state_score = ggml_cont(ctx0, ggml_cast(ctx0, hca_state_score, GGML_TYPE_F32));
         cb(hca_state_score, "hca_state_score", il);
 
         ggml_tensor * ape = layer.attn_comp_ape;
@@ -848,11 +846,9 @@ ggml_tensor * llama_model_deepseek_v4_flash::graph::build_attention(
 
     if (ratio == DSV4_CSA_RATIO && inp_dsv4->get_csa().state_idxs) {
         ggml_tensor * csa_state_kv = build_lora_mm(layer.attn_comp_wkv, cur);
-        csa_state_kv = ggml_cont(ctx0, ggml_cast(ctx0, csa_state_kv, GGML_TYPE_F32));
         cb(csa_state_kv, "csa_state_kv", il);
 
         ggml_tensor * csa_state_score = build_lora_mm(layer.attn_comp_wgate, cur);
-        csa_state_score = ggml_cont(ctx0, ggml_cast(ctx0, csa_state_score, GGML_TYPE_F32));
         cb(csa_state_score, "csa_state_score", il);
 
         ggml_tensor * csa_ape = layer.attn_comp_ape;
@@ -902,11 +898,9 @@ ggml_tensor * llama_model_deepseek_v4_flash::graph::build_attention(
         ggml_build_forward_expand(gf, csa_state_score);
 
         ggml_tensor * lid_state_kv = build_lora_mm(layer.indexer_comp_wkv, cur);
-        lid_state_kv = ggml_cont(ctx0, ggml_cast(ctx0, lid_state_kv, GGML_TYPE_F32));
         cb(lid_state_kv, "lid_state_kv", il);
 
         ggml_tensor * lid_state_score = build_lora_mm(layer.indexer_comp_wgate, cur);
-        lid_state_score = ggml_cont(ctx0, ggml_cast(ctx0, lid_state_score, GGML_TYPE_F32));
         cb(lid_state_score, "lid_state_score", il);
 
         ggml_tensor * lid_ape = layer.indexer_comp_ape;

Removed them and got the same logits.

@am17an
Copy link
Copy Markdown
Contributor Author

am17an commented Jun 5, 2026

@fairydreaming it's an artifact of debugging, you can push your changes to this branch (I added you as collaborator)

@github-actions github-actions Bot added script Script related testing Everything test related labels Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning model Model specific python python script changes script Script related testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants