[MODEL] support qwen3.5 series by JJJYmmm · Pull Request #19468 · ggml-org/llama.cpp

JJJYmmm · 2026-02-09T19:02:07Z

This PR adds model support for the upcoming Qwen3.5 models, including both dense and MoE variants. It has been verified with preview checkpoints from the Qwen Team in both vision and pure text modes (with or without mmproj file).

As mentioned here, this PR was planned to be a follow-up PR after #19435 or #19456 by @pwilkin. However, it seems there is still a lot to do without the real checkpoints (such as vision support, vocab, weight mapping, and forward logic), so I'm sharing my implementation here for reference. 🫡

Some differences compared to #19435 or #19456:

Qwen3.5 uses a different pre-tokenizer
Qwen3.5 uses partial IMRoPE
Qwen3.5 has fixed the qkvzba order—thanks to @ngxson in model: try to improve Qwen3 Next #18683. I saw it last month and we decided to simplify it in the official weights! Now they are split into qkv, z, b, and a.
Qwen3.5 has dense/MoE variants, allowing us to simplify the forward logic (no need to check the mlp type of the current layer).

@pwilkin Sorry for the work overlap, and I really appreciate your work on Qwen3Next! It has been a huge stepping stone for this PR.

Reference HF implementation: huggingface/transformers#43830

pwilkin · 2026-02-09T19:16:18Z

@JJJYmmm no worries, glad if it helped even a little :)

ngxson

I haven't reviewed the actual implementation, but I assume it's mostly qwen3 next + imrope (please correct if I'm wrong)

Don't hesitate to ping if you need helps to add the vision support!

convert_hf_to_gguf_update.py

src/llama-model.cpp

JJJYmmm · 2026-02-09T19:48:39Z

@JJJYmmm no worries, glad if it helped even a little :)

It definitely helped a lot 🙏

gguf-py/gguf/tensor_mapping.py

gguf-py/gguf/constants.py

src/llama-model.cpp

src/models/qwen35.cpp

convert_hf_to_gguf.py

ggerganov

The libllama changes are good. I will follow up with a PR to deduplicate and optimize the delta net stuff across the Qwen family.

ggerganov · 2026-02-10T08:04:06Z

src/models/qwen35.cpp

+
+    // if head keys and value keys are different, repeat to force tensors into matching shapes
+    if (num_k_heads != num_v_heads) {
+        GGML_ASSERT(num_v_heads % num_k_heads == 0);
+        int64_t repeat_factor = num_v_heads / num_k_heads;
+
+        // repeat interleave: reshape to (repeat part, 1, remaining part), do repeat, then reshape back
+        ggml_tensor * q_reshaped = ggml_reshape_3d(ctx0, q_conv, head_k_dim, 1, num_k_heads * n_seq_tokens * n_seqs);
+        ggml_tensor * k_reshaped = ggml_reshape_3d(ctx0, k_conv, head_k_dim, 1, num_k_heads * n_seq_tokens * n_seqs);
+
+        // Repeat along the third dimension (the new dimension with size 1)
+        ggml_tensor * q_repeated =
+            ggml_repeat_4d(ctx0, q_reshaped, head_k_dim, repeat_factor, num_k_heads * n_seq_tokens * n_seqs, 1);
+        ggml_tensor * k_repeated =
+            ggml_repeat_4d(ctx0, k_reshaped, head_k_dim, repeat_factor, num_k_heads * n_seq_tokens * n_seqs, 1);
+
+        // Reshape back to merge the head and repeat dimensions
+        // From [head_dim, num_k_heads, repeat_factor, n_seq_tokens * n_seqs]
+        // Back to [head_dim, num_k_heads * repeat_factor, n_seq_tokens, n_seqs]
+        q_conv = ggml_reshape_4d(ctx0, q_repeated, head_k_dim, num_k_heads * repeat_factor, n_seq_tokens, n_seqs);
+        k_conv = ggml_reshape_4d(ctx0, k_repeated, head_k_dim, num_k_heads * repeat_factor, n_seq_tokens, n_seqs);
+    }
+


Btw, this is an issue inherited from the Qwen3 Next implementation. Basically we are doing redundant repeats here because the V weights are not arranged correctly during conversion. Ideally this whole section should not be needed and we should only rely on broadcasts.

Basically, we want to broadcast Q and K into V. With the current order of the V heads, we need an interleaved broadcast. However the ggml binary ops perform tiled broadcast. Hence the need for this rearrangement.

If it is too difficult to figure out how to fix this for the new Qwen3.5 models, we'll have to do it in a follow-up PR. But it's going to be a breaking change. Or maybe we can try to gate it with some flag.

I would try to fix it in the qwen3.5 gguf conversion since it is not released.

Since it's a new arch we should not need a flag as such then.

just reorder the related v weights to avoid repeat_interleave🫡 4248c93

Nice. Just make sure results are still as expected after this change.

From here, I think we will be able to remove the explicit repeats - this is something for next PR. Thanks a lot!

Yes, I tested it with the new generated weights, and it works just as before. 🫡

convert_hf_to_gguf.py

src/llama-arch.cpp

CISC · 2026-02-10T08:28:50Z

@JJJYmmm https://github.com/ggml-org/llama.cpp/actions/runs/21854651667/job/63068757680?pr=19468

…epeat

ngxson

The changes outside of cgraph looks good to me (sorry don't have too much time to look deeper, but I believe the review from @ggerganov and @CISC should already be enough)

Looking forward to the vision support PR then!

JJJYmmm · 2026-02-10T13:35:37Z

The changes outside of cgraph looks good to me (sorry don't have too much time to look deeper, but I believe the review from @ggerganov and @CISC should already be enough)

Looking forward to the vision support PR then!

@ngxson This pr includes the vision part and I test it too! (qwen3.5 uses the same vit as qwen3vl and deepstack was removed temporally)🤗

ngxson · 2026-02-10T13:40:04Z

@JJJYmmm ah yeah right, thanks for the clarification. Yes I already acknowledge that the deep stack is removed (confirmed by transformers implementation)

Look good to me then, thanks again!

ngxson · 2026-02-10T15:50:05Z

It seems to me that all review comments are resolved, and @JJJYmmm also confirmed that the model is still working fine with last commit. So, I'll let @ggerganov to decide when to merge this.

* support qwen3.5 series * remove deepstack for now, and some code clean * code clean * add FULL_ATTENTION_INTERVAL metadata * code clean * reorder v heads for linear attention to avoid expensive interleaved repeat

support qwen3.5 series

9c94c40

JJJYmmm requested review from CISC and ggerganov as code owners February 9, 2026 19:02

pwilkin mentioned this pull request Feb 9, 2026

[Model] Qwen3.5 support w/o vision, WIP #19456

Closed

pwilkin mentioned this pull request Feb 9, 2026

Migrate Qwen3.5 to IMROPE #19443

Closed

ngxson reviewed Feb 9, 2026

View reviewed changes

convert_hf_to_gguf_update.py Outdated Show resolved Hide resolved

src/llama-model.cpp Outdated Show resolved Hide resolved

src/llama-model.cpp Outdated Show resolved Hide resolved

remove deepstack for now, and some code clean

7512ed7

CISC reviewed Feb 9, 2026

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

github-actions bot added model Model specific examples python python script changes labels Feb 9, 2026

github-actions bot mentioned this pull request Feb 10, 2026

Reddit News Daily 2026-02-10 gitlawr/reddit-daily-news#151

Open

JJJYmmm added 2 commits February 10, 2026 14:13

code clean

1af4a79

add FULL_ATTENTION_INTERVAL metadata

ac7a89f

ggerganov approved these changes Feb 10, 2026

View reviewed changes

mudler mentioned this pull request Feb 10, 2026

Adding Support for Qwen3.5 mudler/LocalAI#8469

Closed

ggerganov reviewed Feb 10, 2026

View reviewed changes

CISC approved these changes Feb 10, 2026

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

src/llama-arch.cpp Outdated Show resolved Hide resolved

Merge branch 'master' into add_qwen35_open

036f602

JJJYmmm added 2 commits February 10, 2026 16:39

code clean

633436f

reorder v heads for linear attention to avoid expensive interleaved r…

4248c93

…epeat

ngxson approved these changes Feb 10, 2026

View reviewed changes

ggerganov merged commit fc0fe40 into ggml-org:master Feb 10, 2026
79 of 82 checks passed

ggerganov mentioned this pull request Feb 12, 2026

ggml: add GATED_DELTA_NET op #19504

Merged

mythikal03 mentioned this pull request Feb 17, 2026

Eval bug: qwen35moe produces static degenerate output (all "/" tokens) with CUDA, works on CPU-only #19683

Closed

aropb mentioned this pull request Feb 24, 2026

[Feature]: Qwen3.5 support SciSharp/LLamaSharp#1340

Open

Josh-XT mentioned this pull request Feb 25, 2026

Update for Qwen3.5 Support xorbitsai/xllamacpp#113

Closed

bereilhp mentioned this pull request Mar 5, 2026

Update llama.cpp to support Qwen 3.5 architecture lgrammel/ai-sdk-llama-cpp#26

Open

fkroener mentioned this pull request Mar 13, 2026

Eval bug: amd vulkan crashes with vk::DeviceLostError with context > 65k tokens #20515

Open

Conversation

JJJYmmm commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pwilkin commented Feb 9, 2026

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JJJYmmm commented Feb 9, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

ggerganov Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

ggerganov Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

JJJYmmm Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

CISC Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

JJJYmmm Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

ggerganov Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

JJJYmmm Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CISC commented Feb 10, 2026

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

JJJYmmm commented Feb 10, 2026

Uh oh!

ngxson commented Feb 10, 2026

Uh oh!

ngxson commented Feb 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

JJJYmmm commented Feb 9, 2026 •

edited

Loading