Skip to content

[MODEL] support qwen3.5 series #19468

Merged
ggerganov merged 7 commits intoggml-org:masterfrom
JJJYmmm:add_qwen35
Feb 10, 2026
Merged

[MODEL] support qwen3.5 series #19468
ggerganov merged 7 commits intoggml-org:masterfrom
JJJYmmm:add_qwen35

Conversation

@JJJYmmm
Copy link
Contributor

@JJJYmmm JJJYmmm commented Feb 9, 2026

This PR adds model support for the upcoming Qwen3.5 models, including both dense and MoE variants. It has been verified with preview checkpoints from the Qwen Team in both vision and pure text modes (with or without mmproj file).

As mentioned here, this PR was planned to be a follow-up PR after #19435 or #19456 by @pwilkin. However, it seems there is still a lot to do without the real checkpoints (such as vision support, vocab, weight mapping, and forward logic), so I'm sharing my implementation here for reference. 🫡

Some differences compared to #19435 or #19456:

  • Qwen3.5 uses a different pre-tokenizer
  • Qwen3.5 uses partial IMRoPE
  • Qwen3.5 has fixed the qkvzba order—thanks to @ngxson in model: try to improve Qwen3 Next #18683. I saw it last month and we decided to simplify it in the official weights! Now they are split into qkv, z, b, and a.
  • Qwen3.5 has dense/MoE variants, allowing us to simplify the forward logic (no need to check the mlp type of the current layer).

@pwilkin Sorry for the work overlap, and I really appreciate your work on Qwen3Next! It has been a huge stepping stone for this PR.

Reference HF implementation: huggingface/transformers#43830

@pwilkin
Copy link
Collaborator

pwilkin commented Feb 9, 2026

@JJJYmmm no worries, glad if it helped even a little :)

Copy link
Contributor

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't reviewed the actual implementation, but I assume it's mostly qwen3 next + imrope (please correct if I'm wrong)

Don't hesitate to ping if you need helps to add the vision support!

@JJJYmmm
Copy link
Contributor Author

JJJYmmm commented Feb 9, 2026

@JJJYmmm no worries, glad if it helped even a little :)

It definitely helped a lot 🙏

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The libllama changes are good. I will follow up with a PR to deduplicate and optimize the delta net stuff across the Qwen family.

Comment on lines +673 to +695

// if head keys and value keys are different, repeat to force tensors into matching shapes
if (num_k_heads != num_v_heads) {
GGML_ASSERT(num_v_heads % num_k_heads == 0);
int64_t repeat_factor = num_v_heads / num_k_heads;

// repeat interleave: reshape to (repeat part, 1, remaining part), do repeat, then reshape back
ggml_tensor * q_reshaped = ggml_reshape_3d(ctx0, q_conv, head_k_dim, 1, num_k_heads * n_seq_tokens * n_seqs);
ggml_tensor * k_reshaped = ggml_reshape_3d(ctx0, k_conv, head_k_dim, 1, num_k_heads * n_seq_tokens * n_seqs);

// Repeat along the third dimension (the new dimension with size 1)
ggml_tensor * q_repeated =
ggml_repeat_4d(ctx0, q_reshaped, head_k_dim, repeat_factor, num_k_heads * n_seq_tokens * n_seqs, 1);
ggml_tensor * k_repeated =
ggml_repeat_4d(ctx0, k_reshaped, head_k_dim, repeat_factor, num_k_heads * n_seq_tokens * n_seqs, 1);

// Reshape back to merge the head and repeat dimensions
// From [head_dim, num_k_heads, repeat_factor, n_seq_tokens * n_seqs]
// Back to [head_dim, num_k_heads * repeat_factor, n_seq_tokens, n_seqs]
q_conv = ggml_reshape_4d(ctx0, q_repeated, head_k_dim, num_k_heads * repeat_factor, n_seq_tokens, n_seqs);
k_conv = ggml_reshape_4d(ctx0, k_repeated, head_k_dim, num_k_heads * repeat_factor, n_seq_tokens, n_seqs);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, this is an issue inherited from the Qwen3 Next implementation. Basically we are doing redundant repeats here because the V weights are not arranged correctly during conversion. Ideally this whole section should not be needed and we should only rely on broadcasts.

Basically, we want to broadcast Q and K into V. With the current order of the V heads, we need an interleaved broadcast. However the ggml binary ops perform tiled broadcast. Hence the need for this rearrangement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is too difficult to figure out how to fix this for the new Qwen3.5 models, we'll have to do it in a follow-up PR. But it's going to be a breaking change. Or maybe we can try to gate it with some flag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would try to fix it in the qwen3.5 gguf conversion since it is not released.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's a new arch we should not need a flag as such then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just reorder the related v weights to avoid repeat_interleave🫡 4248c93

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Just make sure results are still as expected after this change.

From here, I think we will be able to remove the explicit repeats - this is something for next PR. Thanks a lot!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I tested it with the new generated weights, and it works just as before. 🫡

@CISC
Copy link
Member

CISC commented Feb 10, 2026

@JJJYmmm https://github.com/ggml-org/llama.cpp/actions/runs/21854651667/job/63068757680?pr=19468

Copy link
Contributor

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes outside of cgraph looks good to me (sorry don't have too much time to look deeper, but I believe the review from @ggerganov and @CISC should already be enough)

Looking forward to the vision support PR then!

@JJJYmmm
Copy link
Contributor Author

JJJYmmm commented Feb 10, 2026

The changes outside of cgraph looks good to me (sorry don't have too much time to look deeper, but I believe the review from @ggerganov and @CISC should already be enough)

Looking forward to the vision support PR then!

@ngxson This pr includes the vision part and I test it too! (qwen3.5 uses the same vit as qwen3vl and deepstack was removed temporally)🤗

@ngxson
Copy link
Contributor

ngxson commented Feb 10, 2026

@JJJYmmm ah yeah right, thanks for the clarification. Yes I already acknowledge that the deep stack is removed (confirmed by transformers implementation)

Look good to me then, thanks again!

@ngxson
Copy link
Contributor

ngxson commented Feb 10, 2026

It seems to me that all review comments are resolved, and @JJJYmmm also confirmed that the model is still working fine with last commit. So, I'll let @ggerganov to decide when to merge this.

@ggerganov ggerganov merged commit fc0fe40 into ggml-org:master Feb 10, 2026
79 of 82 checks passed
liparetejas pushed a commit to liparetejas/llama.cpp that referenced this pull request Feb 23, 2026
* support qwen3.5 series

* remove deepstack for now, and some code clean

* code clean

* add FULL_ATTENTION_INTERVAL metadata

* code clean

* reorder v heads for linear attention to avoid expensive interleaved repeat
bartowski1182 pushed a commit to bartowski1182/llama.cpp that referenced this pull request Mar 2, 2026
* support qwen3.5 series

* remove deepstack for now, and some code clean

* code clean

* add FULL_ATTENTION_INTERVAL metadata

* code clean

* reorder v heads for linear attention to avoid expensive interleaved repeat
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples model Model specific python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants