Skip to content

[Model] Qwen3.5 dense and MoE support (no vision)#19435

Merged
pwilkin merged 16 commits intoggml-org:masterfrom
pwilkin:qwen35
Feb 8, 2026
Merged

[Model] Qwen3.5 dense and MoE support (no vision)#19435
pwilkin merged 16 commits intoggml-org:masterfrom
pwilkin:qwen35

Conversation

@pwilkin
Copy link
Contributor

@pwilkin pwilkin commented Feb 8, 2026

I've gotten a bit tired of Llama.cpp missing all the zero-day releases, so this time I decided to make (or, more precisely, instructed Opus 4.6 to make, based on reference implementations and my guidelines for model adaptation) a conversion based on the Transformers PR ( https://github.com/huggingface/transformers/pull/43830/changes ). It's mostly based on Qwen3Next, but it's rebased on the common-delta-net PR ( #19125 ).

Here are the mock models I generated to test it: https://huggingface.co/ilintar/qwen35_testing/tree/main

Here are the conversion results from causal-verify-logits:

Model NMSE NMSE (dB) Result
Dense 8.94e-06 -50.49 dB Excellent
MoE 9.36e-05 -40.29 dB Excellent

@github-actions github-actions bot added model Model specific python python script changes labels Feb 8, 2026
@ggerganov
Copy link
Member

instructed Opus 4.6 to make, based on reference implementations and my guidelines for model adaptation

A bit off-topic, but I'm curious to try the same task with a local model. Maybe GLM 4.7 Flash + OpenCode. Do you have the guidelines that you used somewhere shareable?

@pwilkin
Copy link
Contributor Author

pwilkin commented Feb 8, 2026

A bit off-topic, but I'm curious to try the same task with a local model. Maybe GLM 4.7 Flash + OpenCode. Do you have the guidelines that you used somewhere shareable?

I'm using my "adding model architectures" tutorial for this actually (#16770) + got an extra rules section in my agent MD reminding about the tensor layout:

## Tensor format

A very important caveat about tensor format in the GGML library: the GGML library uses a different format internally than most Python implementations,
including PyTorch or Transformers. Notably, in the GGML library:

* tensors are restricted to 4 dimensions - whenever you want to use more dimensions you have to pack them
* the semantic order of dimensions is reversed from PyTorch/Transformers - the *last two* dimensions are used for `[tokens_per_batch, batches]` - even though the physical layout is the same

So, for example, a `[1, 5, 4096, 1]` tensor in PyTorch will probably become a `[1, 4096, 5, 1]` tensor in GGML. This is especially important
when converting certain implementations from a reference implementation written in Python, because the tensor dimension ordering will be different,
but they can be semantically the same.

The prompt I used this time:

"In the transformers directory, I have included a new version of Transformers that includes support for the new Qwen3.5 series of models (MoE and dense). @transformers/src/transformers/models/qwen3_5_moe/modeling_qwen3_5_moe.py @transformers/src/transformers/models/qwen3_5/modeling_qwen3_5.py 

I have also created a script and generated mock models for testing, the script is in @transformers/generate_qwen_models.py and has been already run.

Based on the implementation, which seems to be based strongly on the Qwen3 Next architecture, currently handled in Llama.cpp in @reference/qwen3next.cpp and in @src/models/delta.cpp, please create the implementation and conversion code for the new architectures. There are scripts in @examples/model-conversion/Makefile for testing conversions of new models. 

See @.roo/rules/adding_new_models.md for tips on how to add new model support. Make sure to reuse as much existing code as possible. For now, only add text support, without the multimodal capabilities."

EDIT: Key thing was also interrupting the model when it starts to do something stupid, for example modifying permutation rules in delta_net because it added incorrect extra permutations in conversion. Even with Opus 4.6 I had to do it like 3 or 4 times to stop it from going in the wrong direction.

Oh yeah, I also made some agents that I feel work well with Lllama.cpp for OpenCode (I've used them in fact for some documentation work), so I might share those too:

deep-code-modifier.md
code-architecture-analyzer.md

@CISC
Copy link
Member

CISC commented Feb 8, 2026

Oh yeah, I also made some agents that I feel work well with Lllama.cpp for OpenCode (I've used them in fact for some documentation work), so I might share those too:

/me sits back and waits for the automated PR submissions whenever a new model pops up...

@ggerganov
Copy link
Member

ggerganov commented Feb 8, 2026

Oh yeah, I also made some agents that I feel work well with Lllama.cpp for OpenCode (I've used them in fact for some documentation work), so I might share those too:

/me sits back and waits for the automated PR submissions whenever a new model pops up...

github actions workflow, runs on self-hosted mac mini, accepts vllm/transformers PR as input

@am17an
Copy link
Contributor

am17an commented Feb 8, 2026

I think it maybe now makes sense to have a dedicated operator for delta_net to eliminate all those cpy_scalars

@pwilkin
Copy link
Contributor Author

pwilkin commented Feb 8, 2026

@CISC actually I just made the MoE class inherit from the normal one since most of the code is duplicated :)

@pwilkin
Copy link
Contributor Author

pwilkin commented Feb 8, 2026

@CISC BTW reportedly super(class, self).method(...) is the preferred way to call superclasses under multiple inheritance over superclass.method(self, ...)

@CISC
Copy link
Member

CISC commented Feb 8, 2026

@CISC BTW reportedly super(class, self).method(...) is the preferred way to call superclasses under multiple inheritance over superclass.method(self, ...)

No, absolutely not, it does not do what you think it does.

Edit: As a thought experiment; super() is shorthand for super(MyClass, self).

@pwilkin
Copy link
Contributor Author

pwilkin commented Feb 8, 2026

No, absolutely not, it does not do what you think it does.

WDYM? I though super(AClass, self).method means basically "call method for superclass of (self interpreted as instance of AClass)", isn't that what it does?

@CISC
Copy link
Member

CISC commented Feb 8, 2026

Not quite (in that case super() would call its own method).

@pwilkin
Copy link
Contributor Author

pwilkin commented Feb 8, 2026

Not quite (in that case super() would call its own method).

But why, if it calls the method of the "superclass of AClass"?

The way I understand it:

  • under single inheritance, if A is subclass of B, then super(A, self).method(...) is equivalent to B.method(self, ...)
  • under multiple inheritance, if A is subclass of B and C, then super(A, self).method(...) means calling both B.method and C.method if they exist

Am I misunderstanding something?

@CISC
Copy link
Member

CISC commented Feb 8, 2026

Am I misunderstanding something?

I'll repeat the example. :)

super().method -> super(MyClass, self).method, ie calling the method of the parent of MyClass.
super(ParentClass, self).method, calls which method? :)

@pwilkin
Copy link
Contributor Author

pwilkin commented Feb 9, 2026

Sorry, got a bit too excited... Gonna fix the Next bug ASAP.

@pwilkin
Copy link
Contributor Author

pwilkin commented Feb 9, 2026

So fun fact - packed expert conversion for Qwen2Moe was apparently wrong, which is why all these shenanigans were added. It was likely never caught because no models actually used the packed experts. I moved the fix over to Qwen2Moe instead and now just mostly call super()

Are you sure? There are several models inheriting from Qwen2MoeModel, I'm fairly sure this was added for one of them!

Quite sure - the MoE code didn't change at all between Qwen2 and Qwen3.5, 3.5 actually uses the packed tensors and you have to split on the second dimension, not the last one.

If any other arch uses this conversion scheme, it should go in that arch's dedicated code instead.

@CISC
Copy link
Member

CISC commented Feb 9, 2026

If any other arch uses this conversion scheme, it should go in that arch's dedicated code instead.

Sure, but point is some arch may be using it, I'll have a look...

@JJJYmmm
Copy link
Contributor

JJJYmmm commented Feb 9, 2026

So fun fact - packed expert conversion for Qwen2Moe was apparently wrong, which is why all these shenanigans were added. It was likely never caught because no models actually used the packed experts. I moved the fix over to Qwen2Moe instead and now just mostly call super()

Are you sure? There are several models inheriting from Qwen2MoeModel, I'm fairly sure this was added for one of them!

Quite sure - the MoE code didn't change at all between Qwen2 and Qwen3.5, 3.5 actually uses the packed tensors and you have to split on the second dimension, not the last one.

If any other arch uses this conversion scheme, it should go in that arch's dedicated code instead.

For some historical reason, qwen3vlmoe and gptoss has the transposed packed expert weights. huggingface/transformers#43307

@CISC
Copy link
Member

CISC commented Feb 9, 2026

For some historical reason, qwen3vlmoe and gptoss has the transposed packed expert weights. huggingface/transformers#43307

@pwilkin Added in #16780, you fix? :)

@pwilkin
Copy link
Contributor Author

pwilkin commented Feb 9, 2026

@ggerganov Are you sure about this broken Next BTW? I rebased it on the fixed delta branch, so should be fine. Just checked perplexity and it seems OK, tested 100k deep query on Next-Coder IQ2_S and seems to work well too:

> /read ggml/src/ggml-quants.c

Loaded text from 'ggml/src/ggml-quants.c'

> Can you please explain the TQ quants?

Okay, let's break down the **TQ (Ternary Quantization)** methods implemented in the code, specifically `tq1_0` and `tq2_0`, which are inspired by models like **BitNet b1.58** and **TriLMs**.

The core idea of TQ is to represent weights using **ternary values (-1, 0, +1)**, but encoded in a way that is efficient for computation, unlike standard quantization methods that use powers of 2 (like Q4, Q8).
(...)

@pwilkin
Copy link
Contributor Author

pwilkin commented Feb 9, 2026

@pwilkin Added in #16780, you fix? :)

I break, I fix :)

@ggerganov
Copy link
Member

Yes, both Metal and CUDA produce higher PPL compared to before the change.

Just checked perplexity and it seems OK

Is it the same as before the PR?

@pwilkin
Copy link
Contributor Author

pwilkin commented Feb 9, 2026

@ggerganov I checked the perplexity and it looks fine, can you please verify? I rebased this on the fixed version of the delta_net branch, so it should be correct.

Yes, both Metal and CUDA produce higher PPL compared to before the change.

Just checked perplexity and it seems OK

Is it the same as before the PR?

Seems so, but I will test again.

@ggerganov
Copy link
Member

It looks fine, but the problem is that it is different and significantly higher.

# new
Final estimate: PPL = 5.7395 +/- 0.35952

# old (e06088da0fa86aa444409f38dff274904931c507)
Final estimate: PPL = 5.1777 +/- 0.31137

It should not change. Also, try to perform the test from #19305 - it fails.

@pwilkin
Copy link
Contributor Author

pwilkin commented Feb 9, 2026

@ggerganov Verifying rn.

@ggerganov
Copy link
Member

@pwilkin If you need more time to locate and fix the problem, maybe it would be better to revert the PR for now and take the time to support this properly. No need to rush it. WDYT?

@pwilkin
Copy link
Contributor Author

pwilkin commented Feb 9, 2026

@ggerganov give me half an hour, if I don't find it then we can do that instead, OK?

@ggerganov
Copy link
Member

I just don't have the confidence that these changes are good. The branch that you rebased on was closed #19125 and it is not clear at all that it is working correctly. IMO the cleanest thing is to go back.

@pwilkin
Copy link
Contributor Author

pwilkin commented Feb 9, 2026

@ggerganov Okay, you're right, I'll prepare a new one after you revert.

@ggerganov
Copy link
Member

Ok, I'll open the revert now.

@mirek190
Copy link

A bit off-topic, but I'm curious to try the same task with a local model. Maybe GLM 4.7 Flash + OpenCode. Do you have the guidelines that you used somewhere shareable?

I'm using my "adding model architectures" tutorial for this actually (#16770) + got an extra rules section in my agent MD reminding about the tensor layout:

## Tensor format

A very important caveat about tensor format in the GGML library: the GGML library uses a different format internally than most Python implementations,
including PyTorch or Transformers. Notably, in the GGML library:

* tensors are restricted to 4 dimensions - whenever you want to use more dimensions you have to pack them
* the semantic order of dimensions is reversed from PyTorch/Transformers - the *last two* dimensions are used for `[tokens_per_batch, batches]` - even though the physical layout is the same

So, for example, a `[1, 5, 4096, 1]` tensor in PyTorch will probably become a `[1, 4096, 5, 1]` tensor in GGML. This is especially important
when converting certain implementations from a reference implementation written in Python, because the tensor dimension ordering will be different,
but they can be semantically the same.

The prompt I used this time:

"In the transformers directory, I have included a new version of Transformers that includes support for the new Qwen3.5 series of models (MoE and dense). @transformers/src/transformers/models/qwen3_5_moe/modeling_qwen3_5_moe.py @transformers/src/transformers/models/qwen3_5/modeling_qwen3_5.py 

I have also created a script and generated mock models for testing, the script is in @transformers/generate_qwen_models.py and has been already run.

Based on the implementation, which seems to be based strongly on the Qwen3 Next architecture, currently handled in Llama.cpp in @reference/qwen3next.cpp and in @src/models/delta.cpp, please create the implementation and conversion code for the new architectures. There are scripts in @examples/model-conversion/Makefile for testing conversions of new models. 

See @.roo/rules/adding_new_models.md for tips on how to add new model support. Make sure to reuse as much existing code as possible. For now, only add text support, without the multimodal capabilities."

EDIT: Key thing was also interrupting the model when it starts to do something stupid, for example modifying permutation rules in delta_net because it added incorrect extra permutations in conversion. Even with Opus 4.6 I had to do it like 3 or 4 times to stop it from going in the wrong direction.

Oh yeah, I also made some agents that I feel work well with Lllama.cpp for OpenCode (I've used them in fact for some documentation work), so I might share those too:

deep-code-modifier.md code-architecture-analyzer.md

Have you tried to use GPT codex 5.3 xhigh with codex-cli as is better for a complex coding?

@pwilkin
Copy link
Contributor Author

pwilkin commented Feb 10, 2026

@mirek190 I think they're comparable, but I don't have an active Codex sub atm, just Claude.

@mirek190
Copy link

According to Matt Maher

https://www.youtube.com/watch?v=hwvyew2iXpU&t=937s

On real world usage his tests he got 86% with opus 4.6 high and 95% with codex 5.3 xhigh and claims codex is smarter on complex code.
That's something I also noticed. Seems codex is better for complex tasks than opus 4.6 and is absurd cheap because for 20 USD you can work whole week now.

I just saying....

liparetejas pushed a commit to liparetejas/llama.cpp that referenced this pull request Feb 23, 2026
* Unified delta net handling

* Remove old methods.

* Refactor and optimize

* Adapt autoregressive version from @ymcki

* Change to decay mask approach

* Fix bad permute

* Qwen 3.5 support

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Further fixes

* Use inheritance, remove unneeded conts

* Not like this!

* Remove ggml.h explicit import

* Remove transformers, fix the views

* ACTUALLY fix views, make super calls explicit in conversion.

* Fix conversion again

* Remove extra ggml.h imports

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
liparetejas pushed a commit to liparetejas/llama.cpp that referenced this pull request Feb 23, 2026
bartowski1182 pushed a commit to bartowski1182/llama.cpp that referenced this pull request Mar 2, 2026
* Unified delta net handling

* Remove old methods.

* Refactor and optimize

* Adapt autoregressive version from @ymcki

* Change to decay mask approach

* Fix bad permute

* Qwen 3.5 support

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Further fixes

* Use inheritance, remove unneeded conts

* Not like this!

* Remove ggml.h explicit import

* Remove transformers, fix the views

* ACTUALLY fix views, make super calls explicit in conversion.

* Fix conversion again

* Remove extra ggml.h imports

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
bartowski1182 pushed a commit to bartowski1182/llama.cpp that referenced this pull request Mar 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

model Model specific python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants