[Model] Qwen3.5 dense and MoE support (no vision) by pwilkin · Pull Request #19435 · ggml-org/llama.cpp

pwilkin · 2026-02-08T17:54:43Z

I've gotten a bit tired of Llama.cpp missing all the zero-day releases, so this time I decided to make (or, more precisely, instructed Opus 4.6 to make, based on reference implementations and my guidelines for model adaptation) a conversion based on the Transformers PR ( https://github.com/huggingface/transformers/pull/43830/changes ). It's mostly based on Qwen3Next, but it's rebased on the common-delta-net PR ( #19125 ).

Here are the mock models I generated to test it: https://huggingface.co/ilintar/qwen35_testing/tree/main

Here are the conversion results from causal-verify-logits:

Model NMSE NMSE (dB) Result
Dense 8.94e-06 -50.49 dB Excellent
MoE 9.36e-05 -40.29 dB Excellent

ggerganov · 2026-02-08T18:00:38Z

instructed Opus 4.6 to make, based on reference implementations and my guidelines for model adaptation

A bit off-topic, but I'm curious to try the same task with a local model. Maybe GLM 4.7 Flash + OpenCode. Do you have the guidelines that you used somewhere shareable?

pwilkin · 2026-02-08T18:13:09Z

A bit off-topic, but I'm curious to try the same task with a local model. Maybe GLM 4.7 Flash + OpenCode. Do you have the guidelines that you used somewhere shareable?

I'm using my "adding model architectures" tutorial for this actually (#16770) + got an extra rules section in my agent MD reminding about the tensor layout:

## Tensor format

A very important caveat about tensor format in the GGML library: the GGML library uses a different format internally than most Python implementations,
including PyTorch or Transformers. Notably, in the GGML library:

* tensors are restricted to 4 dimensions - whenever you want to use more dimensions you have to pack them
* the semantic order of dimensions is reversed from PyTorch/Transformers - the *last two* dimensions are used for `[tokens_per_batch, batches]` - even though the physical layout is the same

So, for example, a `[1, 5, 4096, 1]` tensor in PyTorch will probably become a `[1, 4096, 5, 1]` tensor in GGML. This is especially important
when converting certain implementations from a reference implementation written in Python, because the tensor dimension ordering will be different,
but they can be semantically the same.

The prompt I used this time:

"In the transformers directory, I have included a new version of Transformers that includes support for the new Qwen3.5 series of models (MoE and dense). @transformers/src/transformers/models/qwen3_5_moe/modeling_qwen3_5_moe.py @transformers/src/transformers/models/qwen3_5/modeling_qwen3_5.py 

I have also created a script and generated mock models for testing, the script is in @transformers/generate_qwen_models.py and has been already run.

Based on the implementation, which seems to be based strongly on the Qwen3 Next architecture, currently handled in Llama.cpp in @reference/qwen3next.cpp and in @src/models/delta.cpp, please create the implementation and conversion code for the new architectures. There are scripts in @examples/model-conversion/Makefile for testing conversions of new models. 

See @.roo/rules/adding_new_models.md for tips on how to add new model support. Make sure to reuse as much existing code as possible. For now, only add text support, without the multimodal capabilities."

EDIT: Key thing was also interrupting the model when it starts to do something stupid, for example modifying permutation rules in delta_net because it added incorrect extra permutations in conversion. Even with Opus 4.6 I had to do it like 3 or 4 times to stop it from going in the wrong direction.

Oh yeah, I also made some agents that I feel work well with Lllama.cpp for OpenCode (I've used them in fact for some documentation work), so I might share those too:

deep-code-modifier.md
code-architecture-analyzer.md

CISC · 2026-02-08T18:33:45Z

Oh yeah, I also made some agents that I feel work well with Lllama.cpp for OpenCode (I've used them in fact for some documentation work), so I might share those too:

/me sits back and waits for the automated PR submissions whenever a new model pops up...

ggerganov · 2026-02-08T18:38:04Z

Oh yeah, I also made some agents that I feel work well with Lllama.cpp for OpenCode (I've used them in fact for some documentation work), so I might share those too:

/me sits back and waits for the automated PR submissions whenever a new model pops up...

github actions workflow, runs on self-hosted mac mini, accepts vllm/transformers PR as input

am17an · 2026-02-08T19:04:40Z

I think it maybe now makes sense to have a dedicated operator for delta_net to eliminate all those cpy_scalars

convert_hf_to_gguf.py

gguf-py/gguf/tensor_mapping.py

src/models/qwen3-5moe.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

pwilkin · 2026-02-08T21:19:02Z

@CISC actually I just made the MoE class inherit from the normal one since most of the code is duplicated :)

pwilkin · 2026-02-08T21:22:54Z

@CISC BTW reportedly super(class, self).method(...) is the preferred way to call superclasses under multiple inheritance over superclass.method(self, ...)

CISC · 2026-02-08T21:31:13Z

@CISC BTW reportedly super(class, self).method(...) is the preferred way to call superclasses under multiple inheritance over superclass.method(self, ...)

No, absolutely not, it does not do what you think it does.

Edit: As a thought experiment; super() is shorthand for super(MyClass, self).

pwilkin · 2026-02-08T21:33:27Z

No, absolutely not, it does not do what you think it does.

WDYM? I though super(AClass, self).method means basically "call method for superclass of (self interpreted as instance of AClass)", isn't that what it does?

CISC · 2026-02-08T21:35:18Z

Not quite (in that case super() would call its own method).

src/models/qwen3-5.cpp

pwilkin · 2026-02-08T21:39:34Z

Not quite (in that case super() would call its own method).

But why, if it calls the method of the "superclass of AClass"?

The way I understand it:

under single inheritance, if A is subclass of B, then super(A, self).method(...) is equivalent to B.method(self, ...)
under multiple inheritance, if A is subclass of B and C, then super(A, self).method(...) means calling both B.method and C.method if they exist

Am I misunderstanding something?

CISC · 2026-02-08T21:43:31Z

Am I misunderstanding something?

I'll repeat the example. :)

super().method -> super(MyClass, self).method, ie calling the method of the parent of MyClass.
super(ParentClass, self).method, calls which method? :)

pwilkin · 2026-02-09T08:22:22Z

Sorry, got a bit too excited... Gonna fix the Next bug ASAP.

pwilkin · 2026-02-09T08:25:37Z

So fun fact - packed expert conversion for Qwen2Moe was apparently wrong, which is why all these shenanigans were added. It was likely never caught because no models actually used the packed experts. I moved the fix over to Qwen2Moe instead and now just mostly call super()

Are you sure? There are several models inheriting from Qwen2MoeModel, I'm fairly sure this was added for one of them!

Quite sure - the MoE code didn't change at all between Qwen2 and Qwen3.5, 3.5 actually uses the packed tensors and you have to split on the second dimension, not the last one.

If any other arch uses this conversion scheme, it should go in that arch's dedicated code instead.

CISC · 2026-02-09T08:33:38Z

If any other arch uses this conversion scheme, it should go in that arch's dedicated code instead.

Sure, but point is some arch may be using it, I'll have a look...

JJJYmmm · 2026-02-09T08:37:30Z

So fun fact - packed expert conversion for Qwen2Moe was apparently wrong, which is why all these shenanigans were added. It was likely never caught because no models actually used the packed experts. I moved the fix over to Qwen2Moe instead and now just mostly call super()

Are you sure? There are several models inheriting from Qwen2MoeModel, I'm fairly sure this was added for one of them!

Quite sure - the MoE code didn't change at all between Qwen2 and Qwen3.5, 3.5 actually uses the packed tensors and you have to split on the second dimension, not the last one.

If any other arch uses this conversion scheme, it should go in that arch's dedicated code instead.

For some historical reason, qwen3vlmoe and gptoss has the transposed packed expert weights. huggingface/transformers#43307

CISC · 2026-02-09T08:47:06Z

For some historical reason, qwen3vlmoe and gptoss has the transposed packed expert weights. huggingface/transformers#43307

@pwilkin Added in #16780, you fix? :)

pwilkin · 2026-02-09T08:58:36Z

@ggerganov Are you sure about this broken Next BTW? I rebased it on the fixed delta branch, so should be fine. Just checked perplexity and it seems OK, tested 100k deep query on Next-Coder IQ2_S and seems to work well too:

> /read ggml/src/ggml-quants.c

Loaded text from 'ggml/src/ggml-quants.c'

> Can you please explain the TQ quants?

Okay, let's break down the **TQ (Ternary Quantization)** methods implemented in the code, specifically `tq1_0` and `tq2_0`, which are inspired by models like **BitNet b1.58** and **TriLMs**.

The core idea of TQ is to represent weights using **ternary values (-1, 0, +1)**, but encoded in a way that is efficient for computation, unlike standard quantization methods that use powers of 2 (like Q4, Q8).
(...)

pwilkin · 2026-02-09T08:59:13Z

@pwilkin Added in #16780, you fix? :)

I break, I fix :)

ggerganov · 2026-02-09T09:10:40Z

Yes, both Metal and CUDA produce higher PPL compared to before the change.

Just checked perplexity and it seems OK

Is it the same as before the PR?

pwilkin · 2026-02-09T11:11:35Z

@ggerganov I checked the perplexity and it looks fine, can you please verify? I rebased this on the fixed version of the delta_net branch, so it should be correct.

Yes, both Metal and CUDA produce higher PPL compared to before the change.

Just checked perplexity and it seems OK

Is it the same as before the PR?

Seems so, but I will test again.

ggerganov · 2026-02-09T11:25:05Z

It looks fine, but the problem is that it is different and significantly higher.

# new
Final estimate: PPL = 5.7395 +/- 0.35952

# old (e06088da0fa86aa444409f38dff274904931c507)
Final estimate: PPL = 5.1777 +/- 0.31137

It should not change. Also, try to perform the test from #19305 - it fails.

pwilkin · 2026-02-09T11:33:55Z

@ggerganov Verifying rn.

ggerganov · 2026-02-09T12:12:05Z

@pwilkin If you need more time to locate and fix the problem, maybe it would be better to revert the PR for now and take the time to support this properly. No need to rush it. WDYT?

pwilkin · 2026-02-09T12:25:18Z

@ggerganov give me half an hour, if I don't find it then we can do that instead, OK?

ggerganov · 2026-02-09T12:48:07Z

I just don't have the confidence that these changes are good. The branch that you rebased on was closed #19125 and it is not clear at all that it is working correctly. IMO the cleanest thing is to go back.

pwilkin · 2026-02-09T12:49:30Z

@ggerganov Okay, you're right, I'll prepare a new one after you revert.

ggerganov · 2026-02-09T12:49:59Z

Ok, I'll open the revert now.

This reverts commit 39bf692.

…#19453) This reverts commit 39bf692.

mirek190 · 2026-02-10T16:52:46Z

A bit off-topic, but I'm curious to try the same task with a local model. Maybe GLM 4.7 Flash + OpenCode. Do you have the guidelines that you used somewhere shareable?

I'm using my "adding model architectures" tutorial for this actually (#16770) + got an extra rules section in my agent MD reminding about the tensor layout:
## Tensor format

A very important caveat about tensor format in the GGML library: the GGML library uses a different format internally than most Python implementations,
including PyTorch or Transformers. Notably, in the GGML library:

* tensors are restricted to 4 dimensions - whenever you want to use more dimensions you have to pack them
* the semantic order of dimensions is reversed from PyTorch/Transformers - the *last two* dimensions are used for `[tokens_per_batch, batches]` - even though the physical layout is the same

So, for example, a `[1, 5, 4096, 1]` tensor in PyTorch will probably become a `[1, 4096, 5, 1]` tensor in GGML. This is especially important
when converting certain implementations from a reference implementation written in Python, because the tensor dimension ordering will be different,
but they can be semantically the same.
The prompt I used this time:
"In the transformers directory, I have included a new version of Transformers that includes support for the new Qwen3.5 series of models (MoE and dense). @transformers/src/transformers/models/qwen3_5_moe/modeling_qwen3_5_moe.py @transformers/src/transformers/models/qwen3_5/modeling_qwen3_5.py 

I have also created a script and generated mock models for testing, the script is in @transformers/generate_qwen_models.py and has been already run.

Based on the implementation, which seems to be based strongly on the Qwen3 Next architecture, currently handled in Llama.cpp in @reference/qwen3next.cpp and in @src/models/delta.cpp, please create the implementation and conversion code for the new architectures. There are scripts in @examples/model-conversion/Makefile for testing conversions of new models. 

See @.roo/rules/adding_new_models.md for tips on how to add new model support. Make sure to reuse as much existing code as possible. For now, only add text support, without the multimodal capabilities."
EDIT: Key thing was also interrupting the model when it starts to do something stupid, for example modifying permutation rules in delta_net because it added incorrect extra permutations in conversion. Even with Opus 4.6 I had to do it like 3 or 4 times to stop it from going in the wrong direction.

Oh yeah, I also made some agents that I feel work well with Lllama.cpp for OpenCode (I've used them in fact for some documentation work), so I might share those too:

deep-code-modifier.md code-architecture-analyzer.md

Have you tried to use GPT codex 5.3 xhigh with codex-cli as is better for a complex coding?

pwilkin · 2026-02-10T17:10:09Z

@mirek190 I think they're comparable, but I don't have an active Codex sub atm, just Claude.

mirek190 · 2026-02-10T21:34:14Z

According to Matt Maher

https://www.youtube.com/watch?v=hwvyew2iXpU&t=937s

On real world usage his tests he got 86% with opus 4.6 high and 95% with codex 5.3 xhigh and claims codex is smarter on complex code.
That's something I also noticed. Seems codex is better for complex tasks than opus 4.6 and is absurd cheap because for 20 USD you can work whole week now.

I just saying....

@ymcki

* Unified delta net handling * Remove old methods. * Refactor and optimize * Adapt autoregressive version from @ymcki * Change to decay mask approach * Fix bad permute * Qwen 3.5 support * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Further fixes * Use inheritance, remove unneeded conts * Not like this! * Remove ggml.h explicit import * Remove transformers, fix the views * ACTUALLY fix views, make super calls explicit in conversion. * Fix conversion again * Remove extra ggml.h imports --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

…#19435)" (ggml-org#19453) This reverts commit 39bf692.

@ymcki

* Unified delta net handling * Remove old methods. * Refactor and optimize * Adapt autoregressive version from @ymcki * Change to decay mask approach * Fix bad permute * Qwen 3.5 support * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Further fixes * Use inheritance, remove unneeded conts * Not like this! * Remove ggml.h explicit import * Remove transformers, fix the views * ACTUALLY fix views, make super calls explicit in conversion. * Fix conversion again * Remove extra ggml.h imports --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

…#19435)" (ggml-org#19453) This reverts commit 39bf692.

pwilkin added 7 commits February 8, 2026 17:15

Unified delta net handling

9365435

Remove old methods.

7118cc2

Refactor and optimize

c58922d

Adapt autoregressive version from @ymcki

e480e24

Change to decay mask approach

50c8c87

Fix bad permute

80d2772

Qwen 3.5 support

a87d23f

pwilkin requested review from CISC and ggerganov as code owners February 8, 2026 17:54

github-actions bot added model Model specific python python script changes labels Feb 8, 2026

CISC reviewed Feb 8, 2026

View reviewed changes

pwilkin and others added 4 commits February 8, 2026 21:32

Apply suggestions from code review

926f6b5

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Further fixes

cbe50df

Use inheritance, remove unneeded conts

3c7fddf

Not like this!

4e42580

Remove ggml.h explicit import

1df85b7

CISC reviewed Feb 8, 2026

View reviewed changes

src/models/qwen3-5.cpp Outdated Show resolved Hide resolved

mudler mentioned this pull request Feb 9, 2026

Adding Support for Qwen3.5 mudler/LocalAI#8469

Closed

MrHills-rs mentioned this pull request Feb 9, 2026

Feature Request: Support for Qwen 3.5 ikawrakow/ik_llama.cpp#1255

Closed

4 tasks

ggerganov added a commit that referenced this pull request Feb 9, 2026

Revert "[Model] Qwen3.5 dense and MoE support (no vision) (#19435)"

fbb5044

This reverts commit 39bf692.

ggerganov mentioned this pull request Feb 9, 2026

revert : "[Model] Qwen3.5 dense and MoE support (no vision) (#19435)" #19453

Merged

ggerganov added a commit that referenced this pull request Feb 9, 2026

revert : "[Model] Qwen3.5 dense and MoE support (no vision) (#19435)" (…

972f323

…#19453) This reverts commit 39bf692.

pwilkin mentioned this pull request Feb 9, 2026

[Model] Qwen3.5 support w/o vision, WIP #19456

Closed

JJJYmmm mentioned this pull request Feb 9, 2026

[MODEL] support qwen3.5 series #19468

Merged

liparetejas pushed a commit to liparetejas/llama.cpp that referenced this pull request Feb 23, 2026

revert : "[Model] Qwen3.5 dense and MoE support (no vision) (ggml-org…

8529dc7

…#19435)" (ggml-org#19453) This reverts commit 39bf692.

bartowski1182 pushed a commit to bartowski1182/llama.cpp that referenced this pull request Mar 2, 2026

revert : "[Model] Qwen3.5 dense and MoE support (no vision) (ggml-org…

0040a54

…#19435)" (ggml-org#19453) This reverts commit 39bf692.

Conversation

pwilkin commented Feb 8, 2026

Uh oh!

ggerganov commented Feb 8, 2026

Uh oh!

pwilkin commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Feb 8, 2026

Uh oh!

ggerganov commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented Feb 8, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pwilkin commented Feb 8, 2026

Uh oh!

pwilkin commented Feb 8, 2026

Uh oh!

CISC commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pwilkin commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Feb 8, 2026 • edited by pwilkin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

pwilkin commented Feb 8, 2026

Uh oh!

CISC commented Feb 8, 2026

Uh oh!

pwilkin commented Feb 9, 2026

Uh oh!

pwilkin commented Feb 9, 2026

Uh oh!

CISC commented Feb 9, 2026

Uh oh!

JJJYmmm commented Feb 9, 2026

Uh oh!

CISC commented Feb 9, 2026

Uh oh!

pwilkin commented Feb 9, 2026

Uh oh!

pwilkin commented Feb 9, 2026

Uh oh!

ggerganov commented Feb 9, 2026

Uh oh!

pwilkin commented Feb 9, 2026

Uh oh!

ggerganov commented Feb 9, 2026

Uh oh!

pwilkin commented Feb 9, 2026

Uh oh!

ggerganov commented Feb 9, 2026

Uh oh!

pwilkin commented Feb 9, 2026

Uh oh!

ggerganov commented Feb 9, 2026

Uh oh!

pwilkin commented Feb 9, 2026

Uh oh!

ggerganov commented Feb 9, 2026

Uh oh!

mirek190 commented Feb 10, 2026

Uh oh!

pwilkin commented Feb 10, 2026

Uh oh!

mirek190 commented Feb 10, 2026

pwilkin commented Feb 8, 2026 •

edited

Loading

ggerganov commented Feb 8, 2026 •

edited

Loading

CISC commented Feb 8, 2026 •

edited

Loading

pwilkin commented Feb 8, 2026 •

edited

Loading

CISC commented Feb 8, 2026 •

edited by pwilkin

Loading