[Model] Qwen3.5 dense and MoE support (no vision)#19435
[Model] Qwen3.5 dense and MoE support (no vision)#19435pwilkin merged 16 commits intoggml-org:masterfrom
Conversation
A bit off-topic, but I'm curious to try the same task with a local model. Maybe GLM 4.7 Flash + OpenCode. Do you have the guidelines that you used somewhere shareable? |
I'm using my "adding model architectures" tutorial for this actually (#16770) + got an extra rules section in my agent MD reminding about the tensor layout: The prompt I used this time: "In the transformers directory, I have included a new version of Transformers that includes support for the new Qwen3.5 series of models (MoE and dense). @transformers/src/transformers/models/qwen3_5_moe/modeling_qwen3_5_moe.py @transformers/src/transformers/models/qwen3_5/modeling_qwen3_5.py
I have also created a script and generated mock models for testing, the script is in @transformers/generate_qwen_models.py and has been already run.
Based on the implementation, which seems to be based strongly on the Qwen3 Next architecture, currently handled in Llama.cpp in @reference/qwen3next.cpp and in @src/models/delta.cpp, please create the implementation and conversion code for the new architectures. There are scripts in @examples/model-conversion/Makefile for testing conversions of new models.
See @.roo/rules/adding_new_models.md for tips on how to add new model support. Make sure to reuse as much existing code as possible. For now, only add text support, without the multimodal capabilities."EDIT: Key thing was also interrupting the model when it starts to do something stupid, for example modifying permutation rules in delta_net because it added incorrect extra permutations in conversion. Even with Opus 4.6 I had to do it like 3 or 4 times to stop it from going in the wrong direction. Oh yeah, I also made some agents that I feel work well with Lllama.cpp for OpenCode (I've used them in fact for some documentation work), so I might share those too: |
/me sits back and waits for the automated PR submissions whenever a new model pops up... |
github actions workflow, runs on self-hosted mac mini, accepts vllm/transformers PR as input |
|
I think it maybe now makes sense to have a dedicated operator for delta_net to eliminate all those cpy_scalars |
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
|
@CISC actually I just made the MoE class inherit from the normal one since most of the code is duplicated :) |
|
@CISC BTW reportedly |
No, absolutely not, it does not do what you think it does. Edit: As a thought experiment; |
WDYM? I though |
|
Not quite (in that case |
But why, if it calls the method of the "superclass of AClass"? The way I understand it:
Am I misunderstanding something? |
I'll repeat the example. :)
|
|
Sorry, got a bit too excited... Gonna fix the Next bug ASAP. |
Quite sure - the MoE code didn't change at all between Qwen2 and Qwen3.5, 3.5 actually uses the packed tensors and you have to split on the second dimension, not the last one. If any other arch uses this conversion scheme, it should go in that arch's dedicated code instead. |
Sure, but point is some arch may be using it, I'll have a look... |
For some historical reason, qwen3vlmoe and gptoss has the transposed packed expert weights. huggingface/transformers#43307 |
|
|
@ggerganov Are you sure about this broken Next BTW? I rebased it on the fixed delta branch, so should be fine. Just checked perplexity and it seems OK, tested 100k deep query on Next-Coder IQ2_S and seems to work well too: > /read ggml/src/ggml-quants.c
Loaded text from 'ggml/src/ggml-quants.c'
> Can you please explain the TQ quants?
Okay, let's break down the **TQ (Ternary Quantization)** methods implemented in the code, specifically `tq1_0` and `tq2_0`, which are inspired by models like **BitNet b1.58** and **TriLMs**.
The core idea of TQ is to represent weights using **ternary values (-1, 0, +1)**, but encoded in a way that is efficient for computation, unlike standard quantization methods that use powers of 2 (like Q4, Q8).
(...) |
|
Yes, both Metal and CUDA produce higher PPL compared to before the change.
Is it the same as before the PR? |
|
@ggerganov I checked the perplexity and it looks fine, can you please verify? I rebased this on the fixed version of the delta_net branch, so it should be correct.
Seems so, but I will test again. |
|
It looks fine, but the problem is that it is different and significantly higher. # new
Final estimate: PPL = 5.7395 +/- 0.35952
# old (e06088da0fa86aa444409f38dff274904931c507)
Final estimate: PPL = 5.1777 +/- 0.31137It should not change. Also, try to perform the test from #19305 - it fails. |
|
@ggerganov Verifying rn. |
|
@pwilkin If you need more time to locate and fix the problem, maybe it would be better to revert the PR for now and take the time to support this properly. No need to rush it. WDYT? |
|
@ggerganov give me half an hour, if I don't find it then we can do that instead, OK? |
|
I just don't have the confidence that these changes are good. The branch that you rebased on was closed #19125 and it is not clear at all that it is working correctly. IMO the cleanest thing is to go back. |
|
@ggerganov Okay, you're right, I'll prepare a new one after you revert. |
|
Ok, I'll open the revert now. |
Have you tried to use GPT codex 5.3 xhigh with codex-cli as is better for a complex coding? |
|
@mirek190 I think they're comparable, but I don't have an active Codex sub atm, just Claude. |
|
According to Matt Maher https://www.youtube.com/watch?v=hwvyew2iXpU&t=937s On real world usage his tests he got 86% with opus 4.6 high and 95% with codex 5.3 xhigh and claims codex is smarter on complex code. I just saying.... |
* Unified delta net handling * Remove old methods. * Refactor and optimize * Adapt autoregressive version from @ymcki * Change to decay mask approach * Fix bad permute * Qwen 3.5 support * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Further fixes * Use inheritance, remove unneeded conts * Not like this! * Remove ggml.h explicit import * Remove transformers, fix the views * ACTUALLY fix views, make super calls explicit in conversion. * Fix conversion again * Remove extra ggml.h imports --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
…#19435)" (ggml-org#19453) This reverts commit 39bf692.
* Unified delta net handling * Remove old methods. * Refactor and optimize * Adapt autoregressive version from @ymcki * Change to decay mask approach * Fix bad permute * Qwen 3.5 support * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Further fixes * Use inheritance, remove unneeded conts * Not like this! * Remove ggml.h explicit import * Remove transformers, fix the views * ACTUALLY fix views, make super calls explicit in conversion. * Fix conversion again * Remove extra ggml.h imports --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
…#19435)" (ggml-org#19453) This reverts commit 39bf692.
I've gotten a bit tired of Llama.cpp missing all the zero-day releases, so this time I decided to make (or, more precisely, instructed Opus 4.6 to make, based on reference implementations and my guidelines for model adaptation) a conversion based on the Transformers PR ( https://github.com/huggingface/transformers/pull/43830/changes ). It's mostly based on Qwen3Next, but it's rebased on the common-delta-net PR ( #19125 ).
Here are the mock models I generated to test it: https://huggingface.co/ilintar/qwen35_testing/tree/main
Here are the conversion results from
causal-verify-logits:Model NMSE NMSE (dB) Result
Dense 8.94e-06 -50.49 dB Excellent
MoE 9.36e-05 -40.29 dB Excellent