models : optimizing qwen3next graph#19375
Conversation
|
On my system: So a couple percent for TG, no benefit for PP. |
|
@ggerganov did you look at the optimizations in #18792 ? I think there's a ~15% improvement in PP there. (for reference, I'm waiting for #18755 to get finished and merged so I can actually merge the delta-net codes) |
|
I missed this PR, thanks for pointing out. I'll rebase on top of it then. |
|
no problems on master (tested only Release) assert on RELEASEassert on Debug |
|
I noticed that this can become a matmul, but my tests show that the sum-rows version is actually faster - likely matmul with |
6d95997 to
25dad91
Compare
|
still crashing on my system Details |
49e96d0 to
f5b34c2
Compare
5cb8ef2 to
db3ab4c
Compare
|
@ggerganov BTW, I think this is a good moment to ask: back when I first tried to implement the chunking, my idea was like this:
I remember @am17an mentioning he tried something similar. However, that always failed due to scheduler errors - seemed like the view didn't get a backend assigned to it properly. Any idea why? It still might be faster than doing all the CONCATs. |
|
I'll see if we can avoid the concats next. |
|
Using
Preallocating a tensor with |
10a3a73 to
97637e6
Compare
Interestingly, I have a draft PR #18759 that uses |
|
The current version also improves the perf significantly, especially for larger ubatches where we have more chunks. I think adding support to the missing backends should not be difficult and would be quite useful in the future in similar situations. |
|
One more the reason why I considered using llama.cpp/src/models/qwen3next.cpp Lines 351 to 353 in f486ce9 |
This above seems like something worth adding to the model adding tutorial :) |
c4474a8 to
65da8aa
Compare
65da8aa to
4521751
Compare
Feedback for the code owners/authorsSince here: #18266 With this said, I appreciate your hard work and hope you´ll get your funding round soon. Edit: All good, with the 8068 working flawesly and damn fast. Qwen3.5, Minimax 2.5, Glm5, all flying, Thank you :) |
|
Very impressive ggml magic here for Qwen3Next. I think most of the changes can also be replicated for Kimi Linear. Is @ggerganov also going to do it for Kimi Linear? Or should I do it in a new PR? Or should I do it within the unified delta_net PR? |
|
@ymcki I think you should try to integrate the Kimi Linear delta net into the new |
Are you talking about this issue? |
|
Here are some benchmarks on M1 Ultra: before: after |
* models : optimizing qwen3next graph * cont * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * cont : remove redundant q, g chunking * minor * minor * avoid passing masks around * avoid concats during chunking * naming + shapes * update names and use prefix to disable CUDA graphs
@bartowski1182 i found all your qwen next quant have this problem. do you have any plan to update them? |
|
hmm fair call out, yeah i should probably do that, i have some other improvements too that'll apply.. |
* models : optimizing qwen3next graph * cont * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * cont : remove redundant q, g chunking * minor * minor * avoid passing masks around * avoid concats during chunking * naming + shapes * update names and use prefix to disable CUDA graphs
* models : optimizing qwen3next graph * cont * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * cont : remove redundant q, g chunking * minor * minor * avoid passing masks around * avoid concats during chunking * naming + shapes * update names and use prefix to disable CUDA graphs





Rewording the ggml compute graph to avoid too many unnecessary copies.
M2 Ultra:
DGX Spark:
Related backend optimizations and refactorings:
Notes:
Some GGUFs can incorrectly have 1D BF16 tensors. These can hurt the performance in some backends such as Metal and should instead be converted to F32
Fix: convert : store ffn_gate_inp_shexp as F32 #19606
Next PRs
ggml_build_forward_select()to make the graph constant