model: qwen3next: no concat-in-loop by ngxson · Pull Request #18759 · ggml-org/llama.cpp

ngxson · 2026-01-11T13:04:32Z

Alternative to doing ggml_concat in a loop is to use ggml_set_rows, but it requires adding quite a lot of code.

This PR is a PoC, just to see if ggml_concat is actually the root cause.

Testing with a large ubatch, llama-bench -p 2048 -ub 1024

master:

model	size	params	backend	threads	n_ubatch	test	t/s
qwen3next 80B.A3B F16	148.50 GiB	79.67 B	Metal,BLAS	24	1024	pp2048	1049.92 ± 8.52
qwen3next 80B.A3B F16	148.50 GiB	79.67 B	Metal,BLAS	24	1024	tg128	28.85 ± 0.03

build: 506bb6e (7703)

PR:

model	size	params	backend	threads	n_ubatch	test	t/s
qwen3next 80B.A3B F16	148.50 GiB	79.67 B	Metal,BLAS	24	1024	pp2048	1049.65 ± 9.18
qwen3next 80B.A3B F16	148.50 GiB	79.67 B	Metal,BLAS	24	1024	tg128	28.92 ± 0.12

build: 4c7e830 (7705)

ngxson · 2026-01-11T13:05:33Z

@lemmi please give this a try on vulkan

lemmi · 2026-01-11T14:55:56Z

Hey, thanks for looking into it. I ran several benchmarks and I'm getting wildly different results on each run (450-490 t/s on master). But within each run the trend is still the same in that PP performance with larger ubatch sizes falls off dramatically. TG seems to be unaffected, so I left that out from the results:

Master:

build/bin/llama-bench -m ../models/model-q8_0.gguf -p 2048 -ub 512,1024,2048 -n 0 -fa 1 -r 3 --mmap 0
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	n_ubatch	fa	test	t/s
qwen3next 80B.A3B Q8_0	78.98 GiB	79.67 B	Vulkan	99	512	1	pp2048	490.41 ± 1.39
qwen3next 80B.A3B Q8_0	78.98 GiB	79.67 B	Vulkan	99	1024	1	pp2048	432.51 ± 1.75
qwen3next 80B.A3B Q8_0	78.98 GiB	79.67 B	Vulkan	99	2048	1	pp2048	345.59 ± 4.24

build: 506bb6e01 (7703)

PR:

pr18759/bin/llama-bench -m ../models/model-q8_0.gguf -p 2048 -ub 512,1024,2048 -n 0 -fa 1 -r 3 --mmap 0
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	n_ubatch	fa	test	t/s
qwen3next 80B.A3B Q8_0	78.98 GiB	79.67 B	Vulkan	99	512	1	pp2048	476.17 ± 1.82
qwen3next 80B.A3B Q8_0	78.98 GiB	79.67 B	Vulkan	99	1024	1	pp2048	414.57 ± 2.97
qwen3next 80B.A3B Q8_0	78.98 GiB	79.67 B	Vulkan	99	2048	1	pp2048	347.38 ± 2.06

build: 4c7e8303a (7705)

ngxson · 2026-01-11T15:08:59Z

hmm ok, so seems likeggml_concat is not as bad as I thought. probably something else here, I think I will need to wait for more info from @jeffbolznv

closing this PR for now as it doesn't improve much

model: qwen3next: no concat-in-loop

b54db95

no crash on build_delta_net_autoregressive

4c7e830

github-actions bot added the model Model specific label Jan 11, 2026

loci-dev mentioned this pull request Jan 11, 2026

UPSTREAM PR #18759: model: qwen3next: no concat-in-loop auroralabs-loci/llama.cpp#889

Open

ngxson closed this Jan 11, 2026

This was referenced Feb 9, 2026

Feature Request: Fused GGML_CONCAT #19432

Open

models : optimizing qwen3next graph #19375

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

model: qwen3next: no concat-in-loop#18759

model: qwen3next: no concat-in-loop#18759
ngxson wants to merge 2 commits intoggml-org:masterfrom
ngxson:xsn/qwen3_next_no_concat

ngxson commented Jan 11, 2026 •

edited

Loading

Uh oh!

ngxson commented Jan 11, 2026

Uh oh!

lemmi commented Jan 11, 2026

Uh oh!

ngxson commented Jan 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ngxson commented Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Jan 11, 2026

Uh oh!

lemmi commented Jan 11, 2026

Uh oh!

ngxson commented Jan 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ngxson commented Jan 11, 2026 •

edited

Loading