Skip to content

model: qwen3next: no concat-in-loop#18759

Closed
ngxson wants to merge 2 commits intoggml-org:masterfrom
ngxson:xsn/qwen3_next_no_concat
Closed

model: qwen3next: no concat-in-loop#18759
ngxson wants to merge 2 commits intoggml-org:masterfrom
ngxson:xsn/qwen3_next_no_concat

Conversation

@ngxson
Copy link
Contributor

@ngxson ngxson commented Jan 11, 2026

Ref: #18725 (comment)

Alternative to doing ggml_concat in a loop is to use ggml_set_rows, but it requires adding quite a lot of code.

This PR is a PoC, just to see if ggml_concat is actually the root cause.


Testing with a large ubatch, llama-bench -p 2048 -ub 1024

master:

model size params backend threads n_ubatch test t/s
qwen3next 80B.A3B F16 148.50 GiB 79.67 B Metal,BLAS 24 1024 pp2048 1049.92 ± 8.52
qwen3next 80B.A3B F16 148.50 GiB 79.67 B Metal,BLAS 24 1024 tg128 28.85 ± 0.03

build: 506bb6e (7703)

PR:

model size params backend threads n_ubatch test t/s
qwen3next 80B.A3B F16 148.50 GiB 79.67 B Metal,BLAS 24 1024 pp2048 1049.65 ± 9.18
qwen3next 80B.A3B F16 148.50 GiB 79.67 B Metal,BLAS 24 1024 tg128 28.92 ± 0.12

build: 4c7e830 (7705)

@ngxson
Copy link
Contributor Author

ngxson commented Jan 11, 2026

@lemmi please give this a try on vulkan

@lemmi
Copy link

lemmi commented Jan 11, 2026

Hey, thanks for looking into it. I ran several benchmarks and I'm getting wildly different results on each run (450-490 t/s on master). But within each run the trend is still the same in that PP performance with larger ubatch sizes falls off dramatically. TG seems to be unaffected, so I left that out from the results:

Master:

build/bin/llama-bench -m ../models/model-q8_0.gguf -p 2048 -ub 512,1024,2048 -n 0 -fa 1 -r 3 --mmap 0
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
model size params backend ngl n_ubatch fa mmap test t/s
qwen3next 80B.A3B Q8_0 78.98 GiB 79.67 B Vulkan 99 512 1 0 pp2048 490.41 ± 1.39
qwen3next 80B.A3B Q8_0 78.98 GiB 79.67 B Vulkan 99 1024 1 0 pp2048 432.51 ± 1.75
qwen3next 80B.A3B Q8_0 78.98 GiB 79.67 B Vulkan 99 2048 1 0 pp2048 345.59 ± 4.24
build: 506bb6e01 (7703)

PR:

pr18759/bin/llama-bench -m ../models/model-q8_0.gguf -p 2048 -ub 512,1024,2048 -n 0 -fa 1 -r 3 --mmap 0
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
model size params backend ngl n_ubatch fa mmap test t/s
qwen3next 80B.A3B Q8_0 78.98 GiB 79.67 B Vulkan 99 512 1 0 pp2048 476.17 ± 1.82
qwen3next 80B.A3B Q8_0 78.98 GiB 79.67 B Vulkan 99 1024 1 0 pp2048 414.57 ± 2.97
qwen3next 80B.A3B Q8_0 78.98 GiB 79.67 B Vulkan 99 2048 1 0 pp2048 347.38 ± 2.06
build: 4c7e8303a (7705)

@ngxson
Copy link
Contributor Author

ngxson commented Jan 11, 2026

hmm ok, so seems likeggml_concat is not as bad as I thought. probably something else here, I think I will need to wait for more info from @jeffbolznv

closing this PR for now as it doesn't improve much

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

model Model specific

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants