-
Notifications
You must be signed in to change notification settings - Fork 15.5k
Closed
Labels
Description
Name and Version
build/bin/llama-cli --version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
version: 7684 (53eb9435d)
built with GNU 14.2.1 for Linux x86_64
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-bench, llama-server
Command line
build/bin/llama-bench -m ../models/unsloth/Qwen3-Next-80B-A3B-Instruct-GGUF/Qwen3-Next-80B-A3B-Instruct-UD-Q8_K_XL-00001-of-00002.gguf -p 4096 -ub 256,512,1024,2048,4096 -n 0 -fa 1 -r 3 --mmap 0 -ngl 0,999Problem description & steps to reproduce
Qwen3-Next behaves a little strange with respect to ubatch sizes. Normally I expect performance to raise up to point, then plateau (or decrease slightly). With this model (and hardware) performance almost halves with ubatch sizes > 512.
| model | size | params | backend | ngl | n_ubatch | fa | mmap | test | t/s |
|---|---|---|---|---|---|---|---|---|---|
| qwen3next 80B.A3B Q8_0 | 79.57 GiB | 79.67 B | Vulkan | 0 | 256 | 1 | 0 | pp4096 | 479.58 ± 0.92 |
| qwen3next 80B.A3B Q8_0 | 79.57 GiB | 79.67 B | Vulkan | 0 | 512 | 1 | 0 | pp4096 | 499.74 ± 1.77 |
| qwen3next 80B.A3B Q8_0 | 79.57 GiB | 79.67 B | Vulkan | 0 | 1024 | 1 | 0 | pp4096 | 404.50 ± 3.72 |
| qwen3next 80B.A3B Q8_0 | 79.57 GiB | 79.67 B | Vulkan | 0 | 2048 | 1 | 0 | pp4096 | 283.88 ± 5.11 |
| qwen3next 80B.A3B Q8_0 | 79.57 GiB | 79.67 B | Vulkan | 0 | 4096 | 1 | 0 | pp4096 | 280.88 ± 4.69 |
| qwen3next 80B.A3B Q8_0 | 79.57 GiB | 79.67 B | Vulkan | 999 | 256 | 1 | 0 | pp4096 | 388.09 ± 2.15 |
| qwen3next 80B.A3B Q8_0 | 79.57 GiB | 79.67 B | Vulkan | 999 | 512 | 1 | 0 | pp4096 | 444.50 ± 0.70 |
| qwen3next 80B.A3B Q8_0 | 79.57 GiB | 79.67 B | Vulkan | 999 | 1024 | 1 | 0 | pp4096 | 305.61 ± 8.87 |
| qwen3next 80B.A3B Q8_0 | 79.57 GiB | 79.67 B | Vulkan | 999 | 2048 | 1 | 0 | pp4096 | 263.40 ± 13.52 |
| qwen3next 80B.A3B Q8_0 | 79.57 GiB | 79.67 B | Vulkan | 999 | 4096 | 1 | 0 | pp4096 | 261.35 ± 7.57 |
Also PP on Vulkan is slower than CPU. The avx 512 path with repacking seems to work very well.
First Bad Commit
No response
Relevant log output
No response
Reactions are currently unavailable