Misc. bug: Qwen3-Next PP performance loss with larger ubatch-size (Strix Halo, Vulkan)

### Name and Version

```
build/bin/llama-cli --version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
version: 7684 (53eb9435d)
built with GNU 14.2.1 for Linux x86_64
```

### Operating systems

Linux

### Which llama.cpp modules do you know to be affected?

llama-bench, llama-server

### Command line

```shell
build/bin/llama-bench -m ../models/unsloth/Qwen3-Next-80B-A3B-Instruct-GGUF/Qwen3-Next-80B-A3B-Instruct-UD-Q8_K_XL-00001-of-00002.gguf -p 4096 -ub 256,512,1024,2048,4096 -n 0 -fa 1 -r 3 --mmap 0 -ngl 0,999
```

### Problem description & steps to reproduce

Qwen3-Next behaves a little strange with respect to ubatch sizes. Normally I expect performance to raise up to point, then plateau (or decrease slightly). With this model (and hardware) performance almost halves with ubatch sizes > 512.

| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| qwen3next 80B.A3B Q8_0         |  79.57 GiB |    79.67 B | Vulkan     |   0 |      256 |  1 |    0 |          pp4096 |        479.58 ± 0.92 |
| qwen3next 80B.A3B Q8_0         |  79.57 GiB |    79.67 B | Vulkan     |   0 |      512 |  1 |    0 |          pp4096 |        499.74 ± 1.77 |
| qwen3next 80B.A3B Q8_0         |  79.57 GiB |    79.67 B | Vulkan     |   0 |     1024 |  1 |    0 |          pp4096 |        404.50 ± 3.72 |
| qwen3next 80B.A3B Q8_0         |  79.57 GiB |    79.67 B | Vulkan     |   0 |     2048 |  1 |    0 |          pp4096 |        283.88 ± 5.11 |
| qwen3next 80B.A3B Q8_0         |  79.57 GiB |    79.67 B | Vulkan     |   0 |     4096 |  1 |    0 |          pp4096 |        280.88 ± 4.69 |
| qwen3next 80B.A3B Q8_0         |  79.57 GiB |    79.67 B | Vulkan     | 999 |      256 |  1 |    0 |          pp4096 |        388.09 ± 2.15 |
| qwen3next 80B.A3B Q8_0         |  79.57 GiB |    79.67 B | Vulkan     | 999 |      512 |  1 |    0 |          pp4096 |        444.50 ± 0.70 |
| qwen3next 80B.A3B Q8_0         |  79.57 GiB |    79.67 B | Vulkan     | 999 |     1024 |  1 |    0 |          pp4096 |        305.61 ± 8.87 |
| qwen3next 80B.A3B Q8_0         |  79.57 GiB |    79.67 B | Vulkan     | 999 |     2048 |  1 |    0 |          pp4096 |       263.40 ± 13.52 |
| qwen3next 80B.A3B Q8_0         |  79.57 GiB |    79.67 B | Vulkan     | 999 |     4096 |  1 |    0 |          pp4096 |        261.35 ± 7.57 |

Also PP on Vulkan is slower than CPU. The avx 512 path with repacking seems to work *very* well.


### First Bad Commit

_No response_

### Relevant log output

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc. bug: Qwen3-Next PP performance loss with larger ubatch-size (Strix Halo, Vulkan) #18725

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

model	size	params	backend	ngl	n_ubatch	fa	test	t/s
qwen3next 80B.A3B Q8_0	79.57 GiB	79.67 B	Vulkan	0	256	1	pp4096	479.58 ± 0.92
qwen3next 80B.A3B Q8_0	79.57 GiB	79.67 B	Vulkan	0	512	1	pp4096	499.74 ± 1.77
qwen3next 80B.A3B Q8_0	79.57 GiB	79.67 B	Vulkan	0	1024	1	pp4096	404.50 ± 3.72
qwen3next 80B.A3B Q8_0	79.57 GiB	79.67 B	Vulkan	0	2048	1	pp4096	283.88 ± 5.11
qwen3next 80B.A3B Q8_0	79.57 GiB	79.67 B	Vulkan	0	4096	1	pp4096	280.88 ± 4.69
qwen3next 80B.A3B Q8_0	79.57 GiB	79.67 B	Vulkan	999	256	1	pp4096	388.09 ± 2.15
qwen3next 80B.A3B Q8_0	79.57 GiB	79.67 B	Vulkan	999	512	1	pp4096	444.50 ± 0.70
qwen3next 80B.A3B Q8_0	79.57 GiB	79.67 B	Vulkan	999	1024	1	pp4096	305.61 ± 8.87
qwen3next 80B.A3B Q8_0	79.57 GiB	79.67 B	Vulkan	999	2048	1	pp4096	263.40 ± 13.52
qwen3next 80B.A3B Q8_0	79.57 GiB	79.67 B	Vulkan	999	4096	1	pp4096	261.35 ± 7.57

Misc. bug: Qwen3-Next PP performance loss with larger ubatch-size (Strix Halo, Vulkan) #18725

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions