vulkan: For coopmat2 FA, use fp16 accumulators for the final result by jeffbolznv · Pull Request #19376 · ggml-org/llama.cpp

jeffbolznv · 2026-02-05T21:02:11Z

The cpu and cuda backends use fp16 for the VKQ accumulator type, this change does the same for vulkan. This helps particularly with large head sizes which are very register-limited.

I tried this for the coopmat1 path and it slowed down a bit. I didn't try for scalar.

I applied the softmax bias that the cuda backend uses to avoid overflow, although I was not able to reproduce the original bug without it.

These perf results were with the change stacked on #19309. Benefits are mostly in GLM-4.7-Flash (head size 576/512) and Qwen3Next (head size 256):

before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -p 512 -n 0 -d 0-32768+8192 -m c:\models\GLM-4.7-Flash-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q4_K_M.gguf -m c:\models\Qwen3-Next-80B-A3B-Instruct-Q2_K_L.gguf -m c:\models\llama-2-7b.Q4_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |           pp512 |      8340.27 ± 68.37 |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      3294.04 ± 16.42 |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |       2039.99 ± 3.83 |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |       1470.53 ± 3.35 |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |       1140.70 ± 1.43 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           pp512 |     10993.67 ± 85.62 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      9387.68 ± 82.50 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |      8246.66 ± 76.87 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |      7243.18 ± 75.79 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |      6515.95 ± 61.29 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           pp512 |    10427.49 ± 101.64 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      7382.27 ± 68.95 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |      5631.59 ± 52.80 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |      4552.49 ± 22.11 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |      3822.58 ± 20.78 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |           pp512 |      4543.94 ± 98.19 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |     4106.37 ± 120.44 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |     3762.00 ± 116.71 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |      3476.71 ± 93.13 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |      3238.74 ± 85.30 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           pp512 |     12704.57 ± 58.66 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      9245.79 ± 53.39 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |     7104.94 ± 250.26 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |     5758.54 ± 237.91 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |     4973.68 ± 218.60 |

after:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -p 512 -n 0 -d 0-32768+8192 -m c:\models\GLM-4.7-Flash-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q4_K_M.gguf -m c:\models\Qwen3-Next-80B-A3B-Instruct-Q2_K_L.gguf -m c:\models\llama-2-7b.Q4_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |           pp512 |      8558.60 ± 66.68 |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      3821.24 ± 27.20 |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |      2456.64 ± 10.55 |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |       1807.55 ± 4.29 |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |       1414.43 ± 2.79 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           pp512 |     11051.53 ± 91.93 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      9459.82 ± 79.20 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |      8314.35 ± 61.61 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |      7395.40 ± 40.93 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |      6644.62 ± 35.54 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           pp512 |    10454.31 ± 137.08 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      7448.99 ± 74.99 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |      5690.54 ± 52.98 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |      4618.70 ± 25.03 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |      3876.84 ± 24.74 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |           pp512 |      4583.95 ± 47.99 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |     4206.75 ± 123.71 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |     3911.70 ± 129.70 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |     3652.85 ± 104.32 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |      3425.53 ± 92.90 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           pp512 |     12738.05 ± 68.68 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      9203.23 ± 62.93 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |     7007.06 ± 200.46 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |     5751.02 ± 172.84 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |     4952.62 ± 144.95 |

The cpu and cuda backends use fp16 for the VKQ accumulator type, this change does the same for vulkan. This helps particularly with large head sizes which are very register-limited. I tried this for the coopmat1 path and it slowed down a bit. I didn't try for scalar. I applied the softmax bias that the cuda backend uses to avoid overflow, although I was not able to reproduce the original bug without it.

…gml-org#19376) The cpu and cuda backends use fp16 for the VKQ accumulator type, this change does the same for vulkan. This helps particularly with large head sizes which are very register-limited. I tried this for the coopmat1 path and it slowed down a bit. I didn't try for scalar. I applied the softmax bias that the cuda backend uses to avoid overflow, although I was not able to reproduce the original bug without it.

jeffbolznv requested a review from 0cc4m as a code owner February 5, 2026 21:02

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Feb 5, 2026

0cc4m approved these changes Feb 6, 2026

View reviewed changes

0cc4m merged commit 1946e46 into ggml-org:master Feb 6, 2026
72 of 73 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: For coopmat2 FA, use fp16 accumulators for the final result#19376

vulkan: For coopmat2 FA, use fp16 accumulators for the final result#19376
0cc4m merged 1 commit intoggml-org:masterfrom
jeffbolznv:fa_fp16_acc

jeffbolznv commented Feb 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jeffbolznv commented Feb 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants