Skip to content

vulkan: For coopmat2 FA, use fp16 accumulators for the final result#19376

Merged
0cc4m merged 1 commit intoggml-org:masterfrom
jeffbolznv:fa_fp16_acc
Feb 6, 2026
Merged

vulkan: For coopmat2 FA, use fp16 accumulators for the final result#19376
0cc4m merged 1 commit intoggml-org:masterfrom
jeffbolznv:fa_fp16_acc

Conversation

@jeffbolznv
Copy link
Collaborator

The cpu and cuda backends use fp16 for the VKQ accumulator type, this change does the same for vulkan. This helps particularly with large head sizes which are very register-limited.

I tried this for the coopmat1 path and it slowed down a bit. I didn't try for scalar.

I applied the softmax bias that the cuda backend uses to avoid overflow, although I was not able to reproduce the original bug without it.

These perf results were with the change stacked on #19309. Benefits are mostly in GLM-4.7-Flash (head size 576/512) and Qwen3Next (head size 256):

before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -p 512 -n 0 -d 0-32768+8192 -m c:\models\GLM-4.7-Flash-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q4_K_M.gguf -m c:\models\Qwen3-Next-80B-A3B-Instruct-Q2_K_L.gguf -m c:\models\llama-2-7b.Q4_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |           pp512 |      8340.27 ± 68.37 |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      3294.04 ± 16.42 |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |       2039.99 ± 3.83 |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |       1470.53 ± 3.35 |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |       1140.70 ± 1.43 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           pp512 |     10993.67 ± 85.62 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      9387.68 ± 82.50 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |      8246.66 ± 76.87 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |      7243.18 ± 75.79 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |      6515.95 ± 61.29 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           pp512 |    10427.49 ± 101.64 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      7382.27 ± 68.95 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |      5631.59 ± 52.80 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |      4552.49 ± 22.11 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |      3822.58 ± 20.78 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |           pp512 |      4543.94 ± 98.19 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |     4106.37 ± 120.44 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |     3762.00 ± 116.71 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |      3476.71 ± 93.13 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |      3238.74 ± 85.30 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           pp512 |     12704.57 ± 58.66 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      9245.79 ± 53.39 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |     7104.94 ± 250.26 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |     5758.54 ± 237.91 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |     4973.68 ± 218.60 |

after:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -p 512 -n 0 -d 0-32768+8192 -m c:\models\GLM-4.7-Flash-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q4_K_M.gguf -m c:\models\Qwen3-Next-80B-A3B-Instruct-Q2_K_L.gguf -m c:\models\llama-2-7b.Q4_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |           pp512 |      8558.60 ± 66.68 |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      3821.24 ± 27.20 |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |      2456.64 ± 10.55 |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |       1807.55 ± 4.29 |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |       1414.43 ± 2.79 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           pp512 |     11051.53 ± 91.93 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      9459.82 ± 79.20 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |      8314.35 ± 61.61 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |      7395.40 ± 40.93 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |      6644.62 ± 35.54 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           pp512 |    10454.31 ± 137.08 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      7448.99 ± 74.99 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |      5690.54 ± 52.98 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |      4618.70 ± 25.03 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |      3876.84 ± 24.74 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |           pp512 |      4583.95 ± 47.99 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |     4206.75 ± 123.71 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |     3911.70 ± 129.70 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |     3652.85 ± 104.32 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |      3425.53 ± 92.90 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           pp512 |     12738.05 ± 68.68 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      9203.23 ± 62.93 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |     7007.06 ± 200.46 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |     5751.02 ± 172.84 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |     4952.62 ± 144.95 |

The cpu and cuda backends use fp16 for the VKQ accumulator type, this change
does the same for vulkan. This helps particularly with large head sizes which
are very register-limited.

I tried this for the coopmat1 path and it slowed down a bit. I didn't try for
scalar.

I applied the softmax bias that the cuda backend uses to avoid overflow,
although I was not able to reproduce the original bug without it.
@jeffbolznv jeffbolznv requested a review from 0cc4m as a code owner February 5, 2026 21:02
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Feb 5, 2026
@0cc4m 0cc4m merged commit 1946e46 into ggml-org:master Feb 6, 2026
72 of 73 checks passed
liparetejas pushed a commit to liparetejas/llama.cpp that referenced this pull request Feb 23, 2026
…gml-org#19376)

The cpu and cuda backends use fp16 for the VKQ accumulator type, this change
does the same for vulkan. This helps particularly with large head sizes which
are very register-limited.

I tried this for the coopmat1 path and it slowed down a bit. I didn't try for
scalar.

I applied the softmax bias that the cuda backend uses to avoid overflow,
although I was not able to reproduce the original bug without it.
bartowski1182 pushed a commit to bartowski1182/llama.cpp that referenced this pull request Mar 2, 2026
…gml-org#19376)

The cpu and cuda backends use fp16 for the VKQ accumulator type, this change
does the same for vulkan. This helps particularly with large head sizes which
are very register-limited.

I tried this for the coopmat1 path and it slowed down a bit. I didn't try for
scalar.

I applied the softmax bias that the cuda backend uses to avoid overflow,
although I was not able to reproduce the original bug without it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants