vulkan: make FA mask/softcap enables spec constants by jeffbolznv · Pull Request #19309 · ggml-org/llama.cpp

jeffbolznv · 2026-02-03T23:51:16Z

~~This is stacked on #19281.~~ (merged)

This allows the compiler to do a bit better at overlapping loads and math (e.g. loading V can start while computing Q*K^t is still happening). Worth a couple percent for coopmat2, less for coopmat1/scalar.

before

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -p 512 -n 0 -d 0-32768+8192 -m c:\models\GLM-4.7-Flash-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q4_K_M.gguf -m c:\models\Qwen3-Next-80B-A3B-Instruct-Q2_K_L.gguf -m c:\models\llama-2-7b.Q4_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |           pp512 |      8396.02 ± 81.63 |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      3221.30 ± 20.40 |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |       1989.70 ± 3.67 |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |       1426.74 ± 1.67 |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |       1102.89 ± 1.15 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           pp512 |    10954.70 ± 173.33 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |     9356.51 ± 104.58 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |      8217.04 ± 75.10 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |      7279.99 ± 60.94 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |      6558.87 ± 43.19 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           pp512 |     10510.00 ± 90.81 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      7342.35 ± 87.09 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |      5577.50 ± 42.00 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |      4494.34 ± 18.09 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |      3753.69 ± 20.90 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |           pp512 |      4582.17 ± 17.28 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |     4057.73 ± 150.24 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |     3684.80 ± 115.32 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |      3375.87 ± 90.14 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |      3115.29 ± 82.11 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           pp512 |      12876.33 ± 7.22 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |     9110.98 ± 340.42 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |     7155.14 ± 252.22 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |     5736.91 ± 216.05 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |     4896.37 ± 190.76 |

after

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -p 512 -n 0 -d 0-32768+8192 -m c:\models\GLM-4.7-Flash-Q4_K_M.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q4_K_M.gguf -m c:\models\Qwen3-Next-80B-A3B-Instruct-Q2_K_L.gguf -m c:\models\llama-2-7b.Q4_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |           pp512 |      8383.60 ± 84.99 |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      3296.14 ± 17.41 |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |       2049.17 ± 2.63 |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |       1477.91 ± 2.57 |
| deepseek2 30B.A3B Q4_K - Medium |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |       1147.37 ± 2.10 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           pp512 |    11072.72 ± 108.27 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      9449.75 ± 55.99 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |      8265.75 ± 80.46 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |      7300.51 ± 42.59 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |      6568.90 ± 36.88 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |           pp512 |     10492.97 ± 97.52 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      7406.26 ± 85.21 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |      5654.28 ± 49.88 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |      4561.42 ± 45.52 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.35 GiB |    30.53 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |      3835.78 ± 22.02 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |           pp512 |      4587.46 ± 17.63 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |     4098.24 ± 139.56 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |     3749.66 ± 121.45 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |     3472.26 ± 103.67 |
| qwen3next 80B.A3B Q2_K - Medium |  27.23 GiB |    79.67 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |      3232.42 ± 92.36 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           pp512 |    12715.10 ± 144.59 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |     9253.01 ± 138.89 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |  pp512 @ d16384 |     7211.87 ± 238.40 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |  pp512 @ d24576 |     5783.65 ± 236.97 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |  pp512 @ d32768 |     5019.27 ± 187.56 |

jeffbolznv · 2026-02-05T19:58:16Z

The lavapipe CI job timed out due to the increased number of shader compiles, and the compiler is pretty slow. I've disabled specializing for sinks, hopefully that will resolve it.

jeffbolznv · 2026-02-06T00:26:21Z

Had to bump the timeout, but #19381 ought to get the runtime back under control.

0cc4m

LGTM

* vulkan: make FA mask/softcap enables spec constants * don't specialize for sinks * bump timeout a little bit

jeffbolznv requested review from 0cc4m and ggerganov as code owners February 3, 2026 23:51

github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Feb 4, 2026

vulkan: make FA mask/softcap enables spec constants

9ffaf56

jeffbolznv force-pushed the fa_spec_const_flags branch from 4131daf to 9ffaf56 Compare February 5, 2026 15:31

don't specialize for sinks

3aad36d

jeffbolznv mentioned this pull request Feb 5, 2026

vulkan: For coopmat2 FA, use fp16 accumulators for the final result #19376

Merged

bump timeout a little bit

0f27654

jeffbolznv requested a review from CISC as a code owner February 6, 2026 00:08

github-actions bot added the devops improvements to build systems and github actions label Feb 6, 2026

0cc4m approved these changes Feb 6, 2026

View reviewed changes

0cc4m merged commit f9bd518 into ggml-org:master Feb 6, 2026
76 of 78 checks passed

liparetejas pushed a commit to liparetejas/llama.cpp that referenced this pull request Feb 23, 2026

vulkan: make FA mask/softcap enables spec constants (ggml-org#19309)

0cdea45

* vulkan: make FA mask/softcap enables spec constants * don't specialize for sinks * bump timeout a little bit

bartowski1182 pushed a commit to bartowski1182/llama.cpp that referenced this pull request Mar 2, 2026

vulkan: make FA mask/softcap enables spec constants (ggml-org#19309)

61397e6

* vulkan: make FA mask/softcap enables spec constants * don't specialize for sinks * bump timeout a little bit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: make FA mask/softcap enables spec constants#19309

vulkan: make FA mask/softcap enables spec constants#19309
0cc4m merged 3 commits intoggml-org:masterfrom
jeffbolznv:fa_spec_const_flags

jeffbolznv commented Feb 3, 2026 •

edited

Loading

Uh oh!

jeffbolznv commented Feb 5, 2026

Uh oh!

jeffbolznv commented Feb 6, 2026

Uh oh!

0cc4m left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jeffbolznv commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffbolznv commented Feb 5, 2026

Uh oh!

jeffbolznv commented Feb 6, 2026

Uh oh!

0cc4m left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jeffbolznv commented Feb 3, 2026 •

edited

Loading