vulkan: skip all-negative-inf blocks in FA by jeffbolznv · Pull Request #17186 · ggml-org/llama.cpp

jeffbolznv · 2025-11-12T03:28:23Z

Overhead for this check generally seems low enough. Perf for just one parallel prompt in llama-batched-bench is a little lower for coopmat2 mode, but I think it's OK. The scalar path is slower due to increased register usage decreasing occupancy. I've filed an internal bug about that, but I'm not too worried about it since I think it only affects Blackwell and the scalar path isn't used for prompt processing on NVIDIA. Would be good to spot check perf on other HW.

5090 before coopmat2:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-batched-bench -c 33792 -npp 8192 -ntg 32 -npl 1,1,2,4 -kvu -tgs -m c:\models\llama-2-7b.Q4_0.gguf
|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|  8192 |     32 |    1 |   8224 |    0.826 |  9917.77 |    0.209 |   152.82 |    1.035 |  7942.94 |
|  8192 |     32 |    1 |   8224 |    0.809 | 10131.14 |    0.211 |   151.90 |    1.019 |  8068.61 |
|  8192 |     32 |    2 |  16448 |    1.924 |  8517.56 |    0.600 |   106.66 |    2.524 |  6517.75 |
|  8192 |     32 |    4 |  32896 |    5.155 |  6356.68 |    1.922 |    66.61 |    7.076 |  4648.66 |

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 2048 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Llama-3.2-1B.Q3_K_S.gguf -m c:\models\llama-3.2-3b-instruct-q5_k_m.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf  -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\Phi-3-mini-4k-instruct-q4.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |          pp2048 |    11713.01 ± 216.42 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        247.83 ± 2.29 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |          pp2048 |     10245.94 ± 83.15 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |       190.84 ± 13.32 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |          pp2048 |      6818.92 ± 17.11 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        130.91 ± 3.12 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |          pp2048 |    45748.92 ± 510.80 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |       879.14 ± 21.27 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |          pp2048 |    42941.17 ± 990.40 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |       863.41 ± 12.72 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |          pp2048 |    22169.27 ± 312.51 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |       399.25 ± 26.85 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |          pp2048 |     8319.70 ± 117.59 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |       296.94 ± 11.45 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |          pp2048 |    12282.77 ± 191.30 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |       261.90 ± 17.92 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |          pp2048 |     10212.89 ± 66.02 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        336.76 ± 2.97 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |          pp2048 |    19254.03 ± 454.43 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |        370.89 ± 2.31 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |          pp2048 |     12124.74 ± 54.74 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        277.86 ± 2.31 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |          pp2048 |    24753.62 ± 582.20 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        333.44 ± 2.63 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |          pp2048 |      9501.88 ± 38.74 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        317.41 ± 4.00 |

5090 after coopmat2:

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|  8192 |     32 |    1 |   8224 |    0.843 |  9722.47 |    0.210 |   152.15 |    1.053 |  7810.82 |
|  8192 |     32 |    1 |   8224 |    0.831 |  9857.67 |    0.209 |   152.84 |    1.040 |  7904.65 |
|  8192 |     32 |    2 |  16448 |    1.830 |  8952.41 |    0.487 |   131.50 |    2.317 |  7099.41 |
|  8192 |     32 |    4 |  32896 |    4.408 |  7433.93 |    1.416 |    90.43 |    5.823 |  5648.91 |

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 2048 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Llama-3.2-1B.Q3_K_S.gguf -m c:\models\llama-3.2-3b-instruct-q5_k_m.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf  -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\Phi-3-mini-4k-instruct-q4.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |          pp2048 |    11726.87 ± 107.79 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        245.00 ± 3.03 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |          pp2048 |     10159.06 ± 31.87 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        206.71 ± 2.26 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |          pp2048 |    6175.92 ± 1404.49 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        133.97 ± 2.95 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |          pp2048 |    46120.87 ± 508.62 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |       873.73 ± 10.36 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |          pp2048 |    44772.80 ± 320.50 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |       878.28 ± 15.55 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |          pp2048 |    21936.68 ± 212.78 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |       401.81 ± 24.34 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |          pp2048 |      8210.71 ± 66.54 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        303.84 ± 2.19 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |          pp2048 |     12316.43 ± 52.11 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |       259.34 ± 18.32 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |          pp2048 |     10174.14 ± 48.64 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        338.42 ± 4.39 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |          pp2048 |     19734.46 ± 92.10 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |        362.71 ± 0.58 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |          pp2048 |     12013.46 ± 58.52 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        277.91 ± 2.07 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |          pp2048 |    23821.65 ± 925.15 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        329.21 ± 2.53 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |          pp2048 |      9660.46 ± 34.37 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        322.61 ± 3.32 |

5090 before coopmat1:

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|  8192 |     32 |    1 |   8224 |    1.892 |  4328.91 |    0.207 |   154.23 |    2.100 |  3916.43 |
|  8192 |     32 |    1 |   8224 |    1.544 |  5305.27 |    0.209 |   153.26 |    1.753 |  4691.60 |
|  8192 |     32 |    2 |  16448 |    4.409 |  3715.78 |    0.599 |   106.88 |    5.008 |  3284.27 |
|  8192 |     32 |    4 |  32896 |   15.162 |  2161.18 |    1.899 |    67.41 |   17.061 |  1928.15 |

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 2048 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Llama-3.2-1B.Q3_K_S.gguf -m c:\models\llama-3.2-3b-instruct-q5_k_m.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf  -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\Phi-3-mini-4k-instruct-q4.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |          pp2048 |      6227.34 ± 20.87 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        236.64 ± 2.48 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |          pp2048 |       6420.06 ± 7.87 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        194.95 ± 6.14 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |          pp2048 |     3252.99 ± 637.05 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        126.30 ± 4.36 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |          pp2048 |    33098.32 ± 969.77 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |       855.47 ± 17.52 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |          pp2048 |    31363.47 ± 111.65 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |       826.44 ± 14.79 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |          pp2048 |    12565.52 ± 122.84 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        379.76 ± 1.76 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |          pp2048 |     5761.62 ± 752.47 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        282.50 ± 3.72 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |          pp2048 |      9013.99 ± 32.64 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |       247.41 ± 12.91 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |          pp2048 |     4367.86 ± 470.42 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        344.86 ± 2.23 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |          pp2048 |    11427.22 ± 114.52 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |        375.72 ± 2.63 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |          pp2048 |      7684.84 ± 28.68 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        279.58 ± 1.92 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |          pp2048 |     15359.25 ± 61.22 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        297.09 ± 3.31 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |          pp2048 |    6124.06 ± 1105.60 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        314.31 ± 5.58 |

5090 after coopmat1:

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|  8192 |     32 |    1 |   8224 |    1.597 |  5128.29 |    0.209 |   152.91 |    1.807 |  4551.98 |
|  8192 |     32 |    1 |   8224 |    1.599 |  5122.66 |    0.211 |   151.68 |    1.810 |  4543.31 |
|  8192 |     32 |    2 |  16448 |    3.459 |  4737.24 |    0.482 |   132.74 |    3.941 |  4173.89 |
|  8192 |     32 |    4 |  32896 |    8.098 |  4046.34 |    1.376 |    92.99 |    9.475 |  3472.00 |

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 2048 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Llama-3.2-1B.Q3_K_S.gguf -m c:\models\llama-3.2-3b-instruct-q5_k_m.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf  -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\Phi-3-mini-4k-instruct-q4.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |          pp2048 |      6207.47 ± 24.88 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        223.58 ± 8.69 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |          pp2048 |      6456.57 ± 29.85 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |       192.89 ± 14.25 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |          pp2048 |     3236.64 ± 576.79 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        115.69 ± 5.45 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |          pp2048 |    32704.97 ± 851.76 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |       868.28 ± 11.19 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |          pp2048 |    31387.72 ± 153.66 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |       857.09 ± 10.03 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |          pp2048 |     12649.38 ± 54.76 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        381.65 ± 2.53 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |          pp2048 |     5888.24 ± 550.18 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        284.26 ± 1.23 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |          pp2048 |      8810.47 ± 43.06 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |        250.48 ± 0.46 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |          pp2048 |     4862.33 ± 590.61 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        342.24 ± 1.86 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |          pp2048 |     11332.36 ± 33.58 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |        354.53 ± 1.29 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |          pp2048 |       7660.24 ± 9.67 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        282.37 ± 1.57 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |          pp2048 |     15468.97 ± 44.77 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        295.88 ± 2.10 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |          pp2048 |    6134.28 ± 1145.42 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        314.93 ± 3.67 |

5090 before scalar:

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|  8192 |     32 |    1 |   8224 |    3.668 |  2233.22 |    0.212 |   150.80 |    3.880 |  2119.34 |
|  8192 |     32 |    1 |   8224 |    3.320 |  2467.22 |    0.211 |   151.56 |    3.531 |  2328.77 |
|  8192 |     32 |    2 |  16448 |    8.812 |  1859.28 |    0.597 |   107.22 |    9.409 |  1748.13 |
|  8192 |     32 |    4 |  32896 |   27.353 |  1197.96 |    1.928 |    66.39 |   29.281 |  1123.45 |

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 2048 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Llama-3.2-1B.Q3_K_S.gguf -m c:\models\llama-3.2-3b-instruct-q5_k_m.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf  -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\Phi-3-mini-4k-instruct-q4.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |          pp2048 |       3106.11 ± 2.90 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        234.78 ± 2.32 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |          pp2048 |     2381.75 ± 914.89 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |       187.27 ± 13.61 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |          pp2048 |       1832.14 ± 4.81 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        129.32 ± 2.24 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |          pp2048 |    14491.90 ± 430.14 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |       862.53 ± 10.85 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |          pp2048 |     16244.53 ± 57.14 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        843.42 ± 7.81 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |          pp2048 |      6741.59 ± 10.88 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        379.44 ± 2.52 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |          pp2048 |       2631.63 ± 7.71 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        282.23 ± 1.69 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |          pp2048 |       3707.85 ± 9.91 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |       246.72 ± 14.49 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |          pp2048 |       3454.65 ± 3.52 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        344.46 ± 1.69 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |          pp2048 |      6455.76 ± 10.87 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |        374.60 ± 1.39 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |          pp2048 |       3265.22 ± 4.16 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        280.88 ± 1.78 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |          pp2048 |      6777.82 ± 11.08 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        312.62 ± 3.42 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |          pp2048 |     4274.63 ± 966.74 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        313.96 ± 3.14 |

5090 after scalar:

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|  8192 |     32 |    1 |   8224 |    4.295 |  1907.44 |    0.302 |   106.03 |    4.597 |  1789.16 |
|  8192 |     32 |    1 |   8224 |    4.294 |  1907.65 |    0.212 |   150.74 |    4.507 |  1824.89 |
|  8192 |     32 |    2 |  16448 |    9.214 |  1778.22 |    0.482 |   132.74 |    9.696 |  1696.39 |
|  8192 |     32 |    4 |  32896 |   21.068 |  1555.35 |    1.375 |    93.12 |   22.442 |  1465.79 |

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 2048 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Llama-3.2-1B.Q3_K_S.gguf -m c:\models\llama-3.2-3b-instruct-q5_k_m.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf  -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\Phi-3-mini-4k-instruct-q4.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |          pp2048 |       2885.83 ± 2.53 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        225.24 ± 9.67 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |          pp2048 |     2339.63 ± 655.47 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        193.55 ± 3.23 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |          pp2048 |       1671.47 ± 2.97 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        128.03 ± 2.38 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |          pp2048 |     13351.98 ± 45.68 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |        823.45 ± 5.73 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |          pp2048 |    14299.19 ± 387.03 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |      729.88 ± 112.83 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |          pp2048 |      6031.14 ± 10.79 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        377.33 ± 2.82 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |          pp2048 |     1767.83 ± 799.92 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        273.76 ± 1.73 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |          pp2048 |       3418.01 ± 6.28 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |        246.94 ± 2.85 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |          pp2048 |     3600.95 ± 280.90 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        343.15 ± 3.83 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |          pp2048 |       5472.39 ± 7.40 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |        379.15 ± 3.65 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |          pp2048 |       2980.30 ± 4.99 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        280.55 ± 3.76 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |          pp2048 |      6047.80 ± 15.17 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        307.32 ± 1.33 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |          pp2048 |     4069.56 ± 796.52 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        305.23 ± 2.57 |

ggerganov · 2025-11-12T15:55:07Z

Curious if you apply this patch, how do the numbers look:

diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
index c1ed1a21c..25916b284 100644
--- a/ggml/include/ggml.h
+++ b/ggml/include/ggml.h
@@ -2220,7 +2220,7 @@ extern "C" {
             struct ggml_tensor  * a,
             int                   k);
 
-#define GGML_KQ_MASK_PAD 64
+#define GGML_KQ_MASK_PAD 1
 
     // q:    [n_embd_k, n_batch,     n_head,    ne3 ]
     // k:    [n_embd_k, n_kv,        n_head_kv, ne3 ]

jeffbolznv · 2025-11-12T16:55:47Z

Curious if you apply this patch, how do the numbers look:

Helps for token gen with the sizes used in llama-batched-bench, looks like noise for the smaller sizes used in llama-bench. I only ran the coopmat2 path:

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|  8192 |     32 |    1 |   8224 |    0.840 |  9757.28 |    0.202 |   158.35 |    1.042 |  7895.03 |
|  8192 |     32 |    1 |   8224 |    0.839 |  9758.25 |    0.200 |   159.93 |    1.040 |  7910.83 |
|  8192 |     32 |    2 |  16448 |    1.834 |  8934.10 |    0.452 |   141.75 |    2.285 |  7197.06 |
|  8192 |     32 |    4 |  32896 |    4.446 |  7369.56 |    1.262 |   101.44 |    5.708 |  5762.87 |

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 2048 -r 10 --prio 1 -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf -m c:\models\DeepSeek-R1-Distill-Llama-8B-Q6_K.gguf -m c:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m c:\models\Llama-3.2-1B.Q2_K.gguf -m c:\models\Llama-3.2-1B.Q3_K_S.gguf -m c:\models\llama-3.2-3b-instruct-q5_k_m.gguf -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\Qwen2.5-7B-Instruct-1M-Q2_K.gguf  -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -m c:\models\Phi-3-mini-4k-instruct-q4.gguf -m c:\models\llama-2-7b.Q4_0.gguf -m c:\models\llama-3.2-3b-instruct-q8_0.gguf -m c:\models\gpt-oss-20b-mxfp4.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |          pp2048 |    11742.60 ± 150.29 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        248.29 ± 2.20 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |          pp2048 |     10117.71 ± 90.70 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | Vulkan     |  99 |  1 |           tg128 |        207.24 ± 2.23 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |          pp2048 |     6626.88 ± 116.31 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |           tg128 |        132.97 ± 2.97 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |          pp2048 |    45930.32 ± 501.58 |
| llama 1B Q2_K - Medium         | 546.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |       887.56 ± 16.68 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |          pp2048 |    43292.71 ± 412.44 |
| llama 1B Q3_K - Small          | 604.50 MiB |     1.24 B | Vulkan     |  99 |  1 |           tg128 |       884.13 ± 13.02 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |          pp2048 |    21474.01 ± 358.20 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |       397.43 ± 28.07 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |          pp2048 |      8078.10 ± 44.23 |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        295.89 ± 9.86 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |          pp2048 |    12187.26 ± 142.23 |
| qwen2 7B Q2_K - Medium         |   2.80 GiB |     7.62 B | Vulkan     |  99 |  1 |           tg128 |       258.43 ± 17.79 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |          pp2048 |     10135.21 ± 76.11 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       318.64 ± 26.88 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |          pp2048 |    18968.53 ± 213.26 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |           tg128 |       364.75 ± 16.75 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |          pp2048 |    11757.87 ± 141.64 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     |  99 |  1 |           tg128 |        279.98 ± 3.36 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |          pp2048 |    23181.75 ± 787.68 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |           tg128 |        322.32 ± 5.66 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |          pp2048 |      9590.61 ± 21.77 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     |  99 |  1 |           tg128 |        316.00 ± 4.04 |

0cc4m · 2025-11-15T09:14:06Z

Here are batched bench results from my devices. I used a Llama3 8B Q4_0, because llama-2 KV grows too big (lack of GQA I think). I only did llama-bench when llama-batched-bench looked like a possible regression with b=1. Looks very good.

AMD Radeon 8060S

Before:

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
8192	32	1	8224	16.390	499.80	0.960	33.35	17.350	474.01
8192	32	1	8224	17.177	476.91	0.962	33.26	18.139	453.38
8192	32	2	16448	44.992	364.15	2.319	27.59	47.311	347.65
8192	32	4	32896	131.786	248.65	6.348	20.17	138.134	238.15

After:

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
8192	32	1	8224	16.612	493.15	0.950	33.68	17.562	468.29
8192	32	1	8224	17.421	470.23	0.956	33.49	18.377	447.52
8192	32	2	16448	36.297	451.38	2.150	29.76	38.448	427.80
8192	32	4	32896	80.446	407.33	5.538	23.11	85.984	382.58

model	size	params	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 8B Q2_K - Medium	2.95 GiB	8.03 B	99	1	pp512	560.10 ± 49.34	596.03 ± 21.69	+6.4%
llama 8B Q2_K - Medium	2.95 GiB	8.03 B	99	1	tg128	52.32 ± 0.60	54.08 ± 0.80	+3.4%
llama 8B Q3_K - Small	3.41 GiB	8.03 B	99	1	pp512	446.40 ± 6.85	471.26 ± 4.79	+5.6%
llama 8B Q3_K - Small	3.41 GiB	8.03 B	99	1	tg128	46.74 ± 0.23	47.70 ± 0.10	+2.1%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	99	1	pp512	431.56 ± 7.06	455.39 ± 2.08	+5.5%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	99	1	tg128	42.74 ± 0.11	42.99 ± 0.09	+0.6%
granitehybrid 1B Q4_K - Small	3.75 GiB	6.94 B	99	1	pp512	1181.00 ± 42.51	1201.87 ± 38.25	+1.8%
granitehybrid 1B Q4_K - Small	3.75 GiB	6.94 B	99	1	tg128	90.03 ± 0.53	90.75 ± 1.06	+0.8%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512	590.50 ± 19.44	605.10 ± 15.48	+2.5%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128	88.30 ± 0.23	88.83 ± 0.37	+0.6%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	99	1	pp512	880.80 ± 38.55	920.28 ± 39.34	+4.5%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	99	1	tg128	70.19 ± 0.23	69.27 ± 0.44	-1.3%

AMD Radeon Pro VII

Before:

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
8192	32	1	8224	27.250	300.62	0.579	55.23	27.830	295.51
8192	32	1	8224	27.527	297.60	0.586	54.60	28.113	292.53
8192	32	2	16448	97.914	167.33	1.602	39.95	99.515	165.28
8192	32	4	32896	389.309	84.17	5.134	24.93	394.443	83.40

After:

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
8192	32	1	8224	26.646	307.44	0.590	54.19	27.237	301.95
8192	32	1	8224	27.424	298.71	0.590	54.19	28.015	293.56
8192	32	2	16448	59.197	276.77	1.382	46.30	60.580	271.51
8192	32	4	32896	135.309	242.17	3.959	32.33	139.268	236.21

Nvidia RTX 3090 (coopmat2)

Before:

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
8192	32	1	8224	15.871	516.15	0.339	94.43	16.210	507.34
8192	32	1	8224	2.373	3452.87	0.283	113.14	2.655	3097.13
8192	32	2	16448	5.512	2972.39	0.657	97.40	6.169	2666.18
8192	32	4	32896	14.067	2329.35	1.729	74.02	15.797	2082.47

After:

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
8192	32	1	8224	16.050	510.40	0.352	90.96	16.402	501.40
8192	32	1	8224	2.387	3432.42	0.286	112.08	2.672	3077.65
8192	32	2	16448	5.168	3170.16	0.648	98.70	5.817	2827.75
8192	32	4	32896	12.083	2711.89	1.636	78.26	13.719	2397.91

Nvidia RTX 3090 (coopmat1)

Before:

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
8192	32	1	8224	4.589	1784.98	0.361	88.75	4.950	1661.43
8192	32	1	8224	3.493	2345.27	0.298	107.40	3.791	2169.39
8192	32	2	16448	9.283	1764.98	0.712	89.89	9.995	1645.66
8192	32	4	32896	28.171	1163.17	1.890	67.74	30.061	1094.31

After:

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
8192	32	1	8224	4.909	1668.74	0.360	88.96	5.269	1560.88
8192	32	1	8224	3.826	2141.37	0.298	107.35	4.124	1994.34
8192	32	2	16448	8.491	1929.54	0.671	95.44	9.162	1795.30
8192	32	4	32896	20.535	1595.75	1.717	74.54	22.252	1478.36

model	size	params	ngl	fa	test	t/s (before)	t/s (after)	diff
llama 8B Q2_K - Medium	2.95 GiB	8.03 B	99	1	pp512	2911.46 ± 9.53	2969.60 ± 20.05	+2.0%
llama 8B Q2_K - Medium	2.95 GiB	8.03 B	99	1	tg128	115.90 ± 0.14	116.71 ± 0.23	+0.7%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	99	1	pp512	2621.95 ± 16.56	2638.37 ± 53.66	+0.6%
llama 8B Q4_K - Small	4.36 GiB	8.03 B	99	1	tg128	120.89 ± 0.55	122.10 ± 0.37	+1.0%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	pp512	3849.82 ± 44.77	3901.82 ± 17.48	+1.4%
llama 8B Q4_0	4.33 GiB	8.03 B	99	1	tg128	144.43 ± 0.26	144.91 ± 0.45	+0.3%
llama 8B Q4_1	4.77 GiB	8.03 B	99	1	pp512	3549.37 ± 23.35	3615.08 ± 8.49	+1.9%
llama 8B Q4_1	4.77 GiB	8.03 B	99	1	tg128	136.14 ± 0.24	136.55 ± 0.26	+0.3%
llama 8B Q8_0	7.95 GiB	8.03 B	99	1	pp512	3170.79 ± 15.15	3214.66 ± 28.74	+1.4%
llama 8B Q8_0	7.95 GiB	8.03 B	99	1	tg128	93.47 ± 0.06	93.59 ± 0.08	+0.1%
granitehybrid 1B Q4_0	3.73 GiB	6.94 B	99	1	pp512	3890.42 ± 507.68	4144.95 ± 121.58	+6.5%
granitehybrid 1B Q4_0	3.73 GiB	6.94 B	99	1	tg128	148.18 ± 2.10	150.20 ± 2.57	+1.4%
granitehybrid 1B Q8_0	6.88 GiB	6.94 B	99	1	pp512	4142.59 ± 439.96	4363.54 ± 55.45	+5.3%
granitehybrid 1B Q8_0	6.88 GiB	6.94 B	99	1	tg128	136.89 ± 0.43	138.47 ± 0.41	+1.2%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	pp512	2495.65 ± 9.67	2521.34 ± 11.70	+1.0%
qwen3moe 30B.A3B Q2_K - Medium	10.48 GiB	30.53 B	99	1	tg128	160.37 ± 0.47	161.80 ± 0.52	+0.9%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	99	1	pp512	2960.32 ± 27.46	3007.36 ± 23.10	+1.6%
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	99	1	tg128	162.65 ± 0.26	164.17 ± 0.22	+0.9%

Nvidia RTX 3090 (no coopmat)

Before:

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
8192	32	1	8224	7.084	1156.45	0.367	87.13	7.451	1103.74
8192	32	1	8224	6.580	1245.00	0.297	107.56	6.877	1195.80
8192	32	2	16448	18.382	891.29	0.697	91.81	19.080	862.08
8192	32	4	32896	57.862	566.31	1.819	70.38	59.681	551.20

After:

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
8192	32	1	8224	7.045	1162.74	0.365	87.73	7.410	1109.82
8192	32	1	8224	6.681	1226.12	0.302	105.99	6.983	1177.70
8192	32	2	16448	14.722	1112.86	0.693	92.32	15.416	1066.97
8192	32	4	32896	34.462	950.83	1.761	72.67	36.224	908.13

vulkan: skip all-negative-inf blocks in FA

db6ac9d

jeffbolznv requested a review from 0cc4m as a code owner November 12, 2025 03:28

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Nov 12, 2025

DajanaV mentioned this pull request Nov 12, 2025

UPSTREAM PR #17186: vulkan: skip all-negative-inf blocks in FA auroralabs-loci/llama.cpp#177

Closed

0cc4m approved these changes Nov 15, 2025

View reviewed changes

0cc4m merged commit 234ae7d into ggml-org:master Nov 15, 2025
71 of 72 checks passed

alan-l mentioned this pull request Nov 17, 2025

Misc. bug: Vulkan\Llama-server.exe (b7064+) hangs during prompt processing if "--flash-attn on" #17297

Closed

virajwad mentioned this pull request Nov 19, 2025

Eval bug: Vulkan - Gemma3n-E2B-Q4_K_M model crashes llama-cli during evaluation [Intel IGPU] #17389

Closed

ddpasa mentioned this pull request Nov 22, 2025

Eval bug: Vulkan issues starting with 439342ea0be347ff279ec204719794df3b3108f6 #17438

Open

Acly mentioned this pull request Nov 29, 2025

vulkan : fix FA mask load with bounds check (coopmat2) #17606

Merged

ggerganov mentioned this pull request Dec 8, 2025

Misc. bug: kv_unified = true despite not setting --kv-unified #17450

Closed

ggerganov mentioned this pull request Dec 22, 2025

CUDA: skip masked KV slices for all FA kernels #14924

Merged

Anico2 added a commit to Anico2/llama.cpp that referenced this pull request Jan 15, 2026

vulkan: skip all-negative-inf blocks in FA (ggml-org#17186)

bb86db8

ggerganov mentioned this pull request Feb 3, 2026

vulkan: Preprocess FA mask to detect all-neg-inf and all-zero. #19281

Merged

blime4 referenced this pull request in blime4/llama.cpp Feb 5, 2026

vulkan: skip all-negative-inf blocks in FA (#17186)

d22092d

ggerganov mentioned this pull request Feb 12, 2026

Misc. bug: llama-server with -kvu and --parallel 4 slows down tg with more inactive slots #19523

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: skip all-negative-inf blocks in FA#17186

vulkan: skip all-negative-inf blocks in FA#17186
0cc4m merged 1 commit intoggml-org:masterfrom
jeffbolznv:fa_inf_mask

jeffbolznv commented Nov 12, 2025

Uh oh!

ggerganov commented Nov 12, 2025

Uh oh!

jeffbolznv commented Nov 12, 2025

Uh oh!

0cc4m commented Nov 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jeffbolznv commented Nov 12, 2025

Uh oh!

ggerganov commented Nov 12, 2025

Uh oh!

jeffbolznv commented Nov 12, 2025

Uh oh!

0cc4m commented Nov 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants