Skip to content

Misc. bug: Vulkan's performance degradation(TG) on A770 from b7194 and FA problem #17628

@savvadesogle

Description

@savvadesogle

Name and Version

llama-cli -v
load_backend: loaded RPC backend from C:\llm\llama-cpp\VULKAN\b7209\ggml-rpc.dll
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) A770 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from C:\llm\llama-cpp\VULKAN\b7209\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\llm\llama-cpp\VULKAN\b7209\ggml-cpu-haswell.dll
build: 7209 (7f8ef50) with clang version 19.1.5 for x86_64-pc-windows-msvc

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

llama-bench
llama-server

Command line

llama-bench -m T:\models\lmstudio-community\gpt-oss-20b-GGUF\gpt-oss-20b-MXFP4.gguf  -ngl 100 -fa 0,1


llama-bench -m T:\models\lmstudio-community\Meta-Llama-3.1-8B-Instruct-GGUF\Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf -ngl 100 -fa 0,1

Problem description & steps to reproduce

Hello.

Drop in token generation (TG) performance compared to the B7189 version on the Intel Arc A770.
Between b7189 and b7209.

Driver: 8250
cpu: xeon 2699v3 x2
GPU: 1x A770

Models:
53 -> 42 t/s lmstudio-community\Meta-Llama-3.1-8B-Instruct-GGUF\Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
60 -> 54 t/s lmstudio-community\gpt-oss-20b-GGUF\gpt-oss-20b-MXFP4.gguf

B7209

| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     | 100 |  0 |           pp512 |        921.00 ± 3.22 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     | 100 |  0 |           tg128 |         42.39 ± 0.07 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     | 100 |  1 |           pp512 |        280.12 ± 0.44 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     | 100 |  1 |           tg128 |         29.39 ± 0.02 |

build: 7f8ef50cc (7209)

sometimes flash attention crashes (with no error, The bench just stops)
Image

| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     | 100 |  0 |           pp512 |        885.48 ± 5.59 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     | 100 |  0 |           tg128 |         54.20 ± 0.07 |

B7189

| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     | 100 |  0 |           pp512 |        918.06 ± 4.28 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     | 100 |  0 |           tg128 |         53.63 ± 0.10 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     | 100 |  1 |           pp512 |        280.26 ± 0.70 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     | 100 |  1 |           tg128 |         34.50 ± 0.04 |
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     | 100 |  0 |           pp512 |        884.45 ± 5.87 |
| gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | Vulkan     | 100 |  0 |           tg128 |         60.95 ± 0.06 |

First Bad Commit

Between b7189 and b7209

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingneed feedbackTesting and feedback with results are neededperformanceSpeed related topicsregressionA regression introduced in a new build (something that was previously working correctly)

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions