Name and Version
7719
Operating systems
Windows, Linux
Which llama.cpp modules do you know to be affected?
No response
Command line
Problem description & steps to reproduce
Sorry if the title sounds harsh but this is my experience :( Don't want to be ungrateful in any way, just hoping this could be resolved/improved somehow.
Quick summary
Hardware specs:
- Ryzen 5600g
- 64gb DDR4
- Intel arc pro b50(battlemage)
Issue:
When trying to use llama.cpp(tried both, vulkan and sycl) for agentic coding, no matter config(fa, ub, b, different offloading configs, etc.), prompt processing basically goes down to single digits and stalls on first couple of agent's steps.
Models I've tried: Devstral 2 small 24b, GPT-OSS 120b.
I realize that the hardware I have is limited and models are pretty hefty for it but I do not expect 5090 level of perfomance. Main problem that I basically can't use it at all even if I leave it overnight because agent literarily hangs.
Details
Again, I'm not aiming at realtime performance, but would be great if it at least somewhat worked. When context is filled even a bit(at least for agentic use) to 4k-8k tokens, processing almost stops.
What I tried already:
- Windows & Linux
- Updating to latest devel kernel and mesa(26.0)
- Different llama.cpp builds, including latest one, which contains XE2 improvements
- FA on/off, ctk,ctv quant off/on(q8_0)
- Different combinations of UB,B
- Different offloading techniques(none; only experts; fill as much as I can)
- Vulkan mainly, tried SYCL as well but SYCL is totally broken for me, when context fills a bit I get black screen and OOM(left 8 gigs free on GPU but didn't help much)
Spotted quite a few observations:
- So, when benching without FA, at least devstral 24b gives around 300(pp), with FA enabled it basically goes down to ~15-20. I though that I could use it without FA but still this is not the case, on real agentic use it still stalls, despite bench telling me that around 200 PPS on 4096 context should be expected. In reality, it starts somewhere around 30 and goes down to single digits after just couple agent's iterations. Context is not even close to full at this point.
- When using GPT-OSS, CPU is always used during PP, even without FA on. FA seems to not have measurable effect on GPT-OSS somehow, but performance still goes down to single digits quite early.
- When context goes past 4k-8k tokens, GPU util starts spiking from 0 to 100%, further it goes, zero gaps become wider. It seems like something is not working for Intel here at all.
Unfortunatelly, I'm not familiar with shader development so won't be able to contribute. I can do testing as much as needed though so if anyone has any ideas on how to debug it, I'm all yours. Thank you
lama-bench -m ~/LLMs/Models/Devstral-Small-2-24B-Instruct-2512-IQ4_XS.gguf -fa 0,1 -d 0,4096
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(tm) Pro B50 Graphics (BMG G21) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
| model |
size |
params |
backend |
ngl |
fa |
test |
t/s |
| mistral3 14B IQ4_XS - 4.25 bpw |
11.89 GiB |
23.57 B |
Vulkan |
99 |
0 |
pp512 |
256.10 ± 2.01 |
| mistral3 14B IQ4_XS - 4.25 bpw |
11.89 GiB |
23.57 B |
Vulkan |
99 |
0 |
tg128 |
5.69 ± 0.00 |
| mistral3 14B IQ4_XS - 4.25 bpw |
11.89 GiB |
23.57 B |
Vulkan |
99 |
0 |
pp512 @ d4096 |
214.98 ± 0.75 |
| mistral3 14B IQ4_XS - 4.25 bpw |
11.89 GiB |
23.57 B |
Vulkan |
99 |
0 |
tg128 @ d4096 |
5.46 ± 0.02 |
| mistral3 14B IQ4_XS - 4.25 bpw |
11.89 GiB |
23.57 B |
Vulkan |
99 |
1 |
pp512 |
204.15 ± 1.81 |
| mistral3 14B IQ4_XS - 4.25 bpw |
11.89 GiB |
23.57 B |
Vulkan |
99 |
1 |
tg128 |
5.63 ± 0.00 |
| mistral3 14B IQ4_XS - 4.25 bpw |
11.89 GiB |
23.57 B |
Vulkan |
99 |
1 |
pp512 @ d4096 |
15.31 ± 0.00 |
| mistral3 14B IQ4_XS - 4.25 bpw |
11.89 GiB |
23.57 B |
Vulkan |
99 |
1 |
tg128 @ d4096 |
3.68 ± 0.00 |
First Bad Commit
b7064
Relevant log output
Logs
Name and Version
7719
Operating systems
Windows, Linux
Which llama.cpp modules do you know to be affected?
No response
Command line
Problem description & steps to reproduce
Sorry if the title sounds harsh but this is my experience :( Don't want to be ungrateful in any way, just hoping this could be resolved/improved somehow.
Quick summary
Hardware specs:
Issue:
When trying to use llama.cpp(tried both, vulkan and sycl) for agentic coding, no matter config(fa, ub, b, different offloading configs, etc.), prompt processing basically goes down to single digits and stalls on first couple of agent's steps.
Models I've tried: Devstral 2 small 24b, GPT-OSS 120b.
I realize that the hardware I have is limited and models are pretty hefty for it but I do not expect 5090 level of perfomance. Main problem that I basically can't use it at all even if I leave it overnight because agent literarily hangs.
Details
Again, I'm not aiming at realtime performance, but would be great if it at least somewhat worked. When context is filled even a bit(at least for agentic use) to 4k-8k tokens, processing almost stops.
What I tried already:
Spotted quite a few observations:
Unfortunatelly, I'm not familiar with shader development so won't be able to contribute. I can do testing as much as needed though so if anyone has any ideas on how to debug it, I'm all yours. Thank you
lama-bench -m ~/LLMs/Models/Devstral-Small-2-24B-Instruct-2512-IQ4_XS.gguf -fa 0,1 -d 0,4096
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(tm) Pro B50 Graphics (BMG G21) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
First Bad Commit
b7064
Relevant log output
Logs