Skip to content

Misc. bug: Totally broken for agentic use on Intel dGPUs #18808

@andreyzagoruy

Description

@andreyzagoruy

Name and Version

7719

Operating systems

Windows, Linux

Which llama.cpp modules do you know to be affected?

No response

Command line

Problem description & steps to reproduce

Sorry if the title sounds harsh but this is my experience :( Don't want to be ungrateful in any way, just hoping this could be resolved/improved somehow.

Quick summary

Hardware specs:

  • Ryzen 5600g
  • 64gb DDR4
  • Intel arc pro b50(battlemage)

Issue:
When trying to use llama.cpp(tried both, vulkan and sycl) for agentic coding, no matter config(fa, ub, b, different offloading configs, etc.), prompt processing basically goes down to single digits and stalls on first couple of agent's steps.

Models I've tried: Devstral 2 small 24b, GPT-OSS 120b.

I realize that the hardware I have is limited and models are pretty hefty for it but I do not expect 5090 level of perfomance. Main problem that I basically can't use it at all even if I leave it overnight because agent literarily hangs.

Details

Again, I'm not aiming at realtime performance, but would be great if it at least somewhat worked. When context is filled even a bit(at least for agentic use) to 4k-8k tokens, processing almost stops.

What I tried already:

  • Windows & Linux
  • Updating to latest devel kernel and mesa(26.0)
  • Different llama.cpp builds, including latest one, which contains XE2 improvements
  • FA on/off, ctk,ctv quant off/on(q8_0)
  • Different combinations of UB,B
  • Different offloading techniques(none; only experts; fill as much as I can)
  • Vulkan mainly, tried SYCL as well but SYCL is totally broken for me, when context fills a bit I get black screen and OOM(left 8 gigs free on GPU but didn't help much)

Spotted quite a few observations:

  • So, when benching without FA, at least devstral 24b gives around 300(pp), with FA enabled it basically goes down to ~15-20. I though that I could use it without FA but still this is not the case, on real agentic use it still stalls, despite bench telling me that around 200 PPS on 4096 context should be expected. In reality, it starts somewhere around 30 and goes down to single digits after just couple agent's iterations. Context is not even close to full at this point.
  • When using GPT-OSS, CPU is always used during PP, even without FA on. FA seems to not have measurable effect on GPT-OSS somehow, but performance still goes down to single digits quite early.
  • When context goes past 4k-8k tokens, GPU util starts spiking from 0 to 100%, further it goes, zero gaps become wider. It seems like something is not working for Intel here at all.

Unfortunatelly, I'm not familiar with shader development so won't be able to contribute. I can do testing as much as needed though so if anyone has any ideas on how to debug it, I'm all yours. Thank you

lama-bench -m ~/LLMs/Models/Devstral-Small-2-24B-Instruct-2512-IQ4_XS.gguf -fa 0,1 -d 0,4096
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(tm) Pro B50 Graphics (BMG G21) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat

model size params backend ngl fa test t/s
mistral3 14B IQ4_XS - 4.25 bpw 11.89 GiB 23.57 B Vulkan 99 0 pp512 256.10 ± 2.01
mistral3 14B IQ4_XS - 4.25 bpw 11.89 GiB 23.57 B Vulkan 99 0 tg128 5.69 ± 0.00
mistral3 14B IQ4_XS - 4.25 bpw 11.89 GiB 23.57 B Vulkan 99 0 pp512 @ d4096 214.98 ± 0.75
mistral3 14B IQ4_XS - 4.25 bpw 11.89 GiB 23.57 B Vulkan 99 0 tg128 @ d4096 5.46 ± 0.02
mistral3 14B IQ4_XS - 4.25 bpw 11.89 GiB 23.57 B Vulkan 99 1 pp512 204.15 ± 1.81
mistral3 14B IQ4_XS - 4.25 bpw 11.89 GiB 23.57 B Vulkan 99 1 tg128 5.63 ± 0.00
mistral3 14B IQ4_XS - 4.25 bpw 11.89 GiB 23.57 B Vulkan 99 1 pp512 @ d4096 15.31 ± 0.00
mistral3 14B IQ4_XS - 4.25 bpw 11.89 GiB 23.57 B Vulkan 99 1 tg128 @ d4096 3.68 ± 0.00

First Bad Commit

b7064

Relevant log output

Logs

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingperformanceSpeed related topicsregressionA regression introduced in a new build (something that was previously working correctly)

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions