vulkan: fix OOB check in flash_attn_mask_opt by jeffbolznv · Pull Request #20296 · ggml-org/llama.cpp

jeffbolznv · 2026-03-09T15:07:25Z

Fixes #19955.

I saw a few percent slowdown with pp512 (which is too small to hit the aligned path on my system after this change) so I tweaked the use_mask_opt logic to hide it. I should look into spreading the work across more workgroups, but I don't have time for that today.

@el95149 this is different enough from the test change that it's probably worth retesting.

el95149 · 2026-03-09T16:27:05Z

@jeffbolznv Copy that, will retest and report back.

el95149 · 2026-03-09T19:25:44Z

@jeffbolznv Just managed to test your branch.

Did the following (leveraging both GPUs):

10 rounds of pp30000/tg128 on Unsloth's Qwen3-Coder-Next-UD-Q4_K_M.
10 rounds of pp30000/tg128 on Unsloth's Qwen3-Coder-Next-UD-Q64_K.

No errors whatsoever, all runs finished correctly.

el95149 · 2026-03-12T06:28:07Z

Thank you very much for that fix!

* 'master' of github.com:ggml-org/llama.cpp: (33 commits) convert : better mtp check and fix return [no ci] (ggml-org#20419) vulkan: fix SSM_CONV PP scaling with large ubatch sizes (ggml-org#20379) New conversations now auto-select the first loaded model (ggml-org#20403) ggml-virtgpu: Fix some build commands (ggml-org#20341) metal : avoid divisions in bin kernel (ggml-org#20426) ci: Setup self-hosted CI for Intel Linux Vulkan backend (ggml-org#20154) vulkan: fix l2_norm epsilon handling (ggml-org#20350) vulkan: fix OOB check in flash_attn_mask_opt (ggml-org#20296) vulkan: Fix ErrorOutOfHostMemory on Intel GPU when loading large models with --no-mmap (ggml-org#20059) opencl: use larger workgroup size for get_rows (ggml-org#20316) opencl: add cumsum op (ggml-org#18981) hip: compile debug builds with -O2 on hip to avoid a compiler bug (ggml-org#20392) common/parser: add GigaChatV3/3.1 models support (ggml-org#19931) model : add support for Phi4ForCausalLMV (ggml-org#20168) graph : add optional scale parameter to build_lora_mm [no ci] (ggml-org#20427) common : fix --n-cpu-moe, --cpu-moe for models with fused gate + up (ggml-org#20416) ggml-webgpu: Add supports for `GGML_OP_REPEAT` (ggml-org#20230) llama : enable chunked fused GDN path (ggml-org#20340) llama : whitespace cleanup (ggml-org#20422) ggml : add NVFP4 quantization type support (ggml-org#19769) ...

vulkan: fix OOB check in flash_attn_mask_opt

ffcf698

jeffbolznv requested a review from 0cc4m as a code owner March 9, 2026 15:07

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Mar 9, 2026

0cc4m approved these changes Mar 12, 2026

View reviewed changes

0cc4m merged commit aa429cf into ggml-org:master Mar 12, 2026
72 of 78 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: fix OOB check in flash_attn_mask_opt#20296

vulkan: fix OOB check in flash_attn_mask_opt#20296
0cc4m merged 1 commit intoggml-org:masterfrom
jeffbolznv:fa_mask_opt_oob

jeffbolznv commented Mar 9, 2026

Uh oh!

el95149 commented Mar 9, 2026

Uh oh!

el95149 commented Mar 9, 2026

Uh oh!

Uh oh!

el95149 commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jeffbolznv commented Mar 9, 2026

Uh oh!

el95149 commented Mar 9, 2026

Uh oh!

el95149 commented Mar 9, 2026

Uh oh!

Uh oh!

el95149 commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants