metal: add opt-in V skip for negligible attention weights by TheTom · Pull Request #21119 · ggml-org/llama.cpp

TheTom · 2026-03-28T13:09:46Z

Summary

Adds an opt-in optimization to the Metal flash attention vec kernel that
skips V dequantization and accumulation for positions where the post-softmax
attention weight is below 1e-6.

Reduces wasted dequant + FMA work when attention is sparse.

Gated by GGML_METAL_FA_SKIP_V=1 environment variable (default: off).
No behavior change unless explicitly enabled.

Changes

ggml-metal.metal: threshold check in quantized V accumulation loop
ggml-metal-device.m: env var gating + log message

Validation (Apple M5 Max, Qwen3.5-35B-A3B Q8_0)

Perplexity (wikitext-103, 32K context, 50 chunks):

SKIP_V=0: 7.0638 ± 0.021
SKIP_V=1: 7.0638 ± 0.021

Delta: 0.000

Decode (llama-bench, q8_0):

SKIP_V=0: 85.15 tok/s (tg128), 1142.93 tok/s (pp32768+tg128)
SKIP_V=1: 85.04 tok/s (tg128), 1144.78 tok/s (pp32768+tg128)

Within noise. No regression observed.

Scope

Metal only (no CUDA changes)
Applies to quantized V path in FA vec kernel
Default off (opt-in via env var)

Notes

Threshold (1e-6) produced identical PPL across tested values (1e-4 to 1e-8)
in this setup
Effect depends on V dequant cost; minimal difference observed for q8_0
Also tested on q4_0 KV cache with identical ON/OFF PPL

Skip V dequantization and accumulation for positions where post-softmax attention weight < 1e-6. Reduces wasted dequant + FMA work when attention is sparse. Gated by GGML_METAL_FA_SKIP_V env var (default: off). Tested on q8_0 at 32K context (50 chunks, wikitext-103): no measurable PPL change (7.0638 ON/OFF, CI ±0.021).

TheTom requested a review from a team as a code owner March 28, 2026 13:09

github-actions bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Mar 28, 2026

TheTom mentioned this pull request Mar 31, 2026

ggml : add CPU TurboQuant KV cache types (TBQ3_0 / TBQ4_0) #21089

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metal: add opt-in V skip for negligible attention weights#21119

metal: add opt-in V skip for negligible attention weights#21119
TheTom wants to merge 1 commit intoggml-org:masterfrom
TheTom:pr/fa-skip-negligible-v

TheTom commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TheTom commented Mar 28, 2026

Summary

Changes

Validation (Apple M5 Max, Qwen3.5-35B-A3B Q8_0)

Scope

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant