Skip to content

metal: add opt-in V skip for negligible attention weights#21119

Open
TheTom wants to merge 1 commit intoggml-org:masterfrom
TheTom:pr/fa-skip-negligible-v
Open

metal: add opt-in V skip for negligible attention weights#21119
TheTom wants to merge 1 commit intoggml-org:masterfrom
TheTom:pr/fa-skip-negligible-v

Conversation

@TheTom
Copy link
Copy Markdown

@TheTom TheTom commented Mar 28, 2026

Summary

Adds an opt-in optimization to the Metal flash attention vec kernel that
skips V dequantization and accumulation for positions where the post-softmax
attention weight is below 1e-6.

Reduces wasted dequant + FMA work when attention is sparse.

Gated by GGML_METAL_FA_SKIP_V=1 environment variable (default: off).
No behavior change unless explicitly enabled.

Changes

  • ggml-metal.metal: threshold check in quantized V accumulation loop
  • ggml-metal-device.m: env var gating + log message

Validation (Apple M5 Max, Qwen3.5-35B-A3B Q8_0)

Perplexity (wikitext-103, 32K context, 50 chunks):

SKIP_V=0: 7.0638 ± 0.021
SKIP_V=1: 7.0638 ± 0.021

Delta: 0.000

Decode (llama-bench, q8_0):

SKIP_V=0: 85.15 tok/s (tg128), 1142.93 tok/s (pp32768+tg128)
SKIP_V=1: 85.04 tok/s (tg128), 1144.78 tok/s (pp32768+tg128)

Within noise. No regression observed.

Scope

  • Metal only (no CUDA changes)
  • Applies to quantized V path in FA vec kernel
  • Default off (opt-in via env var)

Notes

  • Threshold (1e-6) produced identical PPL across tested values (1e-4 to 1e-8)
    in this setup
  • Effect depends on V dequant cost; minimal difference observed for q8_0
  • Also tested on q4_0 KV cache with identical ON/OFF PPL

Skip V dequantization and accumulation for positions where post-softmax
attention weight < 1e-6. Reduces wasted dequant + FMA work when attention
is sparse.

Gated by GGML_METAL_FA_SKIP_V env var (default: off).
Tested on q8_0 at 32K context (50 chunks, wikitext-103):
no measurable PPL change (7.0638 ON/OFF, CI ±0.021).
@TheTom TheTom requested a review from a team as a code owner March 28, 2026 13:09
@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Mar 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant