Enhance CUDA flash attention kernel selection for DKQ=512 with low gq… by Ooooze · Pull Request #6 · AtomicBot-ai/atomic-llama-cpp-turboquant

Ooooze · 2026-05-08T07:52:57Z

…a_ratio

This update modifies the kernel selection logic in the CUDA implementation of the flash attention mechanism. Specifically, when the query dimension (DKQ) is set to 512 and the gqa_ratio is less than 3, the code now routes to the TILE kernel instead of falling through to an abort condition. This change ensures better compatibility and performance for specific hardware configurations, particularly for models like Gemma 4 E4B.

Overview

Additional information

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure:

…a_ratio This update modifies the kernel selection logic in the CUDA implementation of the flash attention mechanism. Specifically, when the query dimension (DKQ) is set to 512 and the gqa_ratio is less than 3, the code now routes to the TILE kernel instead of falling through to an abort condition. This change ensures better compatibility and performance for specific hardware configurations, particularly for models like Gemma 4 E4B.

danganbenpa · 2026-05-09T13:29:38Z

"Confirming this still reproduces on 31B (Q4_K_XL target, Q8_0 assistant, RTX 3090 sm_8.6)"
"PR #6's gating doesn't help: 31B's global layers have gqa_ratio=8, condition gqa_ratio < 3 doesn't fire, hits the same abort"

cortexist · 2026-05-31T15:12:04Z

I tested Gemma 4 E4B Q4 on Jetson Orin NX 16GB / CUDA 12, it works only if flash-attention and turboquant are turned off. But the eval tk/s improvement is substantial (from >12 tk/s to >17 tk/s). Prefill takes a hit since I can't use FA and turboquant. MTP is also conflicting with MMPROJ, it's a mainline issue but a dealbreaker for me.

github-actions Bot added ggml Nvidia GPU labels May 8, 2026

Ooooze mentioned this pull request May 8, 2026

Repro: MTP path on CUDA aborts at fattn.cu:109 (DKQ=512) for Gemma 4 — Blackwell sm_120 + Ampere sm_86 #5

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance CUDA flash attention kernel selection for DKQ=512 with low gq…#6

Enhance CUDA flash attention kernel selection for DKQ=512 with low gq…#6
Ooooze wants to merge 1 commit into
feature/turboquant-kv-cachefrom
fix/cuda-mma-dkq512-fallback

Ooooze commented May 8, 2026

Uh oh!

danganbenpa commented May 9, 2026

Uh oh!

cortexist commented May 31, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Ooooze commented May 8, 2026

Overview

Additional information

Requirements

Uh oh!

danganbenpa commented May 9, 2026

Uh oh!

cortexist commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cortexist commented May 31, 2026 •

edited

Loading