Skip to content

Enhance CUDA flash attention kernel selection for DKQ=512 with low gq…#6

Open
Ooooze wants to merge 1 commit into
feature/turboquant-kv-cachefrom
fix/cuda-mma-dkq512-fallback
Open

Enhance CUDA flash attention kernel selection for DKQ=512 with low gq…#6
Ooooze wants to merge 1 commit into
feature/turboquant-kv-cachefrom
fix/cuda-mma-dkq512-fallback

Conversation

@Ooooze

@Ooooze Ooooze commented May 8, 2026

Copy link
Copy Markdown

…a_ratio

This update modifies the kernel selection logic in the CUDA implementation of the flash attention mechanism. Specifically, when the query dimension (DKQ) is set to 512 and the gqa_ratio is less than 3, the code now routes to the TILE kernel instead of falling through to an abort condition. This change ensures better compatibility and performance for specific hardware configurations, particularly for models like Gemma 4 E4B.

Overview

Additional information

Requirements

…a_ratio

This update modifies the kernel selection logic in the CUDA implementation of the flash attention mechanism. Specifically, when the query dimension (DKQ) is set to 512 and the gqa_ratio is less than 3, the code now routes to the TILE kernel instead of falling through to an abort condition. This change ensures better compatibility and performance for specific hardware configurations, particularly for models like Gemma 4 E4B.
@danganbenpa

Copy link
Copy Markdown

"Confirming this still reproduces on 31B (Q4_K_XL target, Q8_0 assistant, RTX 3090 sm_8.6)"
"PR #6's gating doesn't help: 31B's global layers have gqa_ratio=8, condition gqa_ratio < 3 doesn't fire, hits the same abort"

@cortexist

cortexist commented May 31, 2026

Copy link
Copy Markdown

I tested Gemma 4 E4B Q4 on Jetson Orin NX 16GB / CUDA 12, it works only if flash-attention and turboquant are turned off. But the eval tk/s improvement is substantial (from >12 tk/s to >17 tk/s). Prefill takes a hit since I can't use FA and turboquant. MTP is also conflicting with MMPROJ, it's a mainline issue but a dealbreaker for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants