perf: FA kernel optimizations + auto-asymmetric KV + warp shuffle WHT by signalnine · Pull Request #36 · TheTom/llama-cpp-turboquant

signalnine · 2026-03-31T18:22:46Z

Summary

CUDA flash attention kernel optimizations, auto-asymmetric KV for
quantized models, and SET_ROWS warp shuffle WHT. Combined effect:
turbo2 now beats q8_0 decode speed at 32K context.

Performance (tg128 @ d32768, Qwen3.5 35B Q4_K_M, RTX 5090)

Type	Before	After	Improvement	vs q8_0
q8_0	156	181	+16%	baseline
turbo2	148	185	+25%	+2.5% faster
turbo3	127	171	+35%	-5.5%
turbo4	101	117	+16%	-35%

Changes

1. Shared-memory LUT for turbo KQ scoring (+15% turbo3, +4% turbo2)

Precompute Q[d] × centroid[c] into shared memory once per decode
step. The KQ inner loop does a single shmem read per element instead
of centroid lookup + multiply. 8-wide processing with bank-conflict-
free stride. turbo4 excluded (16 centroids too large for shmem).

2. General VEC FA optimizations (+16% q8_0 baseline)

__launch_bounds__ occupancy 1→3 (3 blocks/SM, better latency hiding)
__expf fast-math softmax (~3.7%)
L2 prefetch hints for next K+V blocks
Aggressive sparse V threshold (5e-3 for turbo3/4, 1e-2 for turbo2)
Remove turbo from V_is_unquantized (quantized V dequant path)

3. q8_1 Q path for turbo types

Turbo K types now use q8_1-quantized Q (int8 packed) instead of
float2, reducing Q register footprint 4×. KQ dot product rewritten
to process 4 elements per iteration with integer Q.

4. Auto-asymmetric KV for quantized models

When user requests symmetric turbo K+V on a quantized-weight model,
auto-downgrades K to q8_0. Prevents catastrophic PPL on outlier
models (Qwen 2.5: 4015→8.85). Skipped for mismatched K/V head dims.
Override: TURBO_SYMMETRIC=1.

5. Warp shuffle WHT (SET_ROWS)

Replace 5 of 7 __syncthreads() with __shfl_xor_sync() for
intra-warp butterfly stages in all turbo SET_ROWS kernels.

Quality

PPL unchanged across all optimizations:

Model	q8_0	turbo3	turbo4
Qwen 3.5 Q4_K_M	6.18	6.24 (auto-asym)	6.23 (sym)
Qwen 2.5 Q2_K	8.82	8.85 (auto-asym)	—

Test plan

Coherence: 4 models × 4 KV combos (16/16)
PPL: Qwen 3.5 + Qwen 2.5 (3 types each)
NIAH: turbo3 + turbo2 at 4K + 32K (10-11/11)
Server smoke: 3 types × 3 requests (9/9)
CPU-only: Mixtral + DeepSeek (2/2)
Head-to-head vs Madreag/turbo3-cuda (matched or exceeded)

Credit: FA optimizations based on Madreag/turbo3-cuda.
Warp shuffle WHT based on seanrasch/perf/ftz-and-wht-shuffle.

🤖 Generated with Claude Code

When user requests symmetric turbo K+V on a quantized model, auto- downgrade K to q8_0 while keeping V as turbo. This prevents catastrophic PPL on outlier-sensitive models (Qwen 2.5: 4015→8.85) and actually improves quality on all models tested (Qwen 3.5: 6.31→6.24). V compression is virtually lossless across all architectures (+0.3%). K compression is model-sensitive and compounds with weight quantization error. Asymmetric q8_0-K + turbo-V is the safe default. Detection: checks tok_embd tensor type. If quantized (Q2-Q6), auto- switches. F16/F32/BF16 models keep symmetric turbo (no stacking risk). Override: TURBO_SYMMETRIC=1 forces symmetric. PPL (wikitext-2, ctx=512): Qwen 2.5 Q2_K: 4015 symmetric → 8.85 auto-asymmetric (+0.3% vs q8_0) Qwen 3.5 Q4_K_M: 6.31 symmetric → 6.24 auto-asymmetric (+1.0% vs q8_0) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace shared-memory butterfly stages h=1,2,4,8,16 with __shfl_xor_sync() in turbo3, turbo2, and turbo4 SET_ROWS kernels. Pairs with distance < 32 are always in the same warp — no barrier needed. Only h=32 and h=64 (cross-warp) retain __syncthreads(). Mathematically identical: __shfl_xor_sync(mask, v, h) gives thread j the value from j^h, then (j & h) ? (other - v) : (v + other) is the exact butterfly computation. Zero PPL regression: turbo3 6.24, turbo4 6.23 (unchanged). Credit to seanrasch (perf/ftz-and-wht-shuffle branch). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Auto-asymmetric (K→q8_0) creates FA dimension mismatch on models where head_dim_k != head_dim_v (e.g. DeepSeek: K=192, V=128). The q8_0 K at non-standard D has no CUDA FA kernel, falling to slow CPU FA. Fix: when K and V head dims differ, keep symmetric turbo which pads both K and V consistently for CUDA FA. Note: turbo4 symmetric on DeepSeek still falls to CPU FA (21 t/s) because padded K D=256 != V D=128. This is a pre-existing FA D matching limitation, not a regression. turbo3 on DeepSeek works at full CUDA speed (172 t/s) because the VEC kernel handles asymmetric D natively. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Move Q forward rotation from graph-level ggml_turbo_wht op into FA kernels to eliminate a separate kernel launch per layer during decode: - Vec kernel (decode): shared memory FWHT with 64-thread parallel butterfly, zero extra kernel launches, CUDA graph compatible - Prefill MMA: separate k_turbo_fwht_forward kernel with persistent cudaMalloc buffer (avoids cudaMallocAsync NaN on graph replay) - V inverse rotation remains at graph level for CUDA graph compat Results: decode 30.14 tok/s (-0.4%), prefill 1146 tok/s (-0.3%), PPL identical to baseline (19.7152 on 10-chunk test). Also adds temporal decay test (experiment TheTom#36) and benchmarks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Port initial optimizations from Madreag/turbo3-cuda: - Remove turbo types from K_is_unquantized — Q is now q8_1 quantized (int8 packed) for turbo types, reducing Q register footprint 4× - Keep nthreads_KQ=8 for turbo (same ILP as f16) via K_is_turbo flag - Rewrite turbo3/turbo2/turbo4 vec_dot_KQ to process 4 elements per iteration with q8_1 Q (packed int32 + scale) - Replace 5 expf() with __expf() in softmax (~3.7% at long context) PPL unchanged (6.31). Speed unchanged on this model — the main speedup in Madreag's fork comes from a shared-memory LUT approach (precompute Q×centroid for all positions) which eliminates the multiply in the hot loop. That optimization needs separate porting. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Precompute Q[d] × centroid[c] into shared memory LUT once per decode step, then the KQ inner loop does a single shmem read per element instead of centroid lookup + multiply. - turbo3: 127→146 t/s at 32K depth (+15%) - turbo2: 148→154 t/s at 32K depth (+4%) - Only for ncols==1 (decode path, not prefill) - turbo4 excluded: 16 centroids × D exceeds shmem budget - LUT stride = n_centroids+1 to avoid bank conflicts - 8-wide processing (2 qs bytes + 1 signs byte per iteration) - L2 prefetch hints for next K block PPL unchanged (6.31). Zero quality impact — the LUT stores the exact same Q×centroid values, just precomputed. Based on Madreag/turbo3-cuda (release/cuda-optimized branch). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Four general VEC flash attention optimizations: 1. __launch_bounds__ occupancy 1→3: allows 3 blocks per SM, better latency hiding. This is the biggest win (~16% on q8_0 baseline). 2. V_is_unquantized: remove turbo types from V unquantized path, matching the K change. Turbo V uses quantized dequant path. 3. Aggressive sparse V threshold: 5e-3 for turbo3/4 (was 1e-6), 1e-2 for turbo2. Validated zero PPL impact per Madreag. 4. L2 prefetch: add V block prefetch alongside existing K prefetch. Results (tg128 @ d32768, Qwen3.5 35B Q4_K_M, RTX 5090): | Type | Before | After | vs q8_0 | |------|--------|-------|---------| | q8_0 | 156 | 181 | baseline | | turbo2 | 148 | 185 | +2.5% faster | | turbo3 | 127 | 171 | -5.5% | | turbo4 | 101 | 117 | -35% (no LUT) | turbo2 at 7.5x compression is now FASTER than q8_0 at 32K context. PPL unchanged (6.31). Based on Madreag/turbo3-cuda optimizations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- ops.cpp: add TURBO3_0/TURBO4_0/TURBO2_0 to clamp switch to fix -Werror=switch on GCC/Clang CI - ggml-rpc.h: bump RPC_PROTO_PATCH_VERSION 1→2 and update GGML_OP_COUNT assert 96→97 (GGML_OP_TURBO_WHT added) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions bot added Nvidia GPU ggml labels Mar 31, 2026

signalnine and others added 3 commits March 31, 2026 20:01

signalnine force-pushed the feature/turboquant-kv-cache branch from b00a40b to cfb57af Compare April 1, 2026 03:06

signalnine and others added 3 commits April 1, 2026 18:35

signalnine force-pushed the feature/turboquant-kv-cache branch from 338990c to f575431 Compare April 2, 2026 02:07

signalnine changed the title ~~perf: block-128, warp shuffle WHT, auto-asymmetric KV, mismatched-D fix~~ perf: FA kernel optimizations + auto-asymmetric KV + warp shuffle WHT Apr 2, 2026

signalnine marked this pull request as draft April 2, 2026 02:25

signalnine marked this pull request as ready for review April 2, 2026 03:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: FA kernel optimizations + auto-asymmetric KV + warp shuffle WHT#36

perf: FA kernel optimizations + auto-asymmetric KV + warp shuffle WHT#36
signalnine wants to merge 7 commits intoTheTom:feature/turboquant-kv-cachefrom
signalnine:feature/turboquant-kv-cache

signalnine commented Mar 31, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

signalnine commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Performance (tg128 @ d32768, Qwen3.5 35B Q4_K_M, RTX 5090)

Changes

1. Shared-memory LUT for turbo KQ scoring (+15% turbo3, +4% turbo2)

2. General VEC FA optimizations (+16% q8_0 baseline)

3. q8_1 Q path for turbo types

4. Auto-asymmetric KV for quantized models

5. Warp shuffle WHT (SET_ROWS)

Quality

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

signalnine commented Mar 31, 2026 •

edited

Loading