Commit 7d1bd95
feat: sparse V dequant — +22% decode at 32K on M5, auto-enabled
Skip V dequantization for KV positions with negligible attention
weight (< 1e-6). At long context, most softmax weights are near
zero. Skipping their V dequant saves ~50% of V-side overhead.
M5 Max results (auto-enabled on M5+ via has_tensor):
Short: 77.6 tok/s (+1.4%, no regression)
16K: 66.5 tok/s (+12.9%, 0.92x q8_0)
32K: 57.7 tok/s (+22.8%, 0.93x q8_0, was 0.76x)
Quality: PPL 6.1756 (identical), NIAH 9/9 (100%, improved from 7/9)
Pre-M5 (M1/M2/M3/M4): disabled by default until verified.
Enable with TURBO_SPARSE_V=1 env var for testing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai1 parent 00a5423 commit 7d1bd95
1 file changed
Lines changed: 8 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
239 | 239 | | |
240 | 240 | | |
241 | 241 | | |
242 | | - | |
243 | | - | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
244 | 248 | | |
245 | | - | |
| 249 | + | |
| 250 | + | |
246 | 251 | | |
247 | 252 | | |
248 | 253 | | |
| |||
0 commit comments