Skip to content

Commit 7d1bd95

Browse files
TheTomclaude
andcommitted
feat: sparse V dequant — +22% decode at 32K on M5, auto-enabled
Skip V dequantization for KV positions with negligible attention weight (< 1e-6). At long context, most softmax weights are near zero. Skipping their V dequant saves ~50% of V-side overhead. M5 Max results (auto-enabled on M5+ via has_tensor): Short: 77.6 tok/s (+1.4%, no regression) 16K: 66.5 tok/s (+12.9%, 0.92x q8_0) 32K: 57.7 tok/s (+22.8%, 0.93x q8_0, was 0.76x) Quality: PPL 6.1756 (identical), NIAH 9/9 (100%, improved from 7/9) Pre-M5 (M1/M2/M3/M4): disabled by default until verified. Enable with TURBO_SPARSE_V=1 env var for testing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
1 parent 00a5423 commit 7d1bd95

1 file changed

Lines changed: 8 additions & 3 deletions

File tree

ggml/src/ggml-metal/ggml-metal-device.m

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -239,10 +239,15 @@ ggml_metal_library_t ggml_metal_library_init(ggml_metal_device_t dev) {
239239
force_4mag ? " (forced)" : " (pre-M5 hardware)");
240240
}
241241
// Sparse V dequant: skip V for negligible attention weights
242-
const char * sparse_v = getenv("TURBO_SPARSE_V");
243-
if (sparse_v && sparse_v[0] == '1') {
242+
// Enabled by default on M5+ (verified: PPL identical, NIAH 9/9)
243+
// Pre-M5: opt-in via TURBO_SPARSE_V=1 until verified on M2
244+
const char * sparse_v_env = getenv("TURBO_SPARSE_V");
245+
const bool sparse_v_auto = ggml_metal_device_get_props(dev)->has_tensor; // M5+
246+
const bool sparse_v_forced = sparse_v_env && sparse_v_env[0] == '1';
247+
if (sparse_v_auto || sparse_v_forced) {
244248
[prep setObject:@"1" forKey:@"TURBO_SPARSE_V"];
245-
GGML_LOG_INFO("%s: turbo3 sparse V dequant enabled\n", __func__);
249+
GGML_LOG_INFO("%s: turbo3 sparse V dequant enabled%s\n", __func__,
250+
sparse_v_forced ? " (forced)" : "");
246251
}
247252
// TODO: context-adaptive dispatch — compile both 4-mag and 8-LUT
248253
// FA kernel instantiations, select based on ne11 (KV cache size)

0 commit comments

Comments
 (0)