feat: sparse V dequant — +22% decode at 32K on M5, auto-enabled

TheTom · claude · TheTom · commit 7d1bd95d0ddb · 2026-03-27T00:35:40.000-05:00
Skip V dequantization for KV positions with negligible attention
weight (&lt; 1e-6). At long context, most softmax weights are near
zero. Skipping their V dequant saves ~50% of V-side overhead.

M5 Max results (auto-enabled on M5+ via has_tensor):
  Short: 77.6 tok/s (+1.4%, no regression)
  16K:   66.5 tok/s (+12.9%, 0.92x q8_0)
  32K:   57.7 tok/s (+22.8%, 0.93x q8_0, was 0.76x)

Quality: PPL 6.1756 (identical), NIAH 9/9 (100%, improved from 7/9)

Pre-M5 (M1/M2/M3/M4): disabled by default until verified.
Enable with TURBO_SPARSE_V=1 env var for testing.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
Co-Authored-By: tturney@psyguard.ai
diff --git a/ggml/src/ggml-metal/ggml-metal-device.m b/ggml/src/ggml-metal/ggml-metal-device.m
@@ -239,10 +239,15 @@ ggml_metal_library_t ggml_metal_library_init(ggml_metal_device_t dev) {
                             force_4mag ? " (forced)" : " (pre-M5 hardware)");
                     }
                     // Sparse V dequant: skip V for negligible attention weights
-                    const char * sparse_v = getenv("TURBO_SPARSE_V");
-                    if (sparse_v && sparse_v[0] == '1') {
+                    // Enabled by default on M5+ (verified: PPL identical, NIAH 9/9)
+                    // Pre-M5: opt-in via TURBO_SPARSE_V=1 until verified on M2
+                    const char * sparse_v_env = getenv("TURBO_SPARSE_V");
+                    const bool sparse_v_auto = ggml_metal_device_get_props(dev)->has_tensor;  // M5+
+                    const bool sparse_v_forced = sparse_v_env && sparse_v_env[0] == '1';
+                    if (sparse_v_auto || sparse_v_forced) {
                         [prep setObject:@"1" forKey:@"TURBO_SPARSE_V"];
-                        GGML_LOG_INFO("%s: turbo3 sparse V dequant enabled\n", __func__);
+                        GGML_LOG_INFO("%s: turbo3 sparse V dequant enabled%s\n", __func__,
+                            sparse_v_forced ? " (forced)" : "");
                     }
                     // TODO: context-adaptive dispatch — compile both 4-mag and 8-LUT
                     // FA kernel instantiations, select based on ne11 (KV cache size)