Skip to content

Commit 965a6ca

Browse files
TheTomclaude
andcommitted
feat: asymmetric K/V support + q8_0 × turbo FA kernel instantiations
Add full asymmetric K/V quantization support for Metal flash attention: - Pipeline naming uses k{type}_v{type} format for all FA kernels (335 total), eliminating underscore ambiguity in type names - 90 turbo × turbo asymmetric instantiations (turbo2/3/4 all combinations) - 150 q8_0 × turbo asymmetric instantiations (both directions, all head dims) - Gatekeeper and assertion updated to allow turbo × turbo and q8_0 × turbo pairs - Zero regression on existing symmetric paths (validated across 4 models, 2 machines) The q8_0 × turbo kernels fix a silent dispatch failure where mixed q8_0-K + turbo-V configs would NaN (turbo4-V) or fall to undefined paths (turbo3-V). This enables the asymmetric quality rescue: q8_0-K + turbo-V recovers near-baseline PPL on low-bit models where symmetric turbo-K degrades. Validated on Metal (M2 Pro + M5 Max): - phi-4-Q8_0: symmetric turbo3 +4.2%, turbo4 +1.7% (no regression) - Qwen2.5-7B Q4_K_M: q8_0-K + turbo4-V +1.0%, q8_0-K + turbo3-V +2.0% (rescued) - Qwen3.5-35B MoE, 27B Dense, Mistral-24B: all healthy (no regression) - Cross-hardware M2/M5 parity confirmed on all tested configs Closes #27 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
1 parent 43f7d3d commit 965a6ca

4 files changed

Lines changed: 194 additions & 4 deletions

File tree

ggml/src/ggml-metal/ggml-metal-device.cpp

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1348,6 +1348,7 @@ ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_flash_attn_ext(
13481348
dk,
13491349
dv);
13501350

1351+
13511352
snprintf(name, 256, "%s_mask=%d_sinks=%d_bias=%d_scap=%d_kvpad=%d_bcm=%d_ns10=%d_ns20=%d_nsg=%d",
13521353
base,
13531354
has_mask,
@@ -1414,6 +1415,7 @@ ggml_metal_pipeline_with_params ggml_metal_library_get_pipeline_flash_attn_ext_v
14141415
dk,
14151416
dv);
14161417

1418+
14171419
snprintf(name, 256, "%s_mask=%d_sink=%d_bias=%d_scap=%d_kvpad=%d_ns10=%d_ns20=%d_nsg=%d_nwg=%d",
14181420
base,
14191421
has_mask,

ggml/src/ggml-metal/ggml-metal-device.m

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1196,14 +1196,21 @@ bool ggml_metal_device_supports_op(ggml_metal_device_t dev, const struct ggml_te
11961196
return false;
11971197
}
11981198
if (op->src[1]->type != op->src[2]->type) {
1199-
// Allow asymmetric K/V for supported turbo quantization pairs
1199+
// Allow asymmetric K/V for supported mixed pairs:
1200+
// - turbo x turbo (any combination)
1201+
// - q8_0 x turbo (either direction)
12001202
const bool k_is_turbo = (op->src[1]->type == GGML_TYPE_TURBO2_0 ||
12011203
op->src[1]->type == GGML_TYPE_TURBO3_0 ||
12021204
op->src[1]->type == GGML_TYPE_TURBO4_0);
12031205
const bool v_is_turbo = (op->src[2]->type == GGML_TYPE_TURBO2_0 ||
12041206
op->src[2]->type == GGML_TYPE_TURBO3_0 ||
12051207
op->src[2]->type == GGML_TYPE_TURBO4_0);
1206-
if (!k_is_turbo || !v_is_turbo) {
1208+
const bool k_is_q8 = (op->src[1]->type == GGML_TYPE_Q8_0);
1209+
const bool v_is_q8 = (op->src[2]->type == GGML_TYPE_Q8_0);
1210+
const bool supported = (k_is_turbo && v_is_turbo) ||
1211+
(k_is_q8 && v_is_turbo) ||
1212+
(k_is_turbo && v_is_q8);
1213+
if (!supported) {
12071214
return false;
12081215
}
12091216
}

ggml/src/ggml-metal/ggml-metal-ops.cpp

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2682,14 +2682,19 @@ int ggml_metal_op_flash_attn_ext(ggml_metal_op_t ctx, int idx) {
26822682

26832683
GGML_ASSERT(op->src[0]->type == GGML_TYPE_F32);
26842684

2685-
// Allow asymmetric K/V quantization for supported turbo pairs
2685+
// Allow asymmetric K/V quantization for supported mixed pairs
26862686
{
26872687
const ggml_type type_k = op->src[1]->type;
26882688
const ggml_type type_v = op->src[2]->type;
26892689
if (type_k != type_v) {
26902690
const bool k_is_turbo = (type_k == GGML_TYPE_TURBO2_0 || type_k == GGML_TYPE_TURBO3_0 || type_k == GGML_TYPE_TURBO4_0);
26912691
const bool v_is_turbo = (type_v == GGML_TYPE_TURBO2_0 || type_v == GGML_TYPE_TURBO3_0 || type_v == GGML_TYPE_TURBO4_0);
2692-
GGML_ASSERT(k_is_turbo && v_is_turbo && "asymmetric K/V types only supported for turbo quantization pairs");
2692+
const bool k_is_q8 = (type_k == GGML_TYPE_Q8_0);
2693+
const bool v_is_q8 = (type_v == GGML_TYPE_Q8_0);
2694+
const bool supported = (k_is_turbo && v_is_turbo) ||
2695+
(k_is_q8 && v_is_turbo) ||
2696+
(k_is_turbo && v_is_q8);
2697+
GGML_ASSERT(supported && "asymmetric K/V types only supported for turbo and q8_0 mixed pairs");
26932698
}
26942699
}
26952700

0 commit comments

Comments
 (0)