fix(scheduler): exclude MLA models from TurboQuant KV cache (#1613) by popfido · Pull Request #1626 · jundot/omlx

popfido · 2026-06-03T10:06:06Z

Summary

Multi-head Latent Attention (MLA) models crash the engine loop when TurboQuant is enabled:

omlx.scheduler   - INFO  - TurboQuant: converted 46/47 cache layers to 4.0-bit, skipped last KVCache layer
omlx.engine_core - ERROR - Engine loop error: 'TurboQuantMSEState' object has no attribute 'swapaxes'

This PR makes Scheduler._turboquant_eligible() recognise MLA architectures and keep them on an fp16 KV cache — no crash, no TurboQuant — with a one-time info log. Fixes #1613.

Root cause

MLA attention (DeepSeek-V2/V3, GLM-4-MoE / GLM-4.7-Flash, MiniCPM3, …) does not delegate everything to scaled_dot_product_attention(cache=...). It reads the fetched cache tensors directly:

kv_latent, k_pe = cache.update_and_fetch(kv_latent, k_pe)
pe_scores = (q_pe * self.scale) @ k_pe.swapaxes(-1, -2)   # <-- crash

(mlx_lm/models/glm4_moe_lite.py, deepseek_v3.py, …)

TurboQuant replaces each KV layer's state with a per-head quantized TurboQuantMSEState NamedTuple (norms, indices) that has no array methods, so .swapaxes (and the absorbed-MLA embed_q/unembed_out matmuls) raise AttributeError. MLA also stores the latent (kv_lora_rank, e.g. 512) and rope key (qk_rope_head_dim, e.g. 64) with mismatched head dims and a single head, which the per-head codec was never designed for.

This is consistent with the rest of the ecosystem — TurboQuant / KV-cache vector quant is standard-attention (MHA/GQA) only in mlx-vlm, vLLM, and SGLang; MLA always uses a separate KV path (FP8 grouped latent + bf16 rope) or none at all (MLX). So excluding MLA matches how every implementation treats it.

The fix

A memoized _model_uses_mla() detector with three signals, called at the top of _turboquant_eligible():

Config — kv_lora_rank (int), including nested text_config/llm_config and VLM-adapter _language_model/language_model delegation.
Architecture — module walk for kv_a_proj_with_mqa / kv_a_layernorm / int kv_lora_rank.

_turboquant_eligible() returns False for MLA, gating both conversion call sites (empty + post-prefill convert).

Scope / safety — does not affect other functionality

The change is a single early-return in the existing eligibility gate. For non-MLA models the detector returns False immediately and the original cache-type logic runs byte-for-byte identically — TurboQuant behaviour for dense/GQA models is unchanged.
Detection is memoized (model never changes per scheduler), so there is no per-request cost; the module walk only runs once and only if the config signal misses.
No public API, settings, or cache-format changes. MLA models simply keep the fp16 KV cache they already used before TurboQuant was toggled on.

Coverage

Verified against all MLA families in the installed mlx-lm: deepseek_v2, deepseek_v3, deepseek_v32, glm4_moe_lite, glm_moe_dsa, kimi_linear, kimi_vl, longcat_flash, longcat_flash_ngram, minicpm3, youtu_llm — each caught by the config signal (top-level or nested) and/or the architecture walk. No false positives: only MLA models define kv_lora_rank.

Test plan

New tests/test_scheduler.py::TestTurboQuantMLAGuard (6 cases):

MLA detected by config, by architecture, by nested text_config, and via VLM-adapter _language_model
standard MHA/GQA model stays eligible
detection is memoized (single module walk)

pytest tests/test_scheduler.py::TestTurboQuantMLAGuard tests/test_turboquant.py tests/test_scheduler.py
# 134 passed
pytest            # full default suite: 4892 passed, 37 skipped

(The only failures in the full run are 4 pre-existing test_dflash_engine.py cases failing on a dflash_mlx.runtime.config import in this environment — unrelated to this change.)

jundot · 2026-06-03T14:59:12Z

Thanks for tracking this down. I verified the crash path against #1613: MLA models read fetched cache tensors directly, so keeping them off TurboQuant KV cache is the right guard, and the non-MLA eligibility path stays unchanged.

The focused TurboQuant/MLA tests pass locally. There is only a tests/test_scheduler.py conflict with current main, so I'll resolve that on the maintainer side, rerun the focused tests, and merge with the contributor attribution preserved.

Multi-head Latent Attention models (GLM-4.7-Flash, DeepSeek-V2/V3, MiniCPM3, ...) crash the engine loop when TurboQuant is enabled: 'TurboQuantMSEState' object has no attribute 'swapaxes' MLA attention reads the fetched cache tensors directly (`kv_latent, k_pe = cache.update_and_fetch(...)` then `k_pe.swapaxes(-1, -2)`, plus embed/unembed on the latent) and stores keys/values with mismatched head dims. TurboQuant replaces the cache state with per-head quantized NamedTuples that have no array methods, so the model's direct tensor ops raise AttributeError. This mirrors the rest of the ecosystem: TurboQuant is standard (MHA/GQA) attention only in mlx-vlm, vLLM, and SGLang; MLA always uses a separate KV path (or none on MLX). Detect MLA in `Scheduler._turboquant_eligible()` via a memoized `_model_uses_mla()` (config `kv_lora_rank`, incl. nested `text_config` and VLM-adapter `_language_model` delegation, plus an attention module-walk) and keep those models on fp16 KV -- no crash, no TurboQuant, with a one-time info log. Non-MLA behaviour is unchanged: the gate returns early only for MLA, otherwise the existing cache-type eligibility logic runs identically. Fixes jundot#1613

jundot force-pushed the fix/turboquant-exclude-mla-1613 branch from 630c4c6 to b028593 Compare June 3, 2026 15:01

jundot merged commit 2da2194 into jundot:main Jun 3, 2026
4 checks passed

popfido deleted the fix/turboquant-exclude-mla-1613 branch June 3, 2026 15:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(scheduler): exclude MLA models from TurboQuant KV cache (#1613)#1626

fix(scheduler): exclude MLA models from TurboQuant KV cache (#1613)#1626
jundot merged 1 commit into
jundot:mainfrom
popfido:fix/turboquant-exclude-mla-1613

popfido commented Jun 3, 2026

Uh oh!

jundot commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

popfido commented Jun 3, 2026

Summary

Root cause

The fix

Scope / safety — does not affect other functionality

Coverage

Test plan

Uh oh!

jundot commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants