fix(scheduler): exclude MLA models from TurboQuant KV cache (#1613)#1626
Merged
Conversation
Owner
|
Thanks for tracking this down. I verified the crash path against #1613: MLA models read fetched cache tensors directly, so keeping them off TurboQuant KV cache is the right guard, and the non-MLA eligibility path stays unchanged. The focused TurboQuant/MLA tests pass locally. There is only a tests/test_scheduler.py conflict with current main, so I'll resolve that on the maintainer side, rerun the focused tests, and merge with the contributor attribution preserved. |
Multi-head Latent Attention models (GLM-4.7-Flash, DeepSeek-V2/V3,
MiniCPM3, ...) crash the engine loop when TurboQuant is enabled:
'TurboQuantMSEState' object has no attribute 'swapaxes'
MLA attention reads the fetched cache tensors directly
(`kv_latent, k_pe = cache.update_and_fetch(...)` then
`k_pe.swapaxes(-1, -2)`, plus embed/unembed on the latent) and stores
keys/values with mismatched head dims. TurboQuant replaces the cache
state with per-head quantized NamedTuples that have no array methods,
so the model's direct tensor ops raise AttributeError.
This mirrors the rest of the ecosystem: TurboQuant is standard
(MHA/GQA) attention only in mlx-vlm, vLLM, and SGLang; MLA always uses
a separate KV path (or none on MLX).
Detect MLA in `Scheduler._turboquant_eligible()` via a memoized
`_model_uses_mla()` (config `kv_lora_rank`, incl. nested `text_config`
and VLM-adapter `_language_model` delegation, plus an attention
module-walk) and keep those models on fp16 KV -- no crash, no
TurboQuant, with a one-time info log. Non-MLA behaviour is unchanged:
the gate returns early only for MLA, otherwise the existing cache-type
eligibility logic runs identically.
Fixes jundot#1613
630c4c6 to
b028593
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Multi-head Latent Attention (MLA) models crash the engine loop when TurboQuant is enabled:
This PR makes
Scheduler._turboquant_eligible()recognise MLA architectures and keep them on an fp16 KV cache — no crash, no TurboQuant — with a one-time info log. Fixes #1613.Root cause
MLA attention (DeepSeek-V2/V3, GLM-4-MoE / GLM-4.7-Flash, MiniCPM3, …) does not delegate everything to
scaled_dot_product_attention(cache=...). It reads the fetched cache tensors directly:(
mlx_lm/models/glm4_moe_lite.py,deepseek_v3.py, …)TurboQuant replaces each KV layer's state with a per-head quantized
TurboQuantMSEStateNamedTuple (norms,indices) that has no array methods, so.swapaxes(and the absorbed-MLAembed_q/unembed_outmatmuls) raiseAttributeError. MLA also stores the latent (kv_lora_rank, e.g. 512) and rope key (qk_rope_head_dim, e.g. 64) with mismatched head dims and a single head, which the per-head codec was never designed for.This is consistent with the rest of the ecosystem — TurboQuant / KV-cache vector quant is standard-attention (MHA/GQA) only in
mlx-vlm, vLLM, and SGLang; MLA always uses a separate KV path (FP8 grouped latent + bf16 rope) or none at all (MLX). So excluding MLA matches how every implementation treats it.The fix
A memoized
_model_uses_mla()detector with three signals, called at the top of_turboquant_eligible():kv_lora_rank(int), including nestedtext_config/llm_configand VLM-adapter_language_model/language_modeldelegation.kv_a_proj_with_mqa/kv_a_layernorm/ intkv_lora_rank._turboquant_eligible()returnsFalsefor MLA, gating both conversion call sites (empty + post-prefill convert).Scope / safety — does not affect other functionality
Falseimmediately and the original cache-type logic runs byte-for-byte identically — TurboQuant behaviour for dense/GQA models is unchanged.Coverage
Verified against all MLA families in the installed
mlx-lm:deepseek_v2,deepseek_v3,deepseek_v32,glm4_moe_lite,glm_moe_dsa,kimi_linear,kimi_vl,longcat_flash,longcat_flash_ngram,minicpm3,youtu_llm— each caught by the config signal (top-level or nested) and/or the architecture walk. No false positives: only MLA models definekv_lora_rank.Test plan
New
tests/test_scheduler.py::TestTurboQuantMLAGuard(6 cases):text_config, and via VLM-adapter_language_model(The only failures in the full run are 4 pre-existing
test_dflash_engine.pycases failing on adflash_mlx.runtime.configimport in this environment — unrelated to this change.)