Skip to content

fix(scheduler): exclude MLA models from TurboQuant KV cache (#1613)#1626

Merged
jundot merged 1 commit into
jundot:mainfrom
popfido:fix/turboquant-exclude-mla-1613
Jun 3, 2026
Merged

fix(scheduler): exclude MLA models from TurboQuant KV cache (#1613)#1626
jundot merged 1 commit into
jundot:mainfrom
popfido:fix/turboquant-exclude-mla-1613

Conversation

@popfido

@popfido popfido commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Summary

Multi-head Latent Attention (MLA) models crash the engine loop when TurboQuant is enabled:

omlx.scheduler   - INFO  - TurboQuant: converted 46/47 cache layers to 4.0-bit, skipped last KVCache layer
omlx.engine_core - ERROR - Engine loop error: 'TurboQuantMSEState' object has no attribute 'swapaxes'

This PR makes Scheduler._turboquant_eligible() recognise MLA architectures and keep them on an fp16 KV cache — no crash, no TurboQuant — with a one-time info log. Fixes #1613.

Root cause

MLA attention (DeepSeek-V2/V3, GLM-4-MoE / GLM-4.7-Flash, MiniCPM3, …) does not delegate everything to scaled_dot_product_attention(cache=...). It reads the fetched cache tensors directly:

kv_latent, k_pe = cache.update_and_fetch(kv_latent, k_pe)
pe_scores = (q_pe * self.scale) @ k_pe.swapaxes(-1, -2)   # <-- crash

(mlx_lm/models/glm4_moe_lite.py, deepseek_v3.py, …)

TurboQuant replaces each KV layer's state with a per-head quantized TurboQuantMSEState NamedTuple (norms, indices) that has no array methods, so .swapaxes (and the absorbed-MLA embed_q/unembed_out matmuls) raise AttributeError. MLA also stores the latent (kv_lora_rank, e.g. 512) and rope key (qk_rope_head_dim, e.g. 64) with mismatched head dims and a single head, which the per-head codec was never designed for.

This is consistent with the rest of the ecosystem — TurboQuant / KV-cache vector quant is standard-attention (MHA/GQA) only in mlx-vlm, vLLM, and SGLang; MLA always uses a separate KV path (FP8 grouped latent + bf16 rope) or none at all (MLX). So excluding MLA matches how every implementation treats it.

The fix

A memoized _model_uses_mla() detector with three signals, called at the top of _turboquant_eligible():

  1. Configkv_lora_rank (int), including nested text_config/llm_config and VLM-adapter _language_model/language_model delegation.
  2. Architecture — module walk for kv_a_proj_with_mqa / kv_a_layernorm / int kv_lora_rank.

_turboquant_eligible() returns False for MLA, gating both conversion call sites (empty + post-prefill convert).

Scope / safety — does not affect other functionality

  • The change is a single early-return in the existing eligibility gate. For non-MLA models the detector returns False immediately and the original cache-type logic runs byte-for-byte identically — TurboQuant behaviour for dense/GQA models is unchanged.
  • Detection is memoized (model never changes per scheduler), so there is no per-request cost; the module walk only runs once and only if the config signal misses.
  • No public API, settings, or cache-format changes. MLA models simply keep the fp16 KV cache they already used before TurboQuant was toggled on.

Coverage

Verified against all MLA families in the installed mlx-lm: deepseek_v2, deepseek_v3, deepseek_v32, glm4_moe_lite, glm_moe_dsa, kimi_linear, kimi_vl, longcat_flash, longcat_flash_ngram, minicpm3, youtu_llm — each caught by the config signal (top-level or nested) and/or the architecture walk. No false positives: only MLA models define kv_lora_rank.

Test plan

New tests/test_scheduler.py::TestTurboQuantMLAGuard (6 cases):

  • MLA detected by config, by architecture, by nested text_config, and via VLM-adapter _language_model
  • standard MHA/GQA model stays eligible
  • detection is memoized (single module walk)
pytest tests/test_scheduler.py::TestTurboQuantMLAGuard tests/test_turboquant.py tests/test_scheduler.py
# 134 passed
pytest            # full default suite: 4892 passed, 37 skipped

(The only failures in the full run are 4 pre-existing test_dflash_engine.py cases failing on a dflash_mlx.runtime.config import in this environment — unrelated to this change.)

@jundot

jundot commented Jun 3, 2026

Copy link
Copy Markdown
Owner

Thanks for tracking this down. I verified the crash path against #1613: MLA models read fetched cache tensors directly, so keeping them off TurboQuant KV cache is the right guard, and the non-MLA eligibility path stays unchanged.

The focused TurboQuant/MLA tests pass locally. There is only a tests/test_scheduler.py conflict with current main, so I'll resolve that on the maintainer side, rerun the focused tests, and merge with the contributor attribution preserved.

Multi-head Latent Attention models (GLM-4.7-Flash, DeepSeek-V2/V3,
MiniCPM3, ...) crash the engine loop when TurboQuant is enabled:

    'TurboQuantMSEState' object has no attribute 'swapaxes'

MLA attention reads the fetched cache tensors directly
(`kv_latent, k_pe = cache.update_and_fetch(...)` then
`k_pe.swapaxes(-1, -2)`, plus embed/unembed on the latent) and stores
keys/values with mismatched head dims. TurboQuant replaces the cache
state with per-head quantized NamedTuples that have no array methods,
so the model's direct tensor ops raise AttributeError.

This mirrors the rest of the ecosystem: TurboQuant is standard
(MHA/GQA) attention only in mlx-vlm, vLLM, and SGLang; MLA always uses
a separate KV path (or none on MLX).

Detect MLA in `Scheduler._turboquant_eligible()` via a memoized
`_model_uses_mla()` (config `kv_lora_rank`, incl. nested `text_config`
and VLM-adapter `_language_model` delegation, plus an attention
module-walk) and keep those models on fp16 KV -- no crash, no
TurboQuant, with a one-time info log. Non-MLA behaviour is unchanged:
the gate returns early only for MLA, otherwise the existing cache-type
eligibility logic runs identically.

Fixes jundot#1613
@jundot jundot force-pushed the fix/turboquant-exclude-mla-1613 branch from 630c4c6 to b028593 Compare June 3, 2026 15:01
@jundot jundot merged commit 2da2194 into jundot:main Jun 3, 2026
4 checks passed
@popfido popfido deleted the fix/turboquant-exclude-mla-1613 branch June 3, 2026 15:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GLM 4.7 Flash crash with TurboQuant

2 participants