[Feature] Xiaomi MiMo-V2.5 day0 support by Abatom · Pull Request #23811 · sgl-project/sglang

Abatom · 2026-04-27T06:32:10Z

Summary

Adds day-0 support for XiaomiMiMo/MiMo-V2.5 in SGLang.

Registers MiMoV2ForCausalLM and the MiMoV2MTP draft model while keeping the legacy MiMoV2FlashForCausalLM name loadable.
Adds MiMo-V2 multimodal model pieces for image, video, and audio via the checkpoint's vision_config / audio_config.
Adds the MiMo-V2 multimodal processor for image, video, audio, and video+audio request inputs.
Supports the FP8 fused-QKV checkpoint format and skips draft-only MTP weights in the target model; MiMoV2MTP loads the draft weights for multi-layer EAGLE.
Enables MiMo parser/model config plumbing needed for reasoning, tool calls, multimodal scheduling, and multi-layer EAGLE.
Enforces the MiMoV2ForCausalLM effective attention TP size required by the checkpoint fused qkv_proj layout. XiaomiMiMo/MiMo-V2.5 resolves to effective attention TP 4, and XiaomiMiMo/MiMo-V2.5-Pro resolves to effective attention TP 8; the same derived value is used by the target and MTP qkv_proj loaders.

Serving

8-GPU multimodal + MTP example:

sglang serve \
  --trust-remote-code \
  --model-path XiaomiMiMo/MiMo-V2.5 \
  --enable-multimodal \
  --tp 8 \
  --dp 2 \
  --enable-dp-attention \
  --mm-enable-dp-encoder \
  --attention-backend fa3 \
  --mm-attention-backend fa3 \
  --moe-a2a-backend deepep \
  --deepep-mode auto \
  --moe-dense-tp-size 1 \
  --mem-fraction-static 0.65 \
  --chunked-prefill-size 16384 \
  --reasoning-parser mimo \
  --tool-call-parser mimo \
  --host 0.0.0.0 \
  --port 30000 \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --enable-multi-layer-eagle

Tests

source .venv/bin/activate && pre-commit run -a passed.
source .venv/bin/activate && python -m py_compile python/sglang/srt/configs/model_config.py python/sglang/srt/server_args.py python/sglang/srt/models/mimo_v2.py python/sglang/srt/models/mimo_v2_nextn.py passed.
Local startup argument matrix passed:
- XiaomiMiMo/MiMo-V2.5 with --tp 8 rejects because effective attention TP is 8 while the checkpoint fused qkv_proj layout requires 4.
- XiaomiMiMo/MiMo-V2.5 with --tp 8 --dp 2 --enable-dp-attention allows because effective attention TP is 4.
- XiaomiMiMo/MiMo-V2.5-Pro with --tp 4 rejects because effective attention TP is 4 while the checkpoint fused qkv_proj layout requires 8.
- XiaomiMiMo/MiMo-V2.5-Pro with --tp 8 allows because effective attention TP is 8.
Local registered GSM8K MTP run passed:
- PYTHONPATH=python python -m pytest -s test/registered/8-gpu-models/test_mimo_models.py::TestMiMoV2::test_gsm8k
- score=0.925
- avg_spec_accept_length=3.3564
Added registered 8-GPU H200 coverage for XiaomiMiMo/MiMo-V2.5:
- TestMiMoV2::test_gsm8k
- TestMiMoV2::test_mmmu
- GSM8K and MMMU share one MTP multimodal server.

gemini-code-assist · 2026-04-27T06:32:14Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

…mni audio Co-Authored-By: 张袁 <zhangyuan36@xiaomi.com>

JustinTong0323 · 2026-04-28T09:40:48Z

/rerun-failed-ci

JustinTong0323 · 2026-04-28T12:51:27Z

/rerun-failed-ci

JustinTong0323 · 2026-04-28T13:43:28Z

/rerun-failed-ci

MiMo-V2.5 upstream flipped enable_thinking to true by default, causing <think> blocks to appear in response content. Without a reasoning parser, lmms-eval fails to extract answers and accuracy drops to ~0.28 (near random). Adding --reasoning-parser mimo moves thinking tokens to reasoning_content so lmms-eval receives clean final answers.

JustinTong0323 · 2026-04-29T06:58:10Z

/rerun-test test/registered/8-gpu-models/test_mimo_models.py

github-actions · 2026-04-29T06:58:41Z

✅ 8-gpu-h200 (1 test): View workflow run

cd test/ && python3 registered/8-gpu-models/test_mimo_models.py

lukealonso · 2026-04-29T13:51:13Z

This seems like a lot of specialized code and constraints to work around the fact the released model checkpoint is in a weird pre-interleaved format that assumes specific TP sizes.

It's going to make it difficult to properly support different quantization methods (e.g. the quant I'm working on here: https://huggingface.co/lukealonso/MiMo-V2.5-NVFP4) if this model doesn't behave in a uniform way.

Wouldn't it be better to fix the checkpoint to be in a more standard form?

lmms-eval is unstable as a CI dependency: when generation hits max_tokens or EOS before </think>, the reasoning parser routes everything to reasoning_content and message.content becomes null. lmms-eval then drops the response in res.extend([r for r in batch_responses if r is not None]), which leaves trailing instances with empty resps and crashes the take_first filter with IndexError. Keep --reasoning-parser mimo because GSM8K still needs clean content under enable_thinking=true (default for MiMo-V2.5).

JustinTong0323 · 2026-04-29T14:05:41Z

Dropped the MMMU test on this branch — lmms-eval is unstable when paired with a reasoning parser.

Failure mode (run 25095255371, test_mmmu IndexError):

With --reasoning-parser mimo and MiMo-V2.5's default enable_thinking=true, the model emits <think>...</think>answer.
When generation hits max_tokens / EOS before </think>, sglang's reasoning parser routes the whole text into reasoning_content, leaving message.content = None (serving_chat.py: content = text if text else None).
lmms-eval's OpenAI-compatible client then drops it in res.extend([r for r in batch_responses if r is not None]). Trailing instances end up with req.resps == [], and the take_first filter crashes on r[0] — never writes a result JSON.

lmms-eval doesn't pass chat_template_kwargs / extra_body, and there's no clean server-side switch to force enable_thinking=False per-request, so the only stable fix is to drop MMMU here. GSM8K still works since it uses /v1/completions (no chat template) and benefits from --reasoning-parser mimo, so I kept that flag.

JustinTong0323 · 2026-04-29T14:07:33Z

This seems like a lot of specialized code and constraints to work around the fact the released model checkpoint is in a weird pre-interleaved format that assumes specific TP sizes.

It's going to make it difficult to properly support different quantization methods (e.g. the quant I'm working on here: https://huggingface.co/lukealonso/MiMo-V2.5-NVFP4) if this model doesn't behave in a uniform way.

Wouldn't it be better to fix the checkpoint to be in a more standard form?

Thanks for your feedback! I would forward to mimo team. cc @Abatom

Abatom · 2026-04-29T15:06:06Z

This seems like a lot of specialized code and constraints to work around the fact the released model checkpoint is in a weird pre-interleaved format that assumes specific TP sizes.

It's going to make it difficult to properly support different quantization methods (e.g. the quant I'm working on here: https://huggingface.co/lukealonso/MiMo-V2.5-NVFP4) if this model doesn't behave in a uniform way.

Wouldn't it be better to fix the checkpoint to be in a more standard form?

@lukealonso Can NVFP4 quantization be applied only to the experts in MoE, given that these parameters account for the vast majority of the total parameters?

lukealonso · 2026-04-30T00:11:28Z

@Abatom Sure, but the win is being able to run on fewer GPUs, i.e. not the hardcoded (and baked into the checkpoint) TP=4 and TP=8. For my quant I've already de-interleaved the attention projections so it's a more standard checkpoint and can be loaded as TP=2, but I'll have to fight the constraints in the modeling code in this PR.

SpecDecodingMixin renamed the attribute upstream; the merge from main brought the new contract but TestMiMoV2Flash still set the old name, breaking test_bs_1_speed with AttributeError.

# Conflicts: # test/registered/8-gpu-models/test_mimo_models.py

JustinTong0323 · 2026-04-30T13:52:10Z

/rerun-failed-ci

JustinTong0323 · 2026-04-30T14:31:52Z

/rerun-failed-ci

Co-authored-by: 张袁 <zhangyuan36@xiaomi.com> Co-authored-by: 刘安岐 <liuanqi6@xiaomi.com> Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: Shangming Cai <csmthu@gmail.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>

smartssw · 2026-05-08T02:58:52Z

When will HiCache support for MiMo-V2.5 be available?
I noticed that UnifiedRadixCache has already been merged into the main branch.

Abatom requested review from JustinTong0323, Ying1123, hnyls2002, merrymercy, mickqian, xiezhq-hermann, yhyang201 and yuan-luo as code owners April 27, 2026 06:32

github-actions Bot added the Multi-modal multi-modal language model label Apr 27, 2026

Abatom and others added 20 commits April 27, 2026 14:38

1/N

1cc3469

mimo-v2-omni

93fcbf3

rm --audio-tokenizer-path

5a2fb00

Clean up logger in mimo_vision_attention

5c1ac3b

fix: duplicate audio loading in MiMoV2Omni process_mm_data_async

22adf6a

only use sgl_kernel.flash_attn

806f4f0

assert is_cuda

5e970af

fix: handle list[torch.Tensor] feature in shm wrap/unwrap for MiMoV2O…

58048ca

…mni audio Co-Authored-By: 张袁 <zhangyuan36@xiaomi.com>

fix: duplicate video loading in MiMoV2Omni

0db445b

rm _ShmHandle

b030a21

add mimo_vl.py

f41fbb4

mimo_audio.py

0947166

rm audio_linear.py

4a43691

mimo_omni_processor into mimo_v2_omni.py

8f27470

clean: mimo v2 omni processor

cc79587

self.inited.bool().item()

a48d084

MiMoAudioTokenizer requires CUDA to run

54879ef

format

70ab146

has_audio_track & _make_video_content

9b711e3

format

af61cf1

JustinTong0323 added 2 commits April 28, 2026 06:46

fix: derive MiMoV2 fused qkv TP from config

ea25022

fix: tighten MiMoV2 fused qkv TP helper validation

7d99af4

JustinTong0323 and others added 2 commits April 29, 2026 07:53

Merge branch 'main' into feat/support-mimo-v2-omni

c4d7862

JustinTong0323 force-pushed the feat/support-mimo-v2-omni branch from af87123 to 7276d8c Compare April 29, 2026 06:57

sgl-project deleted a comment from github-actions Bot Apr 29, 2026

Merge branch 'main' into feat/support-mimo-v2-omni

289eec1

JustinTong0323 added 2 commits April 30, 2026 01:50

test(mimo-v2.5): rename accept_length_thres to num_accepted_drafts_thres

7edae39

SpecDecodingMixin renamed the attribute upstream; the merge from main brought the new contract but TestMiMoV2Flash still set the old name, breaking test_bs_1_speed with AttributeError.

Merge remote-tracking branch 'upstream/main' into pr-23811

ab9a0d7

# Conflicts: # test/registered/8-gpu-models/test_mimo_models.py

ShangmingCai approved these changes Apr 30, 2026

View reviewed changes

ShangmingCai merged commit 651af06 into sgl-project:main Apr 30, 2026
293 of 330 checks passed

This was referenced May 3, 2026

[Bug]: MiMo-V2.5 NVFP4 produces garbage tokens on consumer Blackwell (SM12.0a / RTX PRO 6000) across all MoE backends #24321

Open

[Bug]: EAGLE + MiMo-V2.5 NVFP4 — shape mismatch in mimo_v2_nextn.py:331 load_merged_column_weight #24322

Open

zorrofox mentioned this pull request May 6, 2026

[Bug] MiMo-V2.5 multi-layer EAGLE × triton attention: draft extend step≥1 fails with kv_indices=None triton CompilationError #24481

Open

alisonshao mentioned this pull request May 6, 2026

ci: bump test_mimo_models.py est_time 330 → 610 #24551

Merged

1 task

Conversation

Abatom commented Apr 27, 2026 • edited by JustinTong0323 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Serving

Tests

Uh oh!

gemini-code-assist Bot commented Apr 27, 2026

Uh oh!

JustinTong0323 commented Apr 28, 2026

Uh oh!

JustinTong0323 commented Apr 28, 2026

Uh oh!

JustinTong0323 commented Apr 28, 2026

Uh oh!

JustinTong0323 commented Apr 29, 2026

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

lukealonso commented Apr 29, 2026

Uh oh!

JustinTong0323 commented Apr 29, 2026

Uh oh!

JustinTong0323 commented Apr 29, 2026

Uh oh!

Abatom commented Apr 29, 2026

Uh oh!

lukealonso commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JustinTong0323 commented Apr 30, 2026

Uh oh!

JustinTong0323 commented Apr 30, 2026

Uh oh!

Uh oh!

smartssw commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Abatom commented Apr 27, 2026 •

edited by JustinTong0323

Loading

lukealonso commented Apr 30, 2026 •

edited

Loading