Skip to content

[Feature] Xiaomi MiMo-V2.5 day0 support#23811

Merged
ShangmingCai merged 65 commits intosgl-project:mainfrom
Abatom:feat/support-mimo-v2-omni
Apr 30, 2026
Merged

[Feature] Xiaomi MiMo-V2.5 day0 support#23811
ShangmingCai merged 65 commits intosgl-project:mainfrom
Abatom:feat/support-mimo-v2-omni

Conversation

@Abatom
Copy link
Copy Markdown
Contributor

@Abatom Abatom commented Apr 27, 2026

Summary

Adds day-0 support for XiaomiMiMo/MiMo-V2.5 in SGLang.

  • Registers MiMoV2ForCausalLM and the MiMoV2MTP draft model while keeping the legacy MiMoV2FlashForCausalLM name loadable.
  • Adds MiMo-V2 multimodal model pieces for image, video, and audio via the checkpoint's vision_config / audio_config.
  • Adds the MiMo-V2 multimodal processor for image, video, audio, and video+audio request inputs.
  • Supports the FP8 fused-QKV checkpoint format and skips draft-only MTP weights in the target model; MiMoV2MTP loads the draft weights for multi-layer EAGLE.
  • Enables MiMo parser/model config plumbing needed for reasoning, tool calls, multimodal scheduling, and multi-layer EAGLE.
  • Enforces the MiMoV2ForCausalLM effective attention TP size required by the checkpoint fused qkv_proj layout. XiaomiMiMo/MiMo-V2.5 resolves to effective attention TP 4, and XiaomiMiMo/MiMo-V2.5-Pro resolves to effective attention TP 8; the same derived value is used by the target and MTP qkv_proj loaders.

Serving

8-GPU multimodal + MTP example:

sglang serve \
  --trust-remote-code \
  --model-path XiaomiMiMo/MiMo-V2.5 \
  --enable-multimodal \
  --tp 8 \
  --dp 2 \
  --enable-dp-attention \
  --mm-enable-dp-encoder \
  --attention-backend fa3 \
  --mm-attention-backend fa3 \
  --moe-a2a-backend deepep \
  --deepep-mode auto \
  --moe-dense-tp-size 1 \
  --mem-fraction-static 0.65 \
  --chunked-prefill-size 16384 \
  --reasoning-parser mimo \
  --tool-call-parser mimo \
  --host 0.0.0.0 \
  --port 30000 \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --enable-multi-layer-eagle

Tests

  • source .venv/bin/activate && pre-commit run -a passed.
  • source .venv/bin/activate && python -m py_compile python/sglang/srt/configs/model_config.py python/sglang/srt/server_args.py python/sglang/srt/models/mimo_v2.py python/sglang/srt/models/mimo_v2_nextn.py passed.
  • Local startup argument matrix passed:
    • XiaomiMiMo/MiMo-V2.5 with --tp 8 rejects because effective attention TP is 8 while the checkpoint fused qkv_proj layout requires 4.
    • XiaomiMiMo/MiMo-V2.5 with --tp 8 --dp 2 --enable-dp-attention allows because effective attention TP is 4.
    • XiaomiMiMo/MiMo-V2.5-Pro with --tp 4 rejects because effective attention TP is 4 while the checkpoint fused qkv_proj layout requires 8.
    • XiaomiMiMo/MiMo-V2.5-Pro with --tp 8 allows because effective attention TP is 8.
  • Local registered GSM8K MTP run passed:
    • PYTHONPATH=python python -m pytest -s test/registered/8-gpu-models/test_mimo_models.py::TestMiMoV2::test_gsm8k
    • score=0.925
    • avg_spec_accept_length=3.3564
  • Added registered 8-GPU H200 coverage for XiaomiMiMo/MiMo-V2.5:
    • TestMiMoV2::test_gsm8k
    • TestMiMoV2::test_mmmu
    • GSM8K and MMMU share one MTP multimodal server.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added the Multi-modal multi-modal language model label Apr 27, 2026
@JustinTong0323
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

2 similar comments
@JustinTong0323
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@JustinTong0323
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

JustinTong0323 and others added 2 commits April 29, 2026 07:53
MiMo-V2.5 upstream flipped enable_thinking to true by default, causing
<think> blocks to appear in response content. Without a reasoning parser,
lmms-eval fails to extract answers and accuracy drops to ~0.28 (near
random). Adding --reasoning-parser mimo moves thinking tokens to
reasoning_content so lmms-eval receives clean final answers.
@JustinTong0323 JustinTong0323 force-pushed the feat/support-mimo-v2-omni branch from af87123 to 7276d8c Compare April 29, 2026 06:57
@JustinTong0323
Copy link
Copy Markdown
Collaborator

/rerun-test test/registered/8-gpu-models/test_mimo_models.py

@sgl-project sgl-project deleted a comment from github-actions Bot Apr 29, 2026
@github-actions
Copy link
Copy Markdown
Contributor

8-gpu-h200 (1 test): View workflow run

cd test/ && python3 registered/8-gpu-models/test_mimo_models.py

@lukealonso
Copy link
Copy Markdown

This seems like a lot of specialized code and constraints to work around the fact the released model checkpoint is in a weird pre-interleaved format that assumes specific TP sizes.

It's going to make it difficult to properly support different quantization methods (e.g. the quant I'm working on here: https://huggingface.co/lukealonso/MiMo-V2.5-NVFP4) if this model doesn't behave in a uniform way.

Wouldn't it be better to fix the checkpoint to be in a more standard form?

lmms-eval is unstable as a CI dependency: when generation hits max_tokens
or EOS before </think>, the reasoning parser routes everything to
reasoning_content and message.content becomes null. lmms-eval then drops
the response in res.extend([r for r in batch_responses if r is not None]),
which leaves trailing instances with empty resps and crashes the
take_first filter with IndexError.

Keep --reasoning-parser mimo because GSM8K still needs clean content under
enable_thinking=true (default for MiMo-V2.5).
@JustinTong0323
Copy link
Copy Markdown
Collaborator

Dropped the MMMU test on this branch — lmms-eval is unstable when paired with a reasoning parser.

Failure mode (run 25095255371, test_mmmu IndexError):

  1. With --reasoning-parser mimo and MiMo-V2.5's default enable_thinking=true, the model emits <think>...</think>answer.
  2. When generation hits max_tokens / EOS before </think>, sglang's reasoning parser routes the whole text into reasoning_content, leaving message.content = None (serving_chat.py: content = text if text else None).
  3. lmms-eval's OpenAI-compatible client then drops it in res.extend([r for r in batch_responses if r is not None]). Trailing instances end up with req.resps == [], and the take_first filter crashes on r[0] — never writes a result JSON.

lmms-eval doesn't pass chat_template_kwargs / extra_body, and there's no clean server-side switch to force enable_thinking=False per-request, so the only stable fix is to drop MMMU here. GSM8K still works since it uses /v1/completions (no chat template) and benefits from --reasoning-parser mimo, so I kept that flag.

@JustinTong0323
Copy link
Copy Markdown
Collaborator

This seems like a lot of specialized code and constraints to work around the fact the released model checkpoint is in a weird pre-interleaved format that assumes specific TP sizes.

It's going to make it difficult to properly support different quantization methods (e.g. the quant I'm working on here: https://huggingface.co/lukealonso/MiMo-V2.5-NVFP4) if this model doesn't behave in a uniform way.

Wouldn't it be better to fix the checkpoint to be in a more standard form?

Thanks for your feedback! I would forward to mimo team. cc @Abatom

@Abatom
Copy link
Copy Markdown
Contributor Author

Abatom commented Apr 29, 2026

This seems like a lot of specialized code and constraints to work around the fact the released model checkpoint is in a weird pre-interleaved format that assumes specific TP sizes.

It's going to make it difficult to properly support different quantization methods (e.g. the quant I'm working on here: https://huggingface.co/lukealonso/MiMo-V2.5-NVFP4) if this model doesn't behave in a uniform way.

Wouldn't it be better to fix the checkpoint to be in a more standard form?

@lukealonso Can NVFP4 quantization be applied only to the experts in MoE, given that these parameters account for the vast majority of the total parameters?

@lukealonso
Copy link
Copy Markdown

lukealonso commented Apr 30, 2026

@Abatom Sure, but the win is being able to run on fewer GPUs, i.e. not the hardcoded (and baked into the checkpoint) TP=4 and TP=8. For my quant I've already de-interleaved the attention projections so it's a more standard checkpoint and can be loaded as TP=2, but I'll have to fight the constraints in the modeling code in this PR.

SpecDecodingMixin renamed the attribute upstream; the merge from main
brought the new contract but TestMiMoV2Flash still set the old name,
breaking test_bs_1_speed with AttributeError.
# Conflicts:
#	test/registered/8-gpu-models/test_mimo_models.py
@JustinTong0323
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

1 similar comment
@JustinTong0323
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@ShangmingCai ShangmingCai merged commit 651af06 into sgl-project:main Apr 30, 2026
293 of 330 checks passed
vguduruTT pushed a commit to vguduruTT/sglang that referenced this pull request May 2, 2026
Co-authored-by: 张袁 <zhangyuan36@xiaomi.com>
Co-authored-by: 刘安岐 <liuanqi6@xiaomi.com>
Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: Shangming Cai <csmthu@gmail.com>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
@smartssw
Copy link
Copy Markdown

smartssw commented May 8, 2026

When will HiCache support for MiMo-V2.5 be available?
I noticed that UnifiedRadixCache has already been merged into the main branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants