[Feature] Xiaomi MiMo-V2.5 day0 support#23811
[Feature] Xiaomi MiMo-V2.5 day0 support#23811ShangmingCai merged 65 commits intosgl-project:mainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
…mni audio Co-Authored-By: 张袁 <zhangyuan36@xiaomi.com>
|
/rerun-failed-ci |
2 similar comments
|
/rerun-failed-ci |
|
/rerun-failed-ci |
MiMo-V2.5 upstream flipped enable_thinking to true by default, causing <think> blocks to appear in response content. Without a reasoning parser, lmms-eval fails to extract answers and accuracy drops to ~0.28 (near random). Adding --reasoning-parser mimo moves thinking tokens to reasoning_content so lmms-eval receives clean final answers.
af87123 to
7276d8c
Compare
|
/rerun-test test/registered/8-gpu-models/test_mimo_models.py |
|
✅ |
|
This seems like a lot of specialized code and constraints to work around the fact the released model checkpoint is in a weird pre-interleaved format that assumes specific TP sizes. It's going to make it difficult to properly support different quantization methods (e.g. the quant I'm working on here: https://huggingface.co/lukealonso/MiMo-V2.5-NVFP4) if this model doesn't behave in a uniform way. Wouldn't it be better to fix the checkpoint to be in a more standard form? |
lmms-eval is unstable as a CI dependency: when generation hits max_tokens or EOS before </think>, the reasoning parser routes everything to reasoning_content and message.content becomes null. lmms-eval then drops the response in res.extend([r for r in batch_responses if r is not None]), which leaves trailing instances with empty resps and crashes the take_first filter with IndexError. Keep --reasoning-parser mimo because GSM8K still needs clean content under enable_thinking=true (default for MiMo-V2.5).
|
Dropped the MMMU test on this branch — lmms-eval is unstable when paired with a reasoning parser. Failure mode (run 25095255371,
lmms-eval doesn't pass |
Thanks for your feedback! I would forward to mimo team. cc @Abatom |
@lukealonso Can NVFP4 quantization be applied only to the experts in MoE, given that these parameters account for the vast majority of the total parameters? |
|
@Abatom Sure, but the win is being able to run on fewer GPUs, i.e. not the hardcoded (and baked into the checkpoint) TP=4 and TP=8. For my quant I've already de-interleaved the attention projections so it's a more standard checkpoint and can be loaded as TP=2, but I'll have to fight the constraints in the modeling code in this PR. |
SpecDecodingMixin renamed the attribute upstream; the merge from main brought the new contract but TestMiMoV2Flash still set the old name, breaking test_bs_1_speed with AttributeError.
# Conflicts: # test/registered/8-gpu-models/test_mimo_models.py
|
/rerun-failed-ci |
1 similar comment
|
/rerun-failed-ci |
Co-authored-by: 张袁 <zhangyuan36@xiaomi.com> Co-authored-by: 刘安岐 <liuanqi6@xiaomi.com> Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: Shangming Cai <csmthu@gmail.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
|
When will HiCache support for MiMo-V2.5 be available? |
Summary
Adds day-0 support for
XiaomiMiMo/MiMo-V2.5in SGLang.MiMoV2ForCausalLMand theMiMoV2MTPdraft model while keeping the legacyMiMoV2FlashForCausalLMname loadable.vision_config/audio_config.MiMoV2MTPloads the draft weights for multi-layer EAGLE.MiMoV2ForCausalLMeffective attention TP size required by the checkpoint fusedqkv_projlayout.XiaomiMiMo/MiMo-V2.5resolves to effective attention TP 4, andXiaomiMiMo/MiMo-V2.5-Proresolves to effective attention TP 8; the same derived value is used by the target and MTPqkv_projloaders.Serving
8-GPU multimodal + MTP example:
Tests
source .venv/bin/activate && pre-commit run -apassed.source .venv/bin/activate && python -m py_compile python/sglang/srt/configs/model_config.py python/sglang/srt/server_args.py python/sglang/srt/models/mimo_v2.py python/sglang/srt/models/mimo_v2_nextn.pypassed.XiaomiMiMo/MiMo-V2.5with--tp 8rejects because effective attention TP is 8 while the checkpoint fusedqkv_projlayout requires 4.XiaomiMiMo/MiMo-V2.5with--tp 8 --dp 2 --enable-dp-attentionallows because effective attention TP is 4.XiaomiMiMo/MiMo-V2.5-Prowith--tp 4rejects because effective attention TP is 4 while the checkpoint fusedqkv_projlayout requires 8.XiaomiMiMo/MiMo-V2.5-Prowith--tp 8allows because effective attention TP is 8.PYTHONPATH=python python -m pytest -s test/registered/8-gpu-models/test_mimo_models.py::TestMiMoV2::test_gsm8kscore=0.925avg_spec_accept_length=3.3564XiaomiMiMo/MiMo-V2.5:TestMiMoV2::test_gsm8kTestMiMoV2::test_mmmu