[Feature] Xiaomi MiMo-V2.5-Pro day0 support#23808
Merged
ispobock merged 3 commits intosgl-project:mainfrom Apr 28, 2026
Merged
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
8a3c83b to
4af6985
Compare
acelyc111
approved these changes
Apr 27, 2026
|
I didn't see it being 1T parameters coming. When release? |
Collaborator
|
/rerun-failed-ci |
1 similar comment
Contributor
Author
|
/rerun-failed-ci |
4bae823 to
7cb04d2
Compare
Contributor
Author
|
/rerun-failed-ci |
7cb04d2 to
bf8db20
Compare
5 tasks
fb2e0b2 to
c6f8c05
Compare
ispobock
approved these changes
Apr 27, 2026
Collaborator
|
/rerun-test test_mimo_models.py |
Contributor
|
✅ |
c6f8c05 to
d11318c
Compare
vguduruTT
pushed a commit
to vguduruTT/sglang
that referenced
this pull request
May 2, 2026
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
MiMo-V2.5-Pro is a Mixture-of-Experts (MoE) language model with 1.02T total parameters and 42B active parameters. It utilizes hybrid attention architecture and 3-layers Multi-Token Prediction (MTP) described in MiMo-V2-Flash. The context length is up to 1M tokens.
Modifications
Accuracy Tests
Speed Tests and Profiling
Benchmark command
Prefill performance
Test setting:
EP16,chunk_size=32K, output length = 1 token, cache flushed.For input lengths up to 256K, we tested with the benchmark commands above and checked the final output / logs to confirm that the requests were processed correctly.
For input lengths >= 512K, we sent two requests and ensured that they were routed to two different DP ranks. We then checked the
bench_servingoutput and recorded the single-node prefill throughput.Decode performance
Test setting: fixed 16K input and 1K output. We compared single-node decode throughput with and without 3-layer MTP.
With 3-layer MTP enabled, decode throughput improves significantly:
These results show that the model can run correctly with long-context prefill up to 1M tokens, and 3-layer MTP provides a clear decode-side throughput improvement under the tested 16K-input / 1K-output setting.
Launch Command example
SGLANG_ENABLE_SPEC_V2=1 SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=256 python3 -m sglang.launch_server \ --model-path XiaomiMiMo/MiMo-V2.5-Pro \ --trust-remote-code \ --pp-size 1 \ --dp-size 2 \ --ep-size 16 \ --tp-size 16 \ --moe-dense-tp-size 1 \ --enable-dp-attention \ --moe-a2a-backend deepep \ --dist-init-addr ${LWS_LEADER_IP}:20000 \ --node-rank ${LWS_WORKER_INDEX} \ --nnodes ${LWS_GROUP_SIZE} \ --page-size 64 \ --attention-backend fa3 \ --quantization fp8 \ --mem-fraction-static 0.7 \ --max-running-requests 128 \ --cuda-graph-max-bs 64 \ --chunked-prefill-size 32768 \ --context-length 1048576 \ --tokenizer-worker-num 64 \ --speculative-algorithm EAGLE \ --speculative-num-steps 3 \ --speculative-eagle-topk 1 \ --speculative-num-draft-tokens 4 \ --enable-multi-layer-eagle \ --host 0.0.0.0 \ --port 9001 \ --reasoning-parser mimo \ --tool-call-parser mimo \ --watchdog-timeout 3600 \ --model-loader-extra-config '{"enable_multithread_load": "true","num_threads": 64}'MiMo-V2.5-Pro FP8 checkpoint uses a fused QKV projection layout exported with attention TP=8.
The FP8 quantization is performed independently on each attention TP shard before the shards are concatenated into the HF checkpoint. Therefore, the fused qkv_proj checkpoint is TP-rank-interleaved rather than a flat [Q_all | K_all | V_all] layout.
For MiMo-V2.5-Pro:
With attention TP=8, each attention TP shard has:
Since FP8 block-wise quantization uses 128x128 blocks, each TP shard's row dimension is padded independently for scale generation:
ceil(3392 / 128) = 27
So each shard has:
After concatenating 8 shards along the row dimension, the HF checkpoint stores:
Note that the scale rows are 8 * ceil(3392 / 128) = 216, not ceil((3392 * 8) / 128) = 212, because quantization is done independently per TP shard.
Therefore, this checkpoint requires runtime attention TP=8 for the fused QKV loading path. With DP attention enabled, this means the derived attention TP size should be 8, e.g. --tp-size 16 --dp-size 2 gives attention TP=8.
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci