Skip to content

docs: enable MiMo V2.5 MTP cookbook path#23945

Merged
wisclmy0611 merged 1 commit intosgl-project:mainfrom
JustinTong0323:docs/mimo-v25-mtp-cookbook
Apr 28, 2026
Merged

docs: enable MiMo V2.5 MTP cookbook path#23945
wisclmy0611 merged 1 commit intosgl-project:mainfrom
JustinTong0323:docs/mimo-v25-mtp-cookbook

Conversation

@JustinTong0323
Copy link
Copy Markdown
Collaborator

Summary

  • Enable EAGLE MTP for MiMo-V2.5 in the cookbook command generator.
  • Update the MiMo-V2.5 deployment notes to describe the checkpoint MTP path and the required Hopper flags.
  • Replace the MiMo-V2.5 speed benchmark results with runs collected from the EAGLE MTP configuration.

Serving configuration used for the MiMo-V2.5 benchmark

  • Image: lmsysorg/sglang:dev-mimo-v2.5
  • Model: XiaomiMiMo/MiMo-V2.5
  • Parallelism: --tp 8 --dp 2 --enable-dp-attention --enable-dp-lm-head --mm-enable-dp-encoder
  • MTP: SGLANG_ENABLE_SPEC_V2=1 --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --enable-multi-layer-eagle

Benchmark results

  • Latency, 10 prompts, concurrency 1: 0.68 req/s, 190.09 output tok/s, accept length 3.08.
  • Throughput, 1000 prompts, concurrency 100: 10.71 req/s, 2095.97 output tok/s, accept length 2.95.
  • Image, 10 prompts, 2 random 720p images per request: 0.39 req/s, 164.03 output tok/s, accept length 2.94.

Validation

  • pre-commit run --files docs_new/src/snippets/autoregressive/mimo-v25-deployment.jsx docs_new/cookbook/autoregressive/Xiaomi/MiMo-V2.5.mdx
  • cd docs_new && mint validate
  • git diff --check

@JustinTong0323 JustinTong0323 force-pushed the docs/mimo-v25-mtp-cookbook branch from 89226e7 to e701d30 Compare April 28, 2026 15:01
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the Xiaomi MiMo-V2.5 documentation and deployment logic to reflect that EAGLE speculative decoding is supported on both the base and Pro variants. It also updates benchmark results for the H200 GPU and corrects the speculative algorithm flag name. Feedback was provided regarding the clarity of the DP configuration command and the validity of the multimodal benchmark data, which currently contains zeroed-out metrics.


**MiMo-V2.5 (310B):**
- The checkpoint has a TP=4-interleaved fused `qkv_proj`; attention-TP per DP group **must** be 4. So DP-attention is always required (`--dp = TP / 4`), and total GPUs must be a multiple of 4. A bare `--tp 8` without `--dp 2` will fail to load with `MiMoV2Omni fused qkv_proj checkpoint is TP=4-interleaved; got tp_size=8`.
- The checkpoint has a TP=4-interleaved fused `qkv_proj`; attention-TP per DP group **must** be 4. Use `--dp = TP / 4`; for TP > 4 this also requires DP-attention. Total GPUs must be a multiple of 4. A bare `--tp 8` without `--dp 2` will fail to load with `MiMoV2 fused qkv_proj checkpoint is TP=4-interleaved; got attention tp_size=8`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The expression --dp = TP / 4 might be misinterpreted as a literal command-line argument including the equals sign and spaces. It is clearer to state "Set --dp to TP / 4" or use a placeholder like --dp <TP/4> to avoid confusion.

- The checkpoint has a TP=4-interleaved fused qkv_proj; attention-TP per DP group must be 4. Set --dp to TP / 4; for TP > 4 this also requires DP-attention. Total GPUs must be a multiple of 4. A bare --tp 8 without --dp 2 will fail to load with MiMoV2 fused qkv_proj checkpoint is TP=4-interleaved; got attention tp_size=8.

Comment on lines +601 to +628
Total generated tokens (retokenized): 0
Request throughput (req/s): 0.39
Input token throughput (tok/s): 25.69
Output token throughput (tok/s): 164.03
Peak output token throughput (tok/s): 1.00
Peak concurrent requests: 2
Total token throughput (tok/s): 542.76
Total token throughput (tok/s): 189.73
Concurrency: 1.00
Accept length: 2.94
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 4186.79
Median E2E Latency (ms): 3366.20
P90 E2E Latency (ms): 7545.54
P99 E2E Latency (ms): 9180.85
Mean E2E Latency (ms): 2570.74
Median E2E Latency (ms): 2411.92
P90 E2E Latency (ms): 3711.62
P99 E2E Latency (ms): 4949.74
---------------Time to First Token----------------
Mean TTFT (ms): 1284.90
Median TTFT (ms): 622.81
P99 TTFT (ms): 5030.79
Mean TTFT (ms): 0.00
Median TTFT (ms): 0.00
P99 TTFT (ms): 0.00
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 7.36
Median TPOT (ms): 8.45
P99 TPOT (ms): 10.94
Mean TPOT (ms): 7.31
Median TPOT (ms): 6.17
P99 TPOT (ms): 17.18
---------------Inter-Token Latency----------------
Mean ITL (ms): 9.54
Median ITL (ms): 9.45
P95 ITL (ms): 9.58
P99 ITL (ms): 11.12
Max ITL (ms): 37.67
Mean ITL (ms): 0.00
Median ITL (ms): 0.00
P95 ITL (ms): 0.00
P99 ITL (ms): 0.00
Max ITL (ms): 0.00
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The benchmark results for the multimodal image run (Section 5.3.3) contain several 0.00 or 0 values for critical metrics such as Total generated tokens (retokenized), Mean TTFT, and Inter-Token Latency. Additionally, the Peak output token throughput is reported as 1.00. These values suggest that the benchmark data was not captured correctly or is incomplete. Please update this section with valid benchmark results.

@wisclmy0611 wisclmy0611 merged commit e458a92 into sgl-project:main Apr 28, 2026
42 checks passed
vguduruTT pushed a commit to vguduruTT/sglang that referenced this pull request May 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants