docs: enable MiMo V2.5 MTP cookbook path#23945
Conversation
89226e7 to
e701d30
Compare
There was a problem hiding this comment.
Code Review
This pull request updates the Xiaomi MiMo-V2.5 documentation and deployment logic to reflect that EAGLE speculative decoding is supported on both the base and Pro variants. It also updates benchmark results for the H200 GPU and corrects the speculative algorithm flag name. Feedback was provided regarding the clarity of the DP configuration command and the validity of the multimodal benchmark data, which currently contains zeroed-out metrics.
|
|
||
| **MiMo-V2.5 (310B):** | ||
| - The checkpoint has a TP=4-interleaved fused `qkv_proj`; attention-TP per DP group **must** be 4. So DP-attention is always required (`--dp = TP / 4`), and total GPUs must be a multiple of 4. A bare `--tp 8` without `--dp 2` will fail to load with `MiMoV2Omni fused qkv_proj checkpoint is TP=4-interleaved; got tp_size=8`. | ||
| - The checkpoint has a TP=4-interleaved fused `qkv_proj`; attention-TP per DP group **must** be 4. Use `--dp = TP / 4`; for TP > 4 this also requires DP-attention. Total GPUs must be a multiple of 4. A bare `--tp 8` without `--dp 2` will fail to load with `MiMoV2 fused qkv_proj checkpoint is TP=4-interleaved; got attention tp_size=8`. |
There was a problem hiding this comment.
The expression --dp = TP / 4 might be misinterpreted as a literal command-line argument including the equals sign and spaces. It is clearer to state "Set --dp to TP / 4" or use a placeholder like --dp <TP/4> to avoid confusion.
- The checkpoint has a TP=4-interleaved fused qkv_proj; attention-TP per DP group must be 4. Set --dp to TP / 4; for TP > 4 this also requires DP-attention. Total GPUs must be a multiple of 4. A bare --tp 8 without --dp 2 will fail to load with MiMoV2 fused qkv_proj checkpoint is TP=4-interleaved; got attention tp_size=8.
| Total generated tokens (retokenized): 0 | ||
| Request throughput (req/s): 0.39 | ||
| Input token throughput (tok/s): 25.69 | ||
| Output token throughput (tok/s): 164.03 | ||
| Peak output token throughput (tok/s): 1.00 | ||
| Peak concurrent requests: 2 | ||
| Total token throughput (tok/s): 542.76 | ||
| Total token throughput (tok/s): 189.73 | ||
| Concurrency: 1.00 | ||
| Accept length: 2.94 | ||
| ----------------End-to-End Latency---------------- | ||
| Mean E2E Latency (ms): 4186.79 | ||
| Median E2E Latency (ms): 3366.20 | ||
| P90 E2E Latency (ms): 7545.54 | ||
| P99 E2E Latency (ms): 9180.85 | ||
| Mean E2E Latency (ms): 2570.74 | ||
| Median E2E Latency (ms): 2411.92 | ||
| P90 E2E Latency (ms): 3711.62 | ||
| P99 E2E Latency (ms): 4949.74 | ||
| ---------------Time to First Token---------------- | ||
| Mean TTFT (ms): 1284.90 | ||
| Median TTFT (ms): 622.81 | ||
| P99 TTFT (ms): 5030.79 | ||
| Mean TTFT (ms): 0.00 | ||
| Median TTFT (ms): 0.00 | ||
| P99 TTFT (ms): 0.00 | ||
| -----Time per Output Token (excl. 1st token)------ | ||
| Mean TPOT (ms): 7.36 | ||
| Median TPOT (ms): 8.45 | ||
| P99 TPOT (ms): 10.94 | ||
| Mean TPOT (ms): 7.31 | ||
| Median TPOT (ms): 6.17 | ||
| P99 TPOT (ms): 17.18 | ||
| ---------------Inter-Token Latency---------------- | ||
| Mean ITL (ms): 9.54 | ||
| Median ITL (ms): 9.45 | ||
| P95 ITL (ms): 9.58 | ||
| P99 ITL (ms): 11.12 | ||
| Max ITL (ms): 37.67 | ||
| Mean ITL (ms): 0.00 | ||
| Median ITL (ms): 0.00 | ||
| P95 ITL (ms): 0.00 | ||
| P99 ITL (ms): 0.00 | ||
| Max ITL (ms): 0.00 |
There was a problem hiding this comment.
The benchmark results for the multimodal image run (Section 5.3.3) contain several 0.00 or 0 values for critical metrics such as Total generated tokens (retokenized), Mean TTFT, and Inter-Token Latency. Additionally, the Peak output token throughput is reported as 1.00. These values suggest that the benchmark data was not captured correctly or is incomplete. Please update this section with valid benchmark results.
Summary
Serving configuration used for the MiMo-V2.5 benchmark
lmsysorg/sglang:dev-mimo-v2.5XiaomiMiMo/MiMo-V2.5--tp 8 --dp 2 --enable-dp-attention --enable-dp-lm-head --mm-enable-dp-encoderSGLANG_ENABLE_SPEC_V2=1 --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --enable-multi-layer-eagleBenchmark results
Validation
pre-commit run --files docs_new/src/snippets/autoregressive/mimo-v25-deployment.jsx docs_new/cookbook/autoregressive/Xiaomi/MiMo-V2.5.mdxcd docs_new && mint validategit diff --check