[Feature] Xiaomi MiMo-V2-Flash Optimization#15208
Open
Conversation
…ITE_LONGER_CONTEXT_LEN=1
feat: support moe_fused_gate support 256/1 support 256/1 update supported cases add compile param compatible for other models compatible for other models
…et_cpu_copy and load_cpu_copy
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds support for Xiaomi MiMo V2 Flash model version 0.5.5, introducing Multi-Token Prediction (MTP) capabilities and various improvements to speculative decoding infrastructure.
Key Changes:
- Added MTP (Multi-Token Prediction) worker implementation for MiMo V2 Flash model
- Enhanced EAGLE speculative decoding with idle batch support and NPU graph runners
- Expanded CUDA kernel support for larger VPT values (up to 256) in MoE fused gate
- Added SWA (Sliding Window Attention) offload unit tests and kernel optimizations
Reviewed changes
Copilot reviewed 91 out of 91 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| python/sglang/srt/speculative/mtp_worker_v2.py | Core MTP worker implementation for speculative decoding |
| python/sglang/srt/speculative/mtp_worker.py | Legacy MTP worker implementation |
| python/sglang/srt/speculative/eagle_worker_v2.py | EAGLE worker updates with idle batch and NPU support |
| python/sglang/srt/models/mimo_v2_flash_nextn.py | MiMo V2 MTP model architecture |
| sgl-kernel/csrc/moe/moe_fused_gate.cu | MoE gate kernel with increased VPT support |
| sgl-kernel/build.sh | Build script refactored to use Docker buildx |
| python/sglang/srt/nvtx_utils.py | NVTX profiling utilities |
| test/srt/test_swa_offload_unit.py | Unit tests for SWA offload functionality |
Comments suppressed due to low confidence (4)
scripts/ci/npu_ci_install_dependency.sh:1
- The date '20251110' (November 10, 2025) is in the future. Verify this is the intended version tag and not a typo.
python/sglang/srt/server_args.py:1 - The help text contains a grammatical issue. 'the ratio will be calculated automatically if it's less than 0' should clarify that negative values trigger automatic calculation.
python/sglang/srt/mem_cache/memory_pool.py:1 - Commented code contains a typo: 'eturn' should be 'return'. Either fix the typo or remove the commented code.
python/sglang/srt/model_executor/model_runner.py:1 - The comment 'hack here' is too vague. Explain what the hack is doing and why it's necessary.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
17 tasks
10 tasks
MiMo-V2-Flash Optimization
MiMo-V2-Flash OptimizationMiMo-V2-Flash Optimization
|
What is the status of this PR? It looks so solid, but no activity last weeks. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
MiMo-V2-Flash is a Mixture-of-Experts (MoE) language model with 309B total parameters and 15B active parameters. Designed for high-speed reasoning and agentic workflows, it utilizes a novel hybrid attention architecture and Multi-Token Prediction (MTP) to achieve state-of-the-art performance while significantly reducing inference costs.
See it on HF: https://huggingface.co/XiaomiMiMo/MiMo-V2-Flash
LMSys blog: https://lmsys.org/blog/2025-12-16-mimo-v2-flash/
Modifications
MiMo-V2-FlashDay 0 Support and Continuous Optimization #15263MiMo-V2-Flashday0 support #15207Accuracy Tests
Benchmarking and Profiling
MiMo-V2-Flash Prefill Benchmark (Radix Cache Disabled):

Checklist