Skip to content

[Feature] Xiaomi MiMo-V2-Flash Optimization#15208

Open
acelyc111 wants to merge 86 commits intomainfrom
xiaomi-mimo-v2-flash-0.5.5
Open

[Feature] Xiaomi MiMo-V2-Flash Optimization#15208
acelyc111 wants to merge 86 commits intomainfrom
xiaomi-mimo-v2-flash-0.5.5

Conversation

@acelyc111
Copy link
Copy Markdown
Collaborator

@acelyc111 acelyc111 commented Dec 15, 2025

Motivation

MiMo-V2-Flash is a Mixture-of-Experts (MoE) language model with 309B total parameters and 15B active parameters. Designed for high-speed reasoning and agentic workflows, it utilizes a novel hybrid attention architecture and Multi-Token Prediction (MTP) to achieve state-of-the-art performance while significantly reducing inference costs.

See it on HF: https://huggingface.co/XiaomiMiMo/MiMo-V2-Flash
LMSys blog: https://lmsys.org/blog/2025-12-16-mimo-v2-flash/

Modifications

Accuracy Tests

Benchmarking and Profiling

MiMo-V2-Flash Prefill Benchmark (Radix Cache Disabled):
image

Checklist

acelyc111 and others added 30 commits November 29, 2025 20:44
feat: support moe_fused_gate

support 256/1

support 256/1

update supported cases

add compile param

compatible for other models

compatible for other models
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for Xiaomi MiMo V2 Flash model version 0.5.5, introducing Multi-Token Prediction (MTP) capabilities and various improvements to speculative decoding infrastructure.

Key Changes:

  • Added MTP (Multi-Token Prediction) worker implementation for MiMo V2 Flash model
  • Enhanced EAGLE speculative decoding with idle batch support and NPU graph runners
  • Expanded CUDA kernel support for larger VPT values (up to 256) in MoE fused gate
  • Added SWA (Sliding Window Attention) offload unit tests and kernel optimizations

Reviewed changes

Copilot reviewed 91 out of 91 changed files in this pull request and generated no comments.

Show a summary per file
File Description
python/sglang/srt/speculative/mtp_worker_v2.py Core MTP worker implementation for speculative decoding
python/sglang/srt/speculative/mtp_worker.py Legacy MTP worker implementation
python/sglang/srt/speculative/eagle_worker_v2.py EAGLE worker updates with idle batch and NPU support
python/sglang/srt/models/mimo_v2_flash_nextn.py MiMo V2 MTP model architecture
sgl-kernel/csrc/moe/moe_fused_gate.cu MoE gate kernel with increased VPT support
sgl-kernel/build.sh Build script refactored to use Docker buildx
python/sglang/srt/nvtx_utils.py NVTX profiling utilities
test/srt/test_swa_offload_unit.py Unit tests for SWA offload functionality
Comments suppressed due to low confidence (4)

scripts/ci/npu_ci_install_dependency.sh:1

  • The date '20251110' (November 10, 2025) is in the future. Verify this is the intended version tag and not a typo.
    python/sglang/srt/server_args.py:1
  • The help text contains a grammatical issue. 'the ratio will be calculated automatically if it's less than 0' should clarify that negative values trigger automatic calculation.
    python/sglang/srt/mem_cache/memory_pool.py:1
  • Commented code contains a typo: 'eturn' should be 'return'. Either fix the typo or remove the commented code.
    python/sglang/srt/model_executor/model_runner.py:1
  • The comment 'hack here' is too vague. Explain what the hack is doing and why it's necessary.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@acelyc111 acelyc111 changed the title Xiaomi mimo v2 flash 0.5.5 [Perf] Xiaomi MiMo-V2-Flash Optimization Dec 16, 2025
@acelyc111 acelyc111 changed the title [Perf] Xiaomi MiMo-V2-Flash Optimization [Track] Xiaomi MiMo-V2-Flash Optimization Dec 16, 2025
@acelyc111 acelyc111 changed the title [Track] Xiaomi MiMo-V2-Flash Optimization [Track] Xiaomi MiMo-V2-Flash Optimization Dec 17, 2025
@acelyc111 acelyc111 changed the title [Track] Xiaomi MiMo-V2-Flash Optimization [Feature] Xiaomi MiMo-V2-Flash Optimization Dec 17, 2025
@MLKoz2
Copy link
Copy Markdown

MLKoz2 commented Mar 16, 2026

What is the status of this PR? It looks so solid, but no activity last weeks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.