[Feature] Xiaomi `MiMo-V2-Flash` Optimization by acelyc111 · Pull Request #15208 · sgl-project/sglang

acelyc111 · 2025-12-15T20:46:18Z

Motivation

MiMo-V2-Flash is a Mixture-of-Experts (MoE) language model with 309B total parameters and 15B active parameters. Designed for high-speed reasoning and agentic workflows, it utilizes a novel hybrid attention architecture and Multi-Token Prediction (MTP) to achieve state-of-the-art performance while significantly reducing inference costs.

See it on HF: https://huggingface.co/XiaomiMiMo/MiMo-V2-Flash
LMSys blog: https://lmsys.org/blog/2025-12-16-mimo-v2-flash/

Modifications

All modifications based on v0.5.5 [Tracking] MiMo-V2-Flash Day 0 Support and Continuous Optimization #15263
Day0 support [Feature] Xiaomi MiMo-V2-Flash day0 support #15207
- Basic model adaption
- Enhance Sliding Window Attention (SWA)
- Introduce multi-layer MTP
- Fix bugs related to PDD ans SWA KV cache management
Improve Accuracy
Improve SWA KV cache management
Improve multi-layer MTP

Accuracy Tests

Benchmarking and Profiling

MiMo-V2-Flash Prefill Benchmark (Radix Cache Disabled):

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

…ITE_LONGER_CONTEXT_LEN=1

…tribution

…red'

feat: support moe_fused_gate support 256/1 support 256/1 update supported cases add compile param compatible for other models compatible for other models

…et_cpu_copy and load_cpu_copy

…#12776)

Copilot

Pull request overview

This PR adds support for Xiaomi MiMo V2 Flash model version 0.5.5, introducing Multi-Token Prediction (MTP) capabilities and various improvements to speculative decoding infrastructure.

Key Changes:

Added MTP (Multi-Token Prediction) worker implementation for MiMo V2 Flash model
Enhanced EAGLE speculative decoding with idle batch support and NPU graph runners
Expanded CUDA kernel support for larger VPT values (up to 256) in MoE fused gate
Added SWA (Sliding Window Attention) offload unit tests and kernel optimizations

Reviewed changes

Copilot reviewed 91 out of 91 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
python/sglang/srt/speculative/mtp_worker_v2.py	Core MTP worker implementation for speculative decoding
python/sglang/srt/speculative/mtp_worker.py	Legacy MTP worker implementation
python/sglang/srt/speculative/eagle_worker_v2.py	EAGLE worker updates with idle batch and NPU support
python/sglang/srt/models/mimo_v2_flash_nextn.py	MiMo V2 MTP model architecture
sgl-kernel/csrc/moe/moe_fused_gate.cu	MoE gate kernel with increased VPT support
sgl-kernel/build.sh	Build script refactored to use Docker buildx
python/sglang/srt/nvtx_utils.py	NVTX profiling utilities
test/srt/test_swa_offload_unit.py	Unit tests for SWA offload functionality

Comments suppressed due to low confidence (4)

scripts/ci/npu_ci_install_dependency.sh:1

The date '20251110' (November 10, 2025) is in the future. Verify this is the intended version tag and not a typo.
python/sglang/srt/server_args.py:1
The help text contains a grammatical issue. 'the ratio will be calculated automatically if it's less than 0' should clarify that negative values trigger automatic calculation.
python/sglang/srt/mem_cache/memory_pool.py:1
Commented code contains a typo: 'eturn' should be 'return'. Either fix the typo or remove the commented code.
python/sglang/srt/model_executor/model_runner.py:1
The comment 'hack here' is too vague. Explain what the hack is doing and why it's necessary.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

MLKoz2 · 2026-03-16T11:27:01Z

What is the status of this PR? It looks so solid, but no activity last weeks.

acelyc111 and others added 30 commits November 29, 2025 20:44

MiMo-v2 adaption

a3067d8

feat: add warn long if deepgeem compile cost too long

494868e

Fix PD KV cache bug caused by non-equal KV head dims

18ecca2

debug: add debug logs when layer_id out of range

d787af5

Add docker/xiaomi/h20e_ep16|32_deepep_tuned.json

c5958a4

[Bugfix] fix dp attention related bug in MTP

40f68e2

revert sgl-kernel

3b967c4

format code

deb0990

feat: using content length from model config when SGLANG_ALLOW_OVERWR…

e22e13b

…ITE_LONGER_CONTEXT_LEN=1

[feat] support mimo_v2 two batch overlap

541be4e

change swa memory pool init logic and enhance input length check

bef233b

Add model and GPU tuned kernel configs

8115810

[PD] Adjustable dispatch node count in fake mode for ideal expert dis…

20bf9ec

…tribution

Remove V*scale per forward and apply V*scale once when loading weight

fc47b1c

Remove V*scale and apply V*scale once when loading MTP weight

93954c2

Fix SWA evict bug which cause garbage output

45d0284

Fix tool_calls incorrectly parsed when 'tool_choice' is set as 'requi…

46d55d8

…red'

support moe_fused_gate

00390b8

feat: support moe_fused_gate support 256/1 support 256/1 update supported cases add compile param compatible for other models compatible for other models

update fused_mode_gate

f7426ac

update dockerfile

c56f905

format code

8a19a18

add switch for fused_moe_gate

0a956b9

format code

1cb85d2

fix: decoder crashed when retraction due to SWAKVPool not implement g…

726e059

…et_cpu_copy and load_cpu_copy

Upgrade NCCL and add .git in image

aa10dcb

Support for Mimov2 Using Hicache

fc2cfb9

feat: Adapt CPU affinity in Xiaomi environment

7f38dfd

feat: Adapt CPU affinity in Xiaomi environment

60631f4

[MTP] fix hybrid_swa_compress_nextn.py

f5cbd98

[HotFix]: Add missing SGLANG_EPLB_HEATMAP_COLLECTION_INTERVAL env var (…

2ad4668

…#12776)

acelyc111 requested review from AniZpZ, ByronHsu, CatherineSue, DarkSharpness, Edwardf0t1, Fridge003, JustinTong0323, Kangyan-Zhou, ShangmingCai, ch-wan, fzyzcjy, hanming-lu, hnyls2002, ishandhanani, slin1237 and xiezhq-hermann as code owners December 15, 2025 20:46

github-actions Bot added deepseek sgl-kernel npu labels Dec 15, 2025

Copilot AI reviewed Dec 15, 2025

View reviewed changes

acelyc111 changed the title ~~Xiaomi mimo v2 flash 0.5.5~~ [Perf] Xiaomi MiMo-V2-Flash Optimization Dec 16, 2025

acelyc111 mentioned this pull request Dec 16, 2025

[Tracking] MiMo-V2-Flash Day 0 Support and Continuous Optimization #15263

Open

17 tasks

acelyc111 changed the title ~~[Perf] Xiaomi MiMo-V2-Flash Optimization~~ [Track] Xiaomi MiMo-V2-Flash Optimization Dec 16, 2025

acelyc111 mentioned this pull request Dec 16, 2025

[Feature] Xiaomi MiMo-V2-Flash day0 support #15207

Merged

10 tasks

acelyc111 changed the title ~~[Track] Xiaomi MiMo-V2-Flash Optimization~~ [Track] Xiaomi MiMo-V2-Flash Optimization Dec 17, 2025

acelyc111 changed the title ~~[Track] Xiaomi MiMo-V2-Flash Optimization~~ [Feature] Xiaomi MiMo-V2-Flash Optimization Dec 17, 2025

Set max tokens in OpenAI API endpoint; Add vscale

efe5eaa

Kangyan-Zhou requested review from Qiaolin-Yu and hebiao064 as code owners December 23, 2025 08:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Xiaomi `MiMo-V2-Flash` Optimization#15208

[Feature] Xiaomi `MiMo-V2-Flash` Optimization#15208
acelyc111 wants to merge 86 commits intomainfrom
xiaomi-mimo-v2-flash-0.5.5

acelyc111 commented Dec 15, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

MLKoz2 commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

Conversation

acelyc111 commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

MLKoz2 commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

acelyc111 commented Dec 15, 2025 •

edited

Loading