[Feature] Xiaomi `MiMo-V2-Flash` day0 support by acelyc111 · Pull Request #15207 · sgl-project/sglang

acelyc111 · 2025-12-15T20:46:04Z

Motivation

MiMo-V2-Flash is a Mixture-of-Experts (MoE) language model with 309B total parameters and 15B active parameters. Designed for high-speed reasoning and agentic workflows, it utilizes a novel hybrid attention architecture and Multi-Token Prediction (MTP) to achieve state-of-the-art performance while significantly reducing inference costs.

See it on HF: https://huggingface.co/XiaomiMiMo/MiMo-V2-Flash
LMSys blog: https://lmsys.org/blog/2025-12-16-mimo-v2-flash/

Modifications

Basic model adaption
Enhance Sliding Window Attention (SWA)
Introduce multi-layer MTP
Fix bugs related to PDD ans SWA KV cache management

The remain improvements are tracked by #15263.

Benchmarking and Profiling

MiMo-V2-Flash Prefill Benchmark (Radix Cache Disabled):

MiMo-V2-Flash Decode Benchmark (DP 2, TP 4, EP 8, MTP Accept Length 3.6, Input Token Length 16k, Varying Batch Size):

MiMo-V2-Flash Decode Benchmark (DP 2, TP 4, EP 8, MTP Accept Length 3.6, Per DP Rank Batch Size 16, Varying Input Token Length):

The full performance can be reproduced by the branch in PR: #15208

We will merge all the performance and accuracy improvements in following patches.

Launch Command example

SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server \
        --model-path XiaomiMiMo/MiMo-V2-Flash \
        --dp-size 2 \
        --enable-dp-attention \
        --tp-size 8 \
        --trust-remote-code \
        --mem-fraction-static 0.75 \
        --max-running-requests 128 \
        --chunked-prefill-size 16384 \
        --reasoning-parser qwen3 \
        --tool-call-parser mimo \
        --model-loader-extra-config '{"enable_multithread_load": "true","num_threads": 64}' \
        --attention-backend fa3 \
        --speculative-algorithm EAGLE \
        --speculative-num-steps=3 \
        --speculative-eagle-topk=1 \
        --speculative-num-draft-tokens=4 \
        --enable-mtp

Co-authors

@JoyFuture
@Jumbo0715
@TZHelloWorld
@acelyc111
@hnyls2002
@ispobock
@lshmouse
@ollybbmonster
@sitabulaixizawaluduo
@yetlinghao
@zhannngchen

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

…et_cpu_copy and load_cpu_copy

…than 256

…ection quant

…ph replay

yhyang201 · 2025-12-18T02:52:38Z

I tested the latest commit(36ef1e9) locally by running python3 test/srt/test_vision_openai_server_a.py TestDeepseekOCRServer,
and it still runs into an OOM issue.
I suspect this CI failure is related to MHATokenToKVPool or to deepseek-ocr itself, rather than being a flaky CI problem.
The DeepSeek OCR tests were passing before and after PR #15277
Is it possibly related to profile_max_num_token?

merrymercy · 2025-12-18T03:07:50Z


+    # For Multi-Layer MTP
+    # FIXME: rename -> enable_multi_layer_mtp
+    enable_mtp: bool = False


Delete this argument. If it is mimo, turn this one.

merrymercy · 2025-12-18T03:08:34Z

rename `multi_layer_eagle_worker.py

TZHelloWorld · 2025-12-18T07:05:04Z

I tested the latest commit(36ef1e9) locally by running python3 test/srt/test_vision_openai_server_a.py TestDeepseekOCRServer, and it still runs into an OOM issue. I suspect this CI failure is related to MHATokenToKVPool or to deepseek-ocr itself, rather than being a flaky CI problem. The DeepSeek OCR tests were passing before and after PR #15277 Is it possibly related to profile_max_num_token?

when i debug this ci ,i find the k buffer and v buffer maybe use 180+GiB, and then find in function profile_max_num_token is modify ，when use the model deepseek-ai/DeepSeek-OCR , the self.model_config.v_head_dim=0.

and i check model deepseek-ocr config.json ,find the v_head_dim is 0 :

hnyls2002 · 2025-12-18T08:33:50Z

@TZHelloWorld Yes, we have fixed it. It is deepseek's bug.

acelyc111 · 2025-12-18T17:20:54Z

/rerun-failed-ci

acelyc111 · 2025-12-18T18:05:44Z

/rerun-stage unit-test-backend-2-gpu

github-actions · 2025-12-18T18:06:03Z

✅ Triggered unit-test-backend-2-gpu to run independently (skipping dependencies).

It will not be shown in this page. Check the Actions tab for progress.

acelyc111 · 2025-12-19T01:02:36Z

/rerun-failed-ci

hnyls2002 · 2025-12-19T02:58:54Z

/rerun-stage unit-test-deepep-4-gpu

hnyls2002 · 2025-12-19T02:59:07Z

/rerun-stage unit-test-deepep-8-gpu

hnyls2002 · 2025-12-19T03:07:20Z

CIs passed:

* 'main' of https://github.com/sgl-project/sglang: (136 commits) fix: unreachable error check in retraction (sgl-project#15433) [sgl-kernel] chore: update deepgemm version (sgl-project#13402) [diffusion] multi-platform: support diffusion on amd and fix encoder loading on MI325 (sgl-project#13760) [amd] Add deterministic all-reduce kernel for AMD (ROCm) (sgl-project#15340) [diffusion] refactor: refactor _build_req_from_sampling to use shallow_asdict (sgl-project#13782) Add customized sampler registration (sgl-project#15423) Update readme (sgl-project#15425) Fix Mindspore model import warning (sgl-project#15287) [Feature] Xiaomi `MiMo-V2-Flash` day0 support (sgl-project#15207) [diffusion] profiling: add bench_serving.py and VBench (sgl-project#15410) [DLLM] Fix dLLM regression (sgl-project#15371) [Deepseek V3.2] Fix Deepseek MTP in V1 mode (sgl-project#15429) chore: update CI_PERMISSIONS (sgl-project#15431) [DLLM] Add CI for diffusion LLMs (sgl-project#14723) Support using different attention backend for draft decoding. (sgl-project#14843) feat(dsv32): better error handling for DeepSeek-v3.2 encoder (sgl-project#14353) tiny fix lint on main (sgl-project#15424) multimodal: precompute hash for MultimodalDataItem (sgl-project#14354) [AMD] Clear pre-built AITER kernels and warmup to prevent segfaults and test timeouts (sgl-project#15318) [Performance] optimize NSA backend metadata computation for multi-step speculative decoding (sgl-project#14781) ...

Co-authored-by: 谢学扬 <xiexueyang@xiaomi.com> Co-authored-by: tz <tangzhen3@xiaomi.com> Co-authored-by: 李家乐 <lijiale10@xiaomi.com> Co-authored-by: 张晨 <zhangchen50@xiaomi.com> Co-authored-by: Shaohui Liu <liushaohui3@xiaomi.com> Co-authored-by: 王晨 <wangchen77@xiaomi.com> Co-authored-by: jiangzihan <jiangzihan@xiaomi.com> Co-authored-by: xiexueyang <xyxie_wangyi@163.com> Co-authored-by: Linghao Zhang <zhanglinghao@xiaomi.com> Co-authored-by: ispobock <ispobaoke@gmail.com> Co-authored-by: Liangsheng Yin <lsyincs@gmail.com> Co-authored-by: JoyFuture <35593546+JoyFuture@users.noreply.github.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com> Co-authored-by: Qiaolin Yu <liin1211@outlook.com> Co-authored-by: root <root@bj9-ml-g8h20e-k8s-slave106-20251106.alicn.idc.xiaomi.com>

yhyang201 · 2026-01-30T18:07:21Z

+                    draft_logits_output.topk_index,
+                )
+            else:
+                draft_logits_output, _ = self.draft_runner_list[step].forward(


I’m a bit confused about the logic here. If self.draft_runner_list[step] is a ModelRunner, shouldn’t its return value be an object rather than a tuple?

Unfortunately, I’m encountering an error when launching mimov2 with disable-cuda-graph enabled, so I’m unable to investigate this part firsthand.

@yhyang201 It seems like a conflict introduced by another patch #15400, they are merged into the main branch about the same time.

I'll fix it later.

acelyc111 and others added 30 commits December 14, 2025 15:01

MiMo-v2 adaption

36464b1

Fix PD KV cache bug caused by non-equal KV head dims

68a7d2d

[Bugfix] fix dp attention related bug in MTP

2cd9101

format code

7ca1930

[feat] support mimo_v2 two batch overlap

4b25dff

change swa memory pool init logic and enhance input length check

c91b5e9

Fix SWA evict bug which cause garbage output

86caa16

fix: decoder crashed when retraction due to SWAKVPool not implement g…

be38493

…et_cpu_copy and load_cpu_copy

[MTP] fix hybrid_swa_compress_nextn.py

47f5917

[Bugfix] evict swa radix cache when enable MTP

e37b670

[opt] Reduce duplicate evict_swa ops when the input length is longer …

8bd26b0

…than 256

fix: Calculate KV cache more accuracily to adapt TP4

8393868

feat: evict swa cache for eaglev2

cc5a1d8

resolve conflicts

a548402

[feat] rename model name as MiMoV2FlashForCausalLM

4ab993f

[feat] rename model name as MiMoV2FlashForCausalLM

10c9471

[feat] rename model name as MiMoV2FlashForCausalLM (2nd)

21d06f5

feat: add warn long if deepgeem compile cost too long

b7ab52d

feat(quant): support QKVO FP8 quantization and add env var for O proj…

16117d2

…ection quant

feat: multilayer mtp support

6085714

bugfix: hidden states before norm default set to None

6c2d9a5

bugfix: hidden states before norm default set to None in draft model

8c0843b

bugfix: fix idle batch in mtp_worker

6d584ca

multilayer mtp supports scheduler overlap

1992ae9

feat：support specv2 draft extend cudagraph

08510a7

MTP: rebase main fix

8b01330

Xiaomi internal changes

017fc42

remove load bf16, but use fp8 quantizer

eebda3e

code format

cf2aa51

opt(MTP): Remove H2D copies and .item() sync operations from CUDA gra…

daa51bb

…ph replay

fix swa chunk cache

36ef1e9

merrymercy reviewed Dec 18, 2025

View reviewed changes

hnyls2002 mentioned this pull request Dec 18, 2025

Monkey patch deepseek-ocr's v_head_dim #15384

Merged

Merge branch 'main' into xiaomi-mimo-v2-flash

1199043

hnyls2002 and others added 3 commits December 18, 2025 19:54

Merge branch 'main' into xiaomi-mimo-v2-flash

f74a3a5

align

1a1141e

tbo add return_hidden_states_before_norm

0299e1d

tbo add original_global_num_tokens_cpu

e53dc8c

hnyls2002 requested a review from Kangyan-Zhou as a code owner December 19, 2025 03:06

hnyls2002 force-pushed the xiaomi-mimo-v2-flash branch from bbffab7 to e53dc8c Compare December 19, 2025 03:38

hnyls2002 merged commit 160a06c into main Dec 19, 2025
85 of 157 checks passed

hnyls2002 deleted the xiaomi-mimo-v2-flash branch December 19, 2025 03:40

mratsim mentioned this pull request Dec 19, 2025

[Feature] Default sampling parameters through Docker/sglang.launch_server #15487

Closed

2 tasks

OrangeRedeng mentioned this pull request Dec 19, 2025

[NPU] [BUGFIX] Fix NPU inference (torch_npu._npu_reshape_and_cache() crash) #15484

Merged

6 tasks

yhyang201 reviewed Jan 30, 2026

View reviewed changes

acelyc111 mentioned this pull request Feb 2, 2026

fix: fix the wrong return value type of draft model runner #18105

Merged

5 tasks

Conversation

acelyc111 commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Benchmarking and Profiling

Launch Command example

Co-authors

Checklist

Uh oh!

yhyang201 commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

merrymercy Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

merrymercy Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

TZHelloWorld commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hnyls2002 commented Dec 18, 2025

Uh oh!

acelyc111 commented Dec 18, 2025

Uh oh!

acelyc111 commented Dec 18, 2025

Uh oh!

github-actions Bot commented Dec 18, 2025

Uh oh!

acelyc111 commented Dec 19, 2025

Uh oh!

hnyls2002 commented Dec 19, 2025

Uh oh!

hnyls2002 commented Dec 19, 2025

Uh oh!

hnyls2002 commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

yhyang201 Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

yhyang201 Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

acelyc111 Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

19 participants

acelyc111 commented Dec 15, 2025 •

edited

Loading

yhyang201 commented Dec 18, 2025 •

edited

Loading

TZHelloWorld commented Dec 18, 2025 •

edited

Loading

hnyls2002 commented Dec 19, 2025 •

edited

Loading

acelyc111 Feb 2, 2026 •

edited

Loading