Checklist
Motivation
MiMo-V2-Flash is a Mixture-of-Experts (MoE) language model with 309B total parameters and 15B active parameters. Designed for high-speed reasoning and agentic workflows, it utilizes a novel hybrid attention architecture and Multi-Token Prediction (MTP) to achieve state-of-the-art performance while significantly reducing inference costs.
See it on HF: https://huggingface.co/XiaomiMiMo/MiMo-V2-Flash
LMSys blog: https://lmsys.org/blog/2025-12-16-mimo-v2-flash/
Installation
Docker
# Pull the docker image
docker pull lmsysorg/sglang:dev-pr-15207
# Launch the container
docker run -it --gpus all \
--shm-size=32g \
--ipc=host \
--network=host \
lmsysorg/sglang:dev-pr-15207 bash
Pip Installation
# On a machine with SGLang dependencies installed or inside a SGLang nightly container
# Start an SGLang nightly container
docker run -it --gpus all \
--shm-size=32g \
--ipc=host \
--network=host \
lmsysorg/sglang:nightly-dev-20251215-4449c170 bash
# If you already have SGLang installed, uninstall the current SGLang version
pip uninstall sglang -y
# Install the PyPI Package
pip install sglang==0.5.6.post2.dev8005+pr.15207.g39d5bd57a \
--index-url https://sgl-project.github.io/whl/pr/ \
--extra-index-url https://pypi.org/simple
Launch Command
SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server \
--model-path XiaomiMiMo/MiMo-V2-Flash \
--dp-size 2 \
--enable-dp-attention \
--tp-size 8 \
--trust-remote-code \
--mem-fraction-static 0.75 \
--max-running-requests 128 \
--chunked-prefill-size 16384 \
--reasoning-parser qwen3 \
--tool-call-parser mimo \
--model-loader-extra-config '{"enable_multithread_load": "true","num_threads": 64}' \
--attention-backend fa3 \
--speculative-algorithm EAGLE \
--speculative-num-steps=3 \
--speculative-eagle-topk=1 \
--speculative-num-draft-tokens=4 \
--enable-mtp
Future Plan
Related resources
No response
Checklist
Motivation
MiMo-V2-Flash is a Mixture-of-Experts (MoE) language model with 309B total parameters and 15B active parameters. Designed for high-speed reasoning and agentic workflows, it utilizes a novel hybrid attention architecture and Multi-Token Prediction (MTP) to achieve state-of-the-art performance while significantly reducing inference costs.
See it on HF: https://huggingface.co/XiaomiMiMo/MiMo-V2-Flash
LMSys blog: https://lmsys.org/blog/2025-12-16-mimo-v2-flash/
MiMo-V2-Flashday0 support #15207MiMo-V2-FlashOptimization #15208Installation
Docker
Pip Installation
Launch Command
SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server \ --model-path XiaomiMiMo/MiMo-V2-Flash \ --dp-size 2 \ --enable-dp-attention \ --tp-size 8 \ --trust-remote-code \ --mem-fraction-static 0.75 \ --max-running-requests 128 \ --chunked-prefill-size 16384 \ --reasoning-parser qwen3 \ --tool-call-parser mimo \ --model-loader-extra-config '{"enable_multithread_load": "true","num_threads": 64}' \ --attention-backend fa3 \ --speculative-algorithm EAGLE \ --speculative-num-steps=3 \ --speculative-eagle-topk=1 \ --speculative-num-draft-tokens=4 \ --enable-mtpFuture Plan
MiMo-V2-Flashday0 support #15207MiMo-V2-FlashOptimization #15208Related resources
No response