[MiMoV2Flash] [feat]: support two batch overlap#17634
[MiMoV2Flash] [feat]: support two batch overlap#17634Kangyan-Zhou merged 8 commits intosgl-project:mainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
There was a problem hiding this comment.
Pull request overview
This PR adds support for two-batch overlap (TBO) optimization to the MiMoV2Flash model, enabling overlapped execution of prefill and decode batches for improved throughput in disaggregated serving scenarios.
Changes:
- Added TBO operation methods to MiMoV2MoE, MiMoV2Attention, and MiMoV2DecoderLayer classes for batch overlap support
- Integrated TBO into MiMoV2Model's forward pass with conditional execution based on
can_run_tboflag - Added MiMoV2DecoderLayer-specific operation strategies for prefill and decode modes in the batch overlap system
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| python/sglang/srt/models/mimo_v2_flash.py | Implements TBO operation methods and integrates TBO into model forward pass |
| python/sglang/srt/batch_overlap/operations_strategy.py | Defines operation scheduling strategies for MiMoV2 layers in TBO mode |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/tag-and-rerun-ci |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Add a new test to enable TBO in test/srt/models/test_mimo_models.py? |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
298cace to
5ac9cc0
Compare
|
/tag-and-rerun-ci |
cd5b227 to
32abf33
Compare
|
/tag-and-rerun-ci |
|
|
|
@TZHelloWorld could you please resolve conflicts? |
20f597d to
5f12e0f
Compare
|
/rerun-failed-ci |
1 similar comment
|
/rerun-failed-ci |
94aca22 to
5edf37c
Compare
Motivation
support mimo_v2_flash two batch overlap:
p:
python3 -m sglang.launch_server \ --model-path /mnt/mify-gw-model-alicn3/models/global_step_84-FP8-Block \ --pp-size 1 --dp-size 2 --tp-size 8 \ --enable-dp-attention \ --moe-a2a-backend deepep \ --deepep-mode normal \ --disaggregation-mode prefill \ --page-size 1 \ --host 0.0.0.0 \ --port 30010 \ --trust-remote-code \ --moe-dense-tp-size 1 \ --enable-dp-lm-head \ --mem-fraction-static 0.7 \ --max-running-requests 32 \ --reasoning-parser qwen3 \ --tool-call-parser mimo \ --context-length 262144 \ --model-loader-extra-config '{"enable_multithread_load": "true","num_threads": 64}' \ --attention-backend fa3 \ --allow-auto-truncate \ --chunked-prefill-size 16384 \ --enable-two-batch-overlapd:
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=1024 \ python3 -m sglang.launch_server \ --model-path /mnt/mify-gw-model-alicn3/models/global_step_84-FP8-Block \ --pp-size 1 --dp-size 2 --tp-size 8 \ --enable-dp-attention \ --moe-a2a-backend deepep \ --deepep-mode low_latency \ --decode-log-interval 1 \ --page-size 1 \ --host 0.0.0.0 --port 30020 \ --trust-remote-code \ --watchdog-timeout 1000000 \ --mem-fraction-static 0.7 \ --max-running-requests 32 \ --reasoning-parser qwen3 \ --tool-call-parser mimo \ --context-length 262144 \ --model-loader-extra-config '{"enable_multithread_load": "true","num_threads": 64}' \ --attention-backend fa3 \ --disaggregation-mode decode \ --moe-dense-tp-size 1 \ --enable-dp-lm-head \ --enable-two-batch-overlaplb:
python -m sglang_router.launch_router \ --pd-disaggregation \ --prefill http://127.x.x.1:30010 \ --decode http://127.x.x.1:30020 \ --host 0.0.0.0 \ --port 30000 \ --mini-lbModifications
Accuracy Tests
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci