Using Main Branch
Environment Preparation
Use SGLang and DeepEP on master is sufficient. Also remember to upgrade Mooncake. It will be better to create customized expert distribution data for MTP (follow the related instructions in #6017)
xP + 2D, max_running_requests=32, draft_token_num=3
Command for decode
SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE=10000000 SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=256 MC_TE_METRIC=true python3 -m sglang.launch_server --model-path /mnt/shared-fs/models/deepseek-ai/DeepSeek-V3-0324 --disaggregation-ib-device mlx5_1 --disaggregation-mode decode --dist-init-addr 10.0.7.67:5757 --tp-size 16 --dp-size 16 --enable-dp-attention --decode-log-interval 1 --enable-deepep-moe --page-size 64 --host 0.0.0.0 --trust-remote-code --moe-dense-tp-size 1 --enable-dp-lm-head --disable-radix-cache --watchdog-timeout 1000000 --deepep-mode low_latency --mem-fraction-static 0.8 --max-running-requests 32 --context-length 73728 --ep-num-redundant-experts 32 --disable-shared-experts-fusion --cuda-graph-max-bs 2 --init-expert-location /mnt/shared-fs/stats-qiaolin/mtp_213.pt --speculative-algorithm EAGLE --speculative-num-steps 2 --speculative-eagle-topk 1 --speculative-num-draft-tokens 3 --nnodes 2 --node-rank 0
Benchmark for decode
# slow down D nodes
curl -X POST -H 'Content-Type: application/json' 'http://10.0.7.67:30000/slow_down' -d '{"forward_sleep_time": 90.0}'
# start benchmark; do not wait for this to finish before running the next line
python3 -m sglang.bench_one_batch_server --base-url http://10.5.38.77:8000 --model-path /mnt/shared-fs/models/deepseek-ai/DeepSeek-V3-0324 --batch-size 128 --input-len 65000 --output-len 4000 --skip-warmup
# after some time (e.g. 10 minute), the D nodes are saturated, then this command should be executed
# finish slowing down D nodes
curl -X POST -H 'Content-Type: application/json' 'http://10.0.7.67:30000/slow_down' -d '{"forward_sleep_time": null}'
xP + 12D, max_running_requests=12288, draft_token_num=2
Command for decode
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=512 SGLANG_NUM_RESERVED_DECODE_TOKENS=176 MC_TE_METRIC=true SGLANG_DISAGGREGATION_HEARTBEAT_INTERVAL=10000000 SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=100000 SGLANG_DUMPER_DIR=/mnt/shared-fs/zbx/tmp SGLANG_EXPERT_DISTRIBUTION_RECORDER_DIR=/mnt/shared-fs/zbx/temp_sglang_server2local SGLANG_TORCH_PROFILER_DIR=/mnt/shared-fs/zbx/temp_sglang_server2local PYTHONUNBUFFERED=1 /home/ql/sglang_ql/bin/python3 -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --trust-remote-code --disaggregation-mode decode --dist-init-addr 10.5.45.38:5757 --nnodes 12 --node-rank 11 --tp-size 96 --dp-size 96 --enable-dp-attention --host 10.0.47.189 --decode-log-interval 1 --context-length 2176 --disable-radix-cache --enable-deepep-moe --moe-dense-tp-size 1 --enable-dp-lm-head --disable-shared-experts-fusion --watchdog-timeout 1000000 --enable-two-batch-overlap --disaggregation-ib-device mlx5_1 --disable-overlap-schedule --speculative-algo EAGLE --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 --init-expert-location /mnt/shared-fs/configs/ep_statistics/decode_in2000out100.json --deepep-mode low_latency --mem-fraction-static 0.75 --cuda-graph-bs 128 --max-running-requests 12288 --ep-num-redundant-experts 32
Benchmark for decode
# slow down D nodes
curl -X POST -H 'Content-Type: application/json' 'http://10.5.45.38:30000/slow_down' -d '{"forward_sleep_time": 90.0}'
# start benchmark; do not wait for this to finish before running the next line
python3 -m sglang.bench_one_batch_server --base-url http://10.5.39.245:8000 --model-path /dev/shm/DeepSeek-V3-0324 --batch-size 24576 --input-len 2000 --output-len 100 --skip-warmup
# after some time (e.g. 10 minute), the D nodes are saturated, then this command should be executed
# finish slowing down D nodes
curl -X POST -H 'Content-Type: application/json' 'http://10.5.45.38:30000/slow_down' -d '{"forward_sleep_time": null}'
Note that since MTP doesn't support overlap scheduling yet, the performance in this case is still not optimal. We're actively working on it — stay tuned.
Using Main Branch
Environment Preparation
Use SGLang and DeepEP on master is sufficient. Also remember to upgrade Mooncake. It will be better to create customized expert distribution data for MTP (follow the related instructions in #6017)
xP + 2D, max_running_requests=32, draft_token_num=3
Command for decode
Benchmark for decode
xP + 12D, max_running_requests=12288, draft_token_num=2
Command for decode
Benchmark for decode
Note that since MTP doesn't support overlap scheduling yet, the performance in this case is still not optimal. We're actively working on it — stay tuned.