Skip to content

Instruction for Running DeepSeek with PD, EP, and MTP #7998

@Qiaolin-Yu

Description

@Qiaolin-Yu

Using Main Branch

Environment Preparation

Use SGLang and DeepEP on master is sufficient. Also remember to upgrade Mooncake. It will be better to create customized expert distribution data for MTP (follow the related instructions in #6017)

xP + 2D, max_running_requests=32, draft_token_num=3

Command for decode

SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE=10000000 SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=256 MC_TE_METRIC=true python3 -m sglang.launch_server --model-path /mnt/shared-fs/models/deepseek-ai/DeepSeek-V3-0324 --disaggregation-ib-device mlx5_1 --disaggregation-mode decode --dist-init-addr 10.0.7.67:5757 --tp-size 16 --dp-size 16 --enable-dp-attention --decode-log-interval 1 --enable-deepep-moe --page-size 64 --host 0.0.0.0 --trust-remote-code --moe-dense-tp-size 1 --enable-dp-lm-head --disable-radix-cache --watchdog-timeout 1000000 --deepep-mode low_latency --mem-fraction-static 0.8 --max-running-requests 32 --context-length 73728 --ep-num-redundant-experts 32 --disable-shared-experts-fusion --cuda-graph-max-bs 2 --init-expert-location /mnt/shared-fs/stats-qiaolin/mtp_213.pt --speculative-algorithm EAGLE --speculative-num-steps 2 --speculative-eagle-topk 1 --speculative-num-draft-tokens 3 --nnodes 2 --node-rank 0

Benchmark for decode

# slow down D nodes
curl -X POST -H 'Content-Type: application/json' 'http://10.0.7.67:30000/slow_down' -d '{"forward_sleep_time": 90.0}' 

# start benchmark; do not wait for this to finish before running the next line
python3 -m sglang.bench_one_batch_server  --base-url http://10.5.38.77:8000 --model-path /mnt/shared-fs/models/deepseek-ai/DeepSeek-V3-0324 --batch-size 128 --input-len 65000 --output-len 4000 --skip-warmup 

# after some time (e.g. 10 minute), the D nodes are saturated, then this command should be executed
# finish slowing down D nodes
curl -X POST -H 'Content-Type: application/json' 'http://10.0.7.67:30000/slow_down' -d '{"forward_sleep_time": null}' 

xP + 12D, max_running_requests=12288, draft_token_num=2

Command for decode

SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=512 SGLANG_NUM_RESERVED_DECODE_TOKENS=176 MC_TE_METRIC=true SGLANG_DISAGGREGATION_HEARTBEAT_INTERVAL=10000000 SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=100000 SGLANG_DUMPER_DIR=/mnt/shared-fs/zbx/tmp SGLANG_EXPERT_DISTRIBUTION_RECORDER_DIR=/mnt/shared-fs/zbx/temp_sglang_server2local SGLANG_TORCH_PROFILER_DIR=/mnt/shared-fs/zbx/temp_sglang_server2local PYTHONUNBUFFERED=1 /home/ql/sglang_ql/bin/python3 -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --trust-remote-code --disaggregation-mode decode --dist-init-addr 10.5.45.38:5757 --nnodes 12 --node-rank 11 --tp-size 96 --dp-size 96 --enable-dp-attention --host 10.0.47.189 --decode-log-interval 1 --context-length 2176 --disable-radix-cache --enable-deepep-moe --moe-dense-tp-size 1 --enable-dp-lm-head --disable-shared-experts-fusion --watchdog-timeout 1000000 --enable-two-batch-overlap --disaggregation-ib-device mlx5_1 --disable-overlap-schedule --speculative-algo EAGLE --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2  --init-expert-location /mnt/shared-fs/configs/ep_statistics/decode_in2000out100.json --deepep-mode low_latency --mem-fraction-static 0.75 --cuda-graph-bs 128 --max-running-requests 12288 --ep-num-redundant-experts 32

Benchmark for decode

# slow down D nodes
curl -X POST -H 'Content-Type: application/json' 'http://10.5.45.38:30000/slow_down' -d '{"forward_sleep_time": 90.0}' 

# start benchmark; do not wait for this to finish before running the next line
python3 -m sglang.bench_one_batch_server  --base-url http://10.5.39.245:8000 --model-path /dev/shm/DeepSeek-V3-0324 --batch-size 24576 --input-len 2000 --output-len 100 --skip-warmup

# after some time (e.g. 10 minute), the D nodes are saturated, then this command should be executed
# finish slowing down D nodes
curl -X POST -H 'Content-Type: application/json' 'http://10.5.45.38:30000/slow_down' -d '{"forward_sleep_time": null}' 

Note that since MTP doesn't support overlap scheduling yet, the performance in this case is still not optimal. We're actively working on it — stay tuned.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions