Instruction for Running DeepSeek with PD, EP, and MTP

# Using Main Branch

## Environment Preparation
Use SGLang and DeepEP on master is sufficient. Also remember to upgrade Mooncake. It will be better to create customized expert distribution data for MTP (follow the related instructions in #6017)


## xP + 2D, max_running_requests=32, draft_token_num=3

### Command for decode
```
SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE=10000000 SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=256 MC_TE_METRIC=true python3 -m sglang.launch_server --model-path /mnt/shared-fs/models/deepseek-ai/DeepSeek-V3-0324 --disaggregation-ib-device mlx5_1 --disaggregation-mode decode --dist-init-addr 10.0.7.67:5757 --tp-size 16 --dp-size 16 --enable-dp-attention --decode-log-interval 1 --enable-deepep-moe --page-size 64 --host 0.0.0.0 --trust-remote-code --moe-dense-tp-size 1 --enable-dp-lm-head --disable-radix-cache --watchdog-timeout 1000000 --deepep-mode low_latency --mem-fraction-static 0.8 --max-running-requests 32 --context-length 73728 --ep-num-redundant-experts 32 --disable-shared-experts-fusion --cuda-graph-max-bs 2 --init-expert-location /mnt/shared-fs/stats-qiaolin/mtp_213.pt --speculative-algorithm EAGLE --speculative-num-steps 2 --speculative-eagle-topk 1 --speculative-num-draft-tokens 3 --nnodes 2 --node-rank 0
```



### Benchmark for decode
```
# slow down D nodes
curl -X POST -H 'Content-Type: application/json' 'http://10.0.7.67:30000/slow_down' -d '{"forward_sleep_time": 90.0}' 

# start benchmark; do not wait for this to finish before running the next line
python3 -m sglang.bench_one_batch_server  --base-url http://10.5.38.77:8000 --model-path /mnt/shared-fs/models/deepseek-ai/DeepSeek-V3-0324 --batch-size 128 --input-len 65000 --output-len 4000 --skip-warmup 

# after some time (e.g. 10 minute), the D nodes are saturated, then this command should be executed
# finish slowing down D nodes
curl -X POST -H 'Content-Type: application/json' 'http://10.0.7.67:30000/slow_down' -d '{"forward_sleep_time": null}' 
```



## xP + 12D, max_running_requests=12288, draft_token_num=2
### Command for decode
```
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=512 SGLANG_NUM_RESERVED_DECODE_TOKENS=176 MC_TE_METRIC=true SGLANG_DISAGGREGATION_HEARTBEAT_INTERVAL=10000000 SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=100000 SGLANG_DUMPER_DIR=/mnt/shared-fs/zbx/tmp SGLANG_EXPERT_DISTRIBUTION_RECORDER_DIR=/mnt/shared-fs/zbx/temp_sglang_server2local SGLANG_TORCH_PROFILER_DIR=/mnt/shared-fs/zbx/temp_sglang_server2local PYTHONUNBUFFERED=1 /home/ql/sglang_ql/bin/python3 -m sglang.launch_server --model-path /dev/shm/DeepSeek-V3-0324 --trust-remote-code --disaggregation-mode decode --dist-init-addr 10.5.45.38:5757 --nnodes 12 --node-rank 11 --tp-size 96 --dp-size 96 --enable-dp-attention --host 10.0.47.189 --decode-log-interval 1 --context-length 2176 --disable-radix-cache --enable-deepep-moe --moe-dense-tp-size 1 --enable-dp-lm-head --disable-shared-experts-fusion --watchdog-timeout 1000000 --enable-two-batch-overlap --disaggregation-ib-device mlx5_1 --disable-overlap-schedule --speculative-algo EAGLE --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2  --init-expert-location /mnt/shared-fs/configs/ep_statistics/decode_in2000out100.json --deepep-mode low_latency --mem-fraction-static 0.75 --cuda-graph-bs 128 --max-running-requests 12288 --ep-num-redundant-experts 32
```


### Benchmark for decode
```
# slow down D nodes
curl -X POST -H 'Content-Type: application/json' 'http://10.5.45.38:30000/slow_down' -d '{"forward_sleep_time": 90.0}' 

# start benchmark; do not wait for this to finish before running the next line
python3 -m sglang.bench_one_batch_server  --base-url http://10.5.39.245:8000 --model-path /dev/shm/DeepSeek-V3-0324 --batch-size 24576 --input-len 2000 --output-len 100 --skip-warmup

# after some time (e.g. 10 minute), the D nodes are saturated, then this command should be executed
# finish slowing down D nodes
curl -X POST -H 'Content-Type: application/json' 'http://10.5.45.38:30000/slow_down' -d '{"forward_sleep_time": null}' 
```

Note that since MTP doesn't support overlap scheduling yet, the performance in this case is still not optimal. We're actively working on it — stay tuned.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instruction for Running DeepSeek with PD, EP, and MTP #7998

Using Main Branch

Environment Preparation

xP + 2D, max_running_requests=32, draft_token_num=3

Command for decode

Benchmark for decode

xP + 12D, max_running_requests=12288, draft_token_num=2

Command for decode

Benchmark for decode

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Instruction for Running DeepSeek with PD, EP, and MTP #7998

Description

Using Main Branch

Environment Preparation

xP + 2D, max_running_requests=32, draft_token_num=3

Command for decode

Benchmark for decode

xP + 12D, max_running_requests=12288, draft_token_num=2

Command for decode

Benchmark for decode

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions