[PD]: support prefill tp and decode dp+tp for mooncake backend by ZhengWG · Pull Request #5887 · sgl-project/sglang

ZhengWG · 2025-04-29T13:18:04Z

Motivation

Support Prefill only with TP and Decode with TP/DP/EP

NOTE: only support for MLA model

Modifications

main design:

For each attn-tp-rank in Decode, get duplicate tp_rank from Prefill, and keep only one real rank and other dummy ranks
For each tp-rank in Prefill, only realrank send kv_cache/aux data.

Example:

For Decode: TP4/DP2, attn_dp_rank=2; Prefill: TP4;
Decode engine_rank=0 will contact Prefill engine_rank=0(real rank),2(dummy_rank);，and engine_rank=1 will contact Prefill engine_rank=1(real_rank),3(dummy_rank)

This code design is inspired by PR-5922 and PR-5681

Test

Since the prefill task is compute-bound, the TP-only configuration outperforms DP+TP, particularly in scenarios where prefill performance becomes the bottleneck. Below are the test results for DeepSeek-R1 1P/1D performance on H20 (96GB):

Prefill	Decode	Input/output	Mean TTFT (ms)	Mean ITL (ms)	QPS
TP8	TP8/DP2	4096/100	5893.11	25.51	1.90
TP8/DP2	TP8/DP2	4096/100	7066.45	25.71	1.82
TP8	TP8/DP4	4096/100	9933.37	24.21	1.87
TP8/DP4	TP8/DP4	4096/100	48255.53	25.17	1.44

Below are test examples:

# decode tp+dp
python3 -m sglang.launch_server --model-path ${MODEL_PATH} \
	--host $HOST_IP --port $HOST_PORT --trust-remote-code --tp-size ${TP_SIZE} --quantization fp8 \
	--mem-fraction-static 0.95 --enable-mixed-chunk --flashinfer-mla-disable-ragged \
	--enable-dp-attention --dp-size ${DP_SIZE} --attention-backend flashinfer \
	--disaggregation-mode decode --page-size 128 --disaggregation-with-mla True

# prefill-only tp
python3 -m sglang.launch_server --model-path ${MODEL_PATH} \
  --host $HOST_IP --port $HOST_PORT --trust-remote-code --tp-size ${TP_SIZE} --quantization fp8 \
  --mem-fraction-static 0.93 --attention-backend fa3 \
  --disaggregation-mode prefill --page-size 128 --disaggregation-with-mla True
  
# prefill tp+dp
python3 -m sglang.launch_server --model-path ${MODEL_PATH} \	
	--host $HOST_IP --port $HOST_PORT --trust-remote-code --tp-size ${TP_SIZE} --quantization fp8 \
	--mem-fraction-static 0.95 --attention-backend fa3 \
	--enable-dp-attention --dp-size ${DP_SIZE} \
	--disaggregation-mode prefill --page-size 128 --disaggregation-with-mla True

# minilb
python3 -m sglang.srt.disaggregation.mini_lb --prefill ${PREFILL_URL} \
	--decode ${DECODE_URL} --host ${HOST_IP} --port ${SERVER_PORT}

# benchmark
 python3 bench_serving.py --model ${MODEL_PATH} \
 	--tokenizer ${MODEL_PATH} \
 	--num-prompts 500 \
 	--random-input-len 4096 \
 	--random-output-len 100 \
 	--host $SERVER_IP \
 	--port $PORT \
 	--random-range-ratio 1 \
 	--max-concurrency 100 \
 	--dataset-name random \
 	--request-rate 2

CC: @ShangmingCai @whybeyoung @jokerwyt

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

jokerwyt · 2025-04-29T15:12:12Z

I suggest a random choice of real source from the available TP ranks so that we can split the traffic.

whybeyoung · 2025-04-30T00:47:00Z

@ShangmingCai

whybeyoung · 2025-04-30T00:59:02Z

LGTM

whybeyoung · 2025-04-30T02:55:09Z

LGTM

And can you share some benchmark data ?

ZhengWG · 2025-04-30T05:45:49Z

LGTM

And can you share some benchmark data ?

Thans you so much for the review, I will add the benchmark data later soon.

ZhengWG · 2025-04-30T05:57:26Z

I suggest a random choice of real source from the available TP ranks so that we can split the traffic.

Good suggestion! I will fix it~

ShangmingCai

Correct me if I am wrong, I think what this PR really wants to achieve is to support PD with different tp size only for MLA models?

In my opinion, we need a more general design to support PD with different tp size for all kinds of models, so I am not really comfortable about the dummy design since it seems like we change a lot of code only to support MLA with prefill_dp_size restricted to 1.

I think I can submit a PR to have the bootstrap mechanism support all kinds of situations with different tp sizes + dp first. Although we still need to figure out how to correctly handle MHA/GQA kvcache data transfer, you can submit a PR with minimum changes to support MLA in advance based on that PR.

ZhengWG · 2025-04-30T11:07:59Z

Correct me if I am wrong, I think what this PR really wants to achieve is to support PD with different tp size only for MLA models?

In my opinion, we need a more general design to support PD with different tp size for all kinds of models, so I am not really comfortable about the dummy design since it seems like we change a lot of code only to support MLA with prefill_dp_size restricted to 1.

I think I can submit a PR to have the bootstrap mechanism support all kinds of situations with different tp sizes + dp first. Although we still need to figure out how to correctly handle MHA/GQA kvcache data transfer, you can submit a PR with minimum changes to support MLA in advance based on that PR.

Thank you for your response!

Yes, this PR currently only works with the MLA model. It would be great if a more general design is released soon.

Could you let me know when the PR you mentioned will be available? I’d like to focus on that one to add MLA model support.

ShangmingCai · 2025-04-30T12:12:30Z

Correct me if I am wrong, I think what this PR really wants to achieve is to support PD with different tp size only for MLA models?
In my opinion, we need a more general design to support PD with different tp size for all kinds of models, so I am not really comfortable about the dummy design since it seems like we change a lot of code only to support MLA with prefill_dp_size restricted to 1.
I think I can submit a PR to have the bootstrap mechanism support all kinds of situations with different tp sizes + dp first. Although we still need to figure out how to correctly handle MHA/GQA kvcache data transfer, you can submit a PR with minimum changes to support MLA in advance based on that PR.

Thank you for your response!

Yes, this PR currently only works with the MLA model. It would be great if a more general design is released soon.

Could you let me know when the PR you mentioned will be available? I’d like to focus on that one to add MLA model support.

I am on vacation now. Will spend some time tonight drafting a PR when I get back to hotel.

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

…tp_size_per_dp_rank Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

ShangmingCai

Thx for the updates! I will find a time to talk about this PR with the SGLang team to see if we have a better solution to detect whether MLA is used. Other parts LGTM. Will run more tests to verify when I have some free time.

ShangmingCai · 2025-05-06T06:19:09Z

@ZhengWG I am working on implementing the PD CI, will finish different tp for MLA tomorrow.

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

ZhengWG requested review from ByronHsu, Ying1123, hnyls2002, ispobock, merrymercy and zhyncs as code owners April 29, 2025 13:18

ShangmingCai reviewed Apr 30, 2025

View reviewed changes

ZhengWG requested review from BBuf, CatherineSue, FlamingoPg, HaiShaw, HandH1998, ch-wan, slin1237, xiezhq-hermann, yizhang2077 and zhaochenyang20 as code owners April 30, 2025 10:59

ZhengWG force-pushed the pd/tp-dp-remapping branch from d791f52 to 76ce628 Compare April 30, 2025 11:26

[PD] Add support for different TP sizes per DP rank

03db2af

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

ShangmingCai mentioned this pull request Apr 30, 2025

[PD] Add support for different TP sizes per DP rank #5922

Merged

6 tasks

ShangmingCai and others added 2 commits May 1, 2025 02:20

Add rank tables when decode_tp_size_per_dp_rank smaller than prefill_…

d0ba026

…tp_size_per_dp_rank Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

[PD]: support prefill tp/decode dp/tp for MLA model

bcb125e

ZhengWG added 3 commits May 1, 2025 18:27

feat: Implement random target selection for KV transfers

cd30172

Merge branch 'pull/5922' into pd/tp-dp-remapping

0928e28

fix: adapt for more general scene

a4a8745

ZhengWG force-pushed the pd/tp-dp-remapping branch from 76ce628 to a4a8745 Compare May 1, 2025 12:44

ShangmingCai reviewed May 1, 2025

View reviewed changes

ZhengWG requested a review from ShangmingCai May 6, 2025 06:04

ZhengWG added 2 commits May 7, 2025 15:43

Merge branch 'main' into pd/tp-dp-remapping

0fa9ae6

Merge branch 'main' into pd/tp-dp-remapping

7c068fa

ShangmingCai pushed a commit to kvcache-ai/sglang that referenced this pull request May 7, 2025

Add impl from sgl-project#5887

8c80e32

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>

Merge branch 'main' into pd/tp-dp-remapping

43d1845

ZhengWG closed this May 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PD]: support prefill tp and decode dp+tp for mooncake backend#5887

[PD]: support prefill tp and decode dp+tp for mooncake backend#5887
ZhengWG wants to merge 9 commits intosgl-project:mainfrom
ZhengWG:pd/tp-dp-remapping

ZhengWG commented Apr 29, 2025 •

edited

Loading

Uh oh!

jokerwyt commented Apr 29, 2025

Uh oh!

whybeyoung commented Apr 30, 2025

Uh oh!

whybeyoung commented Apr 30, 2025

Uh oh!

whybeyoung commented Apr 30, 2025

Uh oh!

ZhengWG commented Apr 30, 2025

Uh oh!

ZhengWG commented Apr 30, 2025

Uh oh!

ShangmingCai left a comment •

edited

Loading

Uh oh!

ZhengWG commented Apr 30, 2025

Uh oh!

ShangmingCai commented Apr 30, 2025

Uh oh!

ShangmingCai left a comment

Uh oh!

ShangmingCai commented May 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ZhengWG commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Test

Checklist

Uh oh!

jokerwyt commented Apr 29, 2025

Uh oh!

whybeyoung commented Apr 30, 2025

Uh oh!

whybeyoung commented Apr 30, 2025

Uh oh!

whybeyoung commented Apr 30, 2025

Uh oh!

ZhengWG commented Apr 30, 2025

Uh oh!

ZhengWG commented Apr 30, 2025

Uh oh!

ShangmingCai left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZhengWG commented Apr 30, 2025

Uh oh!

ShangmingCai commented Apr 30, 2025

Uh oh!

ShangmingCai left a comment

Choose a reason for hiding this comment

Uh oh!

ShangmingCai commented May 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ZhengWG commented Apr 29, 2025 •

edited

Loading

ShangmingCai left a comment •

edited

Loading