Skip to content

[PD]: support prefill tp and decode dp+tp for mooncake backend#5887

Closed
ZhengWG wants to merge 9 commits intosgl-project:mainfrom
ZhengWG:pd/tp-dp-remapping
Closed

[PD]: support prefill tp and decode dp+tp for mooncake backend#5887
ZhengWG wants to merge 9 commits intosgl-project:mainfrom
ZhengWG:pd/tp-dp-remapping

Conversation

@ZhengWG
Copy link
Copy Markdown
Collaborator

@ZhengWG ZhengWG commented Apr 29, 2025

Motivation

Support Prefill only with TP and Decode with TP/DP/EP

NOTE: only support for MLA model

Modifications

main design:

  • For each attn-tp-rank in Decode, get duplicate tp_rank from Prefill, and keep only one real rank and other dummy ranks
  • For each tp-rank in Prefill, only realrank send kv_cache/aux data.

Example:

  • For Decode: TP4/DP2, attn_dp_rank=2; Prefill: TP4;
    Decode engine_rank=0 will contact Prefill engine_rank=0(real rank),2(dummy_rank);,and engine_rank=1 will contact Prefill engine_rank=1(real_rank),3(dummy_rank)

This code design is inspired by PR-5922 and PR-5681

Test

Since the prefill task is compute-bound, the TP-only configuration outperforms DP+TP, particularly in scenarios where prefill performance becomes the bottleneck. Below are the test results for DeepSeek-R1 1P/1D performance on H20 (96GB):

Prefill Decode Input/output Mean TTFT (ms) Mean ITL (ms) QPS
TP8 TP8/DP2 4096/100 5893.11 25.51 1.90
TP8/DP2 TP8/DP2 4096/100 7066.45 25.71 1.82
TP8 TP8/DP4 4096/100 9933.37 24.21 1.87
TP8/DP4 TP8/DP4 4096/100 48255.53 25.17 1.44

Below are test examples:

# decode tp+dp
python3 -m sglang.launch_server --model-path ${MODEL_PATH} \
	--host $HOST_IP --port $HOST_PORT --trust-remote-code --tp-size ${TP_SIZE} --quantization fp8 \
	--mem-fraction-static 0.95 --enable-mixed-chunk --flashinfer-mla-disable-ragged \
	--enable-dp-attention --dp-size ${DP_SIZE} --attention-backend flashinfer \
	--disaggregation-mode decode --page-size 128 --disaggregation-with-mla True

# prefill-only tp
python3 -m sglang.launch_server --model-path ${MODEL_PATH} \
  --host $HOST_IP --port $HOST_PORT --trust-remote-code --tp-size ${TP_SIZE} --quantization fp8 \
  --mem-fraction-static 0.93 --attention-backend fa3 \
  --disaggregation-mode prefill --page-size 128 --disaggregation-with-mla True
  
# prefill tp+dp
python3 -m sglang.launch_server --model-path ${MODEL_PATH} \	
	--host $HOST_IP --port $HOST_PORT --trust-remote-code --tp-size ${TP_SIZE} --quantization fp8 \
	--mem-fraction-static 0.95 --attention-backend fa3 \
	--enable-dp-attention --dp-size ${DP_SIZE} \
	--disaggregation-mode prefill --page-size 128 --disaggregation-with-mla True

# minilb
python3 -m sglang.srt.disaggregation.mini_lb --prefill ${PREFILL_URL} \
	--decode ${DECODE_URL} --host ${HOST_IP} --port ${SERVER_PORT}
# benchmark
 python3 bench_serving.py --model ${MODEL_PATH} \
 	--tokenizer ${MODEL_PATH} \
 	--num-prompts 500 \
 	--random-input-len 4096 \
 	--random-output-len 100 \
 	--host $SERVER_IP \
 	--port $PORT \
 	--random-range-ratio 1 \
 	--max-concurrency 100 \
 	--dataset-name random \
 	--request-rate 2

CC: @ShangmingCai @whybeyoung @jokerwyt

Checklist

@jokerwyt
Copy link
Copy Markdown
Contributor

I suggest a random choice of real source from the available TP ranks so that we can split the traffic.

@whybeyoung
Copy link
Copy Markdown
Collaborator

@ShangmingCai

@whybeyoung
Copy link
Copy Markdown
Collaborator

LGTM

@whybeyoung
Copy link
Copy Markdown
Collaborator

LGTM

And can you share some benchmark data ?

@ZhengWG
Copy link
Copy Markdown
Collaborator Author

ZhengWG commented Apr 30, 2025

LGTM

And can you share some benchmark data ?

Thans you so much for the review, I will add the benchmark data later soon.

@ZhengWG
Copy link
Copy Markdown
Collaborator Author

ZhengWG commented Apr 30, 2025

I suggest a random choice of real source from the available TP ranks so that we can split the traffic.

Good suggestion! I will fix it~

Copy link
Copy Markdown
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I am wrong, I think what this PR really wants to achieve is to support PD with different tp size only for MLA models?

In my opinion, we need a more general design to support PD with different tp size for all kinds of models, so I am not really comfortable about the dummy design since it seems like we change a lot of code only to support MLA with prefill_dp_size restricted to 1.

I think I can submit a PR to have the bootstrap mechanism support all kinds of situations with different tp sizes + dp first. Although we still need to figure out how to correctly handle MHA/GQA kvcache data transfer, you can submit a PR with minimum changes to support MLA in advance based on that PR.

@ZhengWG
Copy link
Copy Markdown
Collaborator Author

ZhengWG commented Apr 30, 2025

Correct me if I am wrong, I think what this PR really wants to achieve is to support PD with different tp size only for MLA models?

In my opinion, we need a more general design to support PD with different tp size for all kinds of models, so I am not really comfortable about the dummy design since it seems like we change a lot of code only to support MLA with prefill_dp_size restricted to 1.

I think I can submit a PR to have the bootstrap mechanism support all kinds of situations with different tp sizes + dp first. Although we still need to figure out how to correctly handle MHA/GQA kvcache data transfer, you can submit a PR with minimum changes to support MLA in advance based on that PR.

Thank you for your response!

Yes, this PR currently only works with the MLA model. It would be great if a more general design is released soon.

Could you let me know when the PR you mentioned will be available? I’d like to focus on that one to add MLA model support.

@ZhengWG ZhengWG force-pushed the pd/tp-dp-remapping branch from d791f52 to 76ce628 Compare April 30, 2025 11:26
@ShangmingCai
Copy link
Copy Markdown
Collaborator

Correct me if I am wrong, I think what this PR really wants to achieve is to support PD with different tp size only for MLA models?
In my opinion, we need a more general design to support PD with different tp size for all kinds of models, so I am not really comfortable about the dummy design since it seems like we change a lot of code only to support MLA with prefill_dp_size restricted to 1.
I think I can submit a PR to have the bootstrap mechanism support all kinds of situations with different tp sizes + dp first. Although we still need to figure out how to correctly handle MHA/GQA kvcache data transfer, you can submit a PR with minimum changes to support MLA in advance based on that PR.

Thank you for your response!

Yes, this PR currently only works with the MLA model. It would be great if a more general design is released soon.

Could you let me know when the PR you mentioned will be available? I’d like to focus on that one to add MLA model support.

I am on vacation now. Will spend some time tonight drafting a PR when I get back to hotel.

Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
ShangmingCai and others added 2 commits May 1, 2025 02:20
@ZhengWG ZhengWG force-pushed the pd/tp-dp-remapping branch from 76ce628 to a4a8745 Compare May 1, 2025 12:44
Copy link
Copy Markdown
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for the updates! I will find a time to talk about this PR with the SGLang team to see if we have a better solution to detect whether MLA is used. Other parts LGTM. Will run more tests to verify when I have some free time.

@ZhengWG ZhengWG requested a review from ShangmingCai May 6, 2025 06:04
@ShangmingCai
Copy link
Copy Markdown
Collaborator

@ZhengWG I am working on implementing the PD CI, will finish different tp for MLA tomorrow.

ShangmingCai pushed a commit to kvcache-ai/sglang that referenced this pull request May 7, 2025
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
@ZhengWG ZhengWG closed this May 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants