[PD]: support prefill tp and decode dp+tp for mooncake backend#5887
[PD]: support prefill tp and decode dp+tp for mooncake backend#5887ZhengWG wants to merge 9 commits intosgl-project:mainfrom
Conversation
|
I suggest a random choice of real source from the available TP ranks so that we can split the traffic. |
|
LGTM |
And can you share some benchmark data ? |
Thans you so much for the review, I will add the benchmark data later soon. |
Good suggestion! I will fix it~ |
There was a problem hiding this comment.
Correct me if I am wrong, I think what this PR really wants to achieve is to support PD with different tp size only for MLA models?
In my opinion, we need a more general design to support PD with different tp size for all kinds of models, so I am not really comfortable about the dummy design since it seems like we change a lot of code only to support MLA with prefill_dp_size restricted to 1.
I think I can submit a PR to have the bootstrap mechanism support all kinds of situations with different tp sizes + dp first. Although we still need to figure out how to correctly handle MHA/GQA kvcache data transfer, you can submit a PR with minimum changes to support MLA in advance based on that PR.
Thank you for your response! Yes, this PR currently only works with the MLA model. It would be great if a more general design is released soon. Could you let me know when the PR you mentioned will be available? I’d like to focus on that one to add MLA model support. |
d791f52 to
76ce628
Compare
I am on vacation now. Will spend some time tonight drafting a PR when I get back to hotel. |
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
…tp_size_per_dp_rank Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
76ce628 to
a4a8745
Compare
ShangmingCai
left a comment
There was a problem hiding this comment.
Thx for the updates! I will find a time to talk about this PR with the SGLang team to see if we have a better solution to detect whether MLA is used. Other parts LGTM. Will run more tests to verify when I have some free time.
|
@ZhengWG I am working on implementing the PD CI, will finish different tp for MLA tomorrow. |
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Motivation
Support Prefill only with TP and Decode with TP/DP/EP
NOTE: only support for MLA model
Modifications
main design:
Example:
Decode engine_rank=0 will contact Prefill engine_rank=0(real rank),2(dummy_rank);,and engine_rank=1 will contact Prefill engine_rank=1(real_rank),3(dummy_rank)
This code design is inspired by PR-5922 and PR-5681
Test
Since the prefill task is compute-bound, the TP-only configuration outperforms DP+TP, particularly in scenarios where prefill performance becomes the bottleneck. Below are the test results for DeepSeek-R1 1P/1D performance on H20 (96GB):
Below are test examples:
CC: @ShangmingCai @whybeyoung @jokerwyt
Checklist