[1/2] support mla dp: custom alltoall by xu-yfei · Pull Request #5000 · sgl-project/sglang

xu-yfei · 2025-04-02T15:31:57Z

Motivation

This pr is for dp mla #5001

About dp mla:
On an 8*H20(96GB), weight mem usage=87.19 GB when --dp-size 4 --enable-dp-attention, not enough memory left.

This optimization is similar to data parallelism attention, but it applies to MLA core instead of the entire attention. Compared with data parallelism attention, it does not additionally increase the memory occupied by weights. It allows for a significant reduction in the KV cache size and enables larger batch sizes with only 1-3ms additional decode latency.

On an 8×H20 (96GB) node, with data parallelism MLA enabled, we have achieved up to 1.85x(dp=4) ~ 2.34x(dp=8) decoding throughput improvement compared to the previous version. And, the number of kvcaches has been increased by 3.3x(dp=4) and 6.6x(dp=8).

Modifications

Implement a custom all_to_all for dynamic input/output splitting (split_sizes) based on sglang allreduce IPC, currently limited to single-machine use.

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

zyksir

Hi, the implementation of custom all to all is really impressive! Do you test the performance of custom all to all with nccl send/recv based alltoall? By using nccl group start, nccl group end, and nccl send/recv, we should be able to implement an all to all as well.

jokerwyt · 2025-06-04T08:11:42Z

Hi, your design of adjusting warps count dynamically is cool! I am curious about the performance comparison with a direct NCCL implementation, especially under a skew sending/receiving split. And I think we can use NVSHMEM to get an internode implementation?

xu-yfei · 2025-06-09T13:12:12Z

In the cuda graph scenario, the actual token count of each DP is uncertain, only the sum is known. If the sum of data is sent to other DPs, the amount of data is relatively large. This all to all supports input/output spilt size as Tensor, that is, the amount of data sent by each DP to other DPs is not fixed in the cuda graph scenario. The default all_to_all requires input/output spilt size to be constant.
The default all to all runtime memory is large, reaching 2 to 3GB, and the reason has not been found yet.
Supports input/output fusion with transpose, because transpose is time-consuming.

Hi, the implementation of custom all to all is really impressive! Do you test the performance of custom all to all with nccl send/recv based alltoall? By using nccl group start, nccl group end, and nccl send/recv, we should be able to implement an all to all as well.

xu-yfei · 2025-06-09T13:19:18Z

1、As mentioned in the above reply, the default all to all is not used
2、Alltoall within a node is sufficient. The communication overhead between nodes is relatively large. We can use TP8 within a node and EP between nodes. When there are multiple nodes, the benefits are not as obvious compared to dp attention, because the time consumption of attention linear is much smaller. The benefits of reducing attention linear time are no longer that great. If you have good ideas, welcome to discuss and exchange.

Hi, your design of adjusting warps count dynamically is cool! I am curious about the performance comparison with a direct NCCL implementation, especially under a skew sending/receiving split. And I think we can use NVSHMEM to get an internode implementation?

xu-yfei requested review from BBuf, FlamingoPg, HandH1998, ispobock, merrymercy, yizhang2077 and zhyncs as code owners April 2, 2025 15:31

xu-yfei mentioned this pull request Apr 2, 2025

[2/2] support dp mla #5001

Closed

6 tasks

xu-yfei changed the title ~~support mla dp: custom alltoall~~ [DRFT] support mla dp[1/2]: custom alltoall Apr 2, 2025

xu-yfei changed the title ~~[DRFT] support mla dp[1/2]: custom alltoall~~ [DRAFT] support mla dp[1/2]: custom alltoall Apr 2, 2025

xu-yfei changed the title ~~[DRAFT] support mla dp[1/2]: custom alltoall~~ [WIP] support mla dp[1/2]: custom alltoall Apr 2, 2025

xu-yfei changed the title ~~[WIP] support mla dp[1/2]: custom alltoall~~ [WIP] [1/2] support mla dp: custom alltoall Apr 2, 2025

xu-yfei force-pushed the mla_dp_kernel branch from 85a234f to e93e3e1 Compare April 29, 2025 06:43

xu-yfei requested review from ByronHsu, HaiShaw, Ying1123 and hnyls2002 as code owners April 29, 2025 06:43

xu-yfei changed the title ~~[WIP] [1/2] support mla dp: custom alltoall~~ [1/2] support mla dp: custom alltoall Apr 29, 2025

xu-yfei force-pushed the mla_dp_kernel branch 2 times, most recently from 06dc284 to 26be401 Compare April 30, 2025 06:48

xu-yfei requested a review from ch-wan as a code owner April 30, 2025 06:48

xu-yfei force-pushed the mla_dp_kernel branch 3 times, most recently from c7a5149 to 3ee96d0 Compare May 12, 2025 09:23

zyksir reviewed May 31, 2025

View reviewed changes

support dp mla: custom alltoall

46ed7de

xu-yfei force-pushed the mla_dp_kernel branch from 3ee96d0 to 46ed7de Compare June 11, 2025 08:16

xu-yfei closed this Jun 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[1/2] support mla dp: custom alltoall#5000

[1/2] support mla dp: custom alltoall#5000
xu-yfei wants to merge 1 commit intosgl-project:mainfrom
xu-yfei:mla_dp_kernel

xu-yfei commented Apr 2, 2025 •

edited

Loading

Uh oh!

zyksir left a comment

Uh oh!

jokerwyt commented Jun 4, 2025

Uh oh!

xu-yfei commented Jun 9, 2025

Uh oh!

xu-yfei commented Jun 9, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xu-yfei commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

zyksir left a comment

Choose a reason for hiding this comment

Uh oh!

jokerwyt commented Jun 4, 2025

Uh oh!

xu-yfei commented Jun 9, 2025

Uh oh!

xu-yfei commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xu-yfei commented Apr 2, 2025 •

edited

Loading

xu-yfei commented Jun 9, 2025 •

edited

Loading