Skip to content

[1/2] support mla dp: custom alltoall#5000

Closed
xu-yfei wants to merge 1 commit intosgl-project:mainfrom
xu-yfei:mla_dp_kernel
Closed

[1/2] support mla dp: custom alltoall#5000
xu-yfei wants to merge 1 commit intosgl-project:mainfrom
xu-yfei:mla_dp_kernel

Conversation

@xu-yfei
Copy link
Copy Markdown
Contributor

@xu-yfei xu-yfei commented Apr 2, 2025

Motivation

This pr is for dp mla #5001

About dp mla:
On an 8*H20(96GB), weight mem usage=87.19 GB when --dp-size 4 --enable-dp-attention, not enough memory left.

This optimization is similar to data parallelism attention, but it applies to MLA core instead of the entire attention. Compared with data parallelism attention, it does not additionally increase the memory occupied by weights. It allows for a significant reduction in the KV cache size and enables larger batch sizes with only 1-3ms additional decode latency.

On an 8×H20 (96GB) node, with data parallelism MLA enabled, we have achieved up to 1.85x(dp=4) ~ 2.34x(dp=8) decoding throughput improvement compared to the previous version. And, the number of kvcaches has been increased by 3.3x(dp=4) and 6.6x(dp=8).

Modifications

Implement a custom all_to_all for dynamic input/output splitting (split_sizes) based on sglang allreduce IPC, currently limited to single-machine use.

Checklist

@xu-yfei xu-yfei mentioned this pull request Apr 2, 2025
6 tasks
@xu-yfei xu-yfei changed the title support mla dp: custom alltoall [DRFT] support mla dp[1/2]: custom alltoall Apr 2, 2025
@xu-yfei xu-yfei changed the title [DRFT] support mla dp[1/2]: custom alltoall [DRAFT] support mla dp[1/2]: custom alltoall Apr 2, 2025
@xu-yfei xu-yfei changed the title [DRAFT] support mla dp[1/2]: custom alltoall [WIP] support mla dp[1/2]: custom alltoall Apr 2, 2025
@xu-yfei xu-yfei changed the title [WIP] support mla dp[1/2]: custom alltoall [WIP] [1/2] support mla dp: custom alltoall Apr 2, 2025
@xu-yfei xu-yfei changed the title [WIP] [1/2] support mla dp: custom alltoall [1/2] support mla dp: custom alltoall Apr 29, 2025
@xu-yfei xu-yfei force-pushed the mla_dp_kernel branch 2 times, most recently from 06dc284 to 26be401 Compare April 30, 2025 06:48
@xu-yfei xu-yfei requested a review from ch-wan as a code owner April 30, 2025 06:48
@xu-yfei xu-yfei force-pushed the mla_dp_kernel branch 3 times, most recently from c7a5149 to 3ee96d0 Compare May 12, 2025 09:23
Copy link
Copy Markdown
Collaborator

@zyksir zyksir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, the implementation of custom all to all is really impressive! Do you test the performance of custom all to all with nccl send/recv based alltoall? By using nccl group start, nccl group end, and nccl send/recv, we should be able to implement an all to all as well.

@jokerwyt
Copy link
Copy Markdown
Contributor

jokerwyt commented Jun 4, 2025

Hi, your design of adjusting warps count dynamically is cool! I am curious about the performance comparison with a direct NCCL implementation, especially under a skew sending/receiving split. And I think we can use NVSHMEM to get an internode implementation?

@xu-yfei
Copy link
Copy Markdown
Contributor Author

xu-yfei commented Jun 9, 2025

  1. In the cuda graph scenario, the actual token count of each DP is uncertain, only the sum is known. If the sum of data is sent to other DPs, the amount of data is relatively large. This all to all supports input/output spilt size as Tensor, that is, the amount of data sent by each DP to other DPs is not fixed in the cuda graph scenario. The default all_to_all requires input/output spilt size to be constant.
  2. The default all to all runtime memory is large, reaching 2 to 3GB, and the reason has not been found yet.
  3. Supports input/output fusion with transpose, because transpose is time-consuming.

Hi, the implementation of custom all to all is really impressive! Do you test the performance of custom all to all with nccl send/recv based alltoall? By using nccl group start, nccl group end, and nccl send/recv, we should be able to implement an all to all as well.

@xu-yfei
Copy link
Copy Markdown
Contributor Author

xu-yfei commented Jun 9, 2025

1、As mentioned in the above reply, the default all to all is not used
2、Alltoall within a node is sufficient. The communication overhead between nodes is relatively large. We can use TP8 within a node and EP between nodes. When there are multiple nodes, the benefits are not as obvious compared to dp attention, because the time consumption of attention linear is much smaller. The benefits of reducing attention linear time are no longer that great. If you have good ideas, welcome to discuss and exchange.

Hi, your design of adjusting warps count dynamically is cool! I am curious about the performance comparison with a direct NCCL implementation, especially under a skew sending/receiving split. And I think we can use NVSHMEM to get an internode implementation?

@xu-yfei xu-yfei closed this Jun 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants