[1/2] support mla dp: custom alltoall#5000
Conversation
06dc284 to
26be401
Compare
c7a5149 to
3ee96d0
Compare
zyksir
left a comment
There was a problem hiding this comment.
Hi, the implementation of custom all to all is really impressive! Do you test the performance of custom all to all with nccl send/recv based alltoall? By using nccl group start, nccl group end, and nccl send/recv, we should be able to implement an all to all as well.
|
Hi, your design of adjusting warps count dynamically is cool! I am curious about the performance comparison with a direct NCCL implementation, especially under a skew sending/receiving split. And I think we can use NVSHMEM to get an internode implementation? |
|
|
1、As mentioned in the above reply, the default all to all is not used
|
Motivation
This pr is for dp mla #5001
About dp mla:
On an 8*H20(96GB), weight mem usage=87.19 GB when
--dp-size 4 --enable-dp-attention, not enough memory left.This optimization is similar to data parallelism attention, but it applies to MLA core instead of the entire attention. Compared with data parallelism attention, it does not additionally increase the memory occupied by weights. It allows for a significant reduction in the KV cache size and enables larger batch sizes with only 1-3ms additional decode latency.
On an 8×H20 (96GB) node, with data parallelism MLA enabled, we have achieved up to 1.85x(dp=4) ~ 2.34x(dp=8) decoding throughput improvement compared to the previous version. And, the number of kvcaches has been increased by 3.3x(dp=4) and 6.6x(dp=8).
Modifications
Implement a custom all_to_all for dynamic input/output splitting (split_sizes) based on sglang allreduce IPC, currently limited to single-machine use.
Checklist