-
Notifications
You must be signed in to change notification settings - Fork 27.7k
Use allgatherv for sparse allreduce #22226
Copy link
Copy link
Closed
Labels
enhancementNot as big of a feature, but technically not a bug. Should be easy to fixNot as big of a feature, but technically not a bug. Should be easy to fixoncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Metadata
Metadata
Assignees
Labels
enhancementNot as big of a feature, but technically not a bug. Should be easy to fixNot as big of a feature, but technically not a bug. Should be easy to fixoncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
The current sparse allreduce in
ProcessGroupGloopads the indices and values tensors to the maximum length across all processes and then performs a regular allgather (because they'll have equal size across processes). Instead, we can use allgatherv, and avoid the padding trick, once #22225 is merged. This is mostly a win for memory usage is there is severe size imbalance between processes. The runtime likely won't change much, do to the nature of the underlying allgather implementation (it takes N steps where each step takes an amount of time proportional to the size of the largest contribution).