-
Notifications
You must be signed in to change notification settings - Fork 27.7k
Abort all nccl communicators explicitly when destroy process group #32231
Copy link
Copy link
Closed
Labels
enhancementNot as big of a feature, but technically not a bug. Should be easy to fixNot as big of a feature, but technically not a bug. Should be easy to fixoncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Metadata
Metadata
Assignees
Labels
enhancementNot as big of a feature, but technically not a bug. Should be easy to fixNot as big of a feature, but technically not a bug. Should be easy to fixoncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
right now, process group holds a vector of share_ptrs of nccl communicators, when process group is destroyed, the nccl communicators may be still held by pending/stuck processGroup::work.
Ideally all nccl communicators should be aborted explicitly when process group is destroyed, otherwise pending nccl kernels may block any CUDA op to run even after destroying process group, this is not good for failure recovery
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar