Skip to content

Abort all nccl communicators explicitly when destroy process group #32231

@zhaojuanmao

Description

@zhaojuanmao

right now, process group holds a vector of share_ptrs of nccl communicators, when process group is destroyed, the nccl communicators may be still held by pending/stuck processGroup::work.

Ideally all nccl communicators should be aborted explicitly when process group is destroyed, otherwise pending nccl kernels may block any CUDA op to run even after destroying process group, this is not good for failure recovery

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar

Metadata

Metadata

Assignees

Labels

enhancementNot as big of a feature, but technically not a bug. Should be easy to fixoncall: distributedAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions