Skip to content

Notify workers of failure of distributed backward pass #27643

@pritamdamania87

Description

@pritamdamania87

When a distributed backward pass fails on a certain worker, we should inform all other workers that the backward pass has failed and the other workers should try to cancel all the tasks it is executing for that particular backward pass.

cc @ezyang @ssnl @albanD @zou3519 @gqchen @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @aazzolini @xush6528

Metadata

Metadata

Assignees

Labels

better-engineeringRelatively self-contained tasks for better engineering contributorsmodule: autogradRelated to torch.autograd, and the autograd engine in generalmodule: rpcRelated to RPC, distributed autograd, RRef, and distributed optimizertriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions