-
Notifications
You must be signed in to change notification settings - Fork 27.7k
RPC package should provide a backend-agnostic helper to verify names/ids #40048
Copy link
Copy link
Closed
Labels
better-engineeringRelatively self-contained tasks for better engineering contributorsRelatively self-contained tasks for better engineering contributorsfeatureA request for a proper, new feature.A request for a proper, new feature.high prioritymodule: rpcRelated to RPC, distributed autograd, RRef, and distributed optimizerRelated to RPC, distributed autograd, RRef, and distributed optimizeroncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriage reviewtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Metadata
Metadata
Assignees
Labels
better-engineeringRelatively self-contained tasks for better engineering contributorsRelatively self-contained tasks for better engineering contributorsfeatureA request for a proper, new feature.A request for a proper, new feature.high prioritymodule: rpcRelated to RPC, distributed autograd, RRef, and distributed optimizerRelated to RPC, distributed autograd, RRef, and distributed optimizeroncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriage reviewtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
In OSS, ProcessGroup RPC backend verifies all worker names are distinct and uses ranks as ids. So that wrong name-to-id mapping will be spotted at construction time. However, for other RPC backend without collective communication capabilities, it is not that easy to gather all information from all workers and check their correctness.
An alternative is to let
torch.distributed.rpcprovide a backend-agnostic helper to check all worker names usingc10d::Store, and assign ids using thec10d::Store. This should help prevent applications from introducing unintentional mapping errors.cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @rohan-varma @jjlilley @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd @cbalioglu @xush6528