Skip to content

RPC package should provide a backend-agnostic helper to verify names/ids #40048

@mrshenli

Description

@mrshenli

In OSS, ProcessGroup RPC backend verifies all worker names are distinct and uses ranks as ids. So that wrong name-to-id mapping will be spotted at construction time. However, for other RPC backend without collective communication capabilities, it is not that easy to gather all information from all workers and check their correctness.

An alternative is to let torch.distributed.rpc provide a backend-agnostic helper to check all worker names using c10d::Store, and assign ids using the c10d::Store. This should help prevent applications from introducing unintentional mapping errors.

cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @rohan-varma @jjlilley @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd @cbalioglu @xush6528

Metadata

Metadata

Assignees

No one assigned

    Labels

    better-engineeringRelatively self-contained tasks for better engineering contributorsfeatureA request for a proper, new feature.high prioritymodule: rpcRelated to RPC, distributed autograd, RRef, and distributed optimizeroncall: distributedAdd this issue/PR to distributed oncall triage queuetriage reviewtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions