Skip to content

Problems in TensorPipeRpcBackendOptions device mapping documentation? #53501

@rafi-cohen

Description

@rafi-cohen

📚 Documentation

The new release of PyTorch 1.8 introduces CUDA-support in RPC.
I've referred to the RPC documentation, and the only reference for the CUDA-support I could find is under TensorPipeRpcBackendOptions and set_device_map.
Seems like setting up CUDA-support is simply done by supplying a device mapping in the TensorPipeRpcBackendOptions, pretty cool.

However, I find the documentation for the device_maps/device_map to be unclear. It seems that TensorPipeRpcBackendOptions's device_maps is a dictionary where the keys are worker names, but I'm not exactly sure what the structure of the dictionary's values should be like? Supposedly each value should be some sort of dictionary (as indicated by the parameter's type - Dict[str, Dict]), yet the example code provides a set: device_maps={"worker1": {0, 1}}. I don't really understand how does this "map worker0's cuda:0 to worker1's cuda:1"?

Same for set_device_map's device_map, the parameter's type also indicates it's a dictionary ((Dict of python:int, str, or torch.device)), but doesn't quite explain its structure. And again, the example code provides a set: options.set_device_map("worker1", {1, 2}).

It is also not explained how to define a GPU->CPU mapping (or vice versa).

Apart for this, there are 2 obvious errors in the example code provided in that documentation:

  1. There is a missing comma in the following part:
>>> rpc.init_rpc(
>>>     "worker0",
>>>     rank=0,
>>>     world_size=2  # <-- missing comma
>>>     backend=rpc.BackendType.TENSORPIPE,
>>>     rpc_backend_options=options
>>> )
  1. I don't see how it is possible that those two prints will give different results. I'm guessing that the second line should read print(rets[1])?
>>> print(rets[0])  # tensor([2., 2.], device='cuda:0')
>>> print(rets[0])  # tensor([2., 2.], device='cuda:1')

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @rohan-varma @jjlilley @osalpekar @jiayisuse @mrzzd @agolynski @SciPioneer @H-Huang @cbalioglu

Metadata

Metadata

Assignees

Labels

module: rpcRelated to RPC, distributed autograd, RRef, and distributed optimizeroncall: distributedAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions