Enable GPU-to-GPU comm in TensorPipeAgent by mrshenli · Pull Request #44418 · pytorch/pytorch

mrshenli · 2020-09-09T20:08:40Z

Stack from ghstack:

Enable GPU-to-GPU comm in TensorPipeAgent #44418 Enable GPU-to-GPU comm in TensorPipeAgent

This commit uses TensorPipe's cuda_ipc channel to conduct
cross-process same-machine GPU-to-GPU communication. On the sender
side, TensorPipeAgent grabs a stream to each device used by the
message, let these streams wait for current streams, and passes
the streams to TensorPipe CudaBuffer. On the receiver side, it
also grabs a stream for each device used in the message, and uses
these streams to receive tensors and run user functions. After that,
these streams are then used for sending the response back to the
sender. When receiving the response, the sender will grab a new set
of streams and use them for TensorPipe's CudaBuffer.

If device maps are provided, TensorPipeAgent::send will return a
derived class of CUDAFuture, which is specifically tailored for
RPC Messages.

TODOs:

Enable sending CUDA RPC to the same process.
Add a custom CUDA stream pool.
When TensorPipe addressed the error for cudaPointerGetAttributes(),
remove cuda:0 context initialization code in backend_registry.py.
When TensorPipe can detect availability of peer access, enable all
tests on platforms without peer access.

Differential Revision: D23626207

[ghstack-poisoned]

ghstack-source-id: 13ac6a4 Pull Request resolved: #44418

dr-ci · 2020-09-09T20:38:21Z

💊 CI failures summary and remediations

As of commit b302be9 (more details on the Dr. CI page):

2/2 failures possibly* introduced in this PR
- 2/2 non-CircleCI failure(s)

Extra GitHub checks: 1 failed

Failed: GitHub Actions - clang-tidy

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

mrshenli · 2020-09-09T21:18:26Z

Land only after TensorPipe CUDA support is in.

[ghstack-poisoned]

ghstack-source-id: 43ad23d Pull Request resolved: #44418

lw · 2020-09-10T15:39:56Z

+                         streams{std::move(streams)}]() mutable {
+          // create guards again as this function runs on a different thread
+          auto guards = streamsToGuards(streams);


The streams we receive from pipeRead do, at the moment, only contain streams for the devices on which the input tensors lived. However the user function may place the result tensors on another different device. I therefore think we should get a stream from the pool for all devices and set them all as current.

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207) [ghstack-poisoned]

ghstack-source-id: 400f8d3 Pull Request resolved: #44418

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207) [ghstack-poisoned]

ghstack-source-id: 66d6bc3 Pull Request resolved: #44418

lw

Nice idea the DeviceContext! :)

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207) [ghstack-poisoned]

ghstack-source-id: 630d15f Pull Request resolved: #44418

ghstack-source-id: 630d15f Pull Request resolved: pytorch#44418

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207) [ghstack-poisoned]

This commit uses TensorPipe's cuda_ipc channel to conduct cross-process same-machine GPU-to-GPU communication. On the sender side, `TensorPipeAgent` grabs a stream to each device, let these streams wait for current streams, and passes the streams to TensorPipe `CudaBuffer`. On the receiver side, it also grabs a stream for each device, and uses these streams to receive tensors and run user functions. After that, these streams are then used for sending the response back to the sender. When receiving the response, the sender will grab a new set of streams and use them for TensorPipe's `CudaBuffer`. If device maps are provided, `TensorPipeAgent::send` will return a derived class of `CUDAFuture`, which is specifically tailored for RPC Messages. TODOs: 1. When sending a request, RPC caller only need to grab streams for used tensor devices in the message, which is different from the callee. 2. Enable sending CUDA RPC to the same process. 3. Add a custom CUDA stream pool. ghstack-source-id: 90e3b20 Pull Request resolved: #44418

This commit uses TensorPipe's cuda_ipc channel to conduct cross-process same-machine GPU-to-GPU communication. On the sender side, `TensorPipeAgent` grabs a stream to each device, let these streams wait for current streams, and passes the streams to TensorPipe `CudaBuffer`. On the receiver side, it also grabs a stream for each device, and uses these streams to receive tensors and run user functions. After that, these streams are then used for sending the response back to the sender. When receiving the response, the sender will grab a new set of streams and use them for TensorPipe's `CudaBuffer`. If device maps are provided, `TensorPipeAgent::send` will return a derived class of `CUDAFuture`, which is specifically tailored for RPC Messages. TODOs: 1. When sending a request, RPC caller only need to grab streams for used tensor devices in the message, which is different from the callee. 2. Enable sending CUDA RPC to the same process. 3. Add a custom CUDA stream pool. Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207) [ghstack-poisoned]

This commit uses TensorPipe's cuda_ipc channel to conduct cross-process same-machine GPU-to-GPU communication. On the sender side, `TensorPipeAgent` grabs a stream to each device, let these streams wait for current streams, and passes the streams to TensorPipe `CudaBuffer`. On the receiver side, it also grabs a stream for each device, and uses these streams to receive tensors and run user functions. After that, these streams are then used for sending the response back to the sender. When receiving the response, the sender will grab a new set of streams and use them for TensorPipe's `CudaBuffer`. If device maps are provided, `TensorPipeAgent::send` will return a derived class of `CUDAFuture`, which is specifically tailored for RPC Messages. TODOs: 1. When sending a request, RPC caller only need to grab streams for used tensor devices in the message, which is different from the callee. 2. Enable sending CUDA RPC to the same process. 3. Add a custom CUDA stream pool. ghstack-source-id: 972bba2 Pull Request resolved: #44418

This commit uses TensorPipe's cuda_ipc channel to conduct cross-process same-machine GPU-to-GPU communication. On the sender side, `TensorPipeAgent` grabs a stream to each device used by the message, let these streams wait for current streams, and passes the streams to TensorPipe `CudaBuffer`. On the receiver side, it also grabs a stream for each device used in the message, and uses these streams to receive tensors and run user functions. After that, these streams are then used for sending the response back to the sender. When receiving the response, the sender will grab a new set of streams and use them for TensorPipe's `CudaBuffer`. If device maps are provided, `TensorPipeAgent::send` will return a derived class of `CUDAFuture`, which is specifically tailored for RPC Messages. TODOs: 1. Enable sending CUDA RPC to the same process. 2. Add a custom CUDA stream pool. 3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`, remove `cuda:0` context initialization code in `backend_registry.py`. 4. When TensorPipe can detect availability of peer access, enable all tests on platforms without peer access. Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207) [ghstack-poisoned]

This commit uses TensorPipe's cuda_ipc channel to conduct cross-process same-machine GPU-to-GPU communication. On the sender side, `TensorPipeAgent` grabs a stream to each device used by the message, let these streams wait for current streams, and passes the streams to TensorPipe `CudaBuffer`. On the receiver side, it also grabs a stream for each device used in the message, and uses these streams to receive tensors and run user functions. After that, these streams are then used for sending the response back to the sender. When receiving the response, the sender will grab a new set of streams and use them for TensorPipe's `CudaBuffer`. If device maps are provided, `TensorPipeAgent::send` will return a derived class of `CUDAFuture`, which is specifically tailored for RPC Messages. TODOs: 1. Enable sending CUDA RPC to the same process. 2. Add a custom CUDA stream pool. 3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`, remove `cuda:0` context initialization code in `backend_registry.py`. 4. When TensorPipe can detect availability of peer access, enable all tests on platforms without peer access. ghstack-source-id: c9cd280 Pull Request resolved: #44418

This commit uses TensorPipe's cuda_ipc channel to conduct cross-process same-machine GPU-to-GPU communication. On the sender side, `TensorPipeAgent` grabs a stream to each device used by the message, let these streams wait for current streams, and passes the streams to TensorPipe `CudaBuffer`. On the receiver side, it also grabs a stream for each device used in the message, and uses these streams to receive tensors and run user functions. After that, these streams are then used for sending the response back to the sender. When receiving the response, the sender will grab a new set of streams and use them for TensorPipe's `CudaBuffer`. If device maps are provided, `TensorPipeAgent::send` will return a derived class of `CUDAFuture`, which is specifically tailored for RPC Messages. TODOs: 1. Enable sending CUDA RPC to the same process. 2. Add a custom CUDA stream pool. 3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`, remove `cuda:0` context initialization code in `backend_registry.py`. 4. When TensorPipe can detect availability of peer access, enable all tests on platforms without peer access. Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207) [ghstack-poisoned]

mrshenli

lint failure is on a file I didn't touch:

  {
    path: 'aten/src/ATen/cuda/CUDAEvent.h',
    start_line: 30,
    end_line: 30,
    start_column: 3,
    end_column: 3,
    annotation_level: 'failure',
    message: '[clang-analyzer-optin.cplusplus.UninitializedObject] warning: 1 uninitialized field at the end of the constructor call'
  }

This commit uses TensorPipe's cuda_ipc channel to conduct cross-process same-machine GPU-to-GPU communication. On the sender side, `TensorPipeAgent` grabs a stream to each device used by the message, let these streams wait for current streams, and passes the streams to TensorPipe `CudaBuffer`. On the receiver side, it also grabs a stream for each device used in the message, and uses these streams to receive tensors and run user functions. After that, these streams are then used for sending the response back to the sender. When receiving the response, the sender will grab a new set of streams and use them for TensorPipe's `CudaBuffer`. If device maps are provided, `TensorPipeAgent::send` will return a derived class of `CUDAFuture`, which is specifically tailored for RPC Messages. TODOs: 1. Enable sending CUDA RPC to the same process. 2. Add a custom CUDA stream pool. 3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`, remove `cuda:0` context initialization code in `backend_registry.py`. 4. When TensorPipe can detect availability of peer access, enable all tests on platforms without peer access. Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207) [ghstack-poisoned]

lw · 2021-01-13T16:44:46Z

+  virtual std::vector<CUDAStream> getReservedStreams() const {
    throw std::runtime_error(
        "Attempting to access CUDA streams, but torch is not built with CUDA");
  }
-#endif

+  virtual CUDAStream getStream(c10::DeviceIndex index) {
+    throw std::runtime_error(c10::str(
+        "Attempting to access CUDA stream of device ",
+        index,
+        ", but torch is not built with CUDA"));
+  }


After re-reading this I'm not sure I follow: we define these methods if USE_CUDA is on, but these methods then claim that CUDA is off? I realize that in the subclass we override it, and I understand that we must gate them because otherwise CUDAStream would be undefined. But doesn't this mean we could just leave them unimplemented? (i.e., = 0)

Was originally intended to provide a clear error message. When I tried to use pure virtual function, I realized that the following will also need to be gated, or will need to change callsites that do not provide a ctx. Will address this in a follow up PR.

c++ TORCH_API std::tuple<tensorpipe::Message, TensorpipeWriteBuffers> tensorpipeSerialize( Message&& rpcMessage, std::vector<c10::DeviceIndex> devices = {}, const std::shared_ptr<LazyStreamContext>& = std::make_shared<LazyStreamContext>());`

This commit uses TensorPipe's cuda_ipc channel to conduct cross-process same-machine GPU-to-GPU communication. On the sender side, `TensorPipeAgent` grabs a stream to each device used by the message, let these streams wait for current streams, and passes the streams to TensorPipe `CudaBuffer`. On the receiver side, it also grabs a stream for each device used in the message, and uses these streams to receive tensors and run user functions. After that, these streams are then used for sending the response back to the sender. When receiving the response, the sender will grab a new set of streams and use them for TensorPipe's `CudaBuffer`. If device maps are provided, `TensorPipeAgent::send` will return a derived class of `CUDAFuture`, which is specifically tailored for RPC Messages. TODOs: 1. Enable sending CUDA RPC to the same process. 2. Add a custom CUDA stream pool. 3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`, remove `cuda:0` context initialization code in `backend_registry.py`. 4. When TensorPipe can detect availability of peer access, enable all tests on platforms without peer access. Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207) [ghstack-poisoned]

lw · 2021-01-13T19:13:33Z

+        s1 = torch.cuda.Stream(device=x.device)
+        with torch.cuda.stream(s1):
+            torch.cuda._sleep(10 * FIFTY_MIL_CYCLES)
+            z = x + y


Shouldn't there also be a synchronization before the addition? x and y might still be filled in, and this is being done in the current streams, hence it's only safe to access them from the current stream, or from streams that are explicitly synchronized with the current one.

Also, we should check that x and y are on the same devices, or else we need to also sync with the current stream of y.device.

ah, yes, good catch! it probably was hided by the _sleep

This commit uses TensorPipe's cuda_ipc channel to conduct cross-process same-machine GPU-to-GPU communication. On the sender side, `TensorPipeAgent` grabs a stream to each device used by the message, let these streams wait for current streams, and passes the streams to TensorPipe `CudaBuffer`. On the receiver side, it also grabs a stream for each device used in the message, and uses these streams to receive tensors and run user functions. After that, these streams are then used for sending the response back to the sender. When receiving the response, the sender will grab a new set of streams and use them for TensorPipe's `CudaBuffer`. If device maps are provided, `TensorPipeAgent::send` will return a derived class of `CUDAFuture`, which is specifically tailored for RPC Messages. TODOs: 1. Enable sending CUDA RPC to the same process. 2. Add a custom CUDA stream pool. 3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`, remove `cuda:0` context initialization code in `backend_registry.py`. 4. When TensorPipe can detect availability of peer access, enable all tests on platforms without peer access. Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207) [ghstack-poisoned]

mrshenli · 2021-01-13T19:26:25Z

ci-all test in #50494

This commit uses TensorPipe's cuda_ipc channel to conduct cross-process same-machine GPU-to-GPU communication. On the sender side, `TensorPipeAgent` grabs a stream to each device used by the message, let these streams wait for current streams, and passes the streams to TensorPipe `CudaBuffer`. On the receiver side, it also grabs a stream for each device used in the message, and uses these streams to receive tensors and run user functions. After that, these streams are then used for sending the response back to the sender. When receiving the response, the sender will grab a new set of streams and use them for TensorPipe's `CudaBuffer`. If device maps are provided, `TensorPipeAgent::send` will return a derived class of `CUDAFuture`, which is specifically tailored for RPC Messages. TODOs: 1. Enable sending CUDA RPC to the same process. 2. Add a custom CUDA stream pool. 3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`, remove `cuda:0` context initialization code in `backend_registry.py`. 4. When TensorPipe can detect availability of peer access, enable all tests on platforms without peer access. Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207) [ghstack-poisoned]

codecov · 2021-01-14T06:09:37Z

Codecov Report

Merging #44418 (28209c4) into gh/mrshenli/235/base (2c55426) will decrease coverage by 0.76%.
The diff coverage is 90.26%.

@@                   Coverage Diff                    @@
##           gh/mrshenli/235/base   #44418      +/-   ##
========================================================
- Coverage                 81.47%   80.71%   -0.77%     
========================================================
  Files                      1792     1910     +118     
  Lines                    186156   207364   +21208     
========================================================
+ Hits                     151669   167369   +15700     
- Misses                    34487    39995    +5508

This commit uses TensorPipe's cuda_ipc channel to conduct cross-process same-machine GPU-to-GPU communication. On the sender side, `TensorPipeAgent` grabs a stream to each device used by the message, let these streams wait for current streams, and passes the streams to TensorPipe `CudaBuffer`. On the receiver side, it also grabs a stream for each device used in the message, and uses these streams to receive tensors and run user functions. After that, these streams are then used for sending the response back to the sender. When receiving the response, the sender will grab a new set of streams and use them for TensorPipe's `CudaBuffer`. If device maps are provided, `TensorPipeAgent::send` will return a derived class of `CUDAFuture`, which is specifically tailored for RPC Messages. TODOs: 1. Enable sending CUDA RPC to the same process. 2. Add a custom CUDA stream pool. 3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`, remove `cuda:0` context initialization code in `backend_registry.py`. 4. When TensorPipe can detect availability of peer access, enable all tests on platforms without peer access. Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207) [ghstack-poisoned]

facebook-github-bot · 2021-01-14T22:01:11Z

@mrshenli merged this pull request in 30e45bb.

Use streams from pool on RPC callees

b0e83ac

[ghstack-poisoned]

mrshenli requested review from beauby, jiayisuse, lw and osalpekar as code owners September 9, 2020 20:08

This was referenced Sep 9, 2020

Add missing rpc.shutdown() #44417

Closed

Use TP Tensor.metadata to carry device info #44396

Closed

mrshenli added a commit that referenced this pull request Sep 9, 2020

Use streams from pool on RPC callees

a630373

ghstack-source-id: 13ac6a4 Pull Request resolved: #44418

mrshenli changed the title ~~Use streams from pool on RPC callees~~ [WIP] Use streams from pool on RPC callees Sep 9, 2020

mrshenli commented Sep 10, 2020

View reviewed changes

Comment thread torch/csrc/distributed/rpc/tensorpipe_agent.cpp Outdated

Update on "[WIP] Use streams from pool on RPC callees"

6ae4c16

[ghstack-poisoned]

mrshenli added a commit that referenced this pull request Sep 10, 2020

Use streams from pool on RPC callees

41c1ff0

ghstack-source-id: 43ad23d Pull Request resolved: #44418

lw reviewed Sep 10, 2020

View reviewed changes

Update on "[WIP] Use streams from pool on RPC callees"

0a23d4d

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207) [ghstack-poisoned]

mrshenli added a commit that referenced this pull request Sep 10, 2020

Use streams from pool on RPC callees

8530ad0

ghstack-source-id: 400f8d3 Pull Request resolved: #44418

Update on "[WIP] Use streams from pool on RPC callees"

c6dec67

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207) [ghstack-poisoned]

Update on "[WIP] Use streams from pool on RPC callees"

7b073dc

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207) [ghstack-poisoned]

Update on "[WIP] Use streams from pool on RPC callees"

7489147

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207) [ghstack-poisoned]

mrshenli added a commit that referenced this pull request Sep 11, 2020

Use streams from pool on RPC callees

e84dc31

ghstack-source-id: 66d6bc3 Pull Request resolved: #44418

lw reviewed Sep 11, 2020

View reviewed changes

Comment thread caffe2/CMakeLists.txt Outdated

Comment thread torch/csrc/distributed/rpc/tensorpipe_agent.h Outdated

Comment thread torch/csrc/distributed/rpc/tensorpipe_agent.cpp Outdated

Comment thread torch/csrc/distributed/rpc/tensorpipe_agent.cpp Outdated

Update on "[WIP] Use streams from pool on RPC callees"

2dc32dc

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207) [ghstack-poisoned]

mrshenli added a commit that referenced this pull request Sep 11, 2020

Use streams from pool on RPC callees

f356746

ghstack-source-id: 630d15f Pull Request resolved: #44418

mrshenli added a commit to mrshenli/pytorch that referenced this pull request Sep 18, 2020

Use streams from pool on RPC callees

727de8e

ghstack-source-id: 630d15f Pull Request resolved: pytorch#44418

pritamdamania87 mentioned this pull request Sep 18, 2020

Support device map for distributed autograd while using TensorPipe. #44859

Closed

mrshenli mentioned this pull request Sep 18, 2020

Let rpc._all_gather use default RPC timeout #44983

Closed

Update on "[WIP] Use streams from pool on RPC callees"

af38592

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207) [ghstack-poisoned]

mrshenli mentioned this pull request Jan 13, 2021

[Don't Review] [Don't Land] Add torch.cuda.can_device_access_peer #50456

Closed

mrshenli added 4 commits January 12, 2021 18:44

mrshenli commented Jan 13, 2021

View reviewed changes

mrshenli added 2 commits January 13, 2021 07:19

lw reviewed Jan 13, 2021

View reviewed changes

lw approved these changes Jan 13, 2021

View reviewed changes

mrshenli mentioned this pull request Jan 13, 2021

[Don't Review][ci-all Test]Enable GPU-to-GPU comm in TensorPipeAgent #50494

Closed

xwang233 mentioned this pull request Jan 17, 2021

distributed/rpc/test_tensorpipe_agent test_device_maps_return_to_gpu_self fails #50671

Closed

Conversation

mrshenli commented Sep 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dr-ci Bot commented Sep 9, 2020 • edited by facebook-github-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Extra GitHub checks: 1 failed

Uh oh!

mrshenli commented Sep 9, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lw Sep 10, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

lw Jan 13, 2021

Choose a reason for hiding this comment

Uh oh!

mrshenli Jan 13, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lw Jan 13, 2021

Choose a reason for hiding this comment

Uh oh!

mrshenli Jan 13, 2021

Choose a reason for hiding this comment

Uh oh!

mrshenli commented Jan 13, 2021

Uh oh!

codecov Bot commented Jan 14, 2021

Codecov Report

Uh oh!

facebook-github-bot commented Jan 14, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mrshenli commented Sep 9, 2020 •

edited

Loading

dr-ci Bot commented Sep 9, 2020 •

edited by facebook-github-bot

Loading