Skip to content

Enable GPU-to-GPU comm in TensorPipeAgent#44418

Closed
mrshenli wants to merge 63 commits intogh/mrshenli/235/basefrom
gh/mrshenli/235/head
Closed

Enable GPU-to-GPU comm in TensorPipeAgent#44418
mrshenli wants to merge 63 commits intogh/mrshenli/235/basefrom
gh/mrshenli/235/head

Conversation

@mrshenli
Copy link
Copy Markdown
Contributor

@mrshenli mrshenli commented Sep 9, 2020

Stack from ghstack:

This commit uses TensorPipe's cuda_ipc channel to conduct
cross-process same-machine GPU-to-GPU communication. On the sender
side, TensorPipeAgent grabs a stream to each device used by the
message, let these streams wait for current streams, and passes
the streams to TensorPipe CudaBuffer. On the receiver side, it
also grabs a stream for each device used in the message, and uses
these streams to receive tensors and run user functions. After that,
these streams are then used for sending the response back to the
sender. When receiving the response, the sender will grab a new set
of streams and use them for TensorPipe's CudaBuffer.

If device maps are provided, TensorPipeAgent::send will return a
derived class of CUDAFuture, which is specifically tailored for
RPC Messages.

TODOs:

  1. Enable sending CUDA RPC to the same process.
  2. Add a custom CUDA stream pool.
  3. When TensorPipe addressed the error for cudaPointerGetAttributes(),
    remove cuda:0 context initialization code in backend_registry.py.
  4. When TensorPipe can detect availability of peer access, enable all
    tests on platforms without peer access.

Differential Revision: D23626207

mrshenli added a commit that referenced this pull request Sep 9, 2020
ghstack-source-id: 13ac6a4
Pull Request resolved: #44418
@dr-ci
Copy link
Copy Markdown

dr-ci Bot commented Sep 9, 2020

💊 CI failures summary and remediations

As of commit b302be9 (more details on the Dr. CI page):


  • 2/2 failures possibly* introduced in this PR
    • 2/2 non-CircleCI failure(s)

Extra GitHub checks: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

@mrshenli mrshenli changed the title Use streams from pool on RPC callees [WIP] Use streams from pool on RPC callees Sep 9, 2020
@mrshenli
Copy link
Copy Markdown
Contributor Author

mrshenli commented Sep 9, 2020

Land only after TensorPipe CUDA support is in.

Comment thread torch/csrc/distributed/rpc/tensorpipe_agent.cpp Outdated
mrshenli added a commit that referenced this pull request Sep 10, 2020
ghstack-source-id: 43ad23d
Pull Request resolved: #44418
Comment thread torch/csrc/distributed/rpc/tensorpipe_agent.cpp Outdated
Comment thread torch/csrc/distributed/rpc/tensorpipe_agent.cpp Outdated
Comment thread torch/csrc/distributed/rpc/tensorpipe_agent.cpp Outdated
Comment on lines +646 to +648
streams{std::move(streams)}]() mutable {
// create guards again as this function runs on a different thread
auto guards = streamsToGuards(streams);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The streams we receive from pipeRead do, at the moment, only contain streams for the devices on which the input tensors lived. However the user function may place the result tensors on another different device. I therefore think we should get a stream from the pool for all devices and set them all as current.

Comment thread torch/csrc/distributed/rpc/tensorpipe_agent.cpp Outdated
mrshenli added a commit that referenced this pull request Sep 10, 2020
ghstack-source-id: 400f8d3
Pull Request resolved: #44418
mrshenli added a commit that referenced this pull request Sep 11, 2020
ghstack-source-id: 66d6bc3
Pull Request resolved: #44418
Copy link
Copy Markdown
Contributor

@lw lw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea the DeviceContext! :)

Comment thread caffe2/CMakeLists.txt Outdated
Comment thread torch/csrc/distributed/rpc/tensorpipe_agent.h Outdated
Comment thread torch/csrc/distributed/rpc/tensorpipe_agent.cpp Outdated
Comment thread torch/csrc/distributed/rpc/tensorpipe_agent.cpp Outdated
mrshenli added a commit that referenced this pull request Sep 11, 2020
ghstack-source-id: 630d15f
Pull Request resolved: #44418
mrshenli added a commit to mrshenli/pytorch that referenced this pull request Sep 18, 2020
ghstack-source-id: 630d15f
Pull Request resolved: pytorch#44418
mrshenli added a commit that referenced this pull request Jan 12, 2021
This commit uses TensorPipe's cuda_ipc channel to conduct
cross-process same-machine GPU-to-GPU communication. On the sender
side, `TensorPipeAgent` grabs a stream to each device, let these
streams wait for current streams, and passes the streams to
TensorPipe `CudaBuffer`. On the receiver side, it also grabs a stream
for each device, and uses these streams to receive tensors and run
user functions. After that, these streams are then used for sending
the response back to the sender. When receiving the response, the
sender will grab a new set of streams and use them for TensorPipe's
`CudaBuffer`.

If device maps are provided, `TensorPipeAgent::send` will return a
derived class of `CUDAFuture`, which is specifically tailored for
RPC Messages.

TODOs:
1. When sending a request, RPC caller only need to grab streams for
used tensor devices in the message, which is different from the callee.
2. Enable sending CUDA RPC to the same process.
3. Add a custom CUDA stream pool.

ghstack-source-id: 90e3b20
Pull Request resolved: #44418
This commit uses TensorPipe's cuda_ipc channel to conduct
cross-process same-machine GPU-to-GPU communication. On the sender
side, `TensorPipeAgent` grabs a stream to each device, let these
streams wait for current streams, and passes the streams to
TensorPipe `CudaBuffer`. On the receiver side, it also grabs a stream
for each device, and uses these streams to receive tensors and run
user functions. After that, these streams are then used for sending
the response back to the sender. When receiving the response, the
sender will grab a new set of streams and use them for TensorPipe's
`CudaBuffer`.

If device maps are provided, `TensorPipeAgent::send` will return a
derived class of `CUDAFuture`, which is specifically tailored for
RPC Messages.

TODOs:
1. When sending a request, RPC caller only need to grab streams for
used tensor devices in the message, which is different from the callee.
2. Enable sending CUDA RPC to the same process.
3. Add a custom CUDA stream pool.

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207)

[ghstack-poisoned]
mrshenli added a commit that referenced this pull request Jan 12, 2021
This commit uses TensorPipe's cuda_ipc channel to conduct
cross-process same-machine GPU-to-GPU communication. On the sender
side, `TensorPipeAgent` grabs a stream to each device, let these
streams wait for current streams, and passes the streams to
TensorPipe `CudaBuffer`. On the receiver side, it also grabs a stream
for each device, and uses these streams to receive tensors and run
user functions. After that, these streams are then used for sending
the response back to the sender. When receiving the response, the
sender will grab a new set of streams and use them for TensorPipe's
`CudaBuffer`.

If device maps are provided, `TensorPipeAgent::send` will return a
derived class of `CUDAFuture`, which is specifically tailored for
RPC Messages.

TODOs:
1. When sending a request, RPC caller only need to grab streams for
used tensor devices in the message, which is different from the callee.
2. Enable sending CUDA RPC to the same process.
3. Add a custom CUDA stream pool.

ghstack-source-id: 972bba2
Pull Request resolved: #44418
This commit uses TensorPipe's cuda_ipc channel to conduct
cross-process same-machine GPU-to-GPU communication. On the sender
side, `TensorPipeAgent` grabs a stream to each device used by the
message, let these streams wait for current streams, and passes
the streams to TensorPipe `CudaBuffer`. On the receiver side, it
also grabs a stream for each device used in the message, and uses
these streams to receive tensors and run user functions. After that,
these streams are then used for sending the response back to the
sender. When receiving the response, the sender will grab a new set
of streams and use them for TensorPipe's `CudaBuffer`.

If device maps are provided, `TensorPipeAgent::send` will return a
derived class of `CUDAFuture`, which is specifically tailored for
RPC Messages.

TODOs:
1. Enable sending CUDA RPC to the same process.
2. Add a custom CUDA stream pool.
3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`,
remove `cuda:0` context initialization code in `backend_registry.py`.
4. When TensorPipe can detect availability of peer access, enable all
tests on platforms without peer access.

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207)

[ghstack-poisoned]
mrshenli added a commit that referenced this pull request Jan 13, 2021
This commit uses TensorPipe's cuda_ipc channel to conduct
cross-process same-machine GPU-to-GPU communication. On the sender
side, `TensorPipeAgent` grabs a stream to each device used by the
message, let these streams wait for current streams, and passes
the streams to TensorPipe `CudaBuffer`. On the receiver side, it
also grabs a stream for each device used in the message, and uses
these streams to receive tensors and run user functions. After that,
these streams are then used for sending the response back to the
sender. When receiving the response, the sender will grab a new set
of streams and use them for TensorPipe's `CudaBuffer`.

If device maps are provided, `TensorPipeAgent::send` will return a
derived class of `CUDAFuture`, which is specifically tailored for
RPC Messages.

TODOs:
1. Enable sending CUDA RPC to the same process.
2. Add a custom CUDA stream pool.
3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`,
remove `cuda:0` context initialization code in `backend_registry.py`.
4. When TensorPipe can detect availability of peer access, enable all
tests on platforms without peer access.

ghstack-source-id: c9cd280
Pull Request resolved: #44418
This commit uses TensorPipe's cuda_ipc channel to conduct
cross-process same-machine GPU-to-GPU communication. On the sender
side, `TensorPipeAgent` grabs a stream to each device used by the
message, let these streams wait for current streams, and passes
the streams to TensorPipe `CudaBuffer`. On the receiver side, it
also grabs a stream for each device used in the message, and uses
these streams to receive tensors and run user functions. After that,
these streams are then used for sending the response back to the
sender. When receiving the response, the sender will grab a new set
of streams and use them for TensorPipe's `CudaBuffer`.

If device maps are provided, `TensorPipeAgent::send` will return a
derived class of `CUDAFuture`, which is specifically tailored for
RPC Messages.

TODOs:
1. Enable sending CUDA RPC to the same process.
2. Add a custom CUDA stream pool.
3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`,
remove `cuda:0` context initialization code in `backend_registry.py`.
4. When TensorPipe can detect availability of peer access, enable all
tests on platforms without peer access.

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207)

[ghstack-poisoned]
This commit uses TensorPipe's cuda_ipc channel to conduct
cross-process same-machine GPU-to-GPU communication. On the sender
side, `TensorPipeAgent` grabs a stream to each device used by the
message, let these streams wait for current streams, and passes
the streams to TensorPipe `CudaBuffer`. On the receiver side, it
also grabs a stream for each device used in the message, and uses
these streams to receive tensors and run user functions. After that,
these streams are then used for sending the response back to the
sender. When receiving the response, the sender will grab a new set
of streams and use them for TensorPipe's `CudaBuffer`.

If device maps are provided, `TensorPipeAgent::send` will return a
derived class of `CUDAFuture`, which is specifically tailored for
RPC Messages.

TODOs:
1. Enable sending CUDA RPC to the same process.
2. Add a custom CUDA stream pool.
3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`,
remove `cuda:0` context initialization code in `backend_registry.py`.
4. When TensorPipe can detect availability of peer access, enable all
tests on platforms without peer access.

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207)

[ghstack-poisoned]
This commit uses TensorPipe's cuda_ipc channel to conduct
cross-process same-machine GPU-to-GPU communication. On the sender
side, `TensorPipeAgent` grabs a stream to each device used by the
message, let these streams wait for current streams, and passes
the streams to TensorPipe `CudaBuffer`. On the receiver side, it
also grabs a stream for each device used in the message, and uses
these streams to receive tensors and run user functions. After that,
these streams are then used for sending the response back to the
sender. When receiving the response, the sender will grab a new set
of streams and use them for TensorPipe's `CudaBuffer`.

If device maps are provided, `TensorPipeAgent::send` will return a
derived class of `CUDAFuture`, which is specifically tailored for
RPC Messages.

TODOs:
1. Enable sending CUDA RPC to the same process.
2. Add a custom CUDA stream pool.
3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`,
remove `cuda:0` context initialization code in `backend_registry.py`.
4. When TensorPipe can detect availability of peer access, enable all
tests on platforms without peer access.

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207)

[ghstack-poisoned]
This commit uses TensorPipe's cuda_ipc channel to conduct
cross-process same-machine GPU-to-GPU communication. On the sender
side, `TensorPipeAgent` grabs a stream to each device used by the
message, let these streams wait for current streams, and passes
the streams to TensorPipe `CudaBuffer`. On the receiver side, it
also grabs a stream for each device used in the message, and uses
these streams to receive tensors and run user functions. After that,
these streams are then used for sending the response back to the
sender. When receiving the response, the sender will grab a new set
of streams and use them for TensorPipe's `CudaBuffer`.

If device maps are provided, `TensorPipeAgent::send` will return a
derived class of `CUDAFuture`, which is specifically tailored for
RPC Messages.

TODOs:
1. Enable sending CUDA RPC to the same process.
2. Add a custom CUDA stream pool.
3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`,
remove `cuda:0` context initialization code in `backend_registry.py`.
4. When TensorPipe can detect availability of peer access, enable all
tests on platforms without peer access.

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207)

[ghstack-poisoned]
This commit uses TensorPipe's cuda_ipc channel to conduct
cross-process same-machine GPU-to-GPU communication. On the sender
side, `TensorPipeAgent` grabs a stream to each device used by the
message, let these streams wait for current streams, and passes
the streams to TensorPipe `CudaBuffer`. On the receiver side, it
also grabs a stream for each device used in the message, and uses
these streams to receive tensors and run user functions. After that,
these streams are then used for sending the response back to the
sender. When receiving the response, the sender will grab a new set
of streams and use them for TensorPipe's `CudaBuffer`.

If device maps are provided, `TensorPipeAgent::send` will return a
derived class of `CUDAFuture`, which is specifically tailored for
RPC Messages.

TODOs:
1. Enable sending CUDA RPC to the same process.
2. Add a custom CUDA stream pool.
3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`,
remove `cuda:0` context initialization code in `backend_registry.py`.
4. When TensorPipe can detect availability of peer access, enable all
tests on platforms without peer access.

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207)

[ghstack-poisoned]
This commit uses TensorPipe's cuda_ipc channel to conduct
cross-process same-machine GPU-to-GPU communication. On the sender
side, `TensorPipeAgent` grabs a stream to each device used by the
message, let these streams wait for current streams, and passes
the streams to TensorPipe `CudaBuffer`. On the receiver side, it
also grabs a stream for each device used in the message, and uses
these streams to receive tensors and run user functions. After that,
these streams are then used for sending the response back to the
sender. When receiving the response, the sender will grab a new set
of streams and use them for TensorPipe's `CudaBuffer`.

If device maps are provided, `TensorPipeAgent::send` will return a
derived class of `CUDAFuture`, which is specifically tailored for
RPC Messages.

TODOs:
1. Enable sending CUDA RPC to the same process.
2. Add a custom CUDA stream pool.
3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`,
remove `cuda:0` context initialization code in `backend_registry.py`.
4. When TensorPipe can detect availability of peer access, enable all
tests on platforms without peer access.

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207)

[ghstack-poisoned]
This commit uses TensorPipe's cuda_ipc channel to conduct
cross-process same-machine GPU-to-GPU communication. On the sender
side, `TensorPipeAgent` grabs a stream to each device used by the
message, let these streams wait for current streams, and passes
the streams to TensorPipe `CudaBuffer`. On the receiver side, it
also grabs a stream for each device used in the message, and uses
these streams to receive tensors and run user functions. After that,
these streams are then used for sending the response back to the
sender. When receiving the response, the sender will grab a new set
of streams and use them for TensorPipe's `CudaBuffer`.

If device maps are provided, `TensorPipeAgent::send` will return a
derived class of `CUDAFuture`, which is specifically tailored for
RPC Messages.

TODOs:
1. Enable sending CUDA RPC to the same process.
2. Add a custom CUDA stream pool.
3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`,
remove `cuda:0` context initialization code in `backend_registry.py`.
4. When TensorPipe can detect availability of peer access, enable all
tests on platforms without peer access.

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207)

[ghstack-poisoned]
Copy link
Copy Markdown
Contributor Author

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lint failure is on a file I didn't touch:

  {
    path: 'aten/src/ATen/cuda/CUDAEvent.h',
    start_line: 30,
    end_line: 30,
    start_column: 3,
    end_column: 3,
    annotation_level: 'failure',
    message: '[clang-analyzer-optin.cplusplus.UninitializedObject] warning: 1 uninitialized field at the end of the constructor call'
  }

This commit uses TensorPipe's cuda_ipc channel to conduct
cross-process same-machine GPU-to-GPU communication. On the sender
side, `TensorPipeAgent` grabs a stream to each device used by the
message, let these streams wait for current streams, and passes
the streams to TensorPipe `CudaBuffer`. On the receiver side, it
also grabs a stream for each device used in the message, and uses
these streams to receive tensors and run user functions. After that,
these streams are then used for sending the response back to the
sender. When receiving the response, the sender will grab a new set
of streams and use them for TensorPipe's `CudaBuffer`.

If device maps are provided, `TensorPipeAgent::send` will return a
derived class of `CUDAFuture`, which is specifically tailored for
RPC Messages.

TODOs:
1. Enable sending CUDA RPC to the same process.
2. Add a custom CUDA stream pool.
3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`,
remove `cuda:0` context initialization code in `backend_registry.py`.
4. When TensorPipe can detect availability of peer access, enable all
tests on platforms without peer access.

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207)

[ghstack-poisoned]
This commit uses TensorPipe's cuda_ipc channel to conduct
cross-process same-machine GPU-to-GPU communication. On the sender
side, `TensorPipeAgent` grabs a stream to each device used by the
message, let these streams wait for current streams, and passes
the streams to TensorPipe `CudaBuffer`. On the receiver side, it
also grabs a stream for each device used in the message, and uses
these streams to receive tensors and run user functions. After that,
these streams are then used for sending the response back to the
sender. When receiving the response, the sender will grab a new set
of streams and use them for TensorPipe's `CudaBuffer`.

If device maps are provided, `TensorPipeAgent::send` will return a
derived class of `CUDAFuture`, which is specifically tailored for
RPC Messages.

TODOs:
1. Enable sending CUDA RPC to the same process.
2. Add a custom CUDA stream pool.
3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`,
remove `cuda:0` context initialization code in `backend_registry.py`.
4. When TensorPipe can detect availability of peer access, enable all
tests on platforms without peer access.

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207)

[ghstack-poisoned]
Comment on lines +40 to +50
virtual std::vector<CUDAStream> getReservedStreams() const {
throw std::runtime_error(
"Attempting to access CUDA streams, but torch is not built with CUDA");
}
#endif

virtual CUDAStream getStream(c10::DeviceIndex index) {
throw std::runtime_error(c10::str(
"Attempting to access CUDA stream of device ",
index,
", but torch is not built with CUDA"));
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After re-reading this I'm not sure I follow: we define these methods if USE_CUDA is on, but these methods then claim that CUDA is off? I realize that in the subclass we override it, and I understand that we must gate them because otherwise CUDAStream would be undefined. But doesn't this mean we could just leave them unimplemented? (i.e., = 0)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was originally intended to provide a clear error message. When I tried to use pure virtual function, I realized that the following will also need to be gated, or will need to change callsites that do not provide a ctx. Will address this in a follow up PR.

c++ TORCH_API std::tuple<tensorpipe::Message, TensorpipeWriteBuffers> tensorpipeSerialize( Message&& rpcMessage, std::vector<c10::DeviceIndex> devices = {}, const std::shared_ptr<LazyStreamContext>& = std::make_shared<LazyStreamContext>());`

Comment thread torch/csrc/distributed/rpc/tensorpipe_utils.cpp
Comment thread torch/csrc/distributed/rpc/tensorpipe_agent.cpp Outdated
Comment thread torch/csrc/distributed/rpc/tensorpipe_agent.cpp Outdated
This commit uses TensorPipe's cuda_ipc channel to conduct
cross-process same-machine GPU-to-GPU communication. On the sender
side, `TensorPipeAgent` grabs a stream to each device used by the
message, let these streams wait for current streams, and passes
the streams to TensorPipe `CudaBuffer`. On the receiver side, it
also grabs a stream for each device used in the message, and uses
these streams to receive tensors and run user functions. After that,
these streams are then used for sending the response back to the
sender. When receiving the response, the sender will grab a new set
of streams and use them for TensorPipe's `CudaBuffer`.

If device maps are provided, `TensorPipeAgent::send` will return a
derived class of `CUDAFuture`, which is specifically tailored for
RPC Messages.

TODOs:
1. Enable sending CUDA RPC to the same process.
2. Add a custom CUDA stream pool.
3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`,
remove `cuda:0` context initialization code in `backend_registry.py`.
4. When TensorPipe can detect availability of peer access, enable all
tests on platforms without peer access.

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207)

[ghstack-poisoned]
s1 = torch.cuda.Stream(device=x.device)
with torch.cuda.stream(s1):
torch.cuda._sleep(10 * FIFTY_MIL_CYCLES)
z = x + y
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't there also be a synchronization before the addition? x and y might still be filled in, and this is being done in the current streams, hence it's only safe to access them from the current stream, or from streams that are explicitly synchronized with the current one.

Also, we should check that x and y are on the same devices, or else we need to also sync with the current stream of y.device.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, yes, good catch! it probably was hided by the _sleep

This commit uses TensorPipe's cuda_ipc channel to conduct
cross-process same-machine GPU-to-GPU communication. On the sender
side, `TensorPipeAgent` grabs a stream to each device used by the
message, let these streams wait for current streams, and passes
the streams to TensorPipe `CudaBuffer`. On the receiver side, it
also grabs a stream for each device used in the message, and uses
these streams to receive tensors and run user functions. After that,
these streams are then used for sending the response back to the
sender. When receiving the response, the sender will grab a new set
of streams and use them for TensorPipe's `CudaBuffer`.

If device maps are provided, `TensorPipeAgent::send` will return a
derived class of `CUDAFuture`, which is specifically tailored for
RPC Messages.

TODOs:
1. Enable sending CUDA RPC to the same process.
2. Add a custom CUDA stream pool.
3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`,
remove `cuda:0` context initialization code in `backend_registry.py`.
4. When TensorPipe can detect availability of peer access, enable all
tests on platforms without peer access.

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207)

[ghstack-poisoned]
@mrshenli
Copy link
Copy Markdown
Contributor Author

ci-all test in #50494

This commit uses TensorPipe's cuda_ipc channel to conduct
cross-process same-machine GPU-to-GPU communication. On the sender
side, `TensorPipeAgent` grabs a stream to each device used by the
message, let these streams wait for current streams, and passes
the streams to TensorPipe `CudaBuffer`. On the receiver side, it
also grabs a stream for each device used in the message, and uses
these streams to receive tensors and run user functions. After that,
these streams are then used for sending the response back to the
sender. When receiving the response, the sender will grab a new set
of streams and use them for TensorPipe's `CudaBuffer`.

If device maps are provided, `TensorPipeAgent::send` will return a
derived class of `CUDAFuture`, which is specifically tailored for
RPC Messages.

TODOs:
1. Enable sending CUDA RPC to the same process.
2. Add a custom CUDA stream pool.
3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`,
remove `cuda:0` context initialization code in `backend_registry.py`.
4. When TensorPipe can detect availability of peer access, enable all
tests on platforms without peer access.

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207)

[ghstack-poisoned]
@codecov
Copy link
Copy Markdown

codecov Bot commented Jan 14, 2021

Codecov Report

Merging #44418 (28209c4) into gh/mrshenli/235/base (2c55426) will decrease coverage by 0.76%.
The diff coverage is 90.26%.

@@                   Coverage Diff                    @@
##           gh/mrshenli/235/base   #44418      +/-   ##
========================================================
- Coverage                 81.47%   80.71%   -0.77%     
========================================================
  Files                      1792     1910     +118     
  Lines                    186156   207364   +21208     
========================================================
+ Hits                     151669   167369   +15700     
- Misses                    34487    39995    +5508     

This commit uses TensorPipe's cuda_ipc channel to conduct
cross-process same-machine GPU-to-GPU communication. On the sender
side, `TensorPipeAgent` grabs a stream to each device used by the
message, let these streams wait for current streams, and passes
the streams to TensorPipe `CudaBuffer`. On the receiver side, it
also grabs a stream for each device used in the message, and uses
these streams to receive tensors and run user functions. After that,
these streams are then used for sending the response back to the
sender. When receiving the response, the sender will grab a new set
of streams and use them for TensorPipe's `CudaBuffer`.

If device maps are provided, `TensorPipeAgent::send` will return a
derived class of `CUDAFuture`, which is specifically tailored for
RPC Messages.

TODOs:
1. Enable sending CUDA RPC to the same process.
2. Add a custom CUDA stream pool.
3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`,
remove `cuda:0` context initialization code in `backend_registry.py`.
4. When TensorPipe can detect availability of peer access, enable all
tests on platforms without peer access.

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@mrshenli merged this pull request in 30e45bb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants