Use c10::ThreadPool to send messages by mrshenli · Pull Request #23968 · pytorch/pytorch

mrshenli · 2019-08-07T20:01:00Z

Stack from ghstack:

Use c10::ThreadPool to send messages #23968 Use c10::ThreadPool to send messages

Existing ProcessGroupAgent uses a single thread to send all messages, and
a single thread to listen and process all received messages. This causes
both performance issues and also prevents nested RPCs. For example, when
running nested RPC A->B->A->B, the second recv on B cannot start until
the first recv on B finishes. If the second recv is triggered by a nested
RPC in the first recv, it will deadlock. Ideally, we should expose sth like
responder or FutureResult to the Python land to support nested asynchronous
UDFs.

This diff adds a shared ThreadPool for send and recv. Send use it do send
out messages, and recv use it to process received messages. There is still
a dedicated thread to listen for incoming messages and add it to task queue.
There are two goals: 1) speed up ProcessGroupAgent 2) use ThreadPool as a
temporary solution for (a small number of) nested RPCs

Differential Revision: D16695091

Existing ProcessGroupAgent uses a single thread to send all messages. To speed up, it now uses a thread pool. Differential Revision: [D16695091](https://our.internmc.facebook.com/intern/diff/D16695091/)

Pull Request resolved: #23968 Existing ProcessGroupAgent uses a single thread to send all messages. To speed up, it now uses a thread pool. ghstack-source-id: 87896432 Differential Revision: [D16695091](https://our.internmc.facebook.com/intern/diff/D16695091/)

aazzolini · 2019-08-07T20:31:37Z

Since the single goal of this PR is speed up, I believe need needs to come along with a micro benchmark. Could you measure the improvement in terms of latency?

ilia-cher

Second that, perf. related PRs should demonstrate an improvement on a (micro-)benchmark

ilia-cher · 2019-08-07T20:34:48Z

also please add more information on what you're trying to achieve here, e.g. I wonder how many new thread pool instances there's going to be in a training process?

mrshenli · 2019-08-07T20:55:28Z

@aazzolini @ilia-cher

I was intended to try out c10::ThreadPool on sends before adding it for recvs to unblock @xush6528. Will add some micro-benchmark later.

@ilia-cher
There will be ~~one send ThreadPool and one recv ThreadPool~~ one shared ThreadPool for send and receive per ProcessGroupAgent per process. Is there any concern on this?

BTW, is it true that c10::ThreadPool is making a copy of the function and bound args?

Existing ProcessGroupAgent uses a single thread to send all messages. To speed up, it now uses a thread pool. Differential Revision: [D16695091](https://our.internmc.facebook.com/intern/diff/D16695091/)

Pull Request resolved: #23968 Existing ProcessGroupAgent uses a single thread to send all messages, and a single thread to listen and process all received messages. This causes both performance issues and also prevents nested RPCs. For example, when running nested RPC A->B->A->B, the second recv on B cannot start until the first recv on B finishes. If the second recv is triggered by a nested RPC in the first recv, it will deadlock. Ideally, we should expose sth like responder or FutureResult to the Python land to support nested asynchronous UDFs. This diff adds a shared ThreadPool for send and recv. Send use it do send out messages, and recv use it to process received messages. There is still a dedicated thread to listen for incoming messages and add it to task queue. There are two goals: 1) speed up ProcessGroupAgent 2) use ThreadPool as a temporary solution for (a small number of) nested RPCs ghstack-source-id: 87907929 Differential Revision: [D16695091](https://our.internmc.facebook.com/intern/diff/D16695091/)

Summary: Pull Request resolved: pytorch#23968 Existing ProcessGroupAgent uses a single thread to send all messages, and a single thread to listen and process all received messages. This causes both performance issues and also prevents nested RPCs. For example, when running nested RPC A->B->A->B, the second recv on B cannot start until the first recv on B finishes. If the second recv is triggered by a nested RPC in the first recv, it will deadlock. Ideally, we should expose sth like responder or FutureResult to the Python land to support nested asynchronous UDFs. This diff adds a shared ThreadPool for send and recv. Send use it do send out messages, and recv use it to process received messages. There is still a dedicated thread to listen for incoming messages and add it to task queue. There are two goals: 1) speed up ProcessGroupAgent 2) use ThreadPool as a temporary solution for (a small number of) nested RPCs ghstack-source-id: 87896432 Differential Revision: D16695091 fbshipit-source-id: b05c6c9749d4801a4c11a5bf1e660b60e0688163

Summary: Pull Request resolved: pytorch#23968 Existing ProcessGroupAgent uses a single thread to send all messages, and a single thread to listen and process all received messages. This causes both performance issues and also prevents nested RPCs. For example, when running nested RPC A->B->A->B, the second recv on B cannot start until the first recv on B finishes. If the second recv is triggered by a nested RPC in the first recv, it will deadlock. Ideally, we should expose sth like responder or FutureResult to the Python land to support nested asynchronous UDFs. This diff adds a shared ThreadPool for send and recv. Send use it do send out messages, and recv use it to process received messages. There is still a dedicated thread to listen for incoming messages and add it to task queue. There are two goals: 1) speed up ProcessGroupAgent 2) use ThreadPool as a temporary solution for (a small number of) nested RPCs ghstack-source-id: 87896432 Differential Revision: D16695091 fbshipit-source-id: 330926902ada920c6123268f63b6e24931eaa9ad

mrshenli · 2019-08-13T19:05:21Z

Hey @pritamdamania87 thanks for reviewing. Shall we hold followup reviews on this PR a bit? I have some updates on this one but cannot export it after rebasing to #23569. Will update when #23569 is landed.

Existing ProcessGroupAgent uses a single thread to send all messages. To speed up, it now uses a thread pool. Differential Revision: [D16695091](https://our.internmc.facebook.com/intern/diff/D16695091/)

Pull Request resolved: #23968 Existing ProcessGroupAgent uses a single thread to send all messages, and a single thread to listen and process all received messages. This causes both performance issues and also prevents nested RPCs. For example, when running nested RPC A->B->A->B, the second recv on B cannot start until the first recv on B finishes. If the second recv is triggered by a nested RPC in the first recv, it will deadlock. Ideally, we should expose sth like responder or FutureResult to the Python land to support nested asynchronous UDFs. This diff adds a shared ThreadPool for send and recv. Send use it do send out messages, and recv use it to process received messages. There is still a dedicated thread to listen for incoming messages and add it to task queue. There are two goals: 1) speed up ProcessGroupAgent 2) use ThreadPool as a temporary solution for (a small number of) nested RPCs ghstack-source-id: 88365104 Differential Revision: [D16695091](https://our.internmc.facebook.com/intern/diff/D16695091/)

mrshenli · 2019-08-15T16:06:25Z

Micro-benchmark:

before:

test_stress_heavy_rpc (test_rpc.RpcTest) ... 
Rank 3 finished testing heavy_rpc 20 times in 1.841923713684082 seconds.
Rank 0 finished testing heavy_rpc 20 times in 1.8617804050445557 seconds.
Rank 1 finished testing heavy_rpc 20 times in 1.9007983207702637 seconds.
Rank 2 finished testing heavy_rpc 20 times in 1.9105055332183838 seconds.

test_stress_light_rpc (test_rpc.RpcTest) ... 
Rank 2 finished testing light_rpc 1000 times in 1.124621868133545 seconds.
Rank 0 finished testing light_rpc 1000 times in 1.1577978134155273 seconds.
Rank 3 finished testing light_rpc 1000 times in 1.170255422592163 seconds.
Rank 1 finished testing light_rpc 1000 times in 1.1736280918121338 seconds.

after (num_send_recv_threads = 4, default):

test_stress_heavy_rpc (test_rpc.RpcTest) ... 
Rank 1 finished testing heavy_rpc 20 times in 0.5223677158355713 seconds.
Rank 0 finished testing heavy_rpc 20 times in 0.5939044952392578 seconds.
Rank 2 finished testing heavy_rpc 20 times in 0.6136205196380615 seconds.
Rank 3 finished testing heavy_rpc 20 times in 0.6397840976715088 seconds.

test_stress_light_rpc (test_rpc.RpcTest) ... 
Rank 3 finished testing light_rpc 1000 times in 0.5930540561676025 seconds.
Rank 0 finished testing light_rpc 1000 times in 0.5937092304229736 seconds.
Rank 2 finished testing light_rpc 1000 times in 0.6050472259521484 seconds.
Rank 1 finished testing light_rpc 1000 times in 0.6078634262084961 seconds.

after (num_send_recv_threads = 8):

test_stress_heavy_rpc (test_rpc.RpcTest) ... 
Rank 1 finished testing heavy_rpc 20 times in 0.34965944290161133 seconds.
Rank 3 finished testing heavy_rpc 20 times in 0.3574059009552002 seconds.
Rank 2 finished testing heavy_rpc 20 times in 0.3725595474243164 seconds.
Rank 0 finished testing heavy_rpc 20 times in 0.41245269775390625 seconds.

test_stress_light_rpc (test_rpc.RpcTest) ... 
Rank 3 finished testing light_rpc 1000 times in 0.5265188217163086 seconds.
Rank 2 finished testing light_rpc 1000 times in 0.5312702655792236 seconds.
Rank 0 finished testing light_rpc 1000 times in 0.5387604236602783 seconds.
Rank 1 finished testing light_rpc 1000 times in 0.541905403137207 seconds.

mrshenli · 2019-08-15T16:08:11Z

@aazzolini @ilia-cher @pritamdamania87 @pietern I think I addressed all the comments above and added a micro-benchmark, could you please help take another look? Thanks!

Existing ProcessGroupAgent uses a single thread to send all messages, and a single thread to listen and process all received messages. This causes both performance issues and also prevents nested RPCs. For example, when running nested RPC A->B->A->B, the second recv on B cannot start until the first recv on B finishes. If the second recv is triggered by a nested RPC in the first recv, it will deadlock. Ideally, we should expose sth like responder or FutureResult to the Python land to support nested asynchronous UDFs. This diff adds a shared ThreadPool for send and recv. Send use it do send out messages, and recv use it to process received messages. There is still a dedicated thread to listen for incoming messages and add it to task queue. There are two goals: 1) speed up ProcessGroupAgent 2) use ThreadPool as a temporary solution for (a small number of) nested RPCs Differential Revision: [D16695091](https://our.internmc.facebook.com/intern/diff/D16695091/)

Pull Request resolved: #23968 Existing ProcessGroupAgent uses a single thread to send all messages, and a single thread to listen and process all received messages. This causes both performance issues and also prevents nested RPCs. For example, when running nested RPC A->B->A->B, the second recv on B cannot start until the first recv on B finishes. If the second recv is triggered by a nested RPC in the first recv, it will deadlock. Ideally, we should expose sth like responder or FutureResult to the Python land to support nested asynchronous UDFs. This diff adds a shared ThreadPool for send and recv. Send use it do send out messages, and recv use it to process received messages. There is still a dedicated thread to listen for incoming messages and add it to task queue. There are two goals: 1) speed up ProcessGroupAgent 2) use ThreadPool as a temporary solution for (a small number of) nested RPCs ghstack-source-id: 88432444 Differential Revision: [D16695091](https://our.internmc.facebook.com/intern/diff/D16695091/)

mrshenli · 2019-08-16T02:41:39Z

 }

 void ProcessGroupAgent::listenLoop() {
+  google::InitGoogleLogging("distributed.rpc");


This line helped me get rid of the following warning for the LOG(INFO) (Thanks! @pritamdamania87 and @jamarshon):

WARNING: Logging before InitGoogleLogging() is written to STDERR

But I don't feel right to call it here. I searched a bit, and saw LOG is used extensively in caffe2/ but almost not used at all in torch/ or aten/. @soumith @dzhulgakov @gchanan @ezyang In general, what are the recommended API for logging? Or are we intentionally avoiding that?

I think @dzhulgakov knows what the current state here is

Existing ProcessGroupAgent uses a single thread to send all messages, and a single thread to listen and process all received messages. This causes both performance issues and also prevents nested RPCs. For example, when running nested RPC A->B->A->B, the second recv on B cannot start until the first recv on B finishes. If the second recv is triggered by a nested RPC in the first recv, it will deadlock. Ideally, we should expose sth like responder or FutureResult to the Python land to support nested asynchronous UDFs. This diff adds a shared ThreadPool for send and recv. Send use it do send out messages, and recv use it to process received messages. There is still a dedicated thread to listen for incoming messages and add it to task queue. There are two goals: 1) speed up ProcessGroupAgent 2) use ThreadPool as a temporary solution for (a small number of) nested RPCs Differential Revision: [D16695091](https://our.internmc.facebook.com/intern/diff/D16695091/)

xush6528 · 2019-08-16T22:18:35Z

+            pg_->send(payload, work.to_, work.to_ /* channelTag */));
+      }
+      for (auto& pendingSend: pendingSends) {
+        pendingSend->wait();


I'm curious about the reason you want to wait for gloo device thread

Because work will be destructed after the lambda function finishes, and the send function itself does not keep the tensor alive. If destruction happens before send finish, the behavior will be undefined.

@mrshenli

Since the change you are making here, waiting for GLOO device thread to finish the ProcessGroup::Work, is similar to, in network programming, waiting for os kernel to send out the buffer in kernel, making your user-land code essentially in a blocking fashion. This is a performance regression we should be able to avoid.

I wonder is this possible to keep alive the ProcessGroup::Work object on heap returned by calling ProcessGroup::send(..).

The solition is capturing the shared_ptr of ProcessGroup::Work by value in lambda (adding ref count to the heap object) to keep it alive. This way, you don't need to call pendingSend->wait();

@xush6528 could you please elaborate more? By worker object do you mean SendWork? If so, why using a shared_ptr would help? The SendWork is already capture by this lambda. Or are you referring to a different lambda than this one?

Or we could have a separate GC thread for send, and enqueue all async send handles to that thread. The GC thread will then wait on all the send handle in order and delete it when done.

@mrshenli Sorry, I mean the ProcessGroup::Work you are waiting for. Updated the above comment.

Because work will be destructed after the lambda function finishes, and the send function itself does not keep the tensor alive. If destruction happens before send finish, the behavior will be undefined.

Oh, my bad, when I say work above, I actually mean preamble and payload tensors. We need keep those tensors alive before send finishes. And I agree that we can optimize this by capturing ProcessGroup::Work and those tensors (all as std::shared_ptrs) in a separate lambda, and wait there. Let me add an issue for this later. Thanks!

facebook-github-bot · 2019-08-17T01:08:27Z

This pull request has been merged in 99dea08.

Pull Request resolved: pytorch/pytorch#23968 Existing ProcessGroupAgent uses a single thread to send all messages, and a single thread to listen and process all received messages. This causes both performance issues and also prevents nested RPCs. For example, when running nested RPC A->B->A->B, the second recv on B cannot start until the first recv on B finishes. If the second recv is triggered by a nested RPC in the first recv, it will deadlock. Ideally, we should expose sth like responder or FutureResult to the Python land to support nested asynchronous UDFs. This diff adds a shared ThreadPool for send and recv. Send use it do send out messages, and recv use it to process received messages. There is still a dedicated thread to listen for incoming messages and add it to task queue. There are two goals: 1) speed up ProcessGroupAgent 2) use ThreadPool as a temporary solution for (a small number of) nested RPCs ghstack-source-id: 88362006 Differential Revision: [D16695091](https://our.internmc.facebook.com/intern/diff/D16695091/)

Summary: Pull Request resolved: pytorch#23968 Existing ProcessGroupAgent uses a single thread to send all messages, and a single thread to listen and process all received messages. This causes both performance issues and also prevents nested RPCs. For example, when running nested RPC A->B->A->B, the second recv on B cannot start until the first recv on B finishes. If the second recv is triggered by a nested RPC in the first recv, it will deadlock. Ideally, we should expose sth like responder or FutureResult to the Python land to support nested asynchronous UDFs. This diff adds a shared ThreadPool for send and recv. Send use it do send out messages, and recv use it to process received messages. There is still a dedicated thread to listen for incoming messages and add it to task queue. There are two goals: 1) speed up ProcessGroupAgent 2) use ThreadPool as a temporary solution for (a small number of) nested RPCs ghstack-source-id: 88476246 Differential Revision: D16695091 fbshipit-source-id: fd18a5c65e7fcd1331b73d1287673e6e10d2dd86

Use c10::ThreadPool to send messages

51607f0

Existing ProcessGroupAgent uses a single thread to send all messages. To speed up, it now uses a thread pool. Differential Revision: [D16695091](https://our.internmc.facebook.com/intern/diff/D16695091/)

mrshenli requested review from apaszke and pietern as code owners August 7, 2019 20:01

pytorchbot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Aug 7, 2019

Update on "Use c10::ThreadPool to send messages"

90b486f

Existing ProcessGroupAgent uses a single thread to send all messages. To speed up, it now uses a thread pool. Differential Revision: [D16695091](https://our.internmc.facebook.com/intern/diff/D16695091/)

mrshenli commented Aug 7, 2019

View reviewed changes

Comment thread torch/csrc/distributed/rpc/ProcessGroupAgent.cpp Outdated

mrshenli requested review from pritamdamania87, xush6528 and zhaojuanmao August 7, 2019 20:09

Update on "Use c10::ThreadPool to send messages"

92ecdc1

Existing ProcessGroupAgent uses a single thread to send all messages. To speed up, it now uses a thread pool. Differential Revision: [D16695091](https://our.internmc.facebook.com/intern/diff/D16695091/)

ilia-cher self-requested a review August 7, 2019 20:33

ilia-cher suggested changes Aug 7, 2019

View reviewed changes

Update on "Use c10::ThreadPool to send messages"

8909ef4

Existing ProcessGroupAgent uses a single thread to send all messages. To speed up, it now uses a thread pool. Differential Revision: [D16695091](https://our.internmc.facebook.com/intern/diff/D16695091/)

aazzolini reviewed Aug 7, 2019

View reviewed changes

Comment thread torch/csrc/distributed/rpc/ProcessGroupAgent.cpp

aazzolini reviewed Aug 7, 2019

View reviewed changes

Comment thread torch/csrc/distributed/rpc/ProcessGroupAgent.cpp Outdated

mrshenli mentioned this pull request Aug 7, 2019

[torch.distributed][RPC] An RPC callee could crash on RpcAgent::join(), if the caller terminates with unresolved future. #23948

Closed

pietern reviewed Aug 8, 2019

View reviewed changes

xush6528 mentioned this pull request Aug 9, 2019

Fix termination crash when having unresolved future #24074

Closed

gqchen self-requested a review August 9, 2019 22:53

pritamdamania87 suggested changes Aug 13, 2019

View reviewed changes

mrshenli mentioned this pull request Aug 13, 2019

Support Callbacks on Asynchronous RPC #24118

Closed

mrshenli added 2 commits August 15, 2019 07:58

Update on "Use c10::ThreadPool to send messages"

31ae7bd

Existing ProcessGroupAgent uses a single thread to send all messages. To speed up, it now uses a thread pool. Differential Revision: [D16695091](https://our.internmc.facebook.com/intern/diff/D16695091/)

Update on "Use c10::ThreadPool to send messages"

eac076f

Existing ProcessGroupAgent uses a single thread to send all messages. To speed up, it now uses a thread pool. Differential Revision: [D16695091](https://our.internmc.facebook.com/intern/diff/D16695091/)

pritamdamania87 reviewed Aug 15, 2019

View reviewed changes

Comment thread torch/csrc/distributed/rpc/ProcessGroupAgent.cpp

Comment thread torch/csrc/distributed/rpc/ProcessGroupAgent.cpp Outdated

Comment thread torch/csrc/distributed/rpc/ProcessGroupAgent.cpp

Comment thread torch/csrc/distributed/rpc/ProcessGroupAgent.cpp Outdated

pritamdamania87 suggested changes Aug 15, 2019

View reviewed changes

Comment thread torch/csrc/distributed/rpc/ProcessGroupAgent.cpp

mrshenli commented Aug 16, 2019

View reviewed changes

pritamdamania87 approved these changes Aug 16, 2019

View reviewed changes

Comment thread torch/csrc/distributed/rpc/ProcessGroupAgent.cpp Outdated

xush6528 reviewed Aug 16, 2019

View reviewed changes

facebook-github-bot closed this in 99dea08 Aug 17, 2019

zou3519 deleted the gh/mrshenli/8/head branch August 17, 2019 00:49

facebook-github-bot added the merged label Aug 17, 2019

mrshenli mentioned this pull request Aug 21, 2019

[RPC] Make ProcessGroupAgent send task non-blocking #24946

Open

mruberry added the Merged label Oct 28, 2020

Conversation

mrshenli commented Aug 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

aazzolini commented Aug 7, 2019

Uh oh!

ilia-cher left a comment

Choose a reason for hiding this comment

Uh oh!

ilia-cher commented Aug 7, 2019

Uh oh!

mrshenli commented Aug 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mrshenli commented Aug 13, 2019

Uh oh!

mrshenli commented Aug 15, 2019

Uh oh!

mrshenli commented Aug 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mrshenli Aug 16, 2019

Choose a reason for hiding this comment

Uh oh!

ezyang Aug 16, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xush6528 Aug 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrshenli Aug 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xush6528 Aug 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrshenli Aug 20, 2019

Choose a reason for hiding this comment

Uh oh!

mrshenli Aug 20, 2019

Choose a reason for hiding this comment

Uh oh!

xush6528 Aug 20, 2019

Choose a reason for hiding this comment

Uh oh!

mrshenli Aug 20, 2019

Choose a reason for hiding this comment

Uh oh!

mrshenli Aug 21, 2019

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Aug 17, 2019

Uh oh!

Reviewers

Assignees

Labels

mrshenli commented Aug 7, 2019 •

edited

Loading

mrshenli commented Aug 7, 2019 •

edited

Loading

mrshenli commented Aug 15, 2019 •

edited

Loading

xush6528 Aug 16, 2019 •

edited

Loading

mrshenli Aug 17, 2019 •

edited

Loading

xush6528 Aug 20, 2019 •

edited

Loading