Use c10 threadpool for GPU to CPU distributed autograd continuations. by pritamdamania87 · Pull Request #42511 · pytorch/pytorch

pritamdamania87 · 2020-08-04T04:38:18Z

Stack from ghstack:

Use c10 threadpool for GPU to CPU distributed autograd continuations. #42511 Use c10 threadpool for GPU to CPU distributed autograd continuations.

DistEngine currently only has a single thread to execute GPU to CPU
continuations as part of the backward pass. This would be a significant
performance bottleneck in cases where we have such continuations and would like
to execute these using all CPU cores.

To alleviate this in this PR, we have the single thread in DistEngine only
dequeue work from the global queue, but then hand off execution of that work to
the c10 threadpool where we call "execute_graph_task_until_ready_queue_empty".

For more context please see:
#40255 (comment).

#Closes: #40255

Differential Revision: D22917579

DistEngine currently only has a single thread to execute GPU to CPU continuations as part of the backward pass. This would be a significant performance bottleneck in cases where we have such continuations and would like to execute these using all CPU cores. To alleviate this in this PR, we have the single thread in DistEngine only dequeue work from the global queue, but then hand off execution of that work to the c10 threadpool where we call "execute_graph_task_until_ready_queue_empty". For more context please see: #40255 (comment). Differential Revision: [D22917579](https://our.internmc.facebook.com/intern/diff/D22917579/) [ghstack-poisoned]

DistEngine currently only has a single thread to execute GPU to CPU continuations as part of the backward pass. This would be a significant performance bottleneck in cases where we have such continuations and would like to execute these using all CPU cores. To alleviate this in this PR, we have the single thread in DistEngine only dequeue work from the global queue, but then hand off execution of that work to the c10 threadpool where we call "execute_graph_task_until_ready_queue_empty". For more context please see: #40255 (comment). Differential Revision: [D22917579](https://our.internmc.facebook.com/intern/diff/D22917579/) ghstack-source-id: 109128600 Pull Request resolved: #42511

albanD · 2020-08-11T15:09:36Z

+        inputs.add(i, std::move(variables[i]), c10::nullopt, c10::nullopt);
+      }
+      execute_graph_task_until_ready_queue_empty(
+          /*node_task*/ NodeTask(graphTask, graphRoot, std::move(inputs)),


Why do you have to "unpack" the given NodeTask that you poped before executing it here?

NodeTask cannot be passed into the lambda since its copy constructor is deleted. Also, at::launch accepts an std::function as a parameter and as a result, anything that is captured in the lambda needs to be copy constructible.

But it was moved out of the queue right? So it could be moved to the lambda and then into the new queue?

Yes, it can be moved to the lambda. Although, the lambda is passed to at::launch as std::function<void()> func. As a result, when the lambda is passed to at::launch it copies the lambda and as a result needs to copy the NodeTask (since it is part of the lambda due to the capture) and thats where it fails. If at::launch took the function as reference or had the function as a templated reference this wouldn't be a problem.

…tinuations." DistEngine currently only has a single thread to execute GPU to CPU continuations as part of the backward pass. This would be a significant performance bottleneck in cases where we have such continuations and would like to execute these using all CPU cores. To alleviate this in this PR, we have the single thread in DistEngine only dequeue work from the global queue, but then hand off execution of that work to the c10 threadpool where we call "execute_graph_task_until_ready_queue_empty". For more context please see: #40255 (comment). #Closes: #40255 Differential Revision: [D22917579](https://our.internmc.facebook.com/intern/diff/D22917579/) [ghstack-poisoned]

Pull Request resolved: #42511 DistEngine currently only has a single thread to execute GPU to CPU continuations as part of the backward pass. This would be a significant performance bottleneck in cases where we have such continuations and would like to execute these using all CPU cores. To alleviate this in this PR, we have the single thread in DistEngine only dequeue work from the global queue, but then hand off execution of that work to the c10 threadpool where we call "execute_graph_task_until_ready_queue_empty". For more context please see: #40255 (comment). ghstack-source-id: 109806943 Differential Revision: [D22917579](https://our.internmc.facebook.com/intern/diff/D22917579/)

dr-ci · 2020-08-13T09:08:05Z

💊 CI failures summary and remediations

As of commit b1ed9ef (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 2 times.

pritamdamania87 · 2020-08-14T17:03:45Z

@albanD Thanks for reviewing the PR, I addressed your comments. Could you take another look at the PR?

albanD

Looks good to me.
Just the small detail about the NodeTask in case we can avoid re-creating it.

…tinuations." DistEngine currently only has a single thread to execute GPU to CPU continuations as part of the backward pass. This would be a significant performance bottleneck in cases where we have such continuations and would like to execute these using all CPU cores. To alleviate this in this PR, we have the single thread in DistEngine only dequeue work from the global queue, but then hand off execution of that work to the c10 threadpool where we call "execute_graph_task_until_ready_queue_empty". For more context please see: #40255 (comment). #Closes: #40255 Differential Revision: [D22917579](https://our.internmc.facebook.com/intern/diff/D22917579/) [ghstack-poisoned]

Pull Request resolved: #42511 DistEngine currently only has a single thread to execute GPU to CPU continuations as part of the backward pass. This would be a significant performance bottleneck in cases where we have such continuations and would like to execute these using all CPU cores. To alleviate this in this PR, we have the single thread in DistEngine only dequeue work from the global queue, but then hand off execution of that work to the c10 threadpool where we call "execute_graph_task_until_ready_queue_empty". For more context please see: #40255 (comment). ghstack-source-id: 109997718 Differential Revision: [D22917579](https://our.internmc.facebook.com/intern/diff/D22917579/)

facebook-github-bot · 2020-08-18T00:15:00Z

This pull request has been merged in 133e9f9.

…pytorch#42511) Summary: Pull Request resolved: pytorch#42511 DistEngine currently only has a single thread to execute GPU to CPU continuations as part of the backward pass. This would be a significant performance bottleneck in cases where we have such continuations and would like to execute these using all CPU cores. To alleviate this in this PR, we have the single thread in DistEngine only dequeue work from the global queue, but then hand off execution of that work to the c10 threadpool where we call "execute_graph_task_until_ready_queue_empty". For more context please see: pytorch#40255 (comment). ghstack-source-id: 109997718 Test Plan: waitforbuildbot Reviewed By: albanD Differential Revision: D22917579 fbshipit-source-id: c634b6c97f3051f071fd7b994333e6ecb8c54155

pritamdamania87 requested review from albanD, apaszke, mrshenli and zhaojuanmao as code owners August 4, 2020 04:38

albanD reviewed Aug 11, 2020

View reviewed changes

pritamdamania87 requested a review from albanD August 13, 2020 02:12

albanD reviewed Aug 14, 2020

View reviewed changes

albanD approved these changes Aug 14, 2020

View reviewed changes

facebook-github-bot closed this in 133e9f9 Aug 17, 2020

facebook-github-bot added the merged label Aug 18, 2020

facebook-github-bot deleted the gh/pritamdamania87/150/head branch August 21, 2020 14:16

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use c10 threadpool for GPU to CPU distributed autograd continuations.#42511

Use c10 threadpool for GPU to CPU distributed autograd continuations.#42511
pritamdamania87 wants to merge 3 commits intogh/pritamdamania87/150/basefrom
gh/pritamdamania87/150/head

pritamdamania87 commented Aug 4, 2020 •

edited

Loading

Uh oh!

albanD Aug 11, 2020

Uh oh!

pritamdamania87 Aug 13, 2020

Uh oh!

albanD Aug 14, 2020

Uh oh!

pritamdamania87 Aug 15, 2020

Uh oh!

Uh oh!

Uh oh!

dr-ci Bot commented Aug 13, 2020 •

edited

Loading

Uh oh!

pritamdamania87 commented Aug 14, 2020

Uh oh!

albanD left a comment

Uh oh!

facebook-github-bot commented Aug 18, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

pritamdamania87 commented Aug 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

albanD Aug 11, 2020

Choose a reason for hiding this comment

Uh oh!

pritamdamania87 Aug 13, 2020

Choose a reason for hiding this comment

Uh oh!

albanD Aug 14, 2020

Choose a reason for hiding this comment

Uh oh!

pritamdamania87 Aug 15, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dr-ci Bot commented Aug 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Uh oh!

pritamdamania87 commented Aug 14, 2020

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Aug 18, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pritamdamania87 commented Aug 4, 2020 •

edited

Loading

dr-ci Bot commented Aug 13, 2020 •

edited

Loading