Skip to content

Use c10 threadpool for GPU to CPU distributed autograd continuations.#42511

Closed
pritamdamania87 wants to merge 3 commits intogh/pritamdamania87/150/basefrom
gh/pritamdamania87/150/head
Closed

Use c10 threadpool for GPU to CPU distributed autograd continuations.#42511
pritamdamania87 wants to merge 3 commits intogh/pritamdamania87/150/basefrom
gh/pritamdamania87/150/head

Conversation

@pritamdamania87
Copy link
Copy Markdown
Contributor

@pritamdamania87 pritamdamania87 commented Aug 4, 2020

Stack from ghstack:

DistEngine currently only has a single thread to execute GPU to CPU
continuations as part of the backward pass. This would be a significant
performance bottleneck in cases where we have such continuations and would like
to execute these using all CPU cores.

To alleviate this in this PR, we have the single thread in DistEngine only
dequeue work from the global queue, but then hand off execution of that work to
the c10 threadpool where we call "execute_graph_task_until_ready_queue_empty".

For more context please see:
#40255 (comment).

#Closes: #40255

Differential Revision: D22917579

DistEngine currently only has a single thread to execute GPU to CPU
continuations as part of the backward pass. This would be a significant
performance bottleneck in cases where we have such continuations and would like
to execute these using all CPU cores.

To alleviate this in this PR, we have the single thread in DistEngine only
dequeue work from the global queue, but then hand off execution of that work to
the c10 threadpool where we call "execute_graph_task_until_ready_queue_empty".

For more context please see:
#40255 (comment).

Differential Revision: [D22917579](https://our.internmc.facebook.com/intern/diff/D22917579/)

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this pull request Aug 4, 2020
DistEngine currently only has a single thread to execute GPU to CPU
continuations as part of the backward pass. This would be a significant
performance bottleneck in cases where we have such continuations and would like
to execute these using all CPU cores.

To alleviate this in this PR, we have the single thread in DistEngine only
dequeue work from the global queue, but then hand off execution of that work to
the c10 threadpool where we call "execute_graph_task_until_ready_queue_empty".

For more context please see:
#40255 (comment).

Differential Revision: [D22917579](https://our.internmc.facebook.com/intern/diff/D22917579/)

ghstack-source-id: 109128600
Pull Request resolved: #42511
inputs.add(i, std::move(variables[i]), c10::nullopt, c10::nullopt);
}
execute_graph_task_until_ready_queue_empty(
/*node_task*/ NodeTask(graphTask, graphRoot, std::move(inputs)),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you have to "unpack" the given NodeTask that you poped before executing it here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NodeTask cannot be passed into the lambda since its copy constructor is deleted. Also, at::launch accepts an std::function as a parameter and as a result, anything that is captured in the lambda needs to be copy constructible.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it was moved out of the queue right? So it could be moved to the lambda and then into the new queue?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can be moved to the lambda. Although, the lambda is passed to at::launch as std::function<void()> func. As a result, when the lambda is passed to at::launch it copies the lambda and as a result needs to copy the NodeTask (since it is part of the lambda due to the capture) and thats where it fails. If at::launch took the function as reference or had the function as a templated reference this wouldn't be a problem.

Comment thread torch/csrc/distributed/autograd/engine/dist_engine.cpp Outdated
Comment thread torch/csrc/distributed/autograd/engine/dist_engine.cpp Outdated
…tinuations."


DistEngine currently only has a single thread to execute GPU to CPU
continuations as part of the backward pass. This would be a significant
performance bottleneck in cases where we have such continuations and would like
to execute these using all CPU cores.

To alleviate this in this PR, we have the single thread in DistEngine only
dequeue work from the global queue, but then hand off execution of that work to
the c10 threadpool where we call "execute_graph_task_until_ready_queue_empty".

For more context please see:
#40255 (comment).

#Closes: #40255

Differential Revision: [D22917579](https://our.internmc.facebook.com/intern/diff/D22917579/)

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this pull request Aug 13, 2020
Pull Request resolved: #42511

DistEngine currently only has a single thread to execute GPU to CPU
continuations as part of the backward pass. This would be a significant
performance bottleneck in cases where we have such continuations and would like
to execute these using all CPU cores.

To alleviate this in this PR, we have the single thread in DistEngine only
dequeue work from the global queue, but then hand off execution of that work to
the c10 threadpool where we call "execute_graph_task_until_ready_queue_empty".

For more context please see:
#40255 (comment).
ghstack-source-id: 109806943

Differential Revision: [D22917579](https://our.internmc.facebook.com/intern/diff/D22917579/)
@pritamdamania87 pritamdamania87 requested a review from albanD August 13, 2020 02:12
@dr-ci
Copy link
Copy Markdown

dr-ci Bot commented Aug 13, 2020

💊 CI failures summary and remediations

As of commit b1ed9ef (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 2 times.

@pritamdamania87
Copy link
Copy Markdown
Contributor Author

@albanD Thanks for reviewing the PR, I addressed your comments. Could you take another look at the PR?

Copy link
Copy Markdown
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.
Just the small detail about the NodeTask in case we can avoid re-creating it.

…tinuations."


DistEngine currently only has a single thread to execute GPU to CPU
continuations as part of the backward pass. This would be a significant
performance bottleneck in cases where we have such continuations and would like
to execute these using all CPU cores.

To alleviate this in this PR, we have the single thread in DistEngine only
dequeue work from the global queue, but then hand off execution of that work to
the c10 threadpool where we call "execute_graph_task_until_ready_queue_empty".

For more context please see:
#40255 (comment).

#Closes: #40255

Differential Revision: [D22917579](https://our.internmc.facebook.com/intern/diff/D22917579/)

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this pull request Aug 15, 2020
Pull Request resolved: #42511

DistEngine currently only has a single thread to execute GPU to CPU
continuations as part of the backward pass. This would be a significant
performance bottleneck in cases where we have such continuations and would like
to execute these using all CPU cores.

To alleviate this in this PR, we have the single thread in DistEngine only
dequeue work from the global queue, but then hand off execution of that work to
the c10 threadpool where we call "execute_graph_task_until_ready_queue_empty".

For more context please see:
#40255 (comment).
ghstack-source-id: 109997718

Differential Revision: [D22917579](https://our.internmc.facebook.com/intern/diff/D22917579/)
@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request has been merged in 133e9f9.

@facebook-github-bot facebook-github-bot deleted the gh/pritamdamania87/150/head branch August 21, 2020 14:16
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
…pytorch#42511)

Summary:
Pull Request resolved: pytorch#42511

DistEngine currently only has a single thread to execute GPU to CPU
continuations as part of the backward pass. This would be a significant
performance bottleneck in cases where we have such continuations and would like
to execute these using all CPU cores.

To alleviate this in this PR, we have the single thread in DistEngine only
dequeue work from the global queue, but then hand off execution of that work to
the c10 threadpool where we call "execute_graph_task_until_ready_queue_empty".

For more context please see:
pytorch#40255 (comment).
ghstack-source-id: 109997718

Test Plan: waitforbuildbot

Reviewed By: albanD

Differential Revision: D22917579

fbshipit-source-id: c634b6c97f3051f071fd7b994333e6ecb8c54155
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants