Use c10 threadpool for GPU to CPU distributed autograd continuations.#42511
Use c10 threadpool for GPU to CPU distributed autograd continuations.#42511pritamdamania87 wants to merge 3 commits intogh/pritamdamania87/150/basefrom
Conversation
DistEngine currently only has a single thread to execute GPU to CPU continuations as part of the backward pass. This would be a significant performance bottleneck in cases where we have such continuations and would like to execute these using all CPU cores. To alleviate this in this PR, we have the single thread in DistEngine only dequeue work from the global queue, but then hand off execution of that work to the c10 threadpool where we call "execute_graph_task_until_ready_queue_empty". For more context please see: #40255 (comment). Differential Revision: [D22917579](https://our.internmc.facebook.com/intern/diff/D22917579/) [ghstack-poisoned]
DistEngine currently only has a single thread to execute GPU to CPU continuations as part of the backward pass. This would be a significant performance bottleneck in cases where we have such continuations and would like to execute these using all CPU cores. To alleviate this in this PR, we have the single thread in DistEngine only dequeue work from the global queue, but then hand off execution of that work to the c10 threadpool where we call "execute_graph_task_until_ready_queue_empty". For more context please see: #40255 (comment). Differential Revision: [D22917579](https://our.internmc.facebook.com/intern/diff/D22917579/) ghstack-source-id: 109128600 Pull Request resolved: #42511
| inputs.add(i, std::move(variables[i]), c10::nullopt, c10::nullopt); | ||
| } | ||
| execute_graph_task_until_ready_queue_empty( | ||
| /*node_task*/ NodeTask(graphTask, graphRoot, std::move(inputs)), |
There was a problem hiding this comment.
Why do you have to "unpack" the given NodeTask that you poped before executing it here?
There was a problem hiding this comment.
NodeTask cannot be passed into the lambda since its copy constructor is deleted. Also, at::launch accepts an std::function as a parameter and as a result, anything that is captured in the lambda needs to be copy constructible.
There was a problem hiding this comment.
But it was moved out of the queue right? So it could be moved to the lambda and then into the new queue?
There was a problem hiding this comment.
Yes, it can be moved to the lambda. Although, the lambda is passed to at::launch as std::function<void()> func. As a result, when the lambda is passed to at::launch it copies the lambda and as a result needs to copy the NodeTask (since it is part of the lambda due to the capture) and thats where it fails. If at::launch took the function as reference or had the function as a templated reference this wouldn't be a problem.
…tinuations." DistEngine currently only has a single thread to execute GPU to CPU continuations as part of the backward pass. This would be a significant performance bottleneck in cases where we have such continuations and would like to execute these using all CPU cores. To alleviate this in this PR, we have the single thread in DistEngine only dequeue work from the global queue, but then hand off execution of that work to the c10 threadpool where we call "execute_graph_task_until_ready_queue_empty". For more context please see: #40255 (comment). #Closes: #40255 Differential Revision: [D22917579](https://our.internmc.facebook.com/intern/diff/D22917579/) [ghstack-poisoned]
Pull Request resolved: #42511 DistEngine currently only has a single thread to execute GPU to CPU continuations as part of the backward pass. This would be a significant performance bottleneck in cases where we have such continuations and would like to execute these using all CPU cores. To alleviate this in this PR, we have the single thread in DistEngine only dequeue work from the global queue, but then hand off execution of that work to the c10 threadpool where we call "execute_graph_task_until_ready_queue_empty". For more context please see: #40255 (comment). ghstack-source-id: 109806943 Differential Revision: [D22917579](https://our.internmc.facebook.com/intern/diff/D22917579/)
💊 CI failures summary and remediationsAs of commit b1ed9ef (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 2 times. |
|
@albanD Thanks for reviewing the PR, I addressed your comments. Could you take another look at the PR? |
albanD
left a comment
There was a problem hiding this comment.
Looks good to me.
Just the small detail about the NodeTask in case we can avoid re-creating it.
…tinuations." DistEngine currently only has a single thread to execute GPU to CPU continuations as part of the backward pass. This would be a significant performance bottleneck in cases where we have such continuations and would like to execute these using all CPU cores. To alleviate this in this PR, we have the single thread in DistEngine only dequeue work from the global queue, but then hand off execution of that work to the c10 threadpool where we call "execute_graph_task_until_ready_queue_empty". For more context please see: #40255 (comment). #Closes: #40255 Differential Revision: [D22917579](https://our.internmc.facebook.com/intern/diff/D22917579/) [ghstack-poisoned]
Pull Request resolved: #42511 DistEngine currently only has a single thread to execute GPU to CPU continuations as part of the backward pass. This would be a significant performance bottleneck in cases where we have such continuations and would like to execute these using all CPU cores. To alleviate this in this PR, we have the single thread in DistEngine only dequeue work from the global queue, but then hand off execution of that work to the c10 threadpool where we call "execute_graph_task_until_ready_queue_empty". For more context please see: #40255 (comment). ghstack-source-id: 109997718 Differential Revision: [D22917579](https://our.internmc.facebook.com/intern/diff/D22917579/)
|
This pull request has been merged in 133e9f9. |
…pytorch#42511) Summary: Pull Request resolved: pytorch#42511 DistEngine currently only has a single thread to execute GPU to CPU continuations as part of the backward pass. This would be a significant performance bottleneck in cases where we have such continuations and would like to execute these using all CPU cores. To alleviate this in this PR, we have the single thread in DistEngine only dequeue work from the global queue, but then hand off execution of that work to the c10 threadpool where we call "execute_graph_task_until_ready_queue_empty". For more context please see: pytorch#40255 (comment). ghstack-source-id: 109997718 Test Plan: waitforbuildbot Reviewed By: albanD Differential Revision: D22917579 fbshipit-source-id: c634b6c97f3051f071fd7b994333e6ecb8c54155
Stack from ghstack:
DistEngine currently only has a single thread to execute GPU to CPU
continuations as part of the backward pass. This would be a significant
performance bottleneck in cases where we have such continuations and would like
to execute these using all CPU cores.
To alleviate this in this PR, we have the single thread in DistEngine only
dequeue work from the global queue, but then hand off execution of that work to
the c10 threadpool where we call "execute_graph_task_until_ready_queue_empty".
For more context please see:
#40255 (comment).
#Closes: #40255
Differential Revision: D22917579