[c10d] DDP perf improvement: move sync_reduction to C++, dedicated CUDA streams for memcpy#12954
[c10d] DDP perf improvement: move sync_reduction to C++, dedicated CUDA streams for memcpy#12954teng-li wants to merge 2 commits intopytorch:masterfrom
Conversation
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
@pytorchbot retest this please |
facebook-github-bot
left a comment
There was a problem hiding this comment.
teng-li has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@pytorchbot retest this please |
|
rebased |
facebook-github-bot
left a comment
There was a problem hiding this comment.
teng-li is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
facebook-github-bot
left a comment
There was a problem hiding this comment.
teng-li is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
facebook-github-bot
left a comment
There was a problem hiding this comment.
teng-li has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
…ams for memcpy (pytorch#12954) Summary: - Moved sync_reduction to C++ - Use a dedicated CUDA stream for memcpy - Also use a dedicated CUDA stream for memcpy in queue_reduction Added test as well. CI should cover both DDP and unittest Pull Request resolved: pytorch#12954 Differential Revision: D10520069 Pulled By: teng-li fbshipit-source-id: 64348e4e43c15f9695a4c28b036c232587ecfb65
Added test as well.
CI should cover both DDP and unittest