Skip to content

[c10d] DDP perf improvement: move sync_reduction to C++, dedicated CUDA streams for memcpy#12954

Closed
teng-li wants to merge 2 commits intopytorch:masterfrom
teng-li:ddp_sync_red
Closed

[c10d] DDP perf improvement: move sync_reduction to C++, dedicated CUDA streams for memcpy#12954
teng-li wants to merge 2 commits intopytorch:masterfrom
teng-li:ddp_sync_red

Conversation

@teng-li
Copy link
Copy Markdown
Contributor

@teng-li teng-li commented Oct 22, 2018

  • Moved sync_reduction to C++
  • Use a dedicated CUDA stream for memcpy
  • Also use a dedicated CUDA stream for memcpy in queue_reduction

Added test as well.

CI should cover both DDP and unittest

@teng-li teng-li added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 22, 2018
Comment thread torch/csrc/distributed/c10d/ddp.cpp Outdated

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

@teng-li
Copy link
Copy Markdown
Contributor Author

teng-li commented Oct 23, 2018

@pytorchbot retest this please

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

teng-li has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@teng-li
Copy link
Copy Markdown
Contributor Author

teng-li commented Oct 24, 2018

@pytorchbot retest this please

@teng-li
Copy link
Copy Markdown
Contributor Author

teng-li commented Oct 24, 2018

rebased

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

teng-li is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

teng-li is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

teng-li has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ezyang ezyang added the merged label Jun 25, 2019
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
…ams for memcpy (pytorch#12954)

Summary:
- Moved sync_reduction to C++
- Use a dedicated CUDA stream for memcpy
- Also use a dedicated CUDA stream for memcpy in queue_reduction

Added test as well.

CI should cover both DDP and unittest
Pull Request resolved: pytorch#12954

Differential Revision: D10520069

Pulled By: teng-li

fbshipit-source-id: 64348e4e43c15f9695a4c28b036c232587ecfb65
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants