Skip to content

add stats that can only be collected at runtime#51386

Closed
zhaojuanmao wants to merge 14 commits intogh/zhaojuanmao/59/basefrom
gh/zhaojuanmao/59/head
Closed

add stats that can only be collected at runtime#51386
zhaojuanmao wants to merge 14 commits intogh/zhaojuanmao/59/basefrom
gh/zhaojuanmao/59/head

Conversation

@zhaojuanmao
Copy link
Copy Markdown
Contributor

@zhaojuanmao zhaojuanmao commented Jan 29, 2021

Stack from ghstack:

add stats such as rebuilt bucket stats, unused parameter stats and performance stats to ddp logging data

  1. gpu time stats are not collected for single process multiple devices and multi device modules in this diff, as that requires events are created and recorded on multiple devices
  2. use at::cuda::event API for safer calls
  3. events may not be created in autograd hook if hook is not triggered in user's codes, e.g., users runs in non-sync mode in some iterations. So we checked events are created or not before synchronizing, also skipped invalid results.
  4. users may not set device upfront, so explicitly set proper device before creating events in our prepare_forward() and prepare_backward() calls

Differential Revision: D26158645

add stats such as rebuilt bucket stats, unused parameter stats and performance stats to ddp logging data

Differential Revision: [D26158645](https://our.internmc.facebook.com/intern/diff/D26158645/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Copy Markdown
Contributor

facebook-github-bot commented Jan 29, 2021

💊 CI failures summary and remediations

As of commit f429313 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

@facebook-github-bot facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Jan 29, 2021
zhaojuanmao added a commit that referenced this pull request Jan 29, 2021
add stats such as rebuilt bucket stats, unused parameter stats and performance stats to ddp logging data

Differential Revision: [D26158645](https://our.internmc.facebook.com/intern/diff/D26158645/)

ghstack-source-id: 120695335
Pull Request resolved: #51386
Comment thread torch/lib/c10d/reducer.cpp Outdated
Comment thread torch/lib/c10d/reducer.cpp Outdated
add stats such as rebuilt bucket stats, unused parameter stats and performance stats to ddp logging data

Differential Revision: [D26158645](https://our.internmc.facebook.com/intern/diff/D26158645/)

[ghstack-poisoned]
zhaojuanmao added a commit that referenced this pull request Feb 2, 2021
Pull Request resolved: #51386

add stats such as rebuilt bucket stats, unused parameter stats and performance stats to ddp logging data
ghstack-source-id: 120818571

Differential Revision: [D26158645](https://our.internmc.facebook.com/intern/diff/D26158645/)
@zhaojuanmao
Copy link
Copy Markdown
Contributor Author

right now all the timers work for CPU only, will add GPU timers

add stats such as rebuilt bucket stats, unused parameter stats and performance stats to ddp logging data

Differential Revision: [D26158645](https://our.internmc.facebook.com/intern/diff/D26158645/)

[ghstack-poisoned]
add stats such as rebuilt bucket stats, unused parameter stats and performance stats to ddp logging data

Differential Revision: [D26158645](https://our.internmc.facebook.com/intern/diff/D26158645/)

[ghstack-poisoned]
zhaojuanmao added a commit that referenced this pull request Feb 4, 2021
Pull Request resolved: #51386

add stats such as rebuilt bucket stats, unused parameter stats and performance stats to ddp logging data
ghstack-source-id: 121022194

Differential Revision: [D26158645](https://our.internmc.facebook.com/intern/diff/D26158645/)
add stats such as rebuilt bucket stats, unused parameter stats and performance stats to ddp logging data

Differential Revision: [D26158645](https://our.internmc.facebook.com/intern/diff/D26158645/)

[ghstack-poisoned]
zhaojuanmao added a commit that referenced this pull request Feb 5, 2021
Pull Request resolved: #51386

add stats such as rebuilt bucket stats, unused parameter stats and performance stats to ddp logging data
ghstack-source-id: 121100800

Differential Revision: [D26158645](https://our.internmc.facebook.com/intern/diff/D26158645/)
Comment thread torch/lib/c10d/logger.cpp Outdated
Comment thread torch/lib/c10d/logger.cpp Outdated
Comment thread torch/lib/c10d/logger.cpp
Comment thread torch/lib/c10d/logger.cpp Outdated
Comment thread torch/lib/c10d/logger.cpp Outdated
Comment thread torch/lib/c10d/logger.cpp
Comment thread torch/lib/c10d/logger.cpp
Comment thread torch/lib/c10d/logger.cpp
Comment thread torch/lib/c10d/reducer.cpp Outdated
@zhaojuanmao
Copy link
Copy Markdown
Contributor Author

although current diff did not result in perf regression in our benchmarks, but for better control performance just in case some application needs it in the future, I will add sampling for event recording in this diff.

@zhaojuanmao
Copy link
Copy Markdown
Contributor Author

@ilia-cher and @ngimel, would you please kindly help reviewing whether CUDA time measurement parts are written properly (any pitfall I did not realize)? thanks a lot!

Copy link
Copy Markdown
Contributor

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good from my side! Thank you for working through all of the different timing issues!

add stats such as rebuilt bucket stats, unused parameter stats and performance stats to ddp logging data

1. gpu time stats are not collected for single process multiple devices and multi device modules in this diff, as that requires events are created and recorded on multiple devices
2. use at::cuda::event API for safer calls
3. events may not be created in autograd hook if hook is not triggered in user's codes, e.g., users runs in non-sync mode in some iterations. So we checked events are created or not before synchronizing, also skipped invalid results.
4. users may not set device upfront, so explicitly set proper device before creating events in our prepare_forward() and prepare_backward() calls

Differential Revision: [D26158645](https://our.internmc.facebook.com/intern/diff/D26158645/)

[ghstack-poisoned]
@zhaojuanmao
Copy link
Copy Markdown
Contributor Author

added sampling control for event recording

Comment thread torch/nn/parallel/distributed.py
Comment thread torch/lib/c10d/logger.cpp Outdated
Comment thread torch/lib/c10d/reducer.cpp Outdated
Comment thread torch/lib/c10d/logger.hpp
add stats such as rebuilt bucket stats, unused parameter stats and performance stats to ddp logging data

1. gpu time stats are not collected for single process multiple devices and multi device modules in this diff, as that requires events are created and recorded on multiple devices
2. use at::cuda::event API for safer calls
3. events may not be created in autograd hook if hook is not triggered in user's codes, e.g., users runs in non-sync mode in some iterations. So we checked events are created or not before synchronizing, also skipped invalid results.
4. users may not set device upfront, so explicitly set proper device before creating events in our prepare_forward() and prepare_backward() calls

Differential Revision: [D26158645](https://our.internmc.facebook.com/intern/diff/D26158645/)

[ghstack-poisoned]
zhaojuanmao added a commit that referenced this pull request Feb 17, 2021
Pull Request resolved: #51386

add stats such as rebuilt bucket stats, unused parameter stats and performance stats to ddp logging data

1. gpu time stats are not collected for single process multiple devices in this diff, as that requires events are created and recorded on multiple devices
2. use at::cuda::event API for safer calls
3. events may not be created in autograd hook if hook is not triggered in user's codes, e.g., users runs in non-sync mode in some iterations. So we checked events are created or not before synchronizing, also skipped invalid results.
4. users may not set device upfront, so explicitly set proper device before creating events in our prepare_forward() and prepare_backward() calls

ghstack-source-id: 121829631

Differential Revision: [D26158645](https://our.internmc.facebook.com/intern/diff/D26158645/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D26158645/)!
Comment thread torch/lib/c10d/logger.cpp
Comment thread torch/lib/c10d/logger.cpp Outdated
Comment thread torch/lib/c10d/logger.cpp
Comment thread torch/lib/c10d/logger.cpp Outdated
Comment thread torch/lib/c10d/logger.cpp Outdated
add stats such as rebuilt bucket stats, unused parameter stats and performance stats to ddp logging data

1. gpu time stats are not collected for single process multiple devices and multi device modules in this diff, as that requires events are created and recorded on multiple devices
2. use at::cuda::event API for safer calls
3. events may not be created in autograd hook if hook is not triggered in user's codes, e.g., users runs in non-sync mode in some iterations. So we checked events are created or not before synchronizing, also skipped invalid results.
4. users may not set device upfront, so explicitly set proper device before creating events in our prepare_forward() and prepare_backward() calls

Differential Revision: [D26158645](https://our.internmc.facebook.com/intern/diff/D26158645/)

[ghstack-poisoned]
add stats such as rebuilt bucket stats, unused parameter stats and performance stats to ddp logging data

1. gpu time stats are not collected for single process multiple devices and multi device modules in this diff, as that requires events are created and recorded on multiple devices
2. use at::cuda::event API for safer calls
3. events may not be created in autograd hook if hook is not triggered in user's codes, e.g., users runs in non-sync mode in some iterations. So we checked events are created or not before synchronizing, also skipped invalid results.
4. users may not set device upfront, so explicitly set proper device before creating events in our prepare_forward() and prepare_backward() calls

Differential Revision: [D26158645](https://our.internmc.facebook.com/intern/diff/D26158645/)

[ghstack-poisoned]
zhaojuanmao added a commit that referenced this pull request Feb 18, 2021
Pull Request resolved: #51386

add stats such as rebuilt bucket stats, unused parameter stats and performance stats to ddp logging data

1. gpu time stats are not collected for single process multiple devices in this diff, as that requires events are created and recorded on multiple devices
2. use at::cuda::event API for safer calls
3. events may not be created in autograd hook if hook is not triggered in user's codes, e.g., users runs in non-sync mode in some iterations. So we checked events are created or not before synchronizing, also skipped invalid results.
4. users may not set device upfront, so explicitly set proper device before creating events in our prepare_forward() and prepare_backward() calls

ghstack-source-id: 121933566

Differential Revision: [D26158645](https://our.internmc.facebook.com/intern/diff/D26158645/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D26158645/)!
@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request has been merged in c75fa39.

@facebook-github-bot facebook-github-bot deleted the gh/zhaojuanmao/59/head branch February 22, 2021 15:17
aocsa pushed a commit to Quansight/pytorch that referenced this pull request Mar 15, 2021
Summary:
Pull Request resolved: pytorch#51386

add stats such as rebuilt bucket stats, unused parameter stats and performance stats to ddp logging data

1. gpu time stats are not collected for single process multiple devices in this diff, as that requires events are created and recorded on multiple devices
2. use at::cuda::event API for safer calls
3. events may not be created in autograd hook if hook is not triggered in user's codes, e.g., users runs in non-sync mode in some iterations. So we checked events are created or not before synchronizing, also skipped invalid results.
4. users may not set device upfront, so explicitly set proper device before creating events in our prepare_forward() and prepare_backward() calls

ghstack-source-id: 121933566

Test Plan: unit tests

Reviewed By: SciPioneer

Differential Revision: D26158645

fbshipit-source-id: ce5f15187802eba76accb980449be68902c10178
xsacha pushed a commit to xsacha/pytorch that referenced this pull request Mar 31, 2021
Summary:
Pull Request resolved: pytorch#51386

add stats such as rebuilt bucket stats, unused parameter stats and performance stats to ddp logging data

1. gpu time stats are not collected for single process multiple devices in this diff, as that requires events are created and recorded on multiple devices
2. use at::cuda::event API for safer calls
3. events may not be created in autograd hook if hook is not triggered in user's codes, e.g., users runs in non-sync mode in some iterations. So we checked events are created or not before synchronizing, also skipped invalid results.
4. users may not set device upfront, so explicitly set proper device before creating events in our prepare_forward() and prepare_backward() calls

ghstack-source-id: 121933566

Test Plan: unit tests

Reviewed By: SciPioneer

Differential Revision: D26158645

fbshipit-source-id: ce5f15187802eba76accb980449be68902c10178
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
Summary:
Pull Request resolved: pytorch#51386

add stats such as rebuilt bucket stats, unused parameter stats and performance stats to ddp logging data

1. gpu time stats are not collected for single process multiple devices in this diff, as that requires events are created and recorded on multiple devices
2. use at::cuda::event API for safer calls
3. events may not be created in autograd hook if hook is not triggered in user's codes, e.g., users runs in non-sync mode in some iterations. So we checked events are created or not before synchronizing, also skipped invalid results.
4. users may not set device upfront, so explicitly set proper device before creating events in our prepare_forward() and prepare_backward() calls

ghstack-source-id: 121933566

Test Plan: unit tests

Reviewed By: SciPioneer

Differential Revision: D26158645

fbshipit-source-id: ce5f15187802eba76accb980449be68902c10178
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants