[Gradient Compression] Simplify the implementation of warm-start by wayi1 · Pull Request #50981 · pytorch/pytorch

wayi1 · 2021-01-23T05:28:14Z

Stack from ghstack:

[Gradient Compression] Refactor default_hooks.py and powerSGD_hook.py by creating a util function that make a vanilla allreduce future #51094 [Gradient Compression] Refactor default_hooks.py and powerSGD_hook.py by creating a util function that make a vanilla allreduce future
[Gradient Compression] Explicitly specify the dtype of the error tensor #50985 [Gradient Compression] Explicitly specify the dtype of the error tensor
[Gradient Compression] Simplify the implementation of warm-start #50981 [Gradient Compression] Simplify the implementation of warm-start
[Gradient Compression] Typo fixes in PowerSGD #50974 [Gradient Compression] Typo fixes in PowerSGD
[Gradient Compression] Allow PowerSGD to run vallina allreduce for the first K iterations #50973 [Gradient Compression] Allow PowerSGD to run vallina allreduce for the first K iterations

Since vanilla allreduce will to be applied in the first few iterations, bucket rebuilding process will not affect caching per-variable tensors.

Previously the cached tensors used for error feedback and warm-up need to be rebuilt later, because their corresponding input tensors' shape will be changed after the bucket rebuild process.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: D26034418

Since PowerSGD will to be applied in the first few iterations, bucket rebuilding process will not affect caching per-variable tensors. Previously the cached tensors used for error feedback need to be rebuilt later, because their corresponding input tensors' shape wil be changed after the bucket rebuild process. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26034418](https://our.internmc.facebook.com/intern/diff/D26034418/) [ghstack-poisoned]

facebook-github-bot · 2021-01-23T05:28:21Z

💊 CI failures summary and remediations

As of commit 3d80e4b (more details on the Dr. CI page):

3/3 failures introduced in this PR

🕵️ 3 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_xla_linux_bionic_py3_6_clang9_build (1/3)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.

CONFLICT (add/add): Merge conflict in .jenkins/caffe2/test.sh
Auto-merging .jenkins/caffe2/test.sh
CONFLICT (add/add): Merge conflict in .gitmodules
Auto-merging .gitmodules
CONFLICT (add/add): Merge conflict in .circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
Auto-merging .circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
CONFLICT (add/add): Merge conflict in .circleci/scripts/python_doc_push_script.sh
Auto-merging .circleci/scripts/python_doc_push_script.sh
CONFLICT (add/add): Merge conflict in .circleci/config.yml
Auto-merging .circleci/config.yml
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

pytorch_linux_xenial_py3_6_gcc5_4_build (2/3)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.

CONFLICT (add/add): Merge conflict in .jenkins/caffe2/test.sh
Auto-merging .jenkins/caffe2/test.sh
CONFLICT (add/add): Merge conflict in .gitmodules
Auto-merging .gitmodules
CONFLICT (add/add): Merge conflict in .circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
Auto-merging .circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
CONFLICT (add/add): Merge conflict in .circleci/scripts/python_doc_push_script.sh
Auto-merging .circleci/scripts/python_doc_push_script.sh
CONFLICT (add/add): Merge conflict in .circleci/config.yml
Auto-merging .circleci/config.yml
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

pytorch_linux_xenial_cuda9_2_cudnn7_py3_gcc5_4_build (3/3)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.

CONFLICT (add/add): Merge conflict in .jenkins/caffe2/test.sh
Auto-merging .jenkins/caffe2/test.sh
CONFLICT (add/add): Merge conflict in .gitmodules
Auto-merging .gitmodules
CONFLICT (add/add): Merge conflict in .circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
Auto-merging .circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
CONFLICT (add/add): Merge conflict in .circleci/scripts/python_doc_push_script.sh
Auto-merging .circleci/scripts/python_doc_push_script.sh
CONFLICT (add/add): Merge conflict in .circleci/config.yml
Auto-merging .circleci/config.yml
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Since PowerSGD will to be applied in the first few iterations, bucket rebuilding process will not affect caching per-variable tensors. Previously the cached tensors used for error feedback need to be rebuilt later, because their corresponding input tensors' shape wil be changed after the bucket rebuild process. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26034418](https://our.internmc.facebook.com/intern/diff/D26034418/) ghstack-source-id: 120257256 Pull Request resolved: #50981

…-start" Since PowerSGD will to be applied in the first few iterations, bucket rebuilding process will not affect caching per-variable tensors. Previously the cached tensors used for error feedback need to be rebuilt later, because their corresponding input tensors' shape wil be changed after the bucket rebuild process. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26034418](https://our.internmc.facebook.com/intern/diff/D26034418/) [ghstack-poisoned]

rohan-varma

I'm not sure if I understand this change. If PowerSGD will be applied in the first few iterations, shouldn't we take into account bucket rebuilding? Bucket rebuilding occurs in the 2nd iteration, so it seems like powerSGD would have to be aware of this?

wayi1 · 2021-01-26T06:48:19Z

I'm not sure if I understand this change. If PowerSGD will be applied in the first few iterations, shouldn't we take into account bucket rebuilding? Bucket rebuilding occurs in the 2nd iteration, so it seems like powerSGD would have to be aware of this?

Sorry, I mistyped the description. In the first few iterations, vanilla allreduce rather than PowerSGD will be used by default. Therefore, we can simplify the warm-start implementation here.

Actually I plan to do this for all the comm hooks in the future, so the comm hook developer won't be bothered by the internal bucketization details. The ultimate goal is to decouple engineering code and research code, and hide the internal implementation details as much as possible -- ideally, a comm hook developer does not need to know how gradients are bucketized in DDP.

…-start" Since vanilla allreudce will to be applied in the first few iterations, bucket rebuilding process will not affect caching per-variable tensors. Previously the cached tensors used for error feedback need to be rebuilt later, because their corresponding input tensors' shape wil be changed after the bucket rebuild process. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26034418](https://our.internmc.facebook.com/intern/diff/D26034418/) [ghstack-poisoned]

…and warm-start Pull Request resolved: #50981 Since vanilla allreduce will to be applied in the first few iterations, bucket rebuilding process will not affect caching per-variable tensors. Previously the cached tensors used for error feedback and warm-up need to be rebuilt later, because their corresponding input tensors' shape will be changed after the bucket rebuild process. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120398567 Differential Revision: [D26034418](https://our.internmc.facebook.com/intern/diff/D26034418/)

rohan-varma

Looks good, can you clarify in the code that we are now expecting vanilla allreduce to be run in the first few iterations so this change is clearer?

Also, would be good to update documentation in the code or anywhere else necessary that we are enforcing vanilla allreduce to be run for the first few iterations. That way comm. hook users won't be surprised that their algorithm doesn't kick in for a few iterations.

Also, please make sure CI is clear before landing.

wayi1 · 2021-01-29T00:00:42Z

Looks good, can you clarify in the code that we are now expecting vanilla allreduce to be run in the first few iterations so this change is clearer?

Also, would be good to update documentation in the code or anywhere else necessary that we are enforcing vanilla allreduce to be run for the first few iterations. That way comm. hook users won't be surprised that their algorithm doesn't kick in for a few iterations.

Also, please make sure CI is clear before landing.

Thanks for the suggestion! This is already documented in PowerSGDState comments. Now added more comments to powerSGD_hook function.

…-start" Since vanilla allreduce will to be applied in the first few iterations, bucket rebuilding process will not affect caching per-variable tensors. Previously the cached tensors used for error feedback and warm-up need to be rebuilt later, because their corresponding input tensors' shape will be changed after the bucket rebuild process. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26034418](https://our.internmc.facebook.com/intern/diff/D26034418/) [ghstack-poisoned]

facebook-github-bot · 2021-01-29T03:03:36Z

This pull request has been merged in b619d37.

…and warm-start (pytorch#50981) Summary: Pull Request resolved: pytorch#50981 Since vanilla allreduce will to be applied in the first few iterations, bucket rebuilding process will not affect caching per-variable tensors. Previously the cached tensors used for error feedback and warm-up need to be rebuilt later, because their corresponding input tensors' shape will be changed after the bucket rebuild process. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression pytorch#47202 ghstack-source-id: 120617971 Test Plan: real run Reviewed By: rohan-varma Differential Revision: D26034418 fbshipit-source-id: e8744431c7f3142d75b77b60110e6861c2ff5c14

wayi1 requested review from mingzhe09088, mrshenli, pritamdamania87, rohan-varma and zhaojuanmao as code owners January 23, 2021 05:28

facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Jan 23, 2021

This was referenced Jan 23, 2021

[Gradient Compression] Allow PowerSGD to run vallina allreduce for the first K iterations #50973

Closed

[Gradient Compression] Typo fixes in PowerSGD #50974

Closed

wayi1 mentioned this pull request Jan 23, 2021

[Gradient Compression] Explicitly specify the dtype of the error tensor #50985

Closed

rohan-varma reviewed Jan 26, 2021

View reviewed changes

wayi1 mentioned this pull request Jan 26, 2021

[Gradient Compression] Refactor default_hooks.py and powerSGD_hook.py by creating a util function that make a vanilla allreduce future #51094

Closed

wayi1 requested a review from rohan-varma January 26, 2021 06:48

wayi1 mentioned this pull request Jan 28, 2021

[Gradient Compression] Allow BatchedPowerSGD to run vanilla allreduce for the first K iterations #51270

Closed

rohan-varma approved these changes Jan 28, 2021

View reviewed changes

facebook-github-bot closed this in b619d37 Jan 29, 2021

facebook-github-bot added the Merged label Jan 29, 2021

facebook-github-bot deleted the gh/SciPioneer/47/head branch February 1, 2021 15:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Gradient Compression] Simplify the implementation of warm-start#50981

[Gradient Compression] Simplify the implementation of warm-start#50981
wayi1 wants to merge 4 commits intogh/SciPioneer/47/basefrom
gh/SciPioneer/47/head

wayi1 commented Jan 23, 2021 •

edited

Loading

Uh oh!

facebook-github-bot commented Jan 23, 2021 •

edited

Loading

Uh oh!

rohan-varma left a comment

Uh oh!

wayi1 commented Jan 26, 2021

Uh oh!

rohan-varma left a comment •

edited

Loading

Uh oh!

wayi1 commented Jan 29, 2021

Uh oh!

facebook-github-bot commented Jan 29, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wayi1 commented Jan 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Jan 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🕵️ 3 new failures recognized by patterns

pytorch_xla_linux_bionic_py3_6_clang9_build (1/3)

pytorch_linux_xenial_py3_6_gcc5_4_build (2/3)

pytorch_linux_xenial_cuda9_2_cudnn7_py3_gcc5_4_build (3/3)

Uh oh!

rohan-varma left a comment

Choose a reason for hiding this comment

Uh oh!

wayi1 commented Jan 26, 2021

Uh oh!

rohan-varma left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wayi1 commented Jan 29, 2021

Uh oh!

facebook-github-bot commented Jan 29, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wayi1 commented Jan 23, 2021 •

edited

Loading

facebook-github-bot commented Jan 23, 2021 •

edited

Loading

rohan-varma left a comment •

edited

Loading