[Gradient Compression] Explicitly specify the dtype of the error tensor by wayi1 · Pull Request #50985 · pytorch/pytorch

wayi1 · 2021-01-23T08:02:46Z

Stack from ghstack:

[Gradient Compression] Refactor default_hooks.py and powerSGD_hook.py by creating a util function that make a vanilla allreduce future #51094 [Gradient Compression] Refactor default_hooks.py and powerSGD_hook.py by creating a util function that make a vanilla allreduce future
[Gradient Compression] Explicitly specify the dtype of the error tensor #50985 [Gradient Compression] Explicitly specify the dtype of the error tensor
[Gradient Compression] Simplify the implementation of warm-start #50981 [Gradient Compression] Simplify the implementation of warm-start
[Gradient Compression] Typo fixes in PowerSGD #50974 [Gradient Compression] Typo fixes in PowerSGD
[Gradient Compression] Allow PowerSGD to run vallina allreduce for the first K iterations #50973 [Gradient Compression] Allow PowerSGD to run vallina allreduce for the first K iterations

Explicitly specify the dtype of error tensor when it is initialized by zeros.

Previously if the dtype of input tensor is FP16, the error tensor is still created in FP32, although later it will be assigned by another FP16 tensor (input_tensor_cp - input_tensor).

This change will make the dtype of error tensor look more clear.

Additionally, also explicitly specify the dtype if rank-1 tensor buffer is empty.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: D26034988

Explicitly specify the dtype of error tensor when it is initialized by zeros. Previously if the dtype of input tensor is FP16, the error tensor is still created in FP32, although later it will be assigned by another FP16 tensor (`input_tensor_cp` - `input_tensor`). This change will make the dtype of error tensor look more clear. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26034988](https://our.internmc.facebook.com/intern/diff/D26034988/) [ghstack-poisoned]

facebook-github-bot · 2021-01-23T08:02:53Z

💊 CI failures summary and remediations

As of commit 4f79850 (more details on the Dr. CI page):

4/4 failures possibly* introduced in this PR
- 1/4 non-CircleCI failure(s)

🕵️ 3 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_linux_xenial_cuda9_2_cudnn7_py3_gcc5_4_build (1/3)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.

CONFLICT (add/add): Merge conflict in .jenkins/caffe2/test.sh
Auto-merging .jenkins/caffe2/test.sh
CONFLICT (add/add): Merge conflict in .gitmodules
Auto-merging .gitmodules
CONFLICT (add/add): Merge conflict in .circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
Auto-merging .circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
CONFLICT (add/add): Merge conflict in .circleci/scripts/python_doc_push_script.sh
Auto-merging .circleci/scripts/python_doc_push_script.sh
CONFLICT (add/add): Merge conflict in .circleci/config.yml
Auto-merging .circleci/config.yml
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

pytorch_linux_xenial_py3_6_gcc5_4_build (2/3)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.

CONFLICT (add/add): Merge conflict in .jenkins/caffe2/test.sh
Auto-merging .jenkins/caffe2/test.sh
CONFLICT (add/add): Merge conflict in .gitmodules
Auto-merging .gitmodules
CONFLICT (add/add): Merge conflict in .circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
Auto-merging .circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
CONFLICT (add/add): Merge conflict in .circleci/scripts/python_doc_push_script.sh
Auto-merging .circleci/scripts/python_doc_push_script.sh
CONFLICT (add/add): Merge conflict in .circleci/config.yml
Auto-merging .circleci/config.yml
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

pytorch_xla_linux_bionic_py3_6_clang9_build (3/3)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.

CONFLICT (add/add): Merge conflict in .jenkins/caffe2/test.sh
Auto-merging .jenkins/caffe2/test.sh
CONFLICT (add/add): Merge conflict in .gitmodules
Auto-merging .gitmodules
CONFLICT (add/add): Merge conflict in .circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
Auto-merging .circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
CONFLICT (add/add): Merge conflict in .circleci/scripts/python_doc_push_script.sh
Auto-merging .circleci/scripts/python_doc_push_script.sh
CONFLICT (add/add): Merge conflict in .circleci/config.yml
Auto-merging .circleci/config.yml
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

… error tensor" Explicitly specify the dtype of error tensor when it is initialized by zeros. Previously if the dtype of input tensor is FP16, the error tensor is still created in FP32, although later it will be assigned by another FP16 tensor (`input_tensor_cp` - `input_tensor`). This change will make the dtype of error tensor look more clear. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26034988](https://our.internmc.facebook.com/intern/diff/D26034988/) [ghstack-poisoned]

Pull Request resolved: #50985 Explicitly specify the dtype of error tensor when it is initialized by zeros. Previously if the dtype of input tensor is FP16, the error tensor is still created in FP32, although later it will be assigned by another FP16 tensor (`input_tensor_cp` - `input_tensor`). This change will make the dtype of error tensor look more clear. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120259409 Differential Revision: [D26034988](https://our.internmc.facebook.com/intern/diff/D26034988/)

… error tensor" Explicitly specify the dtype of error tensor when it is initialized by zeros. Previously if the dtype of input tensor is FP16, the error tensor is still created in FP32, although later it will be assigned by another FP16 tensor (`input_tensor_cp` - `input_tensor`). This change will make the dtype of error tensor look more clear. Additionally, also explicitly specify the dtype if rank-1 tensor buffer is empty. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26034988](https://our.internmc.facebook.com/intern/diff/D26034988/) [ghstack-poisoned]

Pull Request resolved: #50985 Explicitly specify the dtype of error tensor when it is initialized by zeros. Previously if the dtype of input tensor is FP16, the error tensor is still created in FP32, although later it will be assigned by another FP16 tensor (`input_tensor_cp` - `input_tensor`). This change will make the dtype of error tensor look more clear. Additionally, also explicitly specify the dtype if rank-1 tensor buffer is empty. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120328964 Differential Revision: [D26034988](https://our.internmc.facebook.com/intern/diff/D26034988/)

… error tensor" Explicitly specify the dtype of error tensor when it is initialized by zeros. Previously if the dtype of input tensor is FP16, the error tensor is still created in FP32, although later it will be assigned by another FP16 tensor (`input_tensor_cp` - `input_tensor`). This change will make the dtype of error tensor look more clear. Additionally, also explicitly specify the dtype if rank-1 tensor buffer is empty. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26034988](https://our.internmc.facebook.com/intern/diff/D26034988/) [ghstack-poisoned]

Pull Request resolved: #50985 Explicitly specify the dtype of error tensor when it is initialized by zeros. Previously if the dtype of input tensor is FP16, the error tensor is still created in FP32, although later it will be assigned by another FP16 tensor (`input_tensor_cp` - `input_tensor`). This change will make the dtype of error tensor look more clear. Additionally, also explicitly specify the dtype if rank-1 tensor buffer is empty. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120377786 Differential Revision: [D26034988](https://our.internmc.facebook.com/intern/diff/D26034988/)

rohan-varma

LGTM now, thanks!

… error tensor" Explicitly specify the dtype of error tensor when it is initialized by zeros. Previously if the dtype of input tensor is FP16, the error tensor is still created in FP32, although later it will be assigned by another FP16 tensor (`input_tensor_cp` - `input_tensor`). This change will make the dtype of error tensor look more clear. Additionally, also explicitly specify the dtype if rank-1 tensor buffer is empty. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26034988](https://our.internmc.facebook.com/intern/diff/D26034988/) [ghstack-poisoned]

facebook-github-bot · 2021-01-29T03:03:42Z

This pull request has been merged in 9d731e8.

…or (pytorch#50985) Summary: Pull Request resolved: pytorch#50985 Explicitly specify the dtype of error tensor when it is initialized by zeros. Previously if the dtype of input tensor is FP16, the error tensor is still created in FP32, although later it will be assigned by another FP16 tensor (`input_tensor_cp` - `input_tensor`). This change will make the dtype of error tensor look more clear. Additionally, also explicitly specify the dtype if rank-1 tensor buffer is empty. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression pytorch#47202 ghstack-source-id: 120377786 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook Reviewed By: rohan-varma Differential Revision: D26034988 fbshipit-source-id: e0d323d0b77c6a2478cdbe8b31a1946ffd1a07da

wayi1 requested review from mingzhe09088, mrshenli, pritamdamania87, rohan-varma and zhaojuanmao as code owners January 23, 2021 08:02

facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Jan 23, 2021

This was referenced Jan 23, 2021

[Gradient Compression] Allow PowerSGD to run vallina allreduce for the first K iterations #50973

Closed

[Gradient Compression] Typo fixes in PowerSGD #50974

Closed

[Gradient Compression] Simplify the implementation of warm-start #50981

Closed

rohan-varma reviewed Jan 26, 2021

View reviewed changes

Comment thread torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py Outdated

wayi1 mentioned this pull request Jan 26, 2021

[Gradient Compression] Refactor default_hooks.py and powerSGD_hook.py by creating a util function that make a vanilla allreduce future #51094

Closed

wayi1 requested a review from rohan-varma January 26, 2021 07:42

wayi1 mentioned this pull request Jan 28, 2021

[Gradient Compression] Allow BatchedPowerSGD to run vanilla allreduce for the first K iterations #51270

Closed

rohan-varma approved these changes Jan 28, 2021

View reviewed changes

facebook-github-bot closed this in 9d731e8 Jan 29, 2021

facebook-github-bot added the Merged label Jan 29, 2021

facebook-github-bot deleted the gh/SciPioneer/48/head branch February 1, 2021 15:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Gradient Compression] Explicitly specify the dtype of the error tensor#50985

[Gradient Compression] Explicitly specify the dtype of the error tensor#50985
wayi1 wants to merge 5 commits intogh/SciPioneer/48/basefrom
gh/SciPioneer/48/head

wayi1 commented Jan 23, 2021 •

edited

Loading

Uh oh!

facebook-github-bot commented Jan 23, 2021 •

edited

Loading

Uh oh!

Uh oh!

rohan-varma left a comment

Uh oh!

facebook-github-bot commented Jan 29, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wayi1 commented Jan 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Jan 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🕵️ 3 new failures recognized by patterns

pytorch_linux_xenial_cuda9_2_cudnn7_py3_gcc5_4_build (1/3)

pytorch_linux_xenial_py3_6_gcc5_4_build (2/3)

pytorch_xla_linux_bionic_py3_6_clang9_build (3/3)

Uh oh!

Uh oh!

rohan-varma left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jan 29, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wayi1 commented Jan 23, 2021 •

edited

Loading

facebook-github-bot commented Jan 23, 2021 •

edited

Loading