Skip to content

[Gradient Compression] Explicitly specify the dtype of the error tensor#50985

Closed
wayi1 wants to merge 5 commits intogh/SciPioneer/48/basefrom
gh/SciPioneer/48/head
Closed

[Gradient Compression] Explicitly specify the dtype of the error tensor#50985
wayi1 wants to merge 5 commits intogh/SciPioneer/48/basefrom
gh/SciPioneer/48/head

Conversation

@wayi1
Copy link
Copy Markdown
Contributor

@wayi1 wayi1 commented Jan 23, 2021

Stack from ghstack:

Explicitly specify the dtype of error tensor when it is initialized by zeros.

Previously if the dtype of input tensor is FP16, the error tensor is still created in FP32, although later it will be assigned by another FP16 tensor (input_tensor_cp - input_tensor).

This change will make the dtype of error tensor look more clear.

Additionally, also explicitly specify the dtype if rank-1 tensor buffer is empty.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: D26034988

Explicitly specify the dtype of error tensor when it is initialized by zeros.

Previously if the dtype of input tensor is FP16, the error tensor is still created in FP32, although later it will be assigned by another FP16 tensor (`input_tensor_cp` - `input_tensor`).

This change will make the dtype of error tensor look more clear.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D26034988](https://our.internmc.facebook.com/intern/diff/D26034988/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Copy Markdown
Contributor

facebook-github-bot commented Jan 23, 2021

💊 CI failures summary and remediations

As of commit 4f79850 (more details on the Dr. CI page):


  • 4/4 failures possibly* introduced in this PR
    • 1/4 non-CircleCI failure(s)

🕵️ 3 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_xenial_cuda9_2_cudnn7_py3_gcc5_4_build (1/3)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in .jenkins/caffe2/test.sh
Auto-merging .jenkins/caffe2/test.sh
CONFLICT (add/add): Merge conflict in .gitmodules
Auto-merging .gitmodules
CONFLICT (add/add): Merge conflict in .circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
Auto-merging .circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
CONFLICT (add/add): Merge conflict in .circleci/scripts/python_doc_push_script.sh
Auto-merging .circleci/scripts/python_doc_push_script.sh
CONFLICT (add/add): Merge conflict in .circleci/config.yml
Auto-merging .circleci/config.yml
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_build (2/3)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in .jenkins/caffe2/test.sh
Auto-merging .jenkins/caffe2/test.sh
CONFLICT (add/add): Merge conflict in .gitmodules
Auto-merging .gitmodules
CONFLICT (add/add): Merge conflict in .circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
Auto-merging .circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
CONFLICT (add/add): Merge conflict in .circleci/scripts/python_doc_push_script.sh
Auto-merging .circleci/scripts/python_doc_push_script.sh
CONFLICT (add/add): Merge conflict in .circleci/config.yml
Auto-merging .circleci/config.yml
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

See CircleCI build pytorch_xla_linux_bionic_py3_6_clang9_build (3/3)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in .jenkins/caffe2/test.sh
Auto-merging .jenkins/caffe2/test.sh
CONFLICT (add/add): Merge conflict in .gitmodules
Auto-merging .gitmodules
CONFLICT (add/add): Merge conflict in .circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
Auto-merging .circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
CONFLICT (add/add): Merge conflict in .circleci/scripts/python_doc_push_script.sh
Auto-merging .circleci/scripts/python_doc_push_script.sh
CONFLICT (add/add): Merge conflict in .circleci/config.yml
Auto-merging .circleci/config.yml
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

… error tensor"

Explicitly specify the dtype of error tensor when it is initialized by zeros.

Previously if the dtype of input tensor is FP16, the error tensor is still created in FP32, although later it will be assigned by another FP16 tensor (`input_tensor_cp` - `input_tensor`).

This change will make the dtype of error tensor look more clear.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D26034988](https://our.internmc.facebook.com/intern/diff/D26034988/)

[ghstack-poisoned]
wayi1 pushed a commit that referenced this pull request Jan 23, 2021
Pull Request resolved: #50985

Explicitly specify the dtype of error tensor when it is initialized by zeros.

Previously if the dtype of input tensor is FP16, the error tensor is still created in FP32, although later it will be assigned by another FP16 tensor (`input_tensor_cp` - `input_tensor`).

This change will make the dtype of error tensor look more clear.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120259409

Differential Revision: [D26034988](https://our.internmc.facebook.com/intern/diff/D26034988/)
… error tensor"


Explicitly specify the dtype of error tensor when it is initialized by zeros.

Previously if the dtype of input tensor is FP16, the error tensor is still created in FP32, although later it will be assigned by another FP16 tensor (`input_tensor_cp` - `input_tensor`).

This change will make the dtype of error tensor look more clear.

Additionally, also explicitly specify the dtype if rank-1 tensor buffer is empty.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D26034988](https://our.internmc.facebook.com/intern/diff/D26034988/)

[ghstack-poisoned]
wayi1 pushed a commit that referenced this pull request Jan 25, 2021
Pull Request resolved: #50985

Explicitly specify the dtype of error tensor when it is initialized by zeros.

Previously if the dtype of input tensor is FP16, the error tensor is still created in FP32, although later it will be assigned by another FP16 tensor (`input_tensor_cp` - `input_tensor`).

This change will make the dtype of error tensor look more clear.

Additionally, also explicitly specify the dtype if rank-1 tensor buffer is empty.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120328964

Differential Revision: [D26034988](https://our.internmc.facebook.com/intern/diff/D26034988/)
Comment thread torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py Outdated
… error tensor"


Explicitly specify the dtype of error tensor when it is initialized by zeros.

Previously if the dtype of input tensor is FP16, the error tensor is still created in FP32, although later it will be assigned by another FP16 tensor (`input_tensor_cp` - `input_tensor`).

This change will make the dtype of error tensor look more clear.

Additionally, also explicitly specify the dtype if rank-1 tensor buffer is empty.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D26034988](https://our.internmc.facebook.com/intern/diff/D26034988/)

[ghstack-poisoned]
wayi1 pushed a commit that referenced this pull request Jan 26, 2021
Pull Request resolved: #50985

Explicitly specify the dtype of error tensor when it is initialized by zeros.

Previously if the dtype of input tensor is FP16, the error tensor is still created in FP32, although later it will be assigned by another FP16 tensor (`input_tensor_cp` - `input_tensor`).

This change will make the dtype of error tensor look more clear.

Additionally, also explicitly specify the dtype if rank-1 tensor buffer is empty.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120377786

Differential Revision: [D26034988](https://our.internmc.facebook.com/intern/diff/D26034988/)
@wayi1 wayi1 requested a review from rohan-varma January 26, 2021 07:42
Copy link
Copy Markdown
Contributor

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now, thanks!

… error tensor"


Explicitly specify the dtype of error tensor when it is initialized by zeros.

Previously if the dtype of input tensor is FP16, the error tensor is still created in FP32, although later it will be assigned by another FP16 tensor (`input_tensor_cp` - `input_tensor`).

This change will make the dtype of error tensor look more clear.

Additionally, also explicitly specify the dtype if rank-1 tensor buffer is empty.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D26034988](https://our.internmc.facebook.com/intern/diff/D26034988/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request has been merged in 9d731e8.

@facebook-github-bot facebook-github-bot deleted the gh/SciPioneer/48/head branch February 1, 2021 15:19
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
…or (pytorch#50985)

Summary:
Pull Request resolved: pytorch#50985

Explicitly specify the dtype of error tensor when it is initialized by zeros.

Previously if the dtype of input tensor is FP16, the error tensor is still created in FP32, although later it will be assigned by another FP16 tensor (`input_tensor_cp` - `input_tensor`).

This change will make the dtype of error tensor look more clear.

Additionally, also explicitly specify the dtype if rank-1 tensor buffer is empty.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression pytorch#47202
ghstack-source-id: 120377786

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook

Reviewed By: rohan-varma

Differential Revision: D26034988

fbshipit-source-id: e0d323d0b77c6a2478cdbe8b31a1946ffd1a07da
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants