[Gradient Compression] Implement the original layerwise PowerSGD#49417
[Gradient Compression] Implement the original layerwise PowerSGD#49417wayi1 wants to merge 4 commits intogh/SciPioneer/35/basefrom
Conversation
The existing implementation applies PowerSGD to a batch of flatened tensors, which is a coarse-grained compression. This hook now is renamed as "batched_powerSGD_hook". Now implement the original implementation in the paper, which applies PowerSGD to each per-parameter tensor. This is a layer-wise fine-grained compression. Although this original implementation is slower, it is expected to achieve a higher accuracy, especially when the shapes of per-param tensors cannot be aligned. Also add a test in distributed_test.py. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D25511543](https://our.internmc.facebook.com/intern/diff/D25511543/) [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit 9e60718 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. This comment has been revised 11 times. |
The existing implementation applies PowerSGD to a batch of flatened tensors, which is a coarse-grained compression. This hook now is renamed as "batched_powerSGD_hook". Now implement the original implementation in the paper, which applies PowerSGD to each per-parameter tensor. This is a layer-wise fine-grained compression. Although this original implementation is slower, it is expected to achieve a higher accuracy, especially when the shapes of per-param tensors cannot be aligned. Also add a test in distributed_test.py. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D25511543](https://our.internmc.facebook.com/intern/diff/D25511543/) [ghstack-poisoned]
Pull Request resolved: #49417 The existing implementation applies PowerSGD to a batch of flatened tensors, which is a coarse-grained compression. This hook now is renamed as "batched_powerSGD_hook". Now implement the original implementation in the paper, which applies PowerSGD to each per-parameter tensor. This is a layer-wise fine-grained compression. Although this original implementation is slower, it is expected to achieve a higher accuracy, especially when the shapes of per-param tensors cannot be aligned. Also add a test in distributed_test.py. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 118643443 Differential Revision: [D25511543](https://our.internmc.facebook.com/intern/diff/D25511543/)
rohan-varma
left a comment
There was a problem hiding this comment.
Thanks for working on this! I left a few comments/questions inline.
| q_idx += m * matrix_approximation_rank | ||
|
|
||
| # Initialize and then orthogonalize Qs. | ||
| with torch.random.fork_rng(devices=[]): |
There was a problem hiding this comment.
Would it be reasonable to dedupe other use cases of this forking in grad compression to a helper function?
There was a problem hiding this comment.
Had the same feeling. However, in the first implementation, it has a loop in it: for q in qs:, between setting the manual seed and filling random values. It can be a bit tricky. Let me try to do it in a separate refactoring PR.
rohan-varma
left a comment
There was a problem hiding this comment.
Thanks for answering all my questions/comments! LGTM, we can refactor the tests in a follow up PR as you mentioned.
…werSGD" The existing implementation applies PowerSGD to a batch of flattened tensors, which is a coarse-grained compression. This hook now is renamed as "batched_powerSGD_hook". Now implement the original implementation in the paper, which applies PowerSGD to each per-parameter tensor. This is a layerwise fine-grained compression. Although this original implementation is slower, it is expected to achieve a higher accuracy, especially when the shapes of per-param tensors cannot be aligned. Also add a test in distributed_test.py. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D25511543](https://our.internmc.facebook.com/intern/diff/D25511543/) [ghstack-poisoned]
…werSGD" The existing implementation applies PowerSGD to a batch of flattened tensors, which is a coarse-grained compression. This hook now is renamed as "batched_powerSGD_hook". Now implement the original implementation in the paper, which applies PowerSGD to each per-parameter tensor. This is a layerwise fine-grained compression. Although this original implementation is slower, it is expected to achieve a higher accuracy, especially when the shapes of per-param tensors cannot be aligned. Also add a test in distributed_test.py. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D25511543](https://our.internmc.facebook.com/intern/diff/D25511543/) [ghstack-poisoned]
Pull Request resolved: #49417 The existing implementation applies PowerSGD to a batch of flattened tensors, which is a coarse-grained compression. This hook now is renamed as "batched_powerSGD_hook". Now implement the original implementation in the paper, which applies PowerSGD to each per-parameter tensor. This is a layerwise fine-grained compression. Although this original implementation is slower, it is expected to achieve a higher accuracy, especially when the shapes of per-param tensors cannot be aligned. Also add a test in distributed_test.py. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 118921275 Differential Revision: [D25511543](https://our.internmc.facebook.com/intern/diff/D25511543/)
|
This pull request has been merged in 71f3399. |
|
This breaks multi-gpu test on master. reverting |
|
This pull request has been reverted by ad9923e. |
…erSGD (#49639) Summary: Pull Request resolved: #49639 Resubmit #49417 with a fix for distributed_test. The previous submission broke a multi-gpu test that runs on 4 GPUs. Since this test only runs on master, couldn't detect it before the submission. The real diff is: 4ca1014 This time I have verified that the previous failed test `pytorch_linux_xenial_cuda10_2_cudnn7_py3_multigpu_test` could pass after creating a PR (#49651) from a separate branch: https://app.circleci.com/pipelines/github/pytorch/pytorch/253644/workflows/c1c02b70-0877-40e6-8b4c-61f60f6b70ed/jobs/9768079 ghstack-source-id: 118969912 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook、 Reviewed By: mrshenli Differential Revision: D25654961 fbshipit-source-id: 2a45c8ceb9bdb54ff7309a8b66ec87e913e0150e
Address the comment on #49417 (comment) Differential Revision: [D25673997](https://our.internmc.facebook.com/intern/diff/D25673997/) [ghstack-poisoned]
Address the comment on #49417 (comment) Differential Revision: [D25673997](https://our.internmc.facebook.com/intern/diff/D25673997/) ghstack-source-id: 119021459 Pull Request resolved: #49715
…use.size()" Address the comment on #49417 (comment) Differential Revision: [D25673997](https://our.internmc.facebook.com/intern/diff/D25673997/) [ghstack-poisoned]
Pull Request resolved: #49715 Address the comment on #49417 (comment) ghstack-source-id: 119049598 Differential Revision: [D25673997](https://our.internmc.facebook.com/intern/diff/D25673997/)
…49715) Summary: Pull Request resolved: #49715 Address the comment on #49417 (comment) ghstack-source-id: 119049598 Test Plan: waitforbuildbot Reviewed By: rohan-varma Differential Revision: D25673997 fbshipit-source-id: 44eb2540e5a77331c34ba503285cbd0bd63c2c0a
…orch#49417) Summary: Pull Request resolved: pytorch#49417 The existing implementation applies PowerSGD to a batch of flattened tensors, which is a coarse-grained compression. This hook now is renamed as "batched_powerSGD_hook". Now implement the original implementation in the paper, which applies PowerSGD to each per-parameter tensor. This is a layerwise fine-grained compression. Although this original implementation is slower, it is expected to achieve a higher accuracy, especially when the shapes of per-param tensors cannot be aligned. Also add a test in distributed_test.py. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression pytorch#47202 ghstack-source-id: 118921275 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook Reviewed By: rohan-varma Differential Revision: D25511543 fbshipit-source-id: 19ef188bc2d4c7406443c8fa233c1f2c2f27d93c
…erSGD (pytorch#49639) Summary: Pull Request resolved: pytorch#49639 Resubmit pytorch#49417 with a fix for distributed_test. The previous submission broke a multi-gpu test that runs on 4 GPUs. Since this test only runs on master, couldn't detect it before the submission. The real diff is: pytorch@4ca1014 This time I have verified that the previous failed test `pytorch_linux_xenial_cuda10_2_cudnn7_py3_multigpu_test` could pass after creating a PR (pytorch#49651) from a separate branch: https://app.circleci.com/pipelines/github/pytorch/pytorch/253644/workflows/c1c02b70-0877-40e6-8b4c-61f60f6b70ed/jobs/9768079 ghstack-source-id: 118969912 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook、 Reviewed By: mrshenli Differential Revision: D25654961 fbshipit-source-id: 2a45c8ceb9bdb54ff7309a8b66ec87e913e0150e
…orch#49417) Summary: Pull Request resolved: pytorch#49417 The existing implementation applies PowerSGD to a batch of flattened tensors, which is a coarse-grained compression. This hook now is renamed as "batched_powerSGD_hook". Now implement the original implementation in the paper, which applies PowerSGD to each per-parameter tensor. This is a layerwise fine-grained compression. Although this original implementation is slower, it is expected to achieve a higher accuracy, especially when the shapes of per-param tensors cannot be aligned. Also add a test in distributed_test.py. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression pytorch#47202 ghstack-source-id: 118921275 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook Reviewed By: rohan-varma Differential Revision: D25511543 fbshipit-source-id: 19ef188bc2d4c7406443c8fa233c1f2c2f27d93c
…erSGD (pytorch#49639) Summary: Pull Request resolved: pytorch#49639 Resubmit pytorch#49417 with a fix for distributed_test. The previous submission broke a multi-gpu test that runs on 4 GPUs. Since this test only runs on master, couldn't detect it before the submission. The real diff is: pytorch@4ca1014 This time I have verified that the previous failed test `pytorch_linux_xenial_cuda10_2_cudnn7_py3_multigpu_test` could pass after creating a PR (pytorch#49651) from a separate branch: https://app.circleci.com/pipelines/github/pytorch/pytorch/253644/workflows/c1c02b70-0877-40e6-8b4c-61f60f6b70ed/jobs/9768079 ghstack-source-id: 118969912 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook、 Reviewed By: mrshenli Differential Revision: D25654961 fbshipit-source-id: 2a45c8ceb9bdb54ff7309a8b66ec87e913e0150e
…ytorch#49715) Summary: Pull Request resolved: pytorch#49715 Address the comment on pytorch#49417 (comment) ghstack-source-id: 119049598 Test Plan: waitforbuildbot Reviewed By: rohan-varma Differential Revision: D25673997 fbshipit-source-id: 44eb2540e5a77331c34ba503285cbd0bd63c2c0a
Stack from ghstack:
The existing implementation applies PowerSGD to a batch of flattened tensors, which is a coarse-grained compression. This hook now is renamed as "batched_powerSGD_hook".
Now implement the original implementation in the paper, which applies PowerSGD to each per-parameter tensor. This is a layerwise fine-grained compression. Although this original implementation is slower, it is expected to achieve a higher accuracy, especially when the shapes of per-param tensors cannot be aligned.
Also add a test in distributed_test.py.
Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
Differential Revision: D25511543