[Gradient Compression] Check start_PowerSGD_iter > 1#51427
[Gradient Compression] Check start_PowerSGD_iter > 1#51427wayi1 wants to merge 4 commits intogh/SciPioneer/52/basefrom
Conversation
A user reported that `start_PowerSGD_iter` failed when it's set as 1. This is because allocating memory for error tensors somehow overlap with bucket rebuilding process at iteration 1. Assert `start_PowerSGD_iter > 1` instead of `start_PowerSGD_iter >= 1`. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26166897](https://our.internmc.facebook.com/intern/diff/D26166897/) [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit f59f380 (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
A user reported that `start_PowerSGD_iter` failed when it's set as 1. This is because allocating memory for error tensors somehow overlap with bucket rebuilding process at iteration 1. Assert `start_PowerSGD_iter > 1` instead of `start_PowerSGD_iter >= 1`. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26166897](https://our.internmc.facebook.com/intern/diff/D26166897/) [ghstack-poisoned]
Pull Request resolved: #51427 A user reported that `start_PowerSGD_iter` failed when it's set as 1. This is because allocating memory for error tensors somehow overlap with bucket rebuilding process at iteration 1. Assert `start_PowerSGD_iter > 1` instead of `start_PowerSGD_iter >= 1`. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120725871 Differential Revision: [D26166897](https://our.internmc.facebook.com/intern/diff/D26166897/)
rohan-varma
left a comment
There was a problem hiding this comment.
This diff doesn't seem to modify any of the asserts as the description suggests?
A user reported that `start_PowerSGD_iter` failed when it's set as 1. This is because allocating memory for error tensors somehow overlap with bucket rebuilding process at iteration 1. Assert `start_PowerSGD_iter > 1` instead of `start_PowerSGD_iter >= 1`. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26166897](https://our.internmc.facebook.com/intern/diff/D26166897/) [ghstack-poisoned]
Pull Request resolved: #51427 A user reported that `start_PowerSGD_iter` failed when it's set as 1. This is because allocating memory for error tensors somehow overlap with bucket rebuilding process at iteration 1. Assert `start_PowerSGD_iter > 1` instead of `start_PowerSGD_iter >= 1`. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120796036 Differential Revision: [D26166897](https://our.internmc.facebook.com/intern/diff/D26166897/)
Updated. The diff was somehow messed up in the stack. |
rohan-varma
left a comment
There was a problem hiding this comment.
LGTM, though please consider adding the test and ensuring all CI checks pass
There was a problem hiding this comment.
Can we also add appropriate docstrings for these state classes for the hooks, and also ensure that subtleties such as the start iteration are well documented both in this state and in the actual hook?
Added much more documentations, including the invalid value range and the guidance on tuning the PowerSGD configs.
Codecov Report
@@ Coverage Diff @@
## gh/SciPioneer/52/base #51427 +/- ##
=========================================================
- Coverage 80.86% 80.52% -0.34%
=========================================================
Files 1938 1938
Lines 211259 211187 -72
=========================================================
- Hits 170827 170055 -772
- Misses 40432 41132 +700 |
A user reported that `start_PowerSGD_iter` failed when it's set as 1. This is because allocating memory for error tensors somehow overlap with bucket rebuilding process at iteration 1. Check `start_PowerSGD_iter > 1` instead of `start_PowerSGD_iter >= 1`. Also added a unit test of `test_invalid_powerSGD_state` and some guidance on tuning PowerSGD configs. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26166897](https://our.internmc.facebook.com/intern/diff/D26166897/) [ghstack-poisoned]
… on tuning PowerSGD configs. Pull Request resolved: #51427 A user reported that `start_PowerSGD_iter` failed when it's set as 1. This is because allocating memory for error tensors somehow overlap with bucket rebuilding process at iteration 1. Check `start_PowerSGD_iter > 1` instead of `start_PowerSGD_iter >= 1`. Also add a unit test of `test_invalid_powerSGD_state` and some guidance on tuning PowerSGD configs. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120834126 Differential Revision: [D26166897](https://our.internmc.facebook.com/intern/diff/D26166897/)
|
This pull request has been merged in 79e7544. |
… on tuning PowerSGD configs. (pytorch#51427) Summary: Pull Request resolved: pytorch#51427 A user reported that `start_PowerSGD_iter` failed when it's set as 1. This is because allocating memory for error tensors somehow overlap with bucket rebuilding process at iteration 1. Check `start_PowerSGD_iter > 1` instead of `start_PowerSGD_iter >= 1`. Also add a unit test of `test_invalid_powerSGD_state` and some guidance on tuning PowerSGD configs. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression pytorch#47202 ghstack-source-id: 120834126 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_invalid_powerSGD_state Reviewed By: rohan-varma Differential Revision: D26166897 fbshipit-source-id: 34d5b64bb3dd43acb61d792626c70e6c8bb44a5d
Stack from ghstack:
A user reported that
start_PowerSGD_iterfailed when it's set as 1. This is because allocating memory for error tensors somehow overlap with bucket rebuilding process at iteration 1.Check
start_PowerSGD_iter > 1instead ofstart_PowerSGD_iter >= 1.Also added a unit test of
test_invalid_powerSGD_stateand some guidance on tuning PowerSGD configs.Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
Differential Revision: D26166897