[Gradient Compression] Simplify the implementation of warm-start#50981
[Gradient Compression] Simplify the implementation of warm-start#50981wayi1 wants to merge 4 commits intogh/SciPioneer/47/basefrom
Conversation
Since PowerSGD will to be applied in the first few iterations, bucket rebuilding process will not affect caching per-variable tensors. Previously the cached tensors used for error feedback need to be rebuilt later, because their corresponding input tensors' shape wil be changed after the bucket rebuild process. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26034418](https://our.internmc.facebook.com/intern/diff/D26034418/) [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit 3d80e4b (more details on the Dr. CI page):
🕵️ 3 new failures recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
Since PowerSGD will to be applied in the first few iterations, bucket rebuilding process will not affect caching per-variable tensors. Previously the cached tensors used for error feedback need to be rebuilt later, because their corresponding input tensors' shape wil be changed after the bucket rebuild process. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26034418](https://our.internmc.facebook.com/intern/diff/D26034418/) ghstack-source-id: 120257256 Pull Request resolved: #50981
…-start" Since PowerSGD will to be applied in the first few iterations, bucket rebuilding process will not affect caching per-variable tensors. Previously the cached tensors used for error feedback need to be rebuilt later, because their corresponding input tensors' shape wil be changed after the bucket rebuild process. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26034418](https://our.internmc.facebook.com/intern/diff/D26034418/) [ghstack-poisoned]
rohan-varma
left a comment
There was a problem hiding this comment.
I'm not sure if I understand this change. If PowerSGD will be applied in the first few iterations, shouldn't we take into account bucket rebuilding? Bucket rebuilding occurs in the 2nd iteration, so it seems like powerSGD would have to be aware of this?
Sorry, I mistyped the description. In the first few iterations, vanilla allreduce rather than PowerSGD will be used by default. Therefore, we can simplify the warm-start implementation here. Actually I plan to do this for all the comm hooks in the future, so the comm hook developer won't be bothered by the internal bucketization details. The ultimate goal is to decouple engineering code and research code, and hide the internal implementation details as much as possible -- ideally, a comm hook developer does not need to know how gradients are bucketized in DDP. |
…-start" Since vanilla allreudce will to be applied in the first few iterations, bucket rebuilding process will not affect caching per-variable tensors. Previously the cached tensors used for error feedback need to be rebuilt later, because their corresponding input tensors' shape wil be changed after the bucket rebuild process. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26034418](https://our.internmc.facebook.com/intern/diff/D26034418/) [ghstack-poisoned]
…and warm-start Pull Request resolved: #50981 Since vanilla allreduce will to be applied in the first few iterations, bucket rebuilding process will not affect caching per-variable tensors. Previously the cached tensors used for error feedback and warm-up need to be rebuilt later, because their corresponding input tensors' shape will be changed after the bucket rebuild process. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120398567 Differential Revision: [D26034418](https://our.internmc.facebook.com/intern/diff/D26034418/)
There was a problem hiding this comment.
Looks good, can you clarify in the code that we are now expecting vanilla allreduce to be run in the first few iterations so this change is clearer?
Also, would be good to update documentation in the code or anywhere else necessary that we are enforcing vanilla allreduce to be run for the first few iterations. That way comm. hook users won't be surprised that their algorithm doesn't kick in for a few iterations.
Also, please make sure CI is clear before landing.
Thanks for the suggestion! This is already documented in PowerSGDState comments. Now added more comments to |
…-start" Since vanilla allreduce will to be applied in the first few iterations, bucket rebuilding process will not affect caching per-variable tensors. Previously the cached tensors used for error feedback and warm-up need to be rebuilt later, because their corresponding input tensors' shape will be changed after the bucket rebuild process. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26034418](https://our.internmc.facebook.com/intern/diff/D26034418/) [ghstack-poisoned]
|
This pull request has been merged in b619d37. |
…and warm-start (pytorch#50981) Summary: Pull Request resolved: pytorch#50981 Since vanilla allreduce will to be applied in the first few iterations, bucket rebuilding process will not affect caching per-variable tensors. Previously the cached tensors used for error feedback and warm-up need to be rebuilt later, because their corresponding input tensors' shape will be changed after the bucket rebuild process. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression pytorch#47202 ghstack-source-id: 120617971 Test Plan: real run Reviewed By: rohan-varma Differential Revision: D26034418 fbshipit-source-id: e8744431c7f3142d75b77b60110e6861c2ff5c14
Stack from ghstack:
Since vanilla allreduce will to be applied in the first few iterations, bucket rebuilding process will not affect caching per-variable tensors.
Previously the cached tensors used for error feedback and warm-up need to be rebuilt later, because their corresponding input tensors' shape will be changed after the bucket rebuild process.
Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
Differential Revision: D26034418