[Gradient Compression] Allow BatchedPowerSGD to run vanilla allreduce for the first K iterations#51270
[Gradient Compression] Allow BatchedPowerSGD to run vanilla allreduce for the first K iterations#51270wayi1 wants to merge 4 commits intogh/SciPioneer/50/basefrom
Conversation
… for the first K iterations Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations. This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26077709](https://our.internmc.facebook.com/intern/diff/D26077709/) [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit adde81d (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
| Job | Step | Action |
|---|---|---|
| Run tests | 🔁 rerun |
Extra GitHub checks: 1 failed
- Failed: GitHub Actions -
clang-tidy
This comment was automatically generated by Dr. CI (expand for details).
Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group.
… for the first K iterations Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations. This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26077709](https://our.internmc.facebook.com/intern/diff/D26077709/) ghstack-source-id: 120400709 Pull Request resolved: #51270
…a allreduce for the first K iterations" Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations. This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26077709](https://our.internmc.facebook.com/intern/diff/D26077709/) [ghstack-poisoned]
… for the first K iterations Pull Request resolved: #51270 Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations. This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120617938 Differential Revision: [D26077709](https://our.internmc.facebook.com/intern/diff/D26077709/)
rohan-varma
left a comment
There was a problem hiding this comment.
I think the diff below this one was reverted, and this one is also failing tests: https://app.circleci.com/pipelines/github/pytorch/pytorch/266089/workflows/a036f791-01c8-4538-90eb-a4e40234b8c3/jobs/10492151. Do you need to resubmit the previous diff first?
Created #51400 for the resubmission, and will submit that PR first. |
…a allreduce for the first K iterations" Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations. This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26077709](https://our.internmc.facebook.com/intern/diff/D26077709/) [ghstack-poisoned]
…a allreduce for the first K iterations" Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations. This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26077709](https://our.internmc.facebook.com/intern/diff/D26077709/) [ghstack-poisoned]
… for the first K iterations Pull Request resolved: #51270 Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations. This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120725858 Differential Revision: [D26077709](https://our.internmc.facebook.com/intern/diff/D26077709/)
| 7) Computes M, which is approximately equal to PQ^T. | ||
| 8) Truncates the input tensor to the original length. | ||
|
|
||
| Note that this communication hook enforces vanilla allreduce for the first `state.start_powerSGD_iter` iterations. |
There was a problem hiding this comment.
Do we have the default value for state.start_powersgd_iter documented in the docs for powerSGDState? Would be nice to ensure we have that, and maybe also specify the default start iteration here.
|
This pull request has been merged in c080780. |
… for the first K iterations (pytorch#51270) Summary: Pull Request resolved: pytorch#51270 Similar to pytorch#50973, allow the batched version to run vanilla allreduce for the first K iterations. This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression pytorch#47202 ghstack-source-id: 120725858 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl baseline: f248001754 batched PowerSGD: f246960752 The training time was reduced from 54m48s to 30m33s, and the accuracy is approximately the same: 44.21 vs 44.35 Reviewed By: rohan-varma Differential Revision: D26077709 fbshipit-source-id: 6afeefad7a3fbdd7da2cbffb56dfbad855a96cb5
Stack from ghstack:
Similar to #50973, allow the batched version to run vanilla allreduce for the first K iterations.
This may be useful if the batched version can be applied to some use cases where the accuracy requirement is not very strict.
Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
Differential Revision: D26077709