[Gradient Compression] Check start_PowerSGD_iter > 1 by wayi1 · Pull Request #51427 · pytorch/pytorch

wayi1 · 2021-01-31T06:09:13Z

Stack from ghstack:

[Gradient Compression] Check start_PowerSGD_iter > 1 #51427 [Gradient Compression] Check start_PowerSGD_iter > 1

A user reported that start_PowerSGD_iter failed when it's set as 1. This is because allocating memory for error tensors somehow overlap with bucket rebuilding process at iteration 1.

Check start_PowerSGD_iter > 1 instead of start_PowerSGD_iter >= 1.

Also added a unit test of test_invalid_powerSGD_state and some guidance on tuning PowerSGD configs.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: D26166897

A user reported that `start_PowerSGD_iter` failed when it's set as 1. This is because allocating memory for error tensors somehow overlap with bucket rebuilding process at iteration 1. Assert `start_PowerSGD_iter > 1` instead of `start_PowerSGD_iter >= 1`. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26166897](https://our.internmc.facebook.com/intern/diff/D26166897/) [ghstack-poisoned]

facebook-github-bot · 2021-01-31T06:09:18Z

💊 CI failures summary and remediations

As of commit f59f380 (more details on the Dr. CI page):

2/3 failures possibly* introduced in this PR
- 1/2 non-CircleCI failure(s)
1/3 tentatively recognized as flaky ❄️
- Click here to rerun these jobs

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_linux_xenial_py3_6_gcc5_4_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Feb 02 11:57:53 [E request_callback_no_python.cpp:653] Received error while processing request type 258: RuntimeError: Can not pickle torch.futures.Future

Feb 02 11:57:52 At:
Feb 02 11:57:52   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(120): serialize
Feb 02 11:57:52   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(172): serialize
Feb 02 11:57:52 
Feb 02 11:57:52 [E request_callback_no_python.cpp:653] Received error while processing request type 258: RuntimeError: Can not pickle torch.futures.Future
Feb 02 11:57:52 
Feb 02 11:57:52 At:
Feb 02 11:57:52   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(120): serialize
Feb 02 11:57:52   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(172): serialize
Feb 02 11:57:52 
Feb 02 11:57:53 [E request_callback_no_python.cpp:653] Received error while processing request type 258: RuntimeError: Can not pickle torch.futures.Future
Feb 02 11:57:53 
Feb 02 11:57:53 At:
Feb 02 11:57:53   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(120): serialize
Feb 02 11:57:53   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(172): serialize
Feb 02 11:57:53 
Feb 02 11:57:53 ok (2.447s)
Feb 02 11:57:55   test_return_future_remote (__main__.TensorPipeRpcTestWithSpawn) ... ok (2.348s)
Feb 02 11:57:58   test_return_local_rrefs (__main__.TensorPipeRpcTestWithSpawn) ... ok (2.549s)
Feb 02 11:58:01   test_rpc_profiling_async_function (__main__.TensorPipeRpcTestWithSpawn) ... ok (3.348s)
Feb 02 11:58:04   test_rpc_profiling_async_function_single_threaded (__main__.TensorPipeRpcTestWithSpawn) ... ok (3.356s)

❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

pytorch_xla_linux_bionic_py3_6_clang9_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️

Feb 02 11:14:07 RuntimeError: tensorflow/compiler/xla/xla_client/xrt_local_service.cc:56 : Check failed: tensorflow::NewServer(server_def, &server_) == ::tensorflow::Status::OK() (Unknown: Could not start gRPC server vs. OK)

Feb 02 11:14:07   File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.6-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 314, in _setup_replication
Feb 02 11:14:07     device = xm.xla_device()
Feb 02 11:14:07   File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.6-py3.6-linux-x86_64.egg/torch_xla/core/xla_model.py", line 231, in xla_device
Feb 02 11:14:07     devkind=devkind if devkind is not None else None)
Feb 02 11:14:07   File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.6-py3.6-linux-x86_64.egg/torch_xla/core/xla_model.py", line 136, in get_xla_supported_devices
Feb 02 11:14:07     xla_devices = _DEVICES.value
Feb 02 11:14:07   File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.6-py3.6-linux-x86_64.egg/torch_xla/utils/utils.py", line 32, in value
Feb 02 11:14:07     self._value = self._gen_fn()
Feb 02 11:14:07   File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.6-py3.6-linux-x86_64.egg/torch_xla/core/xla_model.py", line 18, in <lambda>
Feb 02 11:14:07     _DEVICES = xu.LazyProperty(lambda: torch_xla._XLAC._xla_get_devices())
Feb 02 11:14:07 RuntimeError: tensorflow/compiler/xla/xla_client/xrt_local_service.cc:56 : Check failed: tensorflow::NewServer(server_def, &server_) == ::tensorflow::Status::OK() (Unknown: Could not start gRPC server vs. OK)
Feb 02 11:14:07 Default device xla:0 is not a TPU device
Feb 02 11:14:07 Traceback (most recent call last):
Feb 02 11:14:07   File "/var/lib/jenkins/workspace/xla/test/test_mp_all_to_all.py", line 34, in <module>
Feb 02 11:14:07     xmp.spawn(_mp_fn, args=())
Feb 02 11:14:07   File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.6-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 394, in spawn
Feb 02 11:14:07     start_method=start_method)
Feb 02 11:14:07   File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
Feb 02 11:14:07     while not context.join():
Feb 02 11:14:07   File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 144, in join
Feb 02 11:14:07     exit_code=exitcode

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

A user reported that `start_PowerSGD_iter` failed when it's set as 1. This is because allocating memory for error tensors somehow overlap with bucket rebuilding process at iteration 1. Assert `start_PowerSGD_iter > 1` instead of `start_PowerSGD_iter >= 1`. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26166897](https://our.internmc.facebook.com/intern/diff/D26166897/) [ghstack-poisoned]

Pull Request resolved: #51427 A user reported that `start_PowerSGD_iter` failed when it's set as 1. This is because allocating memory for error tensors somehow overlap with bucket rebuilding process at iteration 1. Assert `start_PowerSGD_iter > 1` instead of `start_PowerSGD_iter >= 1`. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120725871 Differential Revision: [D26166897](https://our.internmc.facebook.com/intern/diff/D26166897/)

rohan-varma

This diff doesn't seem to modify any of the asserts as the description suggests?

A user reported that `start_PowerSGD_iter` failed when it's set as 1. This is because allocating memory for error tensors somehow overlap with bucket rebuilding process at iteration 1. Assert `start_PowerSGD_iter > 1` instead of `start_PowerSGD_iter >= 1`. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26166897](https://our.internmc.facebook.com/intern/diff/D26166897/) [ghstack-poisoned]

Pull Request resolved: #51427 A user reported that `start_PowerSGD_iter` failed when it's set as 1. This is because allocating memory for error tensors somehow overlap with bucket rebuilding process at iteration 1. Assert `start_PowerSGD_iter > 1` instead of `start_PowerSGD_iter >= 1`. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120796036 Differential Revision: [D26166897](https://our.internmc.facebook.com/intern/diff/D26166897/)

wayi1 · 2021-02-01T22:30:21Z

This diff doesn't seem to modify any of the asserts as the description suggests?

Updated. The diff was somehow messed up in the stack.

rohan-varma

LGTM, though please consider adding the test and ensuring all CI checks pass

rohan-varma

Can we also add appropriate docstrings for these state classes for the hooks, and also ensure that subtleties such as the start iteration are well documented both in this state and in the actual hook?

Added much more documentations, including the invalid value range and the guidance on tuning the PowerSGD configs.

codecov · 2021-02-02T05:03:59Z

Codecov Report

Merging #51427 (2795da9) into gh/SciPioneer/52/base (6c24296) will decrease coverage by 0.33%.
The diff coverage is 11.11%.

@@                    Coverage Diff                    @@
##           gh/SciPioneer/52/base   #51427      +/-   ##
=========================================================
- Coverage                  80.86%   80.52%   -0.34%     
=========================================================
  Files                       1938     1938              
  Lines                     211259   211187      -72     
=========================================================
- Hits                      170827   170055     -772     
- Misses                     40432    41132     +700

A user reported that `start_PowerSGD_iter` failed when it's set as 1. This is because allocating memory for error tensors somehow overlap with bucket rebuilding process at iteration 1. Check `start_PowerSGD_iter > 1` instead of `start_PowerSGD_iter >= 1`. Also added a unit test of `test_invalid_powerSGD_state` and some guidance on tuning PowerSGD configs. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 Differential Revision: [D26166897](https://our.internmc.facebook.com/intern/diff/D26166897/) [ghstack-poisoned]

… on tuning PowerSGD configs. Pull Request resolved: #51427 A user reported that `start_PowerSGD_iter` failed when it's set as 1. This is because allocating memory for error tensors somehow overlap with bucket rebuilding process at iteration 1. Check `start_PowerSGD_iter > 1` instead of `start_PowerSGD_iter >= 1`. Also add a unit test of `test_invalid_powerSGD_state` and some guidance on tuning PowerSGD configs. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120834126 Differential Revision: [D26166897](https://our.internmc.facebook.com/intern/diff/D26166897/)

facebook-github-bot · 2021-02-02T13:21:43Z

This pull request has been merged in 79e7544.

… on tuning PowerSGD configs. (pytorch#51427) Summary: Pull Request resolved: pytorch#51427 A user reported that `start_PowerSGD_iter` failed when it's set as 1. This is because allocating memory for error tensors somehow overlap with bucket rebuilding process at iteration 1. Check `start_PowerSGD_iter > 1` instead of `start_PowerSGD_iter >= 1`. Also add a unit test of `test_invalid_powerSGD_state` and some guidance on tuning PowerSGD configs. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression pytorch#47202 ghstack-source-id: 120834126 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_invalid_powerSGD_state Reviewed By: rohan-varma Differential Revision: D26166897 fbshipit-source-id: 34d5b64bb3dd43acb61d792626c70e6c8bb44a5d

wayi1 requested review from mingzhe09088, mrshenli, pritamdamania87, rohan-varma and zhaojuanmao as code owners January 31, 2021 06:09

facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Jan 31, 2021

rohan-varma reviewed Feb 1, 2021

View reviewed changes

rohan-varma approved these changes Feb 1, 2021

View reviewed changes

Comment thread torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py Outdated

rohan-varma reviewed Feb 1, 2021

View reviewed changes

facebook-github-bot closed this in 79e7544 Feb 2, 2021

facebook-github-bot added the Merged label Feb 2, 2021

wayi1 mentioned this pull request Feb 3, 2021

[Gradient Compression] Replace torch.sqrt(torch.sum(col ** 2)) by torch.norm() #51629

Closed

facebook-github-bot deleted the gh/SciPioneer/52/head branch February 5, 2021 15:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Gradient Compression] Check start_PowerSGD_iter > 1#51427

[Gradient Compression] Check start_PowerSGD_iter > 1#51427
wayi1 wants to merge 4 commits intogh/SciPioneer/52/basefrom
gh/SciPioneer/52/head

wayi1 commented Jan 31, 2021 •

edited

Loading

Uh oh!

facebook-github-bot commented Jan 31, 2021 •

edited

Loading

Uh oh!

rohan-varma left a comment

Uh oh!

wayi1 commented Feb 1, 2021

Uh oh!

rohan-varma left a comment

Uh oh!

Uh oh!

rohan-varma left a comment •

edited by wayi1

Loading

Uh oh!

codecov Bot commented Feb 2, 2021 •

edited

Loading

Uh oh!

facebook-github-bot commented Feb 2, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wayi1 commented Jan 31, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Jan 31, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🕵️ 1 new failure recognized by patterns

pytorch_linux_xenial_py3_6_gcc5_4_test (1/1)

❄️ 1 failure tentatively classified as flaky

pytorch_xla_linux_bionic_py3_6_clang9_test (1/1)

Uh oh!

rohan-varma left a comment

Choose a reason for hiding this comment

Uh oh!

wayi1 commented Feb 1, 2021

Uh oh!

rohan-varma left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rohan-varma left a comment • edited by wayi1 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Feb 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

facebook-github-bot commented Feb 2, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wayi1 commented Jan 31, 2021 •

edited

Loading

facebook-github-bot commented Jan 31, 2021 •

edited

Loading

rohan-varma left a comment •

edited by wayi1

Loading

codecov Bot commented Feb 2, 2021 •

edited

Loading