Skip to content

[Gradient Compression] Refactor default_hooks.py and powerSGD_hook.py by creating a util function that make a vanilla allreduce future#51094

Closed
wayi1 wants to merge 4 commits intogh/SciPioneer/49/basefrom
gh/SciPioneer/49/head
Closed

[Gradient Compression] Refactor default_hooks.py and powerSGD_hook.py by creating a util function that make a vanilla allreduce future#51094
wayi1 wants to merge 4 commits intogh/SciPioneer/49/basefrom
gh/SciPioneer/49/head

Conversation

@wayi1
Copy link
Copy Markdown
Contributor

@wayi1 wayi1 commented Jan 26, 2021

Stack from ghstack:

Address #50973 (comment)

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: D26070147

… by creating a util function that make a vanilla allreduce future

Address #50973 (comment)

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D26070147](https://our.internmc.facebook.com/intern/diff/D26070147/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Copy Markdown
Contributor

facebook-github-bot commented Jan 26, 2021

💊 CI failures summary and remediations

As of commit 2ca21e5 (more details on the Dr. CI page):


  • 9/9 failures introduced in this PR

🕵️ 9 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_xenial_py3_clang5_asan_test2 (1/9)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jan 29 00:53:48 SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/Utils.cpp:15:3 in
Jan 29 00:53:48     #7 0x560ac09b870b in PyEval_EvalCode /tmp/build/80754af9/python_1599604603603/work/Python/ceval.c:731
Jan 29 00:53:48     #8 0x560ac0a38573 in run_mod /tmp/build/80754af9/python_1599604603603/work/Python/pythonrun.c:1025
Jan 29 00:53:48     #9 0x560ac0a3860c in PyRun_StringFlags /tmp/build/80754af9/python_1599604603603/work/Python/pythonrun.c:949
Jan 29 00:53:48     #10 0x560ac0a3866e in PyRun_SimpleStringFlags /tmp/build/80754af9/python_1599604603603/work/Python/pythonrun.c:445
Jan 29 00:53:48     #11 0x560ac0a3c472 in run_command /tmp/build/80754af9/python_1599604603603/work/Modules/main.c:301
Jan 29 00:53:48     #12 0x560ac0a3c472 in Py_Main /tmp/build/80754af9/python_1599604603603/work/Modules/main.c:749
Jan 29 00:53:48     #13 0x560ac090643d in main /tmp/build/80754af9/python_1599604603603/work/Programs/python.c:69
Jan 29 00:53:48     #14 0x7ffa1d8e283f in __libc_start_main /build/glibc-e6zv40/glibc-2.23/csu/../csu/libc-start.c:291
Jan 29 00:53:48     #15 0x560ac09e5d0a in _start /home/rdonnelly/mc/conda-bld/compilers_linux-64_1534865402226/work/.build/src/glibc-2.12.2/csu/../sysdeps/x86_64/elf/start.S:103
Jan 29 00:53:48 
Jan 29 00:53:48 SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/Utils.cpp:15:3 in 
Jan 29 00:53:48 + retcode=1
Jan 29 00:53:48 + set -e
Jan 29 00:53:48 + return 1
Jan 29 00:53:48 + [[ pytorch-linux-xenial-py3-clang5-asan-test2 == *-NO_AVX-* ]]
Jan 29 00:53:48 + [[ pytorch-linux-xenial-py3-clang5-asan-test2 == *-NO_AVX2-* ]]
Jan 29 00:53:48 + '[' -n https://github.com/pytorch/pytorch/pull/51094 ']'
Jan 29 00:53:48 + [[ pytorch-linux-xenial-py3-clang5-asan-test2 != *coverage* ]]
Jan 29 00:53:48 ++ mktemp
Jan 29 00:53:48 + DETERMINE_FROM=/tmp/tmp.fpKEWaHwLo
Jan 29 00:53:48 + file_diff_from_base /tmp/tmp.fpKEWaHwLo

See CircleCI build pytorch_linux_bionic_py3_6_clang9_test (2/9)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jan 29 00:52:06 sccache: error: couldn't connect to server
Jan 29 00:52:06 +++ eval 'extract_trap_cmd '
Jan 29 00:52:06 ++++ extract_trap_cmd
Jan 29 00:52:06 ++++ printf '%s\n' ''
Jan 29 00:52:06 +++ printf '%s\n' cleanup
Jan 29 00:52:06 ++ trap -- '
Jan 29 00:52:06 cleanup' EXIT
Jan 29 00:52:06 ++ [[ pytorch-linux-bionic-py3.6-clang9-test != *pytorch-win-* ]]
Jan 29 00:52:06 ++ which sccache
Jan 29 00:52:06 ++ sccache --stop-server
Jan 29 00:52:06 Stopping sccache server...
Jan 29 00:52:06 sccache: error: couldn't connect to server
Jan 29 00:52:06 sccache: caused by: Connection refused (os error 111)
Jan 29 00:52:06 ++ true
Jan 29 00:52:06 ++ rm /var/lib/jenkins/sccache_error.log
Jan 29 00:52:06 ++ [[ pytorch-linux-bionic-py3.6-clang9-test == *rocm* ]]
Jan 29 00:52:06 ++ SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log
Jan 29 00:52:06 ++ SCCACHE_IDLE_TIMEOUT=1200
Jan 29 00:52:06 ++ RUST_LOG=sccache::server=error
Jan 29 00:52:06 ++ sccache --start-server
Jan 29 00:52:06 sccache: Starting the server...
Jan 29 00:52:06 ++ sccache --zero-stats

See CircleCI build pytorch_xla_linux_bionic_py3_6_clang9_build (3/9)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in .jenkins/caffe2/test.sh
Auto-merging .jenkins/caffe2/test.sh
CONFLICT (add/add): Merge conflict in .gitmodules
Auto-merging .gitmodules
CONFLICT (add/add): Merge conflict in .circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
Auto-merging .circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
CONFLICT (add/add): Merge conflict in .circleci/scripts/python_doc_push_script.sh
Auto-merging .circleci/scripts/python_doc_push_script.sh
CONFLICT (add/add): Merge conflict in .circleci/config.yml
Auto-merging .circleci/config.yml
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_build (4/9)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in .jenkins/caffe2/test.sh
Auto-merging .jenkins/caffe2/test.sh
CONFLICT (add/add): Merge conflict in .gitmodules
Auto-merging .gitmodules
CONFLICT (add/add): Merge conflict in .circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
Auto-merging .circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
CONFLICT (add/add): Merge conflict in .circleci/scripts/python_doc_push_script.sh
Auto-merging .circleci/scripts/python_doc_push_script.sh
CONFLICT (add/add): Merge conflict in .circleci/config.yml
Auto-merging .circleci/config.yml
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

See CircleCI build pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test2 (5/9)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jan 29 01:12:31 sccache: error: couldn't connect to server
Jan 29 01:12:31 +++ eval 'extract_trap_cmd '
Jan 29 01:12:31 ++++ extract_trap_cmd
Jan 29 01:12:31 ++++ printf '%s\n' ''
Jan 29 01:12:31 +++ printf '%s\n' cleanup
Jan 29 01:12:31 ++ trap -- '
Jan 29 01:12:31 cleanup' EXIT
Jan 29 01:12:31 ++ [[ pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7-test2 != *pytorch-win-* ]]
Jan 29 01:12:31 ++ which sccache
Jan 29 01:12:31 ++ sccache --stop-server
Jan 29 01:12:31 Stopping sccache server...
Jan 29 01:12:31 sccache: error: couldn't connect to server
Jan 29 01:12:31 sccache: caused by: Connection refused (os error 111)
Jan 29 01:12:31 ++ true
Jan 29 01:12:31 ++ rm /var/lib/jenkins/sccache_error.log
Jan 29 01:12:31 ++ [[ pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7-test2 == *rocm* ]]
Jan 29 01:12:31 ++ SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log
Jan 29 01:12:31 ++ SCCACHE_IDLE_TIMEOUT=1200
Jan 29 01:12:31 ++ RUST_LOG=sccache::server=error
Jan 29 01:12:31 ++ sccache --start-server
Jan 29 01:12:31 sccache: Starting the server...
Jan 29 01:12:31 ++ sccache --zero-stats

See CircleCI build pytorch_linux_xenial_py3_clang5_asan_test1 (6/9)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jan 29 00:53:45 SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/Utils.cpp:15:3 in
Jan 29 00:53:45     #7 0x56041871170b in PyEval_EvalCode /tmp/build/80754af9/python_1599604603603/work/Python/ceval.c:731
Jan 29 00:53:45     #8 0x560418791573 in run_mod /tmp/build/80754af9/python_1599604603603/work/Python/pythonrun.c:1025
Jan 29 00:53:45     #9 0x56041879160c in PyRun_StringFlags /tmp/build/80754af9/python_1599604603603/work/Python/pythonrun.c:949
Jan 29 00:53:45     #10 0x56041879166e in PyRun_SimpleStringFlags /tmp/build/80754af9/python_1599604603603/work/Python/pythonrun.c:445
Jan 29 00:53:45     #11 0x560418795472 in run_command /tmp/build/80754af9/python_1599604603603/work/Modules/main.c:301
Jan 29 00:53:45     #12 0x560418795472 in Py_Main /tmp/build/80754af9/python_1599604603603/work/Modules/main.c:749
Jan 29 00:53:45     #13 0x56041865f43d in main /tmp/build/80754af9/python_1599604603603/work/Programs/python.c:69
Jan 29 00:53:45     #14 0x7fa9a753183f in __libc_start_main /build/glibc-e6zv40/glibc-2.23/csu/../csu/libc-start.c:291
Jan 29 00:53:45     #15 0x56041873ed0a in _start /home/rdonnelly/mc/conda-bld/compilers_linux-64_1534865402226/work/.build/src/glibc-2.12.2/csu/../sysdeps/x86_64/elf/start.S:103
Jan 29 00:53:45 
Jan 29 00:53:45 SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/Utils.cpp:15:3 in 
Jan 29 00:53:45 + retcode=1
Jan 29 00:53:45 + set -e
Jan 29 00:53:45 + return 1
Jan 29 00:53:45 + [[ pytorch-linux-xenial-py3-clang5-asan-test1 == *-NO_AVX-* ]]
Jan 29 00:53:45 + [[ pytorch-linux-xenial-py3-clang5-asan-test1 == *-NO_AVX2-* ]]
Jan 29 00:53:45 + '[' -n https://github.com/pytorch/pytorch/pull/51094 ']'
Jan 29 00:53:45 + [[ pytorch-linux-xenial-py3-clang5-asan-test1 != *coverage* ]]
Jan 29 00:53:45 ++ mktemp
Jan 29 00:53:45 + DETERMINE_FROM=/tmp/tmp.q1TOnBC06b
Jan 29 00:53:45 + file_diff_from_base /tmp/tmp.q1TOnBC06b

See CircleCI build pytorch_linux_xenial_cuda9_2_cudnn7_py3_gcc5_4_build (7/9)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in .jenkins/caffe2/test.sh
Auto-merging .jenkins/caffe2/test.sh
CONFLICT (add/add): Merge conflict in .gitmodules
Auto-merging .gitmodules
CONFLICT (add/add): Merge conflict in .circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
Auto-merging .circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
CONFLICT (add/add): Merge conflict in .circleci/scripts/python_doc_push_script.sh
Auto-merging .circleci/scripts/python_doc_push_script.sh
CONFLICT (add/add): Merge conflict in .circleci/config.yml
Auto-merging .circleci/config.yml
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

See CircleCI build pytorch_windows_vs2019_py36_cuda10.1_test1 (8/9)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

RuntimeError: distributed/test_c10d failed!
  File "C:\Users\circleci\project\build\win_tmp\build\torch\distributed\algorithms\ddp_comm_hooks\__init__.py", line 7, in <module>
    from . import (
  File "C:\Users\circleci\project\build\win_tmp\build\torch\distributed\algorithms\ddp_comm_hooks\powerSGD_hook.py", line 7, in <module>
    import torch.distributed.algorithms.ddp_comm_hooks.default_hooks as default
AttributeError: module 'torch.distributed.algorithms' has no attribute 'ddp_comm_hooks'
Traceback (most recent call last):
  File "run_test.py", line 922, in <module>
    main()
  File "run_test.py", line 901, in main
    raise RuntimeError(err_message)
RuntimeError: distributed/test_c10d failed!

(base) C:\Users\circleci\project\test>if ERRORLEVEL 1 exit /b 1 
+ cleanup
+ retcode=1
+ set +x


Exited with code exit status 1

See CircleCI build pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test1 (9/9)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Jan 29 01:12:12 sccache: error: couldn't connect to server
Jan 29 01:12:12 +++ eval 'extract_trap_cmd '
Jan 29 01:12:12 ++++ extract_trap_cmd
Jan 29 01:12:12 ++++ printf '%s\n' ''
Jan 29 01:12:12 +++ printf '%s\n' cleanup
Jan 29 01:12:12 ++ trap -- '
Jan 29 01:12:12 cleanup' EXIT
Jan 29 01:12:12 ++ [[ pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7-test1 != *pytorch-win-* ]]
Jan 29 01:12:12 ++ which sccache
Jan 29 01:12:12 ++ sccache --stop-server
Jan 29 01:12:12 Stopping sccache server...
Jan 29 01:12:12 sccache: error: couldn't connect to server
Jan 29 01:12:12 sccache: caused by: Connection refused (os error 111)
Jan 29 01:12:12 ++ true
Jan 29 01:12:12 ++ rm /var/lib/jenkins/sccache_error.log
Jan 29 01:12:12 ++ [[ pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7-test1 == *rocm* ]]
Jan 29 01:12:12 ++ SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log
Jan 29 01:12:12 ++ SCCACHE_IDLE_TIMEOUT=1200
Jan 29 01:12:12 ++ RUST_LOG=sccache::server=error
Jan 29 01:12:12 ++ sccache --start-server
Jan 29 01:12:12 sccache: Starting the server...
Jan 29 01:12:12 ++ sccache --zero-stats

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

…SGD_hook.py by creating a util function that make a vanilla allreduce future"

Address #50973 (comment)

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D26070147](https://our.internmc.facebook.com/intern/diff/D26070147/)

[ghstack-poisoned]
wayi1 pushed a commit that referenced this pull request Jan 26, 2021
… by creating a util function that make a vanilla allreduce future

Pull Request resolved: #51094

Address #50973 (comment)

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120376248

Differential Revision: [D26070147](https://our.internmc.facebook.com/intern/diff/D26070147/)
Comment thread torch/distributed/algorithms/ddp_comm_hooks/default_hooks.py
Comment thread torch/distributed/algorithms/ddp_comm_hooks/default_hooks.py Outdated
wayi added 2 commits January 28, 2021 16:02
…SGD_hook.py by creating a util function that make a vanilla allreduce future"


Address #50973 (comment)

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D26070147](https://our.internmc.facebook.com/intern/diff/D26070147/)

[ghstack-poisoned]
…SGD_hook.py by creating a util function that make a vanilla allreduce future"


Address #50973 (comment)

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D26070147](https://our.internmc.facebook.com/intern/diff/D26070147/)

[ghstack-poisoned]
wayi1 pushed a commit that referenced this pull request Jan 29, 2021
… by creating a util function that make a vanilla allreduce future

Pull Request resolved: #51094

Address #50973 (comment)

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120619680

Differential Revision: [D26070147](https://our.internmc.facebook.com/intern/diff/D26070147/)
@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request has been merged in e7b3496.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request has been reverted by 5a406c0.

@izdeby
Copy link
Copy Markdown
Contributor

izdeby commented Jan 29, 2021

Reverting due to a broken CI

@rohan-varma
Copy link
Copy Markdown
Contributor

@SciPioneer Looks like the failures on this PR were legit:

Jan 29 01:19:21 Traceback (most recent call last):
Jan 29 01:19:21   File "distributed/test_c10d.py", line 21, in <module>
Jan 29 01:19:21     import torch.distributed.algorithms.ddp_comm_hooks.default_hooks as default
Jan 29 01:19:21   File "/opt/conda/lib/python3.6/site-packages/torch/distributed/algorithms/ddp_comm_hooks/__init__.py", line 7, in <module>
Jan 29 01:19:21     from . import (
Jan 29 01:19:21   File "/opt/conda/lib/python3.6/site-packages/torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py", line 7, in <module>
Jan 29 01:19:21     import torch.distributed.algorithms.ddp_comm_hooks.default_hooks as default
Jan 29 01:19:21 AttributeError: module 'torch.distributed.algorithms' has no attribute 'ddp_comm_hooks'
Jan 29 01:19:21 Traceback (most recent call last):
Jan 29 01:19:21   File "test/run_test.py", line 922, in <module>
Jan 29 01:19:21     main()
Jan 29 01:19:21   File "test/run_test.py", line 901, in main
Jan 29 01:19:21     raise RuntimeError(err_message)
Jan 29 01:19:21 RuntimeError: distributed/test_c10d failed!
Jan 29 01:19:22 + cleanup
Jan 29 01:19:22 + retcode=1
Jan 29 01:19:22 + set +x

wayi1 pushed a commit that referenced this pull request Jan 30, 2021
…werSGD_hook.py by creating a util function that make a vanilla allreduce future

Resubmission of #51094

Address #50973 (comment)

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D26162333](https://our.internmc.facebook.com/intern/diff/D26162333/)

[ghstack-poisoned]
wayi1 pushed a commit that referenced this pull request Jan 30, 2021
…s.py and powerSGD_hook.py by creating a util function that make a vanilla allreduce future"

Resubmission of #51094

Address #50973 (comment)

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D26162333](https://our.internmc.facebook.com/intern/diff/D26162333/)

[ghstack-poisoned]
wayi1 pushed a commit that referenced this pull request Jan 30, 2021
…werSGD_hook.py by creating a util function that make a vanilla allreduce future

Pull Request resolved: #51400

Resubmission of #51094

Address #50973 (comment)

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120715333

Differential Revision: [D26162333](https://our.internmc.facebook.com/intern/diff/D26162333/)
wayi1 pushed a commit that referenced this pull request Jan 31, 2021
…s.py and powerSGD_hook.py by creating a util function that make a vanilla allreduce future"

Resubmission of #51094

Address #50973 (comment)

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202

Differential Revision: [D26162333](https://our.internmc.facebook.com/intern/diff/D26162333/)

[ghstack-poisoned]
@facebook-github-bot facebook-github-bot deleted the gh/SciPioneer/49/head branch February 1, 2021 15:19
facebook-github-bot pushed a commit that referenced this pull request Feb 1, 2021
…werSGD_hook.py by creating a util function that make a vanilla allreduce future (#51400)

Summary:
Pull Request resolved: #51400

Resubmission of #51094

Address #50973 (comment)

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120725690

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl

Reviewed By: rohan-varma

Differential Revision: D26162333

fbshipit-source-id: ccc2eae5383a23673e00d61cb5570fb8bf749cd0
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
… by creating a util function that make a vanilla allreduce future (pytorch#51094)

Summary:
Pull Request resolved: pytorch#51094

Address pytorch#50973 (comment)

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression pytorch#47202
ghstack-source-id: 120619680

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl

Reviewed By: rohan-varma

Differential Revision: D26070147

fbshipit-source-id: 8c9339f1511e8f24cc906b9411cfe4850a5a6d81
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
…werSGD_hook.py by creating a util function that make a vanilla allreduce future (pytorch#51400)

Summary:
Pull Request resolved: pytorch#51400

Resubmission of pytorch#51094

Address pytorch#50973 (comment)

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression pytorch#47202
ghstack-source-id: 120725690

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl

Reviewed By: rohan-varma

Differential Revision: D26162333

fbshipit-source-id: ccc2eae5383a23673e00d61cb5570fb8bf749cd0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue Reverted

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants