[redo] Fix SyncBatchNorm forward pass for non-default process group#43861
[redo] Fix SyncBatchNorm forward pass for non-default process group#43861vkuzo wants to merge 1 commit intogh/vkuzo/134/basefrom
Conversation
Codecov Report
@@ Coverage Diff @@
## gh/vkuzo/134/base #43861 +/- ##
==================================================
Coverage 69.31% 69.32%
==================================================
Files 378 378
Lines 46745 46745
==================================================
+ Hits 32403 32404 +1
+ Misses 14342 14341 -1
Continue to review full report at Codecov.
|
💊 CI failures summary and remediationsAs of commit bc60649 (more details on the Dr. CI page):
Extra GitHub checks: 1 failed
codecov.io: 1 failed
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 1 time. |
ngimel
left a comment
There was a problem hiding this comment.
would be good to add test?
@ngimel would you be ok with a manual test for this instance (100% ok to say yes)? There are some issues with my devgpu which are preventing me from running the OSS distributed test suite properly, so just hoping to plan whether I need to block a few hours to fix my env for this PR. |
|
Yeah, manual test is ok, our OSS CI probably just has 2 gpus and won't allow you to create non-default process group anyway. |
great, thanks, added manual test to test plan |
|
This pull request has been merged in b167402. |
…ytorch#43861) Summary: Pull Request resolved: pytorch#43861 This is a redo of pytorch#38874, and fixing my original bug from pytorch#38246. Test Plan: CI Imported from OSS Reviewed By: supriyar Differential Revision: D23418816 fbshipit-source-id: 2a3a3d67fc2d03bb0bf30a87cce4e805ac8839fb
Stack from ghstack:
Summary:
This is a redo of #38874, and
fixing my original bug from
#38246.
Test Plan:
run DDP + SyncBN on two process groups:
https://gist.github.com/vkuzo/8501121d0d25261b70110aae4465ae0a
the script fails before this PR and works after this PR
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: D23418816