Skip to content

Fix SyncBatchNorm forward pass for non-default process group#38874

Closed
linziyi96 wants to merge 1 commit intopytorch:masterfrom
linziyi96:syncbn-pg
Closed

Fix SyncBatchNorm forward pass for non-default process group#38874
linziyi96 wants to merge 1 commit intopytorch:masterfrom
linziyi96:syncbn-pg

Conversation

@linziyi96
Copy link
Copy Markdown
Contributor

In forward pass of SyncBN, the all_gather call ignores process_group, leading to failure if any non-default process group is used,

Seems to be introduced quite recently by #38246.

@linziyi96 linziyi96 requested a review from apaszke as a code owner May 21, 2020 15:59
@ailzhang ailzhang requested a review from vkuzo May 21, 2020 16:30
@ailzhang ailzhang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 21, 2020
Copy link
Copy Markdown
Contributor

@vkuzo vkuzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! sorry for missing this case

@ashaw596
Copy link
Copy Markdown

Is there a reason this hasn't been merged yet. This is a pretty nasty bug when you don't run the same operations on every device and slows down training speed.

@vkuzo
Copy link
Copy Markdown
Contributor

vkuzo commented Aug 28, 2020

@linziyi96 , any chance you can rebase so we can run recent CI? Or, happy to put up a new PR and link this one.

vkuzo added a commit that referenced this pull request Aug 30, 2020
Summary:

This is a redo of #38874, and
fixing my original bug from
#38246.

Test Plan: CI

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
vkuzo added a commit that referenced this pull request Aug 30, 2020
Summary:

This is a redo of #38874, and
fixing my original bug from
#38246.

Test Plan: CI

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: fb76ed7
Pull Request resolved: #43861
facebook-github-bot pushed a commit that referenced this pull request Sep 2, 2020
…43861)

Summary:
Pull Request resolved: #43861

This is a redo of #38874, and
fixing my original bug from
#38246.

Test Plan:
CI

Imported from OSS

Reviewed By: supriyar

Differential Revision: D23418816

fbshipit-source-id: 2a3a3d67fc2d03bb0bf30a87cce4e805ac8839fb
@vkuzo
Copy link
Copy Markdown
Contributor

vkuzo commented Sep 2, 2020

#43861 which is a redo of this PR is landing now, sorry for the delay!

@vkuzo vkuzo closed this Sep 2, 2020
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
…ytorch#43861)

Summary:
Pull Request resolved: pytorch#43861

This is a redo of pytorch#38874, and
fixing my original bug from
pytorch#38246.

Test Plan:
CI

Imported from OSS

Reviewed By: supriyar

Differential Revision: D23418816

fbshipit-source-id: 2a3a3d67fc2d03bb0bf30a87cce4e805ac8839fb
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants