Skip to content

Disable TF32 on DDP tests#52941

Closed
zasdfgbnm wants to merge 1 commit intomasterfrom
ddp-notf32
Closed

Disable TF32 on DDP tests#52941
zasdfgbnm wants to merge 1 commit intomasterfrom
ddp-notf32

Conversation

@zasdfgbnm
Copy link
Copy Markdown
Collaborator

When a system has an ampere and a non-ampere card, lots of tests will fail, because results on different cards are differnet.

@zasdfgbnm zasdfgbnm requested a review from ngimel February 26, 2021 20:43
@facebook-github-bot facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Feb 26, 2021
@facebook-github-bot
Copy link
Copy Markdown
Contributor

facebook-github-bot commented Feb 26, 2021

💊 CI failures summary and remediations

As of commit 6c0052a (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 27, 2021

Codecov Report

Merging #52941 (6c0052a) into master (6514a47) will decrease coverage by 0.32%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #52941      +/-   ##
==========================================
- Coverage   80.76%   80.44%   -0.33%     
==========================================
  Files        1975     1975              
  Lines      216701   216701              
==========================================
- Hits       175025   174321     -704     
- Misses      41676    42380     +704     

@zou3519 zou3519 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 2, 2021
DistributedTest, TestDistBackend
)

torch.backends.cuda.matmul.allow_tf32 = False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar with the impact of this configuration, but this looks in general OK to me and tests suggests this is fine at least in our CI.

@ngimel please correct me if I was wrong. Thanks!

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@mrshenli merged this pull request in dfb5f02.

@zasdfgbnm zasdfgbnm deleted the ddp-notf32 branch March 12, 2021 02:46
xsacha pushed a commit to xsacha/pytorch that referenced this pull request Mar 31, 2021
Summary:
When a system has an ampere and a non-ampere card, lots of tests will fail, because results on different cards are differnet.

Pull Request resolved: pytorch#52941

Reviewed By: albanD

Differential Revision: D26994287

Pulled By: mrshenli

fbshipit-source-id: 287537495fc13361104a4460f5bcd79a208b5d8d
facebook-github-bot pushed a commit that referenced this pull request May 20, 2021
Summary:
This PR does several things to relax test tolerance

- Do not use TF32 in cuda matmul in test_c10d. See #52941.
- Do not use TF32 in cuda matmul in test_linalg. Increase atol for float and cfloat. See #50453
    The tolerance is increased because most linear algebra operators are not that stable in single precision.

Pull Request resolved: #56114

Reviewed By: ailzhang

Differential Revision: D28554467

Pulled By: ngimel

fbshipit-source-id: 90416be8e4c048bedb16903b01315584d344ecdf
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
Summary:
When a system has an ampere and a non-ampere card, lots of tests will fail, because results on different cards are differnet.

Pull Request resolved: pytorch#52941

Reviewed By: albanD

Differential Revision: D26994287

Pulled By: mrshenli

fbshipit-source-id: 287537495fc13361104a4460f5bcd79a208b5d8d
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 25, 2026
Summary:
This PR does several things to relax test tolerance

- Do not use TF32 in cuda matmul in test_c10d. See pytorch#52941.
- Do not use TF32 in cuda matmul in test_linalg. Increase atol for float and cfloat. See pytorch#50453
    The tolerance is increased because most linear algebra operators are not that stable in single precision.

Pull Request resolved: pytorch#56114

Reviewed By: ailzhang

Differential Revision: D28554467

Pulled By: ngimel

fbshipit-source-id: 90416be8e4c048bedb16903b01315584d344ecdf
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants