Disable TF32 on DDP tests by zasdfgbnm · Pull Request #52941 · pytorch/pytorch

zasdfgbnm · 2021-02-26T20:42:54Z

When a system has an ampere and a non-ampere card, lots of tests will fail, because results on different cards are differnet.

facebook-github-bot · 2021-02-26T20:43:06Z

💊 CI failures summary and remediations

As of commit 6c0052a (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

codecov · 2021-02-27T00:21:23Z

Codecov Report

Merging #52941 (6c0052a) into master (6514a47) will decrease coverage by 0.32%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #52941      +/-   ##
==========================================
- Coverage   80.76%   80.44%   -0.33%     
==========================================
  Files        1975     1975              
  Lines      216701   216701              
==========================================
- Hits       175025   174321     -704     
- Misses      41676    42380     +704

mrshenli · 2021-03-11T22:55:00Z

    DistributedTest, TestDistBackend
 )

+torch.backends.cuda.matmul.allow_tf32 = False


I am not familiar with the impact of this configuration, but this looks in general OK to me and tests suggests this is fine at least in our CI.

@ngimel please correct me if I was wrong. Thanks!

facebook-github-bot

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-03-12T02:31:47Z

@mrshenli merged this pull request in dfb5f02.

Summary: When a system has an ampere and a non-ampere card, lots of tests will fail, because results on different cards are differnet. Pull Request resolved: pytorch#52941 Reviewed By: albanD Differential Revision: D26994287 Pulled By: mrshenli fbshipit-source-id: 287537495fc13361104a4460f5bcd79a208b5d8d

Summary: This PR does several things to relax test tolerance - Do not use TF32 in cuda matmul in test_c10d. See #52941. - Do not use TF32 in cuda matmul in test_linalg. Increase atol for float and cfloat. See #50453 The tolerance is increased because most linear algebra operators are not that stable in single precision. Pull Request resolved: #56114 Reviewed By: ailzhang Differential Revision: D28554467 Pulled By: ngimel fbshipit-source-id: 90416be8e4c048bedb16903b01315584d344ecdf

Summary: When a system has an ampere and a non-ampere card, lots of tests will fail, because results on different cards are differnet. Pull Request resolved: pytorch#52941 Reviewed By: albanD Differential Revision: D26994287 Pulled By: mrshenli fbshipit-source-id: 287537495fc13361104a4460f5bcd79a208b5d8d

Summary: This PR does several things to relax test tolerance - Do not use TF32 in cuda matmul in test_c10d. See pytorch#52941. - Do not use TF32 in cuda matmul in test_linalg. Increase atol for float and cfloat. See pytorch#50453 The tolerance is increased because most linear algebra operators are not that stable in single precision. Pull Request resolved: pytorch#56114 Reviewed By: ailzhang Differential Revision: D28554467 Pulled By: ngimel fbshipit-source-id: 90416be8e4c048bedb16903b01315584d344ecdf

Disable TF32 on DDP tests

6c0052a

zasdfgbnm requested review from mrshenli, pritamdamania87, rohan-varma and zhaojuanmao as code owners February 26, 2021 20:42

zasdfgbnm requested a review from ngimel February 26, 2021 20:43

facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Feb 26, 2021

pytorchbot added the open source label Feb 26, 2021

zou3519 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 2, 2021

mrshenli approved these changes Mar 11, 2021

View reviewed changes

facebook-github-bot reviewed Mar 11, 2021

View reviewed changes

facebook-github-bot closed this in dfb5f02 Mar 12, 2021

facebook-github-bot added the Merged label Mar 12, 2021

zasdfgbnm deleted the ddp-notf32 branch March 12, 2021 02:46

zasdfgbnm mentioned this pull request Mar 17, 2021

Tests fail on A100 GPUs due to inaccurate/differing float values #52278

Closed

xwang233 mentioned this pull request Apr 15, 2021

Do not use TF32 matmul in linalg and DDP tests #56114

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable TF32 on DDP tests#52941

Disable TF32 on DDP tests#52941
zasdfgbnm wants to merge 1 commit intomasterfrom
ddp-notf32

zasdfgbnm commented Feb 26, 2021

Uh oh!

facebook-github-bot commented Feb 26, 2021 •

edited

Loading

Uh oh!

codecov Bot commented Feb 27, 2021

Uh oh!

mrshenli Mar 11, 2021

Uh oh!

facebook-github-bot left a comment

Uh oh!

facebook-github-bot commented Mar 12, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

zasdfgbnm commented Feb 26, 2021

Uh oh!

facebook-github-bot commented Feb 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Uh oh!

codecov Bot commented Feb 27, 2021

Codecov Report

Uh oh!

mrshenli Mar 11, 2021

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Mar 12, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

facebook-github-bot commented Feb 26, 2021 •

edited

Loading