Do not use TF32 matmul in linalg and DDP tests by xwang233 · Pull Request #56114 · pytorch/pytorch

xwang233 · 2021-04-15T03:26:52Z

This PR does several things to relax test tolerance

Do not use TF32 in cuda matmul in test_c10d. See Disable TF32 on DDP tests #52941.
Do not use TF32 in cuda matmul in test_linalg. Increase atol for float and cfloat. See Fix TF32 failures in test_linalg.py #50453
The tolerance is increased because most linear algebra operators are not that stable in single precision.

facebook-github-bot · 2021-04-15T03:26:58Z

💊 CI failures summary and remediations

As of commit a83b61c (more details on the Dr. CI page):

2/3 failures possibly* introduced in this PR
- 1/2 non-scanned failure(s)
1/3 broken upstream at merge base 6c70cbe on May 19 from 8:24am to 2:12pm

1 failure not recognized by patterns:

Job	Step	Action
^{pytorch_linux_bionic_cuda10_2_cudnn7_py3_9_gcc7_test2}	^{Run tests}	🔁 rerun

🚧 1 fixed upstream failure:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

If your commit is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

pytorch_linux_bionic_cuda10_2_cudnn7_py3_9_gcc7_test1 on May 19 from 8:24am to 2:12pm (4cf9b11 - 029bec4)
- 🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

codecov · 2021-04-15T07:51:06Z

Codecov Report

Merging #56114 (6c018dc) into master (ccd7141) will increase coverage by 0.56%.
The diff coverage is 100.00%.

❗ Current head 6c018dc differs from pull request most recent head a83b61c. Consider uploading reports for the commit a83b61c to get more accurate results

@@            Coverage Diff             @@
##           master   #56114      +/-   ##
==========================================
+ Coverage   76.44%   77.00%   +0.56%     
==========================================
  Files        1990     1912      -78     
  Lines      199690   189561   -10129     
==========================================
- Hits       152651   145980    -6671     
+ Misses      47039    43581    -3458

…olerance

xwang233 · 2021-05-19T20:23:16Z

ping @ngimel 😄

ngimel · 2021-05-19T20:45:47Z

+        super(self.__class__, self).setUp()
+        torch.backends.cuda.matmul.allow_tf32 = False
+        self.precision_overrides = {
+            torch.float: 1e-4,


Does it mean that regular fp32 needs expanded tolerance? How are tests passing currently?

Ohh, yes you're correct. I forgot to delete them. Let me modify this.

facebook-github-bot · 2021-05-19T22:36:58Z

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-05-20T21:02:48Z

@ngimel merged this pull request in 691c139.

Summary: This PR does several things to relax test tolerance - Do not use TF32 in cuda matmul in test_c10d. See pytorch#52941. - Do not use TF32 in cuda matmul in test_linalg. Increase atol for float and cfloat. See pytorch#50453 The tolerance is increased because most linear algebra operators are not that stable in single precision. Pull Request resolved: pytorch#56114 Reviewed By: ailzhang Differential Revision: D28554467 Pulled By: ngimel fbshipit-source-id: 90416be8e4c048bedb16903b01315584d344ecdf

xwang233 added 5 commits April 14, 2021 18:55

linalg float precision override

c5ad28b

test_Conv3d_depthwise_naive_groups_cuda

43f4a55

some common_nn test

d322605

disable cuda tf32 in c10d tests

01be616

flake8

9107262

xwang233 requested review from H-Huang, mrshenli, pritamdamania87, rohan-varma, wayi1 and zhaojuanmao as code owners April 15, 2021 03:26

facebook-github-bot added oncall: distributed Add this issue/PR to distributed oncall triage queue cla signed labels Apr 15, 2021

xwang233 requested review from ngimel, ptrblck and zasdfgbnm April 15, 2021 03:27

pytorchbot added the open source label Apr 15, 2021

mypy

6c018dc

xwang233 added 3 commits May 19, 2021 13:14

Merge remote-tracking branch 'upstream/master' into relax-some-test-t…

b451cee

…olerance

disable cuda matmul tf32 in c10d_common

da9f195

revert test_nn tolerance change

7a6dd5d

ngimel reviewed May 19, 2021

View reviewed changes

xwang233 changed the title ~~Relax some TF32 test tolerance~~ Do not use TF32 matmul in linalg and DDP tests May 19, 2021

remove test module precision override

a83b61c

ngimel approved these changes May 19, 2021

View reviewed changes

facebook-github-bot closed this in 691c139 May 20, 2021

facebook-github-bot added the Merged label May 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not use TF32 matmul in linalg and DDP tests#56114

Do not use TF32 matmul in linalg and DDP tests#56114
xwang233 wants to merge 10 commits intopytorch:masterfrom
xwang233:relax-some-test-tolerance

xwang233 commented Apr 15, 2021 •

edited

Loading

Uh oh!

facebook-github-bot commented Apr 15, 2021 •

edited

Loading

Uh oh!

codecov Bot commented Apr 15, 2021 •

edited

Loading

Uh oh!

xwang233 commented May 19, 2021

Uh oh!

ngimel May 19, 2021

Uh oh!

xwang233 May 19, 2021

Uh oh!

facebook-github-bot commented May 19, 2021

Uh oh!

facebook-github-bot commented May 20, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

xwang233 commented Apr 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Apr 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

1 failure not recognized by patterns:

🚧 1 fixed upstream failure:

Uh oh!

codecov Bot commented Apr 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

xwang233 commented May 19, 2021

Uh oh!

ngimel May 19, 2021

Choose a reason for hiding this comment

Uh oh!

xwang233 May 19, 2021

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented May 19, 2021

Uh oh!

facebook-github-bot commented May 20, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xwang233 commented Apr 15, 2021 •

edited

Loading

facebook-github-bot commented Apr 15, 2021 •

edited

Loading

codecov Bot commented Apr 15, 2021 •

edited

Loading