Skip to content

Do not use TF32 matmul in linalg and DDP tests#56114

Closed
xwang233 wants to merge 10 commits intopytorch:masterfrom
xwang233:relax-some-test-tolerance
Closed

Do not use TF32 matmul in linalg and DDP tests#56114
xwang233 wants to merge 10 commits intopytorch:masterfrom
xwang233:relax-some-test-tolerance

Conversation

@xwang233
Copy link
Copy Markdown
Collaborator

@xwang233 xwang233 commented Apr 15, 2021

This PR does several things to relax test tolerance

@facebook-github-bot facebook-github-bot added oncall: distributed Add this issue/PR to distributed oncall triage queue cla signed labels Apr 15, 2021
@facebook-github-bot
Copy link
Copy Markdown
Contributor

facebook-github-bot commented Apr 15, 2021

💊 CI failures summary and remediations

As of commit a83b61c (more details on the Dr. CI page):


  • 2/3 failures possibly* introduced in this PR
    • 1/2 non-scanned failure(s)
  • 1/3 broken upstream at merge base 6c70cbe on May 19 from 8:24am to 2:12pm

1 failure not recognized by patterns:

Job Step Action
CircleCI pytorch_linux_bionic_cuda10_2_cudnn7_py3_9_gcc7_test2 Run tests 🔁 rerun

🚧 1 fixed upstream failure:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

If your commit is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 15, 2021

Codecov Report

Merging #56114 (6c018dc) into master (ccd7141) will increase coverage by 0.56%.
The diff coverage is 100.00%.

❗ Current head 6c018dc differs from pull request most recent head a83b61c. Consider uploading reports for the commit a83b61c to get more accurate results

@@            Coverage Diff             @@
##           master   #56114      +/-   ##
==========================================
+ Coverage   76.44%   77.00%   +0.56%     
==========================================
  Files        1990     1912      -78     
  Lines      199690   189561   -10129     
==========================================
- Hits       152651   145980    -6671     
+ Misses      47039    43581    -3458     

@xwang233
Copy link
Copy Markdown
Collaborator Author

ping @ngimel 😄

Comment thread test/test_linalg.py Outdated
super(self.__class__, self).setUp()
torch.backends.cuda.matmul.allow_tf32 = False
self.precision_overrides = {
torch.float: 1e-4,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean that regular fp32 needs expanded tolerance? How are tests passing currently?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh, yes you're correct. I forgot to delete them. Let me modify this.

@xwang233 xwang233 changed the title Relax some TF32 test tolerance Do not use TF32 matmul in linalg and DDP tests May 19, 2021
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@ngimel merged this pull request in 691c139.

laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 25, 2026
Summary:
This PR does several things to relax test tolerance

- Do not use TF32 in cuda matmul in test_c10d. See pytorch#52941.
- Do not use TF32 in cuda matmul in test_linalg. Increase atol for float and cfloat. See pytorch#50453
    The tolerance is increased because most linear algebra operators are not that stable in single precision.

Pull Request resolved: pytorch#56114

Reviewed By: ailzhang

Differential Revision: D28554467

Pulled By: ngimel

fbshipit-source-id: 90416be8e4c048bedb16903b01315584d344ecdf
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants