Skip to content

Fix TF32 failures in test_linalg.py#50453

Closed
zasdfgbnm wants to merge 9 commits intomasterfrom
tf32-failures
Closed

Fix TF32 failures in test_linalg.py#50453
zasdfgbnm wants to merge 9 commits intomasterfrom
tf32-failures

Conversation

@zasdfgbnm
Copy link
Copy Markdown
Collaborator

On Ampere GPU, matmuls are computed by default with TF32 when the dtype is torch.float: https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices, which results in reduced precision in results. However, linear algebra usually need higher precision, therefore lots of tests in test_linalg.py are failing on Ampere GPU because of precision issue.

To fix this issue:

  • Most linear algebra methods, except for matmuls, should add NoTF32Guard
  • Expected results in unit tests should compute matmuls using numpy instead of pytorch cuda.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

facebook-github-bot commented Jan 13, 2021

💊 CI failures summary and remediations

As of commit 4b8f704 (more details on the Dr. CI page):


  • 2/2 failures possibly* introduced in this PR
    • 1/2 non-CircleCI failure(s)

1 failure not recognized by patterns:

Job Step Action
CircleCI docker-pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7 Check if image should be built 🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

@ngimel
Copy link
Copy Markdown
Collaborator

ngimel commented Jan 13, 2021

Linear algebra methods for the most part call magma, and either create a cublas handle for magma with disabled tf32, or magma itself creates a default cublas handle ( :-( ) that does not have tf32 enabled. So as far as implementations are concerned we should be fine? (Problems with cusolver are separate, cc @xwang233 ).
Re: tests, I agree, for "expected" we should disable tf32 either by setting a switch or calling cpu matmuls.

@codecov
Copy link
Copy Markdown

codecov Bot commented Jan 13, 2021

Codecov Report

Merging #50453 (6a170d8) into master (5834438) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master   #50453      +/-   ##
==========================================
- Coverage   80.70%   80.70%   -0.01%     
==========================================
  Files        1905     1905              
  Lines      206813   206814       +1     
==========================================
- Hits       166916   166913       -3     
- Misses      39897    39901       +4     

@zasdfgbnm zasdfgbnm marked this pull request as ready for review January 13, 2021 17:59
@zasdfgbnm
Copy link
Copy Markdown
Collaborator Author

This should be ready

@mrshenli mrshenli added module: linear algebra Issues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmul triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jan 15, 2021
@mrshenli mrshenli requested a review from mruberry January 15, 2021 03:08
Comment thread test/test_linalg.py

@dtypes(torch.float32, torch.float64, torch.complex64, torch.complex128)
@tf32_on_and_off(1e-3)
def test_kron(self, device, dtype):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting. @zasdfgbnm, can you elaborate on the impact of tf32 on the kronecker product? @ngimel is surprised that cuBLAS tries to run a tf32 kernel for its inputs.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kron internally uses tensordot, which internally uses matmul, which is running in tf32

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed all kron changes in this PR. kron will be fixed in #50927, and this PR can be landed independently.

Comment thread test/test_linalg.py Outdated
def test_pinv(self, device, dtype):
from torch.testing._internal.common_utils import random_hermitian_pd_matrix

def _(tensor):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this function, can you just do np_A and np_A_pinv below? That would prevent excessive copying if they're on CUDA, too.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Copy Markdown
Collaborator

@mruberry mruberry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngimel and I took a look. Looks good. We have one suggestion and one question about kron.

We're wondering if we should guard kron against TF32 as well.

@zasdfgbnm
Copy link
Copy Markdown
Collaborator Author

I don't know. It is quite strange to me that Kronecker product internally uses matmul. From the definition, there is no sum at all...

@zasdfgbnm
Copy link
Copy Markdown
Collaborator Author

Why can not kron be implemented as below? This looks simple and fast:

import torch

def mykron(a, b):
    assert a.dim() == b.dim()
    a_view_shape = []
    b_view_shape = []
    ab_view_shape = []
    for i in range(a.dim()):
        a_view_shape.append(a.size(i))
        a_view_shape.append(1)
        b_view_shape.append(1)
        b_view_shape.append(b.size(i))
        ab_view_shape.append(a.size(i) * b.size(i))
    return (a.reshape(a_view_shape) * b.reshape(b_view_shape)).reshape(ab_view_shape)

a = torch.randn(5, 5, 5)
b = torch.randn(6, 6, 6)

r1 = torch.kron(a, b)
r2 = mykron(a, b)

diff = (r1 - r2).abs().max()
print(diff)

@zasdfgbnm
Copy link
Copy Markdown
Collaborator Author

zasdfgbnm commented Jan 22, 2021

Well, based on my benchmark, seems that we should use my implementation for kron: https://github.com/zasdfgbnm/things/blob/master/2021Q1/kron.ipynb

I will open a PR

Edit:
PR is #50927

@zasdfgbnm
Copy link
Copy Markdown
Collaborator Author

I have resolved all review comments

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@ngimel merged this pull request in ba316a7.

@zasdfgbnm zasdfgbnm deleted the tf32-failures branch January 27, 2021 05:21
facebook-github-bot pushed a commit that referenced this pull request May 20, 2021
Summary:
This PR does several things to relax test tolerance

- Do not use TF32 in cuda matmul in test_c10d. See #52941.
- Do not use TF32 in cuda matmul in test_linalg. Increase atol for float and cfloat. See #50453
    The tolerance is increased because most linear algebra operators are not that stable in single precision.

Pull Request resolved: #56114

Reviewed By: ailzhang

Differential Revision: D28554467

Pulled By: ngimel

fbshipit-source-id: 90416be8e4c048bedb16903b01315584d344ecdf
facebook-github-bot pushed a commit that referenced this pull request Feb 28, 2022
Summary:
Disable TF32 in some linalg functions

See also #67948 #50453 #44240

Pull Request resolved: #73460

Reviewed By: albanD

Differential Revision: D34493487

Pulled By: ngimel

fbshipit-source-id: 958cd968ea09df3b5a4d2b4a26aaf0dfddc53981
pytorchmergebot pushed a commit that referenced this pull request Feb 28, 2022
Summary:
Disable TF32 in some linalg functions

See also #67948 #50453 #44240

Pull Request resolved: #73460

Reviewed By: albanD

Differential Revision: D34493487

Pulled By: ngimel

fbshipit-source-id: 958cd968ea09df3b5a4d2b4a26aaf0dfddc53981
(cherry picked from commit cd75ec6)
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
Summary:
On Ampere GPU, matmuls are computed by default with TF32 when the dtype is `torch.float`:  https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices, which results in reduced precision in results. However, linear algebra usually need higher precision, therefore lots of tests in `test_linalg.py` are failing on Ampere GPU because of precision issue.

To fix this issue:
- Most linear algebra methods, except for matmuls, should add `NoTF32Guard`
- Expected results in unit tests should compute matmuls using numpy instead of pytorch cuda.

Pull Request resolved: pytorch#50453

Reviewed By: glaringlee

Differential Revision: D26023005

Pulled By: ngimel

fbshipit-source-id: f0ea533494fee322b07925565b57e3b0db2570c5
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 25, 2026
Summary:
This PR does several things to relax test tolerance

- Do not use TF32 in cuda matmul in test_c10d. See pytorch#52941.
- Do not use TF32 in cuda matmul in test_linalg. Increase atol for float and cfloat. See pytorch#50453
    The tolerance is increased because most linear algebra operators are not that stable in single precision.

Pull Request resolved: pytorch#56114

Reviewed By: ailzhang

Differential Revision: D28554467

Pulled By: ngimel

fbshipit-source-id: 90416be8e4c048bedb16903b01315584d344ecdf
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 25, 2026
Summary:
Disable TF32 in some linalg functions

See also pytorch#67948 pytorch#50453 pytorch#44240

Pull Request resolved: pytorch#73460

Reviewed By: albanD

Differential Revision: D34493487

Pulled By: ngimel

fbshipit-source-id: 958cd968ea09df3b5a4d2b4a26aaf0dfddc53981
(cherry picked from commit cd75ec6)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged module: linear algebra Issues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmul open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants