Skip to content

Disable TF32 in some linalg functions#73460

Closed
xwang233 wants to merge 2 commits intopytorch:masterfrom
xwang233:disable-tf32-in-some-linalg-bwd-functions
Closed

Disable TF32 in some linalg functions#73460
xwang233 wants to merge 2 commits intopytorch:masterfrom
xwang233:disable-tf32-in-some-linalg-bwd-functions

Conversation

@xwang233
Copy link
Copy Markdown
Collaborator

Disable TF32 in some linalg functions

See also #67948 #50453 #44240

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Feb 25, 2022

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/xwang233/pytorch/blob/e7008b5b74bc8da72e306c5d12f3e2f2491e19b8/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default
Add ciflow labels to this PR to trigger more builds:

Workflows Labels (bold enabled) Status
Triggered Workflows
linux-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
linux-binary-libtorch-cxx11-abi ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
linux-binary-libtorch-pre-cxx11 ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
linux-binary-manywheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
linux-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/trunk ✅ triggered
linux-bionic-rocm4.5-py3.7 ciflow/all, ciflow/default, ciflow/linux, ciflow/rocm, ciflow/trunk ✅ triggered
linux-docs ciflow/all, ciflow/cpu, ciflow/default, ciflow/docs, ciflow/linux, ciflow/trunk ✅ triggered
linux-vulkan-bionic-py3.7-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk, ciflow/vulkan ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-cuda11.3-py3.7-gcc7-bazel-test ciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-build ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-asan ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers, ciflow/trunk ✅ triggered
linux-xenial-py3.7-clang7-onnx ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
linux-xenial-py3.7-gcc7-no-ops ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
macos-arm64-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
macos-arm64-binary-wheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
macos-binary-conda ciflow/binaries, ciflow/binaries_conda, ciflow/default ✅ triggered
macos-binary-libtorch-cxx11-abi ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
macos-binary-libtorch-pre-cxx11 ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
macos-binary-wheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/trunk ✅ triggered
win-vs2019-cpu-py3 ciflow/all, ciflow/cpu, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
win-vs2019-cuda11.3-py3 ciflow/all, ciflow/cuda, ciflow/default, ciflow/trunk, ciflow/win ✅ triggered
windows-binary-libtorch-cxx11-abi ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
windows-binary-libtorch-pre-cxx11 ciflow/binaries, ciflow/binaries_libtorch, ciflow/default ✅ triggered
windows-binary-wheel ciflow/binaries, ciflow/binaries_wheel, ciflow/default ✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
docker-builds ciflow/all, ciflow/trunk 🚫 skipped
ios-12-5-1-arm64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-arm64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-arm64-custom-ops ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-arm64-metal ciflow/all, ciflow/ios, ciflow/macos, ciflow/scheduled 🚫 skipped
ios-12-5-1-x86-64 ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
ios-12-5-1-x86-64-coreml ciflow/all, ciflow/ios, ciflow/macos, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/trunk 🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow, ciflow/trunk 🚫 skipped
linux-docs-push ciflow/all, ciflow/cpu, ciflow/linux, ciflow/scheduled 🚫 skipped
linux-xenial-cuda11.3-py3.7-gcc7-no-ops ciflow/all, ciflow/cuda, ciflow/linux, ciflow/trunk 🚫 skipped
macos-10-15-py3-arm64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
macos-11-py3-x86-64 ciflow/all, ciflow/macos, ciflow/trunk 🚫 skipped
parallelnative-linux-xenial-py3.7-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
periodic-libtorch-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-bionic-cuda11.5-py3.7-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck 🚫 skipped
periodic-linux-xenial-cuda11.1-py3.7-gcc7-debug ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-win-vs2019-cuda11.1-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
periodic-win-vs2019-cuda11.5-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-build ciflow/all, ciflow/android, ciflow/cpu, ciflow/linux, ciflow/trunk 🚫 skipped
pytorch-xla-linux-bionic-py3.7-clang8 ciflow/all, ciflow/cpu, ciflow/linux, ciflow/trunk, ciflow/xla 🚫 skipped

@facebook-github-bot
Copy link
Copy Markdown
Contributor

facebook-github-bot commented Feb 25, 2022

🔗 Helpful links

💊 CI failures summary and remediations

As of commit e7008b5 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@ngimel
Copy link
Copy Markdown
Collaborator

ngimel commented Feb 25, 2022

See also #68020. Do we need to adjust some tests now to reflect better accuracy? We won't see it in our CI, but there should be tests that would catch bad linalg computations if someone were to run them on ampere cards.

@xwang233
Copy link
Copy Markdown
Collaborator Author

Thanks @ngimel . We're seeing these linalg backward test failures on A100 cards. By disabling TF32 in backwards, those tests are fixed for now.

The only exception is svd_lowrank and pca_lowrank (which relies on the former). I've filed an issue here and we may need to discuss it.

@ngimel
Copy link
Copy Markdown
Collaborator

ngimel commented Feb 26, 2022

Why did it start happening only now?

@xwang233
Copy link
Copy Markdown
Collaborator Author

xwang233 commented Feb 26, 2022

Some linalg backward test failures, like TestJitCUDA.test_variant_consistency_jit_linalg_svdvals_cuda_complex64, only started since 02/03/22, which happened in the same time period as PR #72181. svd_backward and related methods were implemented in the past month, e.g. #70253

@lezcano
Copy link
Copy Markdown
Collaborator

lezcano commented Feb 26, 2022

For reference, the formula for svdvals backwards is equivalent to the previous one we had, only that with a small optimisation for wide matrices. What that PR adds is forward AD support. Could it be that that's making this test fail, or is it just the standard flakiness from TF32?

@xwang233
Copy link
Copy Markdown
Collaborator Author

xwang233 commented Feb 26, 2022

It's just TF32 precision issue on A100 and 3090. The test on V100 passed well (which doesn't have TF32). Your implementation is correct I think. Relax. 😄

@ngimel
Copy link
Copy Markdown
Collaborator

ngimel commented Feb 26, 2022

What about the other functions? This PR fixes a handful of backward functions, not just svd. Were they failing before?

@xwang233
Copy link
Copy Markdown
Collaborator Author

There are some other tests that rely on linalg_eigh_backward or svd_backward failed. Those are mainly raised from two checks that implemented recently (not sure if they're too strict?)

TORCH_CHECK(at::allclose(imdiag_UhgU, -imdiag_VhgV, /*rtol=*/1e-2, /*atol=*/1e-2),
"svd_backward: The singular vectors in the complex case are specified up to multiplication "
"by e^{i phi}. The specified loss function depends on this phase term, making "
"it ill-defined.");

TORCH_CHECK(at::allclose(imdiag_VhgV, at::zeros_like(imdiag_VhgV), /*rtol=*/1e-2, /*atol=*/1e-2),
is_hermitian ? "linalg_eigh_backward" : "linalg_eig_backward",
": The eigenvectors in the complex case are specified up to multiplication ",
"by e^{i phi}. The specified loss function depends on this quantity, so it is ill-defined.");

There may be other linalg operators that failed before this PR due to TF32, but I didn't check that one by one. Since there is a chance to fix the backward, I think it's better to have all of them fixed.

@lezcano
Copy link
Copy Markdown
Collaborator

lezcano commented Feb 26, 2022

Fwiw, that's a check that I wrote as a "never false positive, fair if it's false negative" (note the tolerances). So if it's firing, it's fine with me if we make it more lax if that makes TF32 pass really.

@ngimel
Copy link
Copy Markdown
Collaborator

ngimel commented Feb 26, 2022

tf32 should not be used for linalg operations

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@ngimel has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@albanD albanD removed their request for review February 28, 2022 16:09
facebook-github-bot pushed a commit that referenced this pull request Feb 28, 2022
Summary:
Disable TF32 in some linalg functions

See also #67948 #50453 #44240

Pull Request resolved: #73460

Reviewed By: albanD

Differential Revision: D34493487

Pulled By: ngimel

fbshipit-source-id: 958cd968ea09df3b5a4d2b4a26aaf0dfddc53981
@github-actions
Copy link
Copy Markdown
Contributor

Hey @xwang233.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

@xwang233 xwang233 added the topic: not user facing topic category label Mar 2, 2022
cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Mar 3, 2022
Summary:
Disable TF32 in some linalg functions

See also pytorch/pytorch#67948 #50453 pytorch/pytorch#44240

Pull Request resolved: pytorch/pytorch#73460

Reviewed By: albanD

Differential Revision: D34493487

Pulled By: ngimel

fbshipit-source-id: 958cd968ea09df3b5a4d2b4a26aaf0dfddc53981
(cherry picked from commit cd75ec645b86c4b4a66c35696ce891d006f3833b)
cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Mar 3, 2022
Summary:
Disable TF32 in some linalg functions

See also pytorch/pytorch#67948 #50453 pytorch/pytorch#44240

Pull Request resolved: pytorch/pytorch#73460

Reviewed By: albanD

Differential Revision: D34493487

Pulled By: ngimel

fbshipit-source-id: 958cd968ea09df3b5a4d2b4a26aaf0dfddc53981
(cherry picked from commit cd75ec645b86c4b4a66c35696ce891d006f3833b)
facebook-github-bot pushed a commit that referenced this pull request Mar 10, 2022
#73614)

Summary:
Follow up of #73460, #73461

Pull Request resolved: #73614

Reviewed By: malfet

Differential Revision: D34772822

Pulled By: ngimel

fbshipit-source-id: 4e2bea0173d1b6b01e857ef63ef5c2d8c3802544
pytorchmergebot pushed a commit that referenced this pull request Mar 10, 2022
#73614)

Summary:
Follow up of #73460, #73461

Pull Request resolved: #73614

Reviewed By: malfet

Differential Revision: D34772822

Pulled By: ngimel

fbshipit-source-id: 4e2bea0173d1b6b01e857ef63ef5c2d8c3802544
(cherry picked from commit 5994863)
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 25, 2026
Summary:
Disable TF32 in some linalg functions

See also pytorch#67948 pytorch#50453 pytorch#44240

Pull Request resolved: pytorch#73460

Reviewed By: albanD

Differential Revision: D34493487

Pulled By: ngimel

fbshipit-source-id: 958cd968ea09df3b5a4d2b4a26aaf0dfddc53981
(cherry picked from commit cd75ec6)
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 25, 2026
pytorch#73614)

Summary:
Follow up of pytorch#73460, pytorch#73461

Pull Request resolved: pytorch#73614

Reviewed By: malfet

Differential Revision: D34772822

Pulled By: ngimel

fbshipit-source-id: 4e2bea0173d1b6b01e857ef63ef5c2d8c3802544
(cherry picked from commit 5994863)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants