Conversation
💊 CI failures summary and remediationsAs of commit 4b8f704 (more details on the Dr. CI page):
1 failure not recognized by patterns:
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
|
Linear algebra methods for the most part call magma, and either create a cublas handle for magma with disabled tf32, or magma itself creates a default cublas handle ( :-( ) that does not have tf32 enabled. So as far as implementations are concerned we should be fine? (Problems with cusolver are separate, cc @xwang233 ). |
Codecov Report
@@ Coverage Diff @@
## master #50453 +/- ##
==========================================
- Coverage 80.70% 80.70% -0.01%
==========================================
Files 1905 1905
Lines 206813 206814 +1
==========================================
- Hits 166916 166913 -3
- Misses 39897 39901 +4 |
|
This should be ready |
|
|
||
| @dtypes(torch.float32, torch.float64, torch.complex64, torch.complex128) | ||
| @tf32_on_and_off(1e-3) | ||
| def test_kron(self, device, dtype): |
There was a problem hiding this comment.
This is interesting. @zasdfgbnm, can you elaborate on the impact of tf32 on the kronecker product? @ngimel is surprised that cuBLAS tries to run a tf32 kernel for its inputs.
There was a problem hiding this comment.
kron internally uses tensordot, which internally uses matmul, which is running in tf32
There was a problem hiding this comment.
I have removed all kron changes in this PR. kron will be fixed in #50927, and this PR can be landed independently.
| def test_pinv(self, device, dtype): | ||
| from torch.testing._internal.common_utils import random_hermitian_pd_matrix | ||
|
|
||
| def _(tensor): |
There was a problem hiding this comment.
Instead of this function, can you just do np_A and np_A_pinv below? That would prevent excessive copying if they're on CUDA, too.
|
I don't know. It is quite strange to me that Kronecker product internally uses matmul. From the definition, there is no sum at all... |
|
Why can not kron be implemented as below? This looks simple and fast: import torch
def mykron(a, b):
assert a.dim() == b.dim()
a_view_shape = []
b_view_shape = []
ab_view_shape = []
for i in range(a.dim()):
a_view_shape.append(a.size(i))
a_view_shape.append(1)
b_view_shape.append(1)
b_view_shape.append(b.size(i))
ab_view_shape.append(a.size(i) * b.size(i))
return (a.reshape(a_view_shape) * b.reshape(b_view_shape)).reshape(ab_view_shape)
a = torch.randn(5, 5, 5)
b = torch.randn(6, 6, 6)
r1 = torch.kron(a, b)
r2 = mykron(a, b)
diff = (r1 - r2).abs().max()
print(diff) |
|
Well, based on my benchmark, seems that we should use my implementation for kron: https://github.com/zasdfgbnm/things/blob/master/2021Q1/kron.ipynb I will open a PR Edit: |
|
I have resolved all review comments |
facebook-github-bot
left a comment
There was a problem hiding this comment.
@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: This PR does several things to relax test tolerance - Do not use TF32 in cuda matmul in test_c10d. See #52941. - Do not use TF32 in cuda matmul in test_linalg. Increase atol for float and cfloat. See #50453 The tolerance is increased because most linear algebra operators are not that stable in single precision. Pull Request resolved: #56114 Reviewed By: ailzhang Differential Revision: D28554467 Pulled By: ngimel fbshipit-source-id: 90416be8e4c048bedb16903b01315584d344ecdf
Summary: On Ampere GPU, matmuls are computed by default with TF32 when the dtype is `torch.float`: https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices, which results in reduced precision in results. However, linear algebra usually need higher precision, therefore lots of tests in `test_linalg.py` are failing on Ampere GPU because of precision issue. To fix this issue: - Most linear algebra methods, except for matmuls, should add `NoTF32Guard` - Expected results in unit tests should compute matmuls using numpy instead of pytorch cuda. Pull Request resolved: pytorch#50453 Reviewed By: glaringlee Differential Revision: D26023005 Pulled By: ngimel fbshipit-source-id: f0ea533494fee322b07925565b57e3b0db2570c5
Summary: This PR does several things to relax test tolerance - Do not use TF32 in cuda matmul in test_c10d. See pytorch#52941. - Do not use TF32 in cuda matmul in test_linalg. Increase atol for float and cfloat. See pytorch#50453 The tolerance is increased because most linear algebra operators are not that stable in single precision. Pull Request resolved: pytorch#56114 Reviewed By: ailzhang Differential Revision: D28554467 Pulled By: ngimel fbshipit-source-id: 90416be8e4c048bedb16903b01315584d344ecdf
Summary: Disable TF32 in some linalg functions See also pytorch#67948 pytorch#50453 pytorch#44240 Pull Request resolved: pytorch#73460 Reviewed By: albanD Differential Revision: D34493487 Pulled By: ngimel fbshipit-source-id: 958cd968ea09df3b5a4d2b4a26aaf0dfddc53981 (cherry picked from commit cd75ec6)
On Ampere GPU, matmuls are computed by default with TF32 when the dtype is
torch.float: https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices, which results in reduced precision in results. However, linear algebra usually need higher precision, therefore lots of tests intest_linalg.pyare failing on Ampere GPU because of precision issue.To fix this issue:
NoTF32Guard