Skip to content

Adjust TF32 tests#44240

Closed
zasdfgbnm wants to merge 52 commits intomasterfrom
tf32-tests
Closed

Adjust TF32 tests#44240
zasdfgbnm wants to merge 52 commits intomasterfrom
tf32-tests

Conversation

@zasdfgbnm
Copy link
Copy Markdown
Collaborator

@zasdfgbnm zasdfgbnm commented Sep 5, 2020

  • The thresholds of some tests are bumped up. Depending on the random generator, sometimes these tests fail with things like 0.0059 is not smaller than 0.005. I ran test_nn.py and test_torch.py for 10+ times to check these are no longer flaky.
  • Add @tf32_on_and_off to new matrix_exp tests.
  • Disable TF32 on test suites other than test_nn.py and test_torch.py

cc: @ptrblck

@facebook-github-bot facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Sep 5, 2020
@dr-ci
Copy link
Copy Markdown

dr-ci Bot commented Sep 5, 2020

💊 CI failures summary and remediations

As of commit b137059 (more details on the Dr. CI page):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-CircleCI failure(s)

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 138 times.

@zasdfgbnm zasdfgbnm marked this pull request as ready for review September 5, 2020 04:24
@zasdfgbnm zasdfgbnm requested review from mruberry and ngimel September 5, 2020 04:25
@codecov
Copy link
Copy Markdown

codecov Bot commented Sep 5, 2020

Codecov Report

Merging #44240 into master will increase coverage by 0.02%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #44240      +/-   ##
==========================================
+ Coverage   68.01%   68.03%   +0.02%     
==========================================
  Files         393      393              
  Lines       50847    50855       +8     
==========================================
+ Hits        34583    34601      +18     
+ Misses      16264    16254      -10     
Impacted Files Coverage Δ
torch/testing/_internal/common_cuda.py 64.04% <100.00%> (+9.82%) ⬆️
torch/testing/_internal/common_nn.py 83.18% <100.00%> (+0.03%) ⬆️
torch/backends/cuda/__init__.py 70.83% <0.00%> (+8.33%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update da4033d...b137059. Read the comment docs.

@zasdfgbnm
Copy link
Copy Markdown
Collaborator Author

I will need to rebase and test to make sure new tests introduced in #43980 is not broken

@zasdfgbnm
Copy link
Copy Markdown
Collaborator Author

done

@ngimel
Copy link
Copy Markdown
Collaborator

ngimel commented Sep 7, 2020

Running 10 times does not do anything since the test set the seed deterministically? Or did you find a way to disable that?

Comment thread test/test_torch.py Outdated
@skipCUDAIfNoMagma
@skipCPUIfNoLapack
@dtypes(torch.float, torch.double)
@tf32_on_and_off(5.0) # values are 40x.xx vs 40x.xx
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ouch! On the other hand, for matrix exp I'm not particularly surprised.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far we've disabled TF32 for linear algebra calls e.g. via MAGMA. Would it make sense to disable it for matrix_exp as well or at least raise a warning?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to my test, matrix_exp TF32 can be 6x faster than FP32. Different from magma, which is CPU bound, TF32 flags on matrix_exp do make a difference.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the test for precision:

import torch

for i in range(5, 15):
    size = 2 ** i
    print(f"size: {size}x{size}")
    m1 = torch.randn(size, size, device='cuda')
    m2 = torch.randn(size, size, device='cuda')
    torch.backends.cuda.matmul.allow_tf32 = False
    e1 = m1 @ m2
    e2 = torch.matrix_exp(m1)
    torch.backends.cuda.matmul.allow_tf32 = True
    r1 = m1 @ m2
    r2 = torch.matrix_exp(m1)
    print("matmul error:", (r1 - e1).abs().max())
    print("exp error:", (r2 - e2).abs().max())
    print("matmul relative error:", (r1 - e1).abs().max() / e1.abs().max())
    print("exp relative error:", (r2 - e2).abs().max() / e2.abs().max())
    print()
size: 32x32
matmul error: tensor(3.8147e-06, device='cuda:0')
exp error: tensor(4.7684e-05, device='cuda:0')
matmul relative error: tensor(1.3857e-07, device='cuda:0')
exp relative error: tensor(5.0852e-07, device='cuda:0')

size: 64x64
matmul error: tensor(0.0096, device='cuda:0')
exp error: tensor(7.3298, device='cuda:0')
matmul relative error: tensor(0.0004, device='cuda:0')
exp relative error: tensor(0.0063, device='cuda:0')

size: 128x128
matmul error: tensor(0.0167, device='cuda:0')
exp error: tensor(183.8311, device='cuda:0')
matmul relative error: tensor(0.0004, device='cuda:0')
exp relative error: tensor(0.0105, device='cuda:0')

size: 256x256
matmul error: tensor(0.0212, device='cuda:0')
exp error: tensor(11040.5312, device='cuda:0')
matmul relative error: tensor(0.0003, device='cuda:0')
exp relative error: tensor(0.0102, device='cuda:0')

size: 512x512
matmul error: tensor(0.0325, device='cuda:0')
exp error: tensor(14844504., device='cuda:0')
matmul relative error: tensor(0.0003, device='cuda:0')
exp relative error: tensor(0.0215, device='cuda:0')

size: 1024x1024
matmul error: tensor(0.0432, device='cuda:0')
exp error: tensor(3.1523e+11, device='cuda:0')
matmul relative error: tensor(0.0003, device='cuda:0')
exp relative error: tensor(0.0482, device='cuda:0')

size: 2048x2048
matmul error: tensor(0.0715, device='cuda:0')
exp error: tensor(2.8972e+17, device='cuda:0')
matmul relative error: tensor(0.0003, device='cuda:0')
exp relative error: tensor(0.1296, device='cuda:0')

size: 4096x4096
matmul error: tensor(0.1049, device='cuda:0')
exp error: tensor(7.2534e+25, device='cuda:0')
matmul relative error: tensor(0.0003, device='cuda:0')
exp relative error: tensor(0.3424, device='cuda:0')

size: 8192x8192
matmul error: tensor(0.1492, device='cuda:0')
exp error: tensor(2.3980e+37, device='cuda:0')
matmul relative error: tensor(0.0003, device='cuda:0')
exp relative error: tensor(0.5230, device='cuda:0')

size: 16384x16384
matmul error: tensor(0.2262, device='cuda:0')
exp error: tensor(nan, device='cuda:0')
matmul relative error: tensor(0.0003, device='cuda:0')
exp relative error: tensor(nan, device='cuda:0')

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is an easy way to disable TF32 for matrix_exp. It is implemented using other aten operators, and at::matmul is where the TF32 flag play a role. You can not just disable this flag at the beginning of matrix_exp, because these flags are not threadsafe.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngimel @ptrblck Any opinion?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngimel Can we not acquire the current cuBLAS handle and use the above guard before calling matrix_exp? As long as the functions matrix_exp is composed of don't use multiple cuBLAS handles maybe that will work?

matrix_exp appears to be very little used so we'd prefer to avoid a precision headache. If it's difficult to call it with tf32 disabled locally then throwing a warning seems OK.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there other cases of poor precision that aren't covered yet?

@ngimel suggests all linear algebra computations other than matmul should disable tf32. Most of these operations use MAGMA currently, but this PR is going to use cuSolver and cuBLAS calls #42403. Do we need to update it or put warnings around these operations for how to handle tf32?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have disabled TF32 on matrix_exp by introducing a thread_local variable to override the TF32 flag.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we deal with cusolver ops in later PR? I don't know which op will be affected now. After @xwang233 has migrated ops to cusolver, I will run tests to see if it is needed to disable TF32.

Comment thread test/test_torch.py Outdated
Comment thread torch/testing/_internal/common_nn.py
@zasdfgbnm
Copy link
Copy Markdown
Collaborator Author

@mruberry @ngimel This should be ready

Comment thread aten/src/ATen/Context.cpp Outdated
Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@ngimel merged this pull request in 3f5eee6.

facebook-github-bot pushed a commit that referenced this pull request Feb 28, 2022
Summary:
Disable TF32 in some linalg functions

See also #67948 #50453 #44240

Pull Request resolved: #73460

Reviewed By: albanD

Differential Revision: D34493487

Pulled By: ngimel

fbshipit-source-id: 958cd968ea09df3b5a4d2b4a26aaf0dfddc53981
pytorchmergebot pushed a commit that referenced this pull request Feb 28, 2022
Summary:
Disable TF32 in some linalg functions

See also #67948 #50453 #44240

Pull Request resolved: #73460

Reviewed By: albanD

Differential Revision: D34493487

Pulled By: ngimel

fbshipit-source-id: 958cd968ea09df3b5a4d2b4a26aaf0dfddc53981
(cherry picked from commit cd75ec6)
cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Mar 3, 2022
Summary:
Disable TF32 in some linalg functions

See also pytorch/pytorch#67948 #50453 pytorch/pytorch#44240

Pull Request resolved: pytorch/pytorch#73460

Reviewed By: albanD

Differential Revision: D34493487

Pulled By: ngimel

fbshipit-source-id: 958cd968ea09df3b5a4d2b4a26aaf0dfddc53981
(cherry picked from commit cd75ec645b86c4b4a66c35696ce891d006f3833b)
cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Mar 3, 2022
Summary:
Disable TF32 in some linalg functions

See also pytorch/pytorch#67948 #50453 pytorch/pytorch#44240

Pull Request resolved: pytorch/pytorch#73460

Reviewed By: albanD

Differential Revision: D34493487

Pulled By: ngimel

fbshipit-source-id: 958cd968ea09df3b5a4d2b4a26aaf0dfddc53981
(cherry picked from commit cd75ec645b86c4b4a66c35696ce891d006f3833b)
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
Summary:
- The thresholds of some tests are bumped up. Depending on the random generator, sometimes these tests fail with things like 0.0059 is not smaller than 0.005. I ran `test_nn.py` and `test_torch.py` for 10+ times to check these are no longer flaky.
- Add `tf32_on_and_off` to new `matrix_exp` tests.
- Disable TF32 on test suites other than `test_nn.py` and `test_torch.py`

cc: ptrblck

Pull Request resolved: pytorch#44240

Reviewed By: mruberry

Differential Revision: D23882498

Pulled By: ngimel

fbshipit-source-id: 44a9ec08802c93a2efaf4e01d7487222478b6df8
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 25, 2026
Summary:
Disable TF32 in some linalg functions

See also pytorch#67948 pytorch#50453 pytorch#44240

Pull Request resolved: pytorch#73460

Reviewed By: albanD

Differential Revision: D34493487

Pulled By: ngimel

fbshipit-source-id: 958cd968ea09df3b5a4d2b4a26aaf0dfddc53981
(cherry picked from commit cd75ec6)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged oncall: jit Add this issue/PR to JIT oncall triage queue open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants