Adjust TF32 tests by zasdfgbnm · Pull Request #44240 · pytorch/pytorch

zasdfgbnm · 2020-09-05T02:11:33Z

The thresholds of some tests are bumped up. Depending on the random generator, sometimes these tests fail with things like 0.0059 is not smaller than 0.005. I ran test_nn.py and test_torch.py for 10+ times to check these are no longer flaky.
Add @tf32_on_and_off to new matrix_exp tests.
Disable TF32 on test suites other than test_nn.py and test_torch.py

cc: @ptrblck

dr-ci · 2020-09-05T02:23:48Z

💊 CI failures summary and remediations

As of commit b137059 (more details on the Dr. CI page):

1/1 failures possibly* introduced in this PR
- 1/1 non-CircleCI failure(s)

ci.pytorch.org: 1 failed

Failed: pr/pytorch-linux-bionic-rocm3.7-py3.6

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 138 times.

codecov · 2020-09-05T05:52:06Z

Codecov Report

Merging #44240 into master will increase coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master   #44240      +/-   ##
==========================================
+ Coverage   68.01%   68.03%   +0.02%     
==========================================
  Files         393      393              
  Lines       50847    50855       +8     
==========================================
+ Hits        34583    34601      +18     
+ Misses      16264    16254      -10

Impacted Files	Coverage Δ
torch/testing/_internal/common_cuda.py	`64.04% <100.00%> (+9.82%)`	⬆️
torch/testing/_internal/common_nn.py	`83.18% <100.00%> (+0.03%)`	⬆️
torch/backends/cuda/__init__.py	`70.83% <0.00%> (+8.33%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update da4033d...b137059. Read the comment docs.

zasdfgbnm · 2020-09-05T06:58:11Z

I will need to rebase and test to make sure new tests introduced in #43980 is not broken

…-tests

zasdfgbnm · 2020-09-05T07:13:19Z

done

ngimel · 2020-09-07T23:29:39Z

Running 10 times does not do anything since the test set the seed deterministically? Or did you find a way to disable that?

ngimel · 2020-09-07T23:31:50Z

    @skipCUDAIfNoMagma
    @skipCPUIfNoLapack
    @dtypes(torch.float, torch.double)
+    @tf32_on_and_off(5.0)  # values are 40x.xx vs 40x.xx


ouch! On the other hand, for matrix exp I'm not particularly surprised.

So far we've disabled TF32 for linear algebra calls e.g. via MAGMA. Would it make sense to disable it for matrix_exp as well or at least raise a warning?

According to my test, matrix_exp TF32 can be 6x faster than FP32. Different from magma, which is CPU bound, TF32 flags on matrix_exp do make a difference.

Here is the test for precision:

import torch for i in range(5, 15): size = 2 ** i print(f"size: {size}x{size}") m1 = torch.randn(size, size, device='cuda') m2 = torch.randn(size, size, device='cuda') torch.backends.cuda.matmul.allow_tf32 = False e1 = m1 @ m2 e2 = torch.matrix_exp(m1) torch.backends.cuda.matmul.allow_tf32 = True r1 = m1 @ m2 r2 = torch.matrix_exp(m1) print("matmul error:", (r1 - e1).abs().max()) print("exp error:", (r2 - e2).abs().max()) print("matmul relative error:", (r1 - e1).abs().max() / e1.abs().max()) print("exp relative error:", (r2 - e2).abs().max() / e2.abs().max()) print()

size: 32x32 matmul error: tensor(3.8147e-06, device='cuda:0') exp error: tensor(4.7684e-05, device='cuda:0') matmul relative error: tensor(1.3857e-07, device='cuda:0') exp relative error: tensor(5.0852e-07, device='cuda:0') size: 64x64 matmul error: tensor(0.0096, device='cuda:0') exp error: tensor(7.3298, device='cuda:0') matmul relative error: tensor(0.0004, device='cuda:0') exp relative error: tensor(0.0063, device='cuda:0') size: 128x128 matmul error: tensor(0.0167, device='cuda:0') exp error: tensor(183.8311, device='cuda:0') matmul relative error: tensor(0.0004, device='cuda:0') exp relative error: tensor(0.0105, device='cuda:0') size: 256x256 matmul error: tensor(0.0212, device='cuda:0') exp error: tensor(11040.5312, device='cuda:0') matmul relative error: tensor(0.0003, device='cuda:0') exp relative error: tensor(0.0102, device='cuda:0') size: 512x512 matmul error: tensor(0.0325, device='cuda:0') exp error: tensor(14844504., device='cuda:0') matmul relative error: tensor(0.0003, device='cuda:0') exp relative error: tensor(0.0215, device='cuda:0') size: 1024x1024 matmul error: tensor(0.0432, device='cuda:0') exp error: tensor(3.1523e+11, device='cuda:0') matmul relative error: tensor(0.0003, device='cuda:0') exp relative error: tensor(0.0482, device='cuda:0') size: 2048x2048 matmul error: tensor(0.0715, device='cuda:0') exp error: tensor(2.8972e+17, device='cuda:0') matmul relative error: tensor(0.0003, device='cuda:0') exp relative error: tensor(0.1296, device='cuda:0') size: 4096x4096 matmul error: tensor(0.1049, device='cuda:0') exp error: tensor(7.2534e+25, device='cuda:0') matmul relative error: tensor(0.0003, device='cuda:0') exp relative error: tensor(0.3424, device='cuda:0') size: 8192x8192 matmul error: tensor(0.1492, device='cuda:0') exp error: tensor(2.3980e+37, device='cuda:0') matmul relative error: tensor(0.0003, device='cuda:0') exp relative error: tensor(0.5230, device='cuda:0') size: 16384x16384 matmul error: tensor(0.2262, device='cuda:0') exp error: tensor(nan, device='cuda:0') matmul relative error: tensor(0.0003, device='cuda:0') exp relative error: tensor(nan, device='cuda:0')

I don't think there is an easy way to disable TF32 for matrix_exp. It is implemented using other aten operators, and at::matmul is where the TF32 flag play a role. You can not just disable this flag at the beginning of matrix_exp, because these flags are not threadsafe.

@ngimel @ptrblck Any opinion?

@ngimel Can we not acquire the current cuBLAS handle and use the above guard before calling matrix_exp? As long as the functions matrix_exp is composed of don't use multiple cuBLAS handles maybe that will work?

matrix_exp appears to be very little used so we'd prefer to avoid a precision headache. If it's difficult to call it with tf32 disabled locally then throwing a warning seems OK.

Are there other cases of poor precision that aren't covered yet?

@ngimel suggests all linear algebra computations other than matmul should disable tf32. Most of these operations use MAGMA currently, but this PR is going to use cuSolver and cuBLAS calls #42403. Do we need to update it or put warnings around these operations for how to handle tf32?

I have disabled TF32 on matrix_exp by introducing a thread_local variable to override the TF32 flag.

Can we deal with cusolver ops in later PR? I don't know which op will be affected now. After @xwang233 has migrated ops to cusolver, I will run tests to see if it is needed to disable TF32.

zasdfgbnm · 2020-09-21T23:39:07Z

@mruberry @ngimel This should be ready

facebook-github-bot

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-09-24T18:13:23Z

@ngimel merged this pull request in 3f5eee6.

Summary: Disable TF32 in some linalg functions See also #67948 #50453 #44240 Pull Request resolved: #73460 Reviewed By: albanD Differential Revision: D34493487 Pulled By: ngimel fbshipit-source-id: 958cd968ea09df3b5a4d2b4a26aaf0dfddc53981

Summary: Disable TF32 in some linalg functions See also #67948 #50453 #44240 Pull Request resolved: #73460 Reviewed By: albanD Differential Revision: D34493487 Pulled By: ngimel fbshipit-source-id: 958cd968ea09df3b5a4d2b4a26aaf0dfddc53981 (cherry picked from commit cd75ec6)

Summary: Disable TF32 in some linalg functions See also pytorch/pytorch#67948 #50453 pytorch/pytorch#44240 Pull Request resolved: pytorch/pytorch#73460 Reviewed By: albanD Differential Revision: D34493487 Pulled By: ngimel fbshipit-source-id: 958cd968ea09df3b5a4d2b4a26aaf0dfddc53981 (cherry picked from commit cd75ec645b86c4b4a66c35696ce891d006f3833b)

Summary: - The thresholds of some tests are bumped up. Depending on the random generator, sometimes these tests fail with things like 0.0059 is not smaller than 0.005. I ran `test_nn.py` and `test_torch.py` for 10+ times to check these are no longer flaky. - Add `tf32_on_and_off` to new `matrix_exp` tests. - Disable TF32 on test suites other than `test_nn.py` and `test_torch.py` cc: ptrblck Pull Request resolved: pytorch#44240 Reviewed By: mruberry Differential Revision: D23882498 Pulled By: ngimel fbshipit-source-id: 44a9ec08802c93a2efaf4e01d7487222478b6df8

Summary: Disable TF32 in some linalg functions See also pytorch#67948 pytorch#50453 pytorch#44240 Pull Request resolved: pytorch#73460 Reviewed By: albanD Differential Revision: D34493487 Pulled By: ngimel fbshipit-source-id: 958cd968ea09df3b5a4d2b4a26aaf0dfddc53981 (cherry picked from commit cd75ec6)

zasdfgbnm added 3 commits September 5, 2020 00:41

fix-tf32-tests

951bc0b

more

195b673

fix-ore

d53e424

facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Sep 5, 2020

pytorchbot added the open source label Sep 5, 2020

zasdfgbnm added 2 commits September 5, 2020 02:16

more

2acaa66

super().tearDown()

8434b51

addbmm

921ea21

zasdfgbnm marked this pull request as ready for review September 5, 2020 04:24

zasdfgbnm requested review from mruberry and ngimel September 5, 2020 04:25

zasdfgbnm added 2 commits September 5, 2020 07:01

Merge branch 'master' of https://github.com/pytorch/pytorch into tf32…

dec4f01

…-tests

fix

45a5980

ngimel reviewed Sep 7, 2020

View reviewed changes

zasdfgbnm added 11 commits September 8, 2020 14:24

Debug CUDA 10.1 Windows failure

7060170

Update tests.h

8b9b6ae

Update tests.h

853cfec

Update torch_python_test.cpp

9da02e9

Update torch_python_test.cpp

ec5a686

Update tests.h

182a1c1

Update torch_python_test.cpp

0d515e1

Update tests.h

8bc8764

Update torch_python_test.cpp

8c15daa

Update tests.h

c2eb342

Update tests.h

78ec96f

zasdfgbnm and others added 10 commits September 17, 2020 10:17

Merge branch 'master' of github.com:pytorch/pytorch into tf32-tests

94b61ea

fix

754a2b0

Opt out tf32 in a different way

75cbdc7

Merge branch 'tf32-tests' of github.com:pytorch/pytorch into tf32-tests

074b455

disable tf32 on matrix_exp

06f6a29

disable tf32

9655bd4

save

f7851aa

fix windows

6233d61

Merge branch 'master' into tf32-tests

5ba8e99

Update test_jit.py

3ff1289

ngimel reviewed Sep 22, 2020

View reviewed changes

Comment thread aten/src/ATen/Context.cpp Outdated

zasdfgbnm and others added 4 commits September 21, 2020 22:52

Update Context.cpp

455e1a7

Merge branch 'master' into tf32-tests

e4ec3e0

more

7fb3f5c

more

b137059

facebook-github-bot reviewed Sep 23, 2020

View reviewed changes

facebook-github-bot closed this in 3f5eee6 Sep 24, 2020

zasdfgbnm deleted the tf32-tests branch September 24, 2020 17:57

facebook-github-bot added the merged label Sep 24, 2020

mruberry added the Merged label Oct 28, 2020

xwang233 mentioned this pull request Feb 25, 2022

Disable TF32 in some linalg functions #73460

Closed

Conversation

zasdfgbnm commented Sep 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dr-ci Bot commented Sep 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

ci.pytorch.org: 1 failed

Uh oh!

codecov Bot commented Sep 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

zasdfgbnm commented Sep 5, 2020

Uh oh!

zasdfgbnm commented Sep 5, 2020

Uh oh!

ngimel commented Sep 7, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

zasdfgbnm commented Sep 21, 2020

Uh oh!

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Sep 24, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

zasdfgbnm commented Sep 5, 2020 •

edited

Loading

dr-ci Bot commented Sep 5, 2020 •

edited

Loading

codecov Bot commented Sep 5, 2020 •

edited

Loading