[pytorch] cuBLAS addmm malfunction test by souravmandal · Pull Request #85084 · pytorch/pytorch

souravmandal · 2022-09-15T13:35:21Z

Summary: Create unit test to detect cuBLAS breakage via large differences between CPU and GPU addmm invocations

Test Plan:
Sample unit test output --

[...]
test_cublas_addmm_size_10000_cpu_bfloat16 (test_linalg.TestLinalgCPU) ... ok
test_cublas_addmm_size_10000_cpu_float16 (test_linalg.TestLinalgCPU) ... ok
test_cublas_addmm_size_10000_cpu_float32 (test_linalg.TestLinalgCPU) ... ok
test_cublas_addmm_size_1000_cpu_bfloat16 (test_linalg.TestLinalgCPU) ... ok
test_cublas_addmm_size_1000_cpu_float16 (test_linalg.TestLinalgCPU) ... ok
test_cublas_addmm_size_1000_cpu_float32 (test_linalg.TestLinalgCPU) ... ok
test_cublas_addmm_size_100_cpu_bfloat16 (test_linalg.TestLinalgCPU) ... ok
test_cublas_addmm_size_100_cpu_float16 (test_linalg.TestLinalgCPU) ... ok
test_cublas_addmm_size_100_cpu_float32 (test_linalg.TestLinalgCPU) ... ok
[...]

Reviewed By: mikekgfb

Differential Revision: D39433029

pytorch-bot · 2022-09-15T13:35:25Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/85084

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures, 4 Pending

As of commit 55b4804:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2022-09-15T13:36:02Z

This pull request was exported from Phabricator. Differential Revision: D39433029

lezcano

We are already testing quite extensively matmul in the tests test_matmul_small_brute_force_{1,2,3}d_Nd against NumPy. What's the reason for wanting to test addmm on its own?

souravmandal · 2022-09-15T14:49:29Z

We (@mikekgfb @ngimel) have observed in the past that cuBLAS breaks such that it produces random data, or crashes outright, esp. for large input arrays. It would be useful to test whether a new cuBLAS version triggers that.

lezcano

In any case, I think it'd be better to user NumPy to compare against, as we are sure that it will return correct results.
Also, this way, you could simply skip torch.float16 in CPU and the code could be heavily simplified. See for example how check_single_matmul does this, together with dynamic tolerances for robustness.

test/test_linalg.py

lezcano · 2022-09-15T15:09:27Z

Also, this issue may be relevant #84538. I think @srossross is looking into implementing this one.

souravmandal · 2022-09-15T21:14:36Z

The issue with numpy is that it does not support bfloat16 (ref1, ref2). To simplify one could just make the reference tensors in each call float32, and apply the dtype to just the torch tensors on GPU.

mikekgfb · 2022-09-19T20:58:17Z

Right, this is not numeric accuracy, this is ensuring that cuBLAS does not crash or produce wildly incorrect results. As such, we do want to exercise the bfloat16, and compare with reasonable bounds to the epxected result obtained by computing with another numeric representation.

zrphercule

Stamp

test/test_linalg.py

facebook-github-bot · 2022-09-20T02:56:49Z

This pull request was exported from Phabricator. Differential Revision: D39433029

facebook-github-bot · 2022-09-20T15:05:18Z

This pull request was exported from Phabricator. Differential Revision: D39433029

mikekgfb · 2022-09-20T17:51:16Z

@pytorchbot merge

pytorchmergebot · 2022-09-20T17:53:25Z

@pytorchbot successfully started a merge job. Check the current status here.
The merge job was triggered without a flag. This means that your change will be merged once all checks on your PR have passed (ETA: 0-4 Hours). If this is not the intended behavior, feel free to use some of the other merge options in the wiki.
Please reach out to the PyTorch DevX Team with feedback or questions!

pytorchmergebot · 2022-09-20T17:53:26Z

Merge failed

Reason: The following mandatory check(s) failed (Rule superuser):

Lint
pull

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Summary: Pull Request resolved: pytorch#85084 Create unit test to detect cuBLAS breakage via large differences between CPU and GPU addmm invocations Test Plan: Sample unit test output -- [...] test_cublas_addmm_size_10000_cpu_bfloat16 (test_linalg.TestLinalgCPU) ... ok test_cublas_addmm_size_10000_cpu_float16 (test_linalg.TestLinalgCPU) ... ok test_cublas_addmm_size_10000_cpu_float32 (test_linalg.TestLinalgCPU) ... ok test_cublas_addmm_size_1000_cpu_bfloat16 (test_linalg.TestLinalgCPU) ... ok test_cublas_addmm_size_1000_cpu_float16 (test_linalg.TestLinalgCPU) ... ok test_cublas_addmm_size_1000_cpu_float32 (test_linalg.TestLinalgCPU) ... ok test_cublas_addmm_size_100_cpu_bfloat16 (test_linalg.TestLinalgCPU) ... ok test_cublas_addmm_size_100_cpu_float16 (test_linalg.TestLinalgCPU) ... ok test_cublas_addmm_size_100_cpu_float32 (test_linalg.TestLinalgCPU) ... ok [...] Reviewed By: mikekgfb Differential Revision: D39433029 fbshipit-source-id: b308ecceb44eab1afb039c98f4e1b6aa8ddb8f53

facebook-github-bot · 2022-09-20T19:51:16Z

This pull request was exported from Phabricator. Differential Revision: D39433029

mikekgfb · 2022-09-21T13:39:17Z

@pytorchbot merge

pytorchmergebot · 2022-09-21T13:42:08Z

@pytorchbot successfully started a merge job. Check the current status here.
The merge job was triggered without a flag. This means that your change will be merged once all checks on your PR have passed (ETA: 0-4 Hours). If this is not the intended behavior, feel free to use some of the other merge options in the wiki.
Please reach out to the PyTorch DevX Team with feedback or questions!

clee2000 · 2022-09-21T16:46:58Z

@pytorchbot revert -m "broke tests on trunk, https://github.com/pytorch/pytorch/actions/runs/3098347639/jobs/5017166419" -m nosignal

pytorch-bot · 2022-09-21T16:47:00Z

❌ 🤖 pytorchbot command failed:

@pytorchbot revert: error: the following arguments are required: -c/--classification

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst}

Try @pytorchbot --help for more info.

clee2000 · 2022-09-21T16:47:11Z

@pytorchbot revert -m "broke tests on trunk, https://github.com/pytorch/pytorch/actions/runs/3098347639/jobs/5017166419" -c nosignal

pytorchmergebot · 2022-09-21T16:48:51Z

@pytorchbot successfully started a revert job. Check the current status here.
Please reach out to the PyTorch DevX Team with feedback or questions!

pytorchmergebot · 2022-09-21T16:48:58Z

@souravmandal your PR has been successfully reverted.

This reverts commit 0297c75. Reverted #85084 on behalf of https://github.com/clee2000 due to broke tests on trunk, https://github.com/pytorch/pytorch/actions/runs/3098347639/jobs/5017166419

weiwangmeta · 2022-09-21T17:04:36Z

This link instead: https://github.com/pytorch/pytorch/actions/runs/3098294186

malfet · 2022-09-21T17:14:02Z

By the way, how much total test time this PR adds? (Though addmm even for 10kx10k matrices should be pretty quick )

test/test_linalg.py

facebook-github-bot · 2022-09-21T19:06:30Z

This pull request was exported from Phabricator. Differential Revision: D39433029

Summary: Re-submit for approved PR that was then reverted: pytorch#85084 Create unit test to detect cuBLAS breakage via large differences between CPU and GPU addmm invocations Test Plan: Sample unit test output -- [...] test_cublas_addmm_size_10000_cpu_bfloat16 (test_linalg.TestLinalgCPU) ... ok test_cublas_addmm_size_10000_cpu_float16 (test_linalg.TestLinalgCPU) ... ok test_cublas_addmm_size_10000_cpu_float32 (test_linalg.TestLinalgCPU) ... ok test_cublas_addmm_size_1000_cpu_bfloat16 (test_linalg.TestLinalgCPU) ... ok test_cublas_addmm_size_1000_cpu_float16 (test_linalg.TestLinalgCPU) ... ok test_cublas_addmm_size_1000_cpu_float32 (test_linalg.TestLinalgCPU) ... ok test_cublas_addmm_size_100_cpu_bfloat16 (test_linalg.TestLinalgCPU) ... ok test_cublas_addmm_size_100_cpu_float16 (test_linalg.TestLinalgCPU) ... ok test_cublas_addmm_size_100_cpu_float32 (test_linalg.TestLinalgCPU) ... ok [...] Reviewed By: mikekgfb Differential Revision: D39433029 fbshipit-source-id: e8f5d5f722047f31d2804932539408b1beb2ad55

Summary: Re-submit for approved PR that was then reverted: #85084 Create unit test to detect cuBLAS breakage via large differences between CPU and GPU addmm invocations Test Plan: Sample unit test output -- [...] test_cublas_addmm_size_10000_cpu_bfloat16 (test_linalg.TestLinalgCPU) ... ok test_cublas_addmm_size_10000_cpu_float16 (test_linalg.TestLinalgCPU) ... ok test_cublas_addmm_size_10000_cpu_float32 (test_linalg.TestLinalgCPU) ... ok test_cublas_addmm_size_1000_cpu_bfloat16 (test_linalg.TestLinalgCPU) ... ok test_cublas_addmm_size_1000_cpu_float16 (test_linalg.TestLinalgCPU) ... ok test_cublas_addmm_size_1000_cpu_float32 (test_linalg.TestLinalgCPU) ... ok test_cublas_addmm_size_100_cpu_bfloat16 (test_linalg.TestLinalgCPU) ... ok test_cublas_addmm_size_100_cpu_float16 (test_linalg.TestLinalgCPU) ... ok test_cublas_addmm_size_100_cpu_float32 (test_linalg.TestLinalgCPU) ... ok [...] Reviewed By: mikekgfb Differential Revision: D39433029 Pull Request resolved: #85432 Approved by: https://github.com/zrphercule

Summary: Create unit test to detect cuBLAS breakage via large differences between CPU and GPU addmm invocations Test Plan: Sample unit test output -- [...] test_cublas_addmm_size_10000_cpu_bfloat16 (test_linalg.TestLinalgCPU) ... ok test_cublas_addmm_size_10000_cpu_float16 (test_linalg.TestLinalgCPU) ... ok test_cublas_addmm_size_10000_cpu_float32 (test_linalg.TestLinalgCPU) ... ok test_cublas_addmm_size_1000_cpu_bfloat16 (test_linalg.TestLinalgCPU) ... ok test_cublas_addmm_size_1000_cpu_float16 (test_linalg.TestLinalgCPU) ... ok test_cublas_addmm_size_1000_cpu_float32 (test_linalg.TestLinalgCPU) ... ok test_cublas_addmm_size_100_cpu_bfloat16 (test_linalg.TestLinalgCPU) ... ok test_cublas_addmm_size_100_cpu_float16 (test_linalg.TestLinalgCPU) ... ok test_cublas_addmm_size_100_cpu_float32 (test_linalg.TestLinalgCPU) ... ok [...] Reviewed By: mikekgfb Differential Revision: D39433029 Pull Request resolved: #85084 Approved by: https://github.com/zrphercule

This reverts commit 0297c75. Reverted #85084 on behalf of https://github.com/clee2000 due to broke tests on trunk, https://github.com/pytorch/pytorch/actions/runs/3098347639/jobs/5017166419

Summary: Re-submit for approved PR that was then reverted: #85084 Create unit test to detect cuBLAS breakage via large differences between CPU and GPU addmm invocations Test Plan: Sample unit test output -- [...] test_cublas_addmm_size_10000_cpu_bfloat16 (test_linalg.TestLinalgCPU) ... ok test_cublas_addmm_size_10000_cpu_float16 (test_linalg.TestLinalgCPU) ... ok test_cublas_addmm_size_10000_cpu_float32 (test_linalg.TestLinalgCPU) ... ok test_cublas_addmm_size_1000_cpu_bfloat16 (test_linalg.TestLinalgCPU) ... ok test_cublas_addmm_size_1000_cpu_float16 (test_linalg.TestLinalgCPU) ... ok test_cublas_addmm_size_1000_cpu_float32 (test_linalg.TestLinalgCPU) ... ok test_cublas_addmm_size_100_cpu_bfloat16 (test_linalg.TestLinalgCPU) ... ok test_cublas_addmm_size_100_cpu_float16 (test_linalg.TestLinalgCPU) ... ok test_cublas_addmm_size_100_cpu_float32 (test_linalg.TestLinalgCPU) ... ok [...] Reviewed By: mikekgfb Differential Revision: D39433029 Pull Request resolved: #85432 Approved by: https://github.com/zrphercule

souravmandal requested review from IvanYashchuk, lezcano and nikitaved as code owners September 15, 2022 13:35

pytorch-bot bot added the topic: not user facing topic category label Sep 15, 2022

facebook-github-bot added cla signed fb-exported labels Sep 15, 2022

lezcano reviewed Sep 15, 2022

View reviewed changes

test/test_linalg.py Outdated Show resolved Hide resolved

test/test_linalg.py Outdated Show resolved Hide resolved

zrphercule self-requested a review September 19, 2022 21:34

zrphercule approved these changes Sep 19, 2022

View reviewed changes

ngimel reviewed Sep 19, 2022

View reviewed changes

test/test_linalg.py Outdated Show resolved Hide resolved

test/test_linalg.py Outdated Show resolved Hide resolved

souravmandal force-pushed the export-D39433029 branch from 99764e5 to 377700f Compare September 20, 2022 02:57

souravmandal force-pushed the export-D39433029 branch from 377700f to 3797628 Compare September 20, 2022 15:05

souravmandal force-pushed the export-D39433029 branch from 3797628 to 55b4804 Compare September 20, 2022 19:51

pytorchmergebot added the Merged label Sep 21, 2022

pytorchmergebot closed this in 0297c75 Sep 21, 2022

pytorchmergebot added the Reverted label Sep 21, 2022

malfet reviewed Sep 21, 2022

View reviewed changes

test/test_linalg.py Show resolved Hide resolved

souravmandal mentioned this pull request Sep 21, 2022

[pytorch] cuBLAS addmm malfunction test #85432

Closed

Conversation

souravmandal commented Sep 15, 2022

Uh oh!

pytorch-bot bot commented Sep 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/85084

✅ No Failures, 4 Pending

Uh oh!

facebook-github-bot commented Sep 15, 2022

Uh oh!

lezcano left a comment

Choose a reason for hiding this comment

Uh oh!

souravmandal commented Sep 15, 2022

Uh oh!

lezcano left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lezcano commented Sep 15, 2022

Uh oh!

souravmandal commented Sep 15, 2022

Uh oh!

mikekgfb commented Sep 19, 2022

Uh oh!

zrphercule left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot commented Sep 20, 2022

Uh oh!

facebook-github-bot commented Sep 20, 2022

Uh oh!

mikekgfb commented Sep 20, 2022

Uh oh!

pytorchmergebot commented Sep 20, 2022

Uh oh!

pytorchmergebot commented Sep 20, 2022

Merge failed

Uh oh!

facebook-github-bot commented Sep 20, 2022

Uh oh!

mikekgfb commented Sep 21, 2022

Uh oh!

pytorchmergebot commented Sep 21, 2022

Uh oh!

clee2000 commented Sep 21, 2022

Uh oh!

pytorch-bot bot commented Sep 21, 2022

Uh oh!

clee2000 commented Sep 21, 2022

Uh oh!

pytorchmergebot commented Sep 21, 2022

Uh oh!

pytorchmergebot commented Sep 21, 2022

Uh oh!

weiwangmeta commented Sep 21, 2022

Uh oh!

malfet commented Sep 21, 2022

Uh oh!

Uh oh!

facebook-github-bot commented Sep 21, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

pytorch-bot bot commented Sep 15, 2022 •

edited

Loading