Cuts test_torch.py runtime in half by marking four tests as slow by mruberry · Pull Request #26789 · pytorch/pytorch

mruberry · 2019-09-25T06:35:02Z

Adds @slowtest to four tests

On my devfair running test_torch.py takes ~200 seconds with slow tests enabled. Running with the current @slowtest annotations takes ~145s. Running with these four additional annotations takes ~64s.

test_sum_dim, for example, takes 30s but was not marked as slow.
test_det_logdet_slogdet takes 17s on CPU and 22s on CUDA for a total of 39s!
test_einsum takes 7s.
test_triu_tril takes 5 seconds on CPU and 9s on CUDA for a total of 14s.

Several of the current @slowtests are faster than this. test_cholesky_solve_batched_many_batches, for example, takes a ~3 seconds on CPU and ~4.5 on CUDA, for a total of 7.5s across both devices.

mruberry · 2019-09-25T08:27:24Z

@pytorchbot rebase this please

facebook-github-bot

@mruberry has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

mruberry · 2019-09-25T17:56:31Z

Error on slow_test build is real and reproducible (at least for me locally):

test_cholesky_batched_many_batches_cuda (main.TestTorchDeviceTypeCUDA) ... CUDA runtime error: an illegal memory access was encountered (77) in magma_dpotrf_batched at /opt/conda/conda-bld/magma-cuda100_1549065924616/work/src/dpotrf_batched.cpp:234
CUDA runtime error: an illegal memory access was encountered (77) in magma_queue_destroy_internal at /opt/conda/conda-bld/magma-cuda100_1549065924616/work/interface_cuda/interface.cpp:944
CUDA runtime error: an illegal memory access was encountered (77) in magma_queue_destroy_internal at /opt/conda/conda-bld/magma-cuda100_1549065924616/work/interface_cuda/interface.cpp:945
CUDA runtime error: an illegal memory access was encountered (77) in magma_queue_destroy_internal at /opt/conda/conda-bld/magma-cuda100_1549065924616/work/interface_cuda/interface.cpp:946
ERROR

mruberry · 2019-09-27T00:22:58Z

@ngimel and I think MAGMA is causing this problem. Our analysis suggests no workaround in PyTorch except to pad calls to batched MAGMA functions with extra batches when the issue is encountered.

Blocking this PR pending a follow-up on what to do about MAGMA.

This is the third major issue hit simply by moving tests around, the other two being ROCm stream instability and cdist launching a kernel on the wrong stream. The latter was fixed by @ngimel, who was coincidentally working on cdist at the time.

soumith

awesomesauce!

soumith · 2019-09-27T03:02:52Z

@vishwakftw about magma!!!

vishwakftw · 2019-09-27T04:36:08Z

Thank you @soumith for flagging this, @ngimel, @mruberry and I have discussed about it and will be discussing about a hacky way to prevent this from happening, since this is not a problem on PyTorch’s side.

Once I have collated a report, I will also post it on the MAGMA forums.

mruberry · 2019-09-27T21:47:35Z

Updated with a skip for test_cholesky_batched_many_batches on CUDA, citing #26996.

ezyang

reapproving

facebook-github-bot

@mruberry has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@mruberry has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@mruberry has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2019-10-01T02:37:01Z

@mruberry merged this pull request in ec7913a.

…orch#26789) Summary: - Adds slowTest to four tests On my devfair running test_torch.py takes ~200 seconds with slow tests enabled. Running with the current slowTest annotations takes ~145s. Running with these four additional annotations takes ~64s. test_sum_dim, for example, takes 30s but was not marked as slow. test_det_logdet_slogdet takes 17s on CPU and 22s on CUDA for a total of 39s! test_einsum takes 7s. test_triu_tril takes 5 seconds on CPU and 9s on CUDA for a total of 14s. Several of the current slowTests are faster than this. test_cholesky_solve_batched_many_batches, for example, takes a ~3 seconds on CPU and ~4.5 on CUDA, for a total of 7.5s across both devices. Pull Request resolved: pytorch#26789 Differential Revision: D17574282 Pulled By: mruberry fbshipit-source-id: 3e5e505244c09b0ae23bd8c0145828119326719b

richardk53 · 2020-06-18T13:59:17Z

I think ignoring the underlying issue and skipping the corresponding tests is not a good approach.

Could this magma issue with batched cholesky etc. that caused these tests to fail please be investigated? This is a major problem for many pytorch users that exists since almost a year now.

Thanks :)

ngimel · 2020-06-18T16:12:51Z

The tests are not skipped and are run on every commit to master. They are just skipped in the tests run on the PR, to provide developers with faster signal while working on the PR.

mruberry · 2020-06-18T17:18:49Z

The issues with MAGMA are tracked separately, see: #26996.

…holesky" MAGMA has an off-by-one error in their batched cholesky implementation which is causing illegal memory access for certain inputs. The workaround implemented in this PR is to pad the input to MAGMA with 1 extra element. Fixes #41394, #26996, #48996 See also #42666, #26789 TODO --- - [ ] Benchmark to check for perf regressions [ghstack-poisoned]

…0957) Summary: Pull Request resolved: #50957 MAGMA has an off-by-one error in their batched cholesky implementation which is causing illegal memory access for certain inputs. The workaround implemented in this PR is to pad the input to MAGMA with 1 extra element. **Benchmark** Ran the script below for both before and after my PR and got similar results. *Script* ``` import torch from torch.utils import benchmark DTYPE = torch.float32 BATCHSIZE = 512 * 512 MATRIXSIZE = 16 a = torch.eye(MATRIXSIZE, device='cuda', dtype=DTYPE) t0 = benchmark.Timer( stmt='torch.cholesky(a)', globals={'a': a}, label='Single' ) t1 = benchmark.Timer( stmt='torch.cholesky(a)', globals={'a': a.expand(BATCHSIZE, -1, -1)}, label='Batched' ) print(t0.timeit(100)) print(t1.timeit(100)) ``` *Results before* ``` <torch.utils.benchmark.utils.common.Measurement object at 0x7faf9bc63400> Single 2.08 ms 1 measurement, 100 runs , 1 thread <torch.utils.benchmark.utils.common.Measurement object at 0x7faf9bc63400> Batched 7.68 ms 1 measurement, 100 runs , 1 thread ``` *Results after* ``` <torch.utils.benchmark.utils.common.Measurement object at 0x7faf9bc63400> Single 2.10 ms 1 measurement, 100 runs , 1 thread <torch.utils.benchmark.utils.common.Measurement object at 0x7faf9bc63400> Batched 7.56 ms 1 measurement, 100 runs , 1 thread ``` Fixes #41394, #26996, #48996 See also #42666, #26789 TODO --- - [x] Benchmark to check for perf regressions Test Plan: Imported from OSS Reviewed By: bdhirsh Differential Revision: D26050978 Pulled By: heitorschueroff fbshipit-source-id: 7a5ba7e34c9d74b58568b2a0c631cc6d7ba63f86

…orch#26789) Summary: - Adds slowTest to four tests On my devfair running test_torch.py takes ~200 seconds with slow tests enabled. Running with the current slowTest annotations takes ~145s. Running with these four additional annotations takes ~64s. test_sum_dim, for example, takes 30s but was not marked as slow. test_det_logdet_slogdet takes 17s on CPU and 22s on CUDA for a total of 39s! test_einsum takes 7s. test_triu_tril takes 5 seconds on CPU and 9s on CUDA for a total of 14s. Several of the current slowTests are faster than this. test_cholesky_solve_batched_many_batches, for example, takes a ~3 seconds on CPU and ~4.5 on CUDA, for a total of 7.5s across both devices. Pull Request resolved: pytorch#26789 Differential Revision: D17574282 Pulled By: mruberry fbshipit-source-id: 3e5e505244c09b0ae23bd8c0145828119326719b

…torch#50957) Summary: Pull Request resolved: pytorch#50957 MAGMA has an off-by-one error in their batched cholesky implementation which is causing illegal memory access for certain inputs. The workaround implemented in this PR is to pad the input to MAGMA with 1 extra element. **Benchmark** Ran the script below for both before and after my PR and got similar results. *Script* ``` import torch from torch.utils import benchmark DTYPE = torch.float32 BATCHSIZE = 512 * 512 MATRIXSIZE = 16 a = torch.eye(MATRIXSIZE, device='cuda', dtype=DTYPE) t0 = benchmark.Timer( stmt='torch.cholesky(a)', globals={'a': a}, label='Single' ) t1 = benchmark.Timer( stmt='torch.cholesky(a)', globals={'a': a.expand(BATCHSIZE, -1, -1)}, label='Batched' ) print(t0.timeit(100)) print(t1.timeit(100)) ``` *Results before* ``` <torch.utils.benchmark.utils.common.Measurement object at 0x7faf9bc63400> Single 2.08 ms 1 measurement, 100 runs , 1 thread <torch.utils.benchmark.utils.common.Measurement object at 0x7faf9bc63400> Batched 7.68 ms 1 measurement, 100 runs , 1 thread ``` *Results after* ``` <torch.utils.benchmark.utils.common.Measurement object at 0x7faf9bc63400> Single 2.10 ms 1 measurement, 100 runs , 1 thread <torch.utils.benchmark.utils.common.Measurement object at 0x7faf9bc63400> Batched 7.56 ms 1 measurement, 100 runs , 1 thread ``` Fixes pytorch#41394, pytorch#26996, pytorch#48996 See also pytorch#42666, pytorch#26789 TODO --- - [x] Benchmark to check for perf regressions Test Plan: Imported from OSS Reviewed By: bdhirsh Differential Revision: D26050978 Pulled By: heitorschueroff fbshipit-source-id: 7a5ba7e34c9d74b58568b2a0c631cc6d7ba63f86

pytorchbot added the module: operators label Sep 25, 2019

mruberry requested a review from ezyang September 25, 2019 06:35

ezyang approved these changes Sep 25, 2019

View reviewed changes

facebook-github-bot reviewed Sep 25, 2019

View reviewed changes

soumith approved these changes Sep 27, 2019

View reviewed changes

mruberry mentioned this pull request Sep 27, 2019

Batched MAGMA calls illegally read CUDA memory #26996

Closed

mruberry force-pushed the slowtests branch from 5d989e7 to 2a0137b Compare September 27, 2019 21:46

mruberry requested review from ezyang and soumith September 27, 2019 21:47

ezyang approved these changes Sep 30, 2019

View reviewed changes

facebook-github-bot reviewed Sep 30, 2019

View reviewed changes

update to unhork diffs

7cc13f7

mruberry force-pushed the slowtests branch from 2a0137b to 7cc13f7 Compare September 30, 2019 20:18

facebook-github-bot reviewed Sep 30, 2019

View reviewed changes

facebook-github-bot closed this in ec7913a Oct 1, 2019

mruberry deleted the slowtests branch October 1, 2019 00:31

facebook-github-bot added the merged label Oct 1, 2019

mruberry added the Merged label Oct 28, 2020

heitorschueroff mentioned this pull request Jan 22, 2021

Workaround for MAGMA accessing illegal memory in batched cholesky #50957

Closed

1 task

Conversation

mruberry commented Sep 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mruberry commented Sep 25, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

mruberry commented Sep 25, 2019

Uh oh!

mruberry commented Sep 27, 2019

Uh oh!

soumith left a comment

Choose a reason for hiding this comment

Uh oh!

soumith commented Sep 27, 2019

Uh oh!

vishwakftw commented Sep 27, 2019

Uh oh!

mruberry commented Sep 27, 2019

Uh oh!

ezyang left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Oct 1, 2019

Uh oh!

richardk53 commented Jun 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngimel commented Jun 18, 2020

Uh oh!

mruberry commented Jun 18, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

mruberry commented Sep 25, 2019 •

edited

Loading

richardk53 commented Jun 18, 2020 •

edited

Loading