Conversation
|
@pytorchbot rebase this please |
facebook-github-bot
left a comment
There was a problem hiding this comment.
@mruberry has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
Error on slow_test build is real and reproducible (at least for me locally): test_cholesky_batched_many_batches_cuda (main.TestTorchDeviceTypeCUDA) ... CUDA runtime error: an illegal memory access was encountered (77) in magma_dpotrf_batched at /opt/conda/conda-bld/magma-cuda100_1549065924616/work/src/dpotrf_batched.cpp:234 |
|
@ngimel and I think MAGMA is causing this problem. Our analysis suggests no workaround in PyTorch except to pad calls to batched MAGMA functions with extra batches when the issue is encountered. Blocking this PR pending a follow-up on what to do about MAGMA. This is the third major issue hit simply by moving tests around, the other two being ROCm stream instability and cdist launching a kernel on the wrong stream. The latter was fixed by @ngimel, who was coincidentally working on cdist at the time. |
|
@vishwakftw about magma!!! |
|
Updated with a skip for test_cholesky_batched_many_batches on CUDA, citing #26996. |
facebook-github-bot
left a comment
There was a problem hiding this comment.
@mruberry has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
facebook-github-bot
left a comment
There was a problem hiding this comment.
@mruberry has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
facebook-github-bot
left a comment
There was a problem hiding this comment.
@mruberry has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
…orch#26789) Summary: - Adds slowTest to four tests On my devfair running test_torch.py takes ~200 seconds with slow tests enabled. Running with the current slowTest annotations takes ~145s. Running with these four additional annotations takes ~64s. test_sum_dim, for example, takes 30s but was not marked as slow. test_det_logdet_slogdet takes 17s on CPU and 22s on CUDA for a total of 39s! test_einsum takes 7s. test_triu_tril takes 5 seconds on CPU and 9s on CUDA for a total of 14s. Several of the current slowTests are faster than this. test_cholesky_solve_batched_many_batches, for example, takes a ~3 seconds on CPU and ~4.5 on CUDA, for a total of 7.5s across both devices. Pull Request resolved: pytorch#26789 Differential Revision: D17574282 Pulled By: mruberry fbshipit-source-id: 3e5e505244c09b0ae23bd8c0145828119326719b
…orch#26789) Summary: - Adds slowTest to four tests On my devfair running test_torch.py takes ~200 seconds with slow tests enabled. Running with the current slowTest annotations takes ~145s. Running with these four additional annotations takes ~64s. test_sum_dim, for example, takes 30s but was not marked as slow. test_det_logdet_slogdet takes 17s on CPU and 22s on CUDA for a total of 39s! test_einsum takes 7s. test_triu_tril takes 5 seconds on CPU and 9s on CUDA for a total of 14s. Several of the current slowTests are faster than this. test_cholesky_solve_batched_many_batches, for example, takes a ~3 seconds on CPU and ~4.5 on CUDA, for a total of 7.5s across both devices. Pull Request resolved: pytorch#26789 Differential Revision: D17574282 Pulled By: mruberry fbshipit-source-id: 3e5e505244c09b0ae23bd8c0145828119326719b
|
I think ignoring the underlying issue and skipping the corresponding tests is not a good approach. Could this magma issue with batched cholesky etc. that caused these tests to fail please be investigated? This is a major problem for many pytorch users that exists since almost a year now. Thanks :) |
|
The tests are not skipped and are run on every commit to master. They are just skipped in the tests run on the PR, to provide developers with faster signal while working on the PR. |
|
The issues with MAGMA are tracked separately, see: #26996. |
…holesky" MAGMA has an off-by-one error in their batched cholesky implementation which is causing illegal memory access for certain inputs. The workaround implemented in this PR is to pad the input to MAGMA with 1 extra element. Fixes #41394, #26996, #48996 See also #42666, #26789 TODO --- - [ ] Benchmark to check for perf regressions [ghstack-poisoned]
…holesky" MAGMA has an off-by-one error in their batched cholesky implementation which is causing illegal memory access for certain inputs. The workaround implemented in this PR is to pad the input to MAGMA with 1 extra element. Fixes #41394, #26996, #48996 See also #42666, #26789 TODO --- - [ ] Benchmark to check for perf regressions [ghstack-poisoned]
…0957) Summary: Pull Request resolved: #50957 MAGMA has an off-by-one error in their batched cholesky implementation which is causing illegal memory access for certain inputs. The workaround implemented in this PR is to pad the input to MAGMA with 1 extra element. **Benchmark** Ran the script below for both before and after my PR and got similar results. *Script* ``` import torch from torch.utils import benchmark DTYPE = torch.float32 BATCHSIZE = 512 * 512 MATRIXSIZE = 16 a = torch.eye(MATRIXSIZE, device='cuda', dtype=DTYPE) t0 = benchmark.Timer( stmt='torch.cholesky(a)', globals={'a': a}, label='Single' ) t1 = benchmark.Timer( stmt='torch.cholesky(a)', globals={'a': a.expand(BATCHSIZE, -1, -1)}, label='Batched' ) print(t0.timeit(100)) print(t1.timeit(100)) ``` *Results before* ``` <torch.utils.benchmark.utils.common.Measurement object at 0x7faf9bc63400> Single 2.08 ms 1 measurement, 100 runs , 1 thread <torch.utils.benchmark.utils.common.Measurement object at 0x7faf9bc63400> Batched 7.68 ms 1 measurement, 100 runs , 1 thread ``` *Results after* ``` <torch.utils.benchmark.utils.common.Measurement object at 0x7faf9bc63400> Single 2.10 ms 1 measurement, 100 runs , 1 thread <torch.utils.benchmark.utils.common.Measurement object at 0x7faf9bc63400> Batched 7.56 ms 1 measurement, 100 runs , 1 thread ``` Fixes #41394, #26996, #48996 See also #42666, #26789 TODO --- - [x] Benchmark to check for perf regressions Test Plan: Imported from OSS Reviewed By: bdhirsh Differential Revision: D26050978 Pulled By: heitorschueroff fbshipit-source-id: 7a5ba7e34c9d74b58568b2a0c631cc6d7ba63f86
…orch#26789) Summary: - Adds slowTest to four tests On my devfair running test_torch.py takes ~200 seconds with slow tests enabled. Running with the current slowTest annotations takes ~145s. Running with these four additional annotations takes ~64s. test_sum_dim, for example, takes 30s but was not marked as slow. test_det_logdet_slogdet takes 17s on CPU and 22s on CUDA for a total of 39s! test_einsum takes 7s. test_triu_tril takes 5 seconds on CPU and 9s on CUDA for a total of 14s. Several of the current slowTests are faster than this. test_cholesky_solve_batched_many_batches, for example, takes a ~3 seconds on CPU and ~4.5 on CUDA, for a total of 7.5s across both devices. Pull Request resolved: pytorch#26789 Differential Revision: D17574282 Pulled By: mruberry fbshipit-source-id: 3e5e505244c09b0ae23bd8c0145828119326719b
…torch#50957) Summary: Pull Request resolved: pytorch#50957 MAGMA has an off-by-one error in their batched cholesky implementation which is causing illegal memory access for certain inputs. The workaround implemented in this PR is to pad the input to MAGMA with 1 extra element. **Benchmark** Ran the script below for both before and after my PR and got similar results. *Script* ``` import torch from torch.utils import benchmark DTYPE = torch.float32 BATCHSIZE = 512 * 512 MATRIXSIZE = 16 a = torch.eye(MATRIXSIZE, device='cuda', dtype=DTYPE) t0 = benchmark.Timer( stmt='torch.cholesky(a)', globals={'a': a}, label='Single' ) t1 = benchmark.Timer( stmt='torch.cholesky(a)', globals={'a': a.expand(BATCHSIZE, -1, -1)}, label='Batched' ) print(t0.timeit(100)) print(t1.timeit(100)) ``` *Results before* ``` <torch.utils.benchmark.utils.common.Measurement object at 0x7faf9bc63400> Single 2.08 ms 1 measurement, 100 runs , 1 thread <torch.utils.benchmark.utils.common.Measurement object at 0x7faf9bc63400> Batched 7.68 ms 1 measurement, 100 runs , 1 thread ``` *Results after* ``` <torch.utils.benchmark.utils.common.Measurement object at 0x7faf9bc63400> Single 2.10 ms 1 measurement, 100 runs , 1 thread <torch.utils.benchmark.utils.common.Measurement object at 0x7faf9bc63400> Batched 7.56 ms 1 measurement, 100 runs , 1 thread ``` Fixes pytorch#41394, pytorch#26996, pytorch#48996 See also pytorch#42666, pytorch#26789 TODO --- - [x] Benchmark to check for perf regressions Test Plan: Imported from OSS Reviewed By: bdhirsh Differential Revision: D26050978 Pulled By: heitorschueroff fbshipit-source-id: 7a5ba7e34c9d74b58568b2a0c631cc6d7ba63f86
On my devfair running test_torch.py takes ~200 seconds with slow tests enabled. Running with the current @slowtest annotations takes ~145s. Running with these four additional annotations takes ~64s.
test_sum_dim, for example, takes 30s but was not marked as slow.
test_det_logdet_slogdet takes 17s on CPU and 22s on CUDA for a total of 39s!
test_einsum takes 7s.
test_triu_tril takes 5 seconds on CPU and 9s on CUDA for a total of 14s.
Several of the current @slowtests are faster than this. test_cholesky_solve_batched_many_batches, for example, takes a ~3 seconds on CPU and ~4.5 on CUDA, for a total of 7.5s across both devices.