Skip to content

Cuts test_torch.py runtime in half by marking four tests as slow#26789

Closed
mruberry wants to merge 1 commit intomasterfrom
slowtests
Closed

Cuts test_torch.py runtime in half by marking four tests as slow#26789
mruberry wants to merge 1 commit intomasterfrom
slowtests

Conversation

@mruberry
Copy link
Copy Markdown
Collaborator

@mruberry mruberry commented Sep 25, 2019

On my devfair running test_torch.py takes ~200 seconds with slow tests enabled. Running with the current @slowtest annotations takes ~145s. Running with these four additional annotations takes ~64s.

test_sum_dim, for example, takes 30s but was not marked as slow.
test_det_logdet_slogdet takes 17s on CPU and 22s on CUDA for a total of 39s!
test_einsum takes 7s.
test_triu_tril takes 5 seconds on CPU and 9s on CUDA for a total of 14s.

Several of the current @slowtests are faster than this. test_cholesky_solve_batched_many_batches, for example, takes a ~3 seconds on CPU and ~4.5 on CUDA, for a total of 7.5s across both devices.

@mruberry
Copy link
Copy Markdown
Collaborator Author

@pytorchbot rebase this please

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mruberry has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@mruberry
Copy link
Copy Markdown
Collaborator Author

Error on slow_test build is real and reproducible (at least for me locally):

test_cholesky_batched_many_batches_cuda (main.TestTorchDeviceTypeCUDA) ... CUDA runtime error: an illegal memory access was encountered (77) in magma_dpotrf_batched at /opt/conda/conda-bld/magma-cuda100_1549065924616/work/src/dpotrf_batched.cpp:234
CUDA runtime error: an illegal memory access was encountered (77) in magma_queue_destroy_internal at /opt/conda/conda-bld/magma-cuda100_1549065924616/work/interface_cuda/interface.cpp:944
CUDA runtime error: an illegal memory access was encountered (77) in magma_queue_destroy_internal at /opt/conda/conda-bld/magma-cuda100_1549065924616/work/interface_cuda/interface.cpp:945
CUDA runtime error: an illegal memory access was encountered (77) in magma_queue_destroy_internal at /opt/conda/conda-bld/magma-cuda100_1549065924616/work/interface_cuda/interface.cpp:946
ERROR

@mruberry
Copy link
Copy Markdown
Collaborator Author

@ngimel and I think MAGMA is causing this problem. Our analysis suggests no workaround in PyTorch except to pad calls to batched MAGMA functions with extra batches when the issue is encountered.

Blocking this PR pending a follow-up on what to do about MAGMA.

This is the third major issue hit simply by moving tests around, the other two being ROCm stream instability and cdist launching a kernel on the wrong stream. The latter was fixed by @ngimel, who was coincidentally working on cdist at the time.

Copy link
Copy Markdown
Collaborator

@soumith soumith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesomesauce!

@soumith
Copy link
Copy Markdown
Collaborator

soumith commented Sep 27, 2019

@vishwakftw about magma!!!

@vishwakftw
Copy link
Copy Markdown
Contributor

Thank you @soumith for flagging this, @ngimel, @mruberry and I have discussed about it and will be discussing about a hacky way to prevent this from happening, since this is not a problem on PyTorch’s side.

Once I have collated a report, I will also post it on the MAGMA forums.

@mruberry
Copy link
Copy Markdown
Collaborator Author

Updated with a skip for test_cholesky_batched_many_batches on CUDA, citing #26996.

Copy link
Copy Markdown
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reapproving

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mruberry has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mruberry has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mruberry has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@mruberry merged this pull request in ec7913a.

pdlive215 pushed a commit to pdlive215/pytorch that referenced this pull request Nov 27, 2019
…orch#26789)

Summary:
- Adds slowTest to four tests

On my devfair running test_torch.py takes ~200 seconds with slow tests enabled. Running with the current slowTest annotations takes ~145s. Running with these four additional annotations takes ~64s.

test_sum_dim, for example, takes 30s but was not marked as slow.
test_det_logdet_slogdet takes 17s on CPU and 22s on CUDA for a total of 39s!
test_einsum takes 7s.
test_triu_tril takes 5 seconds on CPU and 9s on CUDA for a total of 14s.

Several of the current slowTests are faster than this. test_cholesky_solve_batched_many_batches, for example, takes a ~3 seconds on CPU and ~4.5 on CUDA, for a total of 7.5s across both devices.
Pull Request resolved: pytorch#26789

Differential Revision: D17574282

Pulled By: mruberry

fbshipit-source-id: 3e5e505244c09b0ae23bd8c0145828119326719b
thiagocrepaldi pushed a commit to thiagocrepaldi/pytorch that referenced this pull request Feb 4, 2020
…orch#26789)

Summary:
- Adds slowTest to four tests

On my devfair running test_torch.py takes ~200 seconds with slow tests enabled. Running with the current slowTest annotations takes ~145s. Running with these four additional annotations takes ~64s.

test_sum_dim, for example, takes 30s but was not marked as slow.
test_det_logdet_slogdet takes 17s on CPU and 22s on CUDA for a total of 39s!
test_einsum takes 7s.
test_triu_tril takes 5 seconds on CPU and 9s on CUDA for a total of 14s.

Several of the current slowTests are faster than this. test_cholesky_solve_batched_many_batches, for example, takes a ~3 seconds on CPU and ~4.5 on CUDA, for a total of 7.5s across both devices.
Pull Request resolved: pytorch#26789

Differential Revision: D17574282

Pulled By: mruberry

fbshipit-source-id: 3e5e505244c09b0ae23bd8c0145828119326719b
@richardk53
Copy link
Copy Markdown

richardk53 commented Jun 18, 2020

I think ignoring the underlying issue and skipping the corresponding tests is not a good approach.

Could this magma issue with batched cholesky etc. that caused these tests to fail please be investigated? This is a major problem for many pytorch users that exists since almost a year now.

Thanks :)

@ngimel
Copy link
Copy Markdown
Collaborator

ngimel commented Jun 18, 2020

The tests are not skipped and are run on every commit to master. They are just skipped in the tests run on the PR, to provide developers with faster signal while working on the PR.

@mruberry
Copy link
Copy Markdown
Collaborator Author

The issues with MAGMA are tracked separately, see: #26996.

heitorschueroff added a commit that referenced this pull request Jan 22, 2021
…holesky"


MAGMA has an off-by-one error in their batched cholesky implementation which is causing illegal memory access for certain inputs. The workaround implemented in this PR is to pad the input to MAGMA with 1 extra element.

Fixes #41394, #26996, #48996

See also #42666, #26789

TODO
---
- [ ] Benchmark to check for perf regressions

[ghstack-poisoned]
heitorschueroff added a commit that referenced this pull request Jan 25, 2021
…holesky"


MAGMA has an off-by-one error in their batched cholesky implementation which is causing illegal memory access for certain inputs. The workaround implemented in this PR is to pad the input to MAGMA with 1 extra element.

Fixes #41394, #26996, #48996

See also #42666, #26789

TODO
---
- [ ] Benchmark to check for perf regressions

[ghstack-poisoned]
facebook-github-bot pushed a commit that referenced this pull request Jan 25, 2021
…0957)

Summary:
Pull Request resolved: #50957

MAGMA has an off-by-one error in their batched cholesky implementation which is causing illegal memory access for certain inputs. The workaround implemented in this PR is to pad the input to MAGMA with 1 extra element.

**Benchmark**
Ran the script below for both before and after my PR and got similar results.

*Script*
```
import torch
from torch.utils import benchmark

DTYPE = torch.float32
BATCHSIZE = 512 * 512
MATRIXSIZE = 16

a = torch.eye(MATRIXSIZE, device='cuda', dtype=DTYPE)

t0 = benchmark.Timer(
    stmt='torch.cholesky(a)',
    globals={'a': a},
    label='Single'
)

t1 = benchmark.Timer(
    stmt='torch.cholesky(a)',
    globals={'a': a.expand(BATCHSIZE, -1, -1)},
    label='Batched'
)

print(t0.timeit(100))
print(t1.timeit(100))
```

*Results before*
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7faf9bc63400>
Single
  2.08 ms
  1 measurement, 100 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7faf9bc63400>
Batched
  7.68 ms
  1 measurement, 100 runs , 1 thread
```

*Results after*
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7faf9bc63400>
Single
  2.10 ms
  1 measurement, 100 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7faf9bc63400>
Batched
  7.56 ms
  1 measurement, 100 runs , 1 thread
```

Fixes #41394, #26996, #48996

See also #42666, #26789

TODO
 ---
- [x] Benchmark to check for perf regressions

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D26050978

Pulled By: heitorschueroff

fbshipit-source-id: 7a5ba7e34c9d74b58568b2a0c631cc6d7ba63f86
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
…orch#26789)

Summary:
- Adds slowTest to four tests

On my devfair running test_torch.py takes ~200 seconds with slow tests enabled. Running with the current slowTest annotations takes ~145s. Running with these four additional annotations takes ~64s.

test_sum_dim, for example, takes 30s but was not marked as slow.
test_det_logdet_slogdet takes 17s on CPU and 22s on CUDA for a total of 39s!
test_einsum takes 7s.
test_triu_tril takes 5 seconds on CPU and 9s on CUDA for a total of 14s.

Several of the current slowTests are faster than this. test_cholesky_solve_batched_many_batches, for example, takes a ~3 seconds on CPU and ~4.5 on CUDA, for a total of 7.5s across both devices.
Pull Request resolved: pytorch#26789

Differential Revision: D17574282

Pulled By: mruberry

fbshipit-source-id: 3e5e505244c09b0ae23bd8c0145828119326719b
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
…torch#50957)

Summary:
Pull Request resolved: pytorch#50957

MAGMA has an off-by-one error in their batched cholesky implementation which is causing illegal memory access for certain inputs. The workaround implemented in this PR is to pad the input to MAGMA with 1 extra element.

**Benchmark**
Ran the script below for both before and after my PR and got similar results.

*Script*
```
import torch
from torch.utils import benchmark

DTYPE = torch.float32
BATCHSIZE = 512 * 512
MATRIXSIZE = 16

a = torch.eye(MATRIXSIZE, device='cuda', dtype=DTYPE)

t0 = benchmark.Timer(
    stmt='torch.cholesky(a)',
    globals={'a': a},
    label='Single'
)

t1 = benchmark.Timer(
    stmt='torch.cholesky(a)',
    globals={'a': a.expand(BATCHSIZE, -1, -1)},
    label='Batched'
)

print(t0.timeit(100))
print(t1.timeit(100))
```

*Results before*
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7faf9bc63400>
Single
  2.08 ms
  1 measurement, 100 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7faf9bc63400>
Batched
  7.68 ms
  1 measurement, 100 runs , 1 thread
```

*Results after*
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7faf9bc63400>
Single
  2.10 ms
  1 measurement, 100 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7faf9bc63400>
Batched
  7.56 ms
  1 measurement, 100 runs , 1 thread
```

Fixes pytorch#41394, pytorch#26996, pytorch#48996

See also pytorch#42666, pytorch#26789

TODO
 ---
- [x] Benchmark to check for perf regressions

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D26050978

Pulled By: heitorschueroff

fbshipit-source-id: 7a5ba7e34c9d74b58568b2a0c631cc6d7ba63f86
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants