LowRankMultivariateNormal creates Illegal Memory Access in magma_spotrf_batched

## 🐛 Bug

Reported in the [forum](https://discuss.pytorch.org/t/lowrankmultivariatenormal-throws-cuda-error/89029) by `eliaz` (thanks again for reporting this issue :) )
> I’m using the LowRankMultivariateNormal distribution in order to have a distribution of logits for every pixel of an image.
I have an issue when using this distribution for squared images with a shape that is even and large (>= 512x512).
The following code does not work on a colab notebook when the size is set to 512x512 but does for size 513x513.
I also tried it on different GPUs, with the same results.
It works fine when using CPUs.

## To Reproduce

Code to reproduce the issue was added by `eliaz` as well:
```python
import torch
from torch.distributions import LowRankMultivariateNormal

DEVICE = "cuda"

torch.manual_seed(23)
for i in range(10):
    print(i)
    distrib = LowRankMultivariateNormal(
        torch.randn(1, 512, 512, 2).to(DEVICE),
        torch.randn(1, 512, 512, 2, 10).to(DEVICE),
        torch.randn(1, 512, 512, 2).to(DEVICE).exp()
    )
```
Since the illegal memory access is not always triggered in the first run, multiple iterations were added.

Reproduced using PyTorch `1.7.0.dev20200713+CUDA10.2`.

Stack trace:
```
0
1
2
CUDA runtime error: an illegal memory access was encountered (700) in magma_spotrf_batched at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_batched.cpp:234
CUDA runtime error: an illegal memory access was encountered (700) in magma_queue_destroy_internal at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/interface_cuda/interface.cpp:944
CUDA runtime error: an illegal memory access was encountered (700) in magma_queue_destroy_internal at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/interface_cuda/interface.cpp:945
CUDA runtime error: an illegal memory access was encountered (700) in magma_queue_destroy_internal at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/interface_cuda/interface.cpp:946
Traceback (most recent call last):
  File "lowrank_cuda.py", line 12, in <module>
    torch.randn(1, 512, 512, 2).to(DEVICE).exp()
  File "/home/pbialecki/anaconda3/envs/pytorch_nightly/lib/python3.7/site-packages/torch/distributions/lowrank_multivariate_normal.py", line 108, in __init__
    self._capacitance_tril = _batch_capacitance_tril(cov_factor, cov_diag)
  File "/home/pbialecki/anaconda3/envs/pytorch_nightly/lib/python3.7/site-packages/torch/distributions/lowrank_multivariate_normal.py", line 19, in _batch_capacitance_tril
    return torch.cholesky(K)
RuntimeError: CUDA error: an illegal memory access was encountered
```

`cuda-gdb` output:
```
CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x555558c1a220

Thread 1 "python" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 3, grid 50, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]
0x0000555558c1a740 in spotf2_smlpout_anywidth_kernel_batched(int, int, float**, int, int, int, int, int, int*, int)<<<(1,1,1),(2,8,1)>>> ()
```

which seems to point to the [Cholesky factorization](https://icl.cs.utk.edu/projectsfiles/magma/doxygen/group__magma__potf2.html#ga0d430425aee761779b9279681be31699). 



cc @ngimel @fritzo @neerajprad @alicanb @vishwakftw @nikitaved @jianyuh @pearu @mruberry @heitorschueroff @walterddr @IvanYashchuk @vincentqb @SsnL

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LowRankMultivariateNormal creates Illegal Memory Access in magma_spotrf_batched #41394

🐛 Bug

To Reproduce

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LowRankMultivariateNormal creates Illegal Memory Access in magma_spotrf_batched #41394

Description

🐛 Bug

To Reproduce

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions