Skip to content

LowRankMultivariateNormal creates Illegal Memory Access in magma_spotrf_batched #41394

@ptrblck

Description

@ptrblck

🐛 Bug

Reported in the forum by eliaz (thanks again for reporting this issue :) )

I’m using the LowRankMultivariateNormal distribution in order to have a distribution of logits for every pixel of an image.
I have an issue when using this distribution for squared images with a shape that is even and large (>= 512x512).
The following code does not work on a colab notebook when the size is set to 512x512 but does for size 513x513.
I also tried it on different GPUs, with the same results.
It works fine when using CPUs.

To Reproduce

Code to reproduce the issue was added by eliaz as well:

import torch
from torch.distributions import LowRankMultivariateNormal

DEVICE = "cuda"

torch.manual_seed(23)
for i in range(10):
    print(i)
    distrib = LowRankMultivariateNormal(
        torch.randn(1, 512, 512, 2).to(DEVICE),
        torch.randn(1, 512, 512, 2, 10).to(DEVICE),
        torch.randn(1, 512, 512, 2).to(DEVICE).exp()
    )

Since the illegal memory access is not always triggered in the first run, multiple iterations were added.

Reproduced using PyTorch 1.7.0.dev20200713+CUDA10.2.

Stack trace:

0
1
2
CUDA runtime error: an illegal memory access was encountered (700) in magma_spotrf_batched at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/src/spotrf_batched.cpp:234
CUDA runtime error: an illegal memory access was encountered (700) in magma_queue_destroy_internal at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/interface_cuda/interface.cpp:944
CUDA runtime error: an illegal memory access was encountered (700) in magma_queue_destroy_internal at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/interface_cuda/interface.cpp:945
CUDA runtime error: an illegal memory access was encountered (700) in magma_queue_destroy_internal at /opt/conda/conda-bld/magma-cuda102_1583546904148/work/interface_cuda/interface.cpp:946
Traceback (most recent call last):
  File "lowrank_cuda.py", line 12, in <module>
    torch.randn(1, 512, 512, 2).to(DEVICE).exp()
  File "/home/pbialecki/anaconda3/envs/pytorch_nightly/lib/python3.7/site-packages/torch/distributions/lowrank_multivariate_normal.py", line 108, in __init__
    self._capacitance_tril = _batch_capacitance_tril(cov_factor, cov_diag)
  File "/home/pbialecki/anaconda3/envs/pytorch_nightly/lib/python3.7/site-packages/torch/distributions/lowrank_multivariate_normal.py", line 19, in _batch_capacitance_tril
    return torch.cholesky(K)
RuntimeError: CUDA error: an illegal memory access was encountered

cuda-gdb output:

CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x555558c1a220

Thread 1 "python" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 3, grid 50, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]
0x0000555558c1a740 in spotf2_smlpout_anywidth_kernel_batched(int, int, float**, int, int, int, int, int, int*, int)<<<(1,1,1),(2,8,1)>>> ()

which seems to point to the Cholesky factorization.

cc @ngimel @fritzo @neerajprad @alicanb @vishwakftw @nikitaved @jianyuh @pearu @mruberry @heitorschueroff @walterddr @IvanYashchuk @vincentqb @ssnl

Metadata

Metadata

Labels

module: cudaRelated to torch.cuda, and CUDA support in generalmodule: dependency bugProblem is not caused by us, but caused by an upstream library we usemodule: distributionsRelated to torch.distributionsmodule: linear algebraIssues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmultriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions