Skip to content

test_cholesky_solve_batched_many_batches_cuda_complex128 has cuda illegal memory access #48996

@xwang233

Description

@xwang233

🐛 Bug

test_cholesky_solve_batched_many_batches_cuda_complex128 has cuda illegal memory access. #47047 might be related.

To Reproduce

Steps to reproduce the behavior:

$ PYTORCH_TEST_WITH_SLOW=1 python test/test_linalg.py -v -k test_cholesky_solve_batched_many_batches_cuda_complex128
test_cholesky_solve_batched_many_batches_cuda_complex128 (__main__.TestLinalgCUDA) ... CUDA runtime error: an illegal memory access was encountered (700) in magma_zpotrf_batched at /home/xwang/Developer/magma-2.5.3/src/zpotrf_batched.cpp:234
CUDA runtime error: an illegal memory access was encountered (700) in magma_queue_destroy_internal at /home/xwang/Developer/magma-2.5.3/interface_cuda/interface.cpp:945
CUDA runtime error: an illegal memory access was encountered (700) in magma_queue_destroy_internal at /home/xwang/Developer/magma-2.5.3/interface_cuda/interface.cpp:946
CUDA runtime error: an illegal memory access was encountered (700) in magma_queue_destroy_internal at /home/xwang/Developer/magma-2.5.3/interface_cuda/interface.cpp:947
ERROR

======================================================================
ERROR: test_cholesky_solve_batched_many_batches_cuda_complex128 (__main__.TestLinalgCUDA)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/xwang/Developer/pytorch/torch/testing/_internal/common_utils.py", line 864, in wrapper
    method(*args, **kwargs)
  File "/home/xwang/Developer/pytorch/torch/testing/_internal/common_device_type.py", line 273, in instantiated_test
    result = test_fn(self, *args)
  File "/home/xwang/Developer/pytorch/torch/testing/_internal/common_utils.py", line 542, in wrapper
    fn(*args, **kwargs)
  File "/home/xwang/Developer/pytorch/torch/testing/_internal/common_device_type.py", line 545, in dep_fn
    return fn(slf, device, *args, **kwargs)
  File "/home/xwang/Developer/pytorch/torch/testing/_internal/common_device_type.py", line 545, in dep_fn
    return fn(slf, device, *args, **kwargs)
  File "test/test_linalg.py", line 1906, in test_cholesky_solve_batched_many_batches
    b, A, L = self.cholesky_solve_test_helper(A_dims, b_dims, upper, device, dtype)
  File "test/test_linalg.py", line 1845, in cholesky_solve_test_helper
    L = torch.cholesky(A, upper=upper)
RuntimeError: CUDA error: an illegal memory access was encountered
Exception raised from magmaCholeskyBatched<c10::complex<double> > at /home/xwang/Developer/pytorch/aten/src/ATen/native/cuda/BatchLinearAlgebra.cu:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x68 (0x7efde8c50828 in /home/xwang/Developer/pytorch/torch/lib/libc10.so)
frame #1: void at::native::magmaCholeskyBatched<c10::complex<double> >(magma_uplo_t, int, c10::complex<double>**, int, int*, int, at::native::MAGMAQueue const&) + 0x155 (0x7efde9fffbd5 in /home/xwang/Developer/pytorch/torch/lib/libtorch_cuda.so)
frame #2: at::native::_cholesky_helper_cuda(at::Tensor const&, bool) + 0x1709 (0x7efdea017d19 in /home/xwang/Developer/pytorch/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0x3336c13 (0x7efdebfdcc13 in /home/xwang/Developer/pytorch/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x3336c84 (0x7efdebfdcc84 in /home/xwang/Developer/pytorch/torch/lib/libtorch_cuda.so)
frame #5: at::_cholesky_helper(at::Tensor const&, bool) + 0x116 (0x7efdfde476a6 in /home/xwang/Developer/pytorch/torch/lib/libtorch_cpu.so)
frame #6: at::native::cholesky(at::Tensor const&, bool) + 0xa9 (0x7efdfd8b2c99 in /home/xwang/Developer/pytorch/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0x1c388f3 (0x7efdfe0398f3 in /home/xwang/Developer/pytorch/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x1c38964 (0x7efdfe039964 in /home/xwang/Developer/pytorch/torch/lib/libtorch_cpu.so)
frame #9: at::cholesky(at::Tensor const&, bool) + 0x116 (0x7efdfde46bc6 in /home/xwang/Developer/pytorch/torch/lib/libtorch_cpu.so)
frame #10: <unknown function> + 0x31ab9c4 (0x7efdff5ac9c4 in /home/xwang/Developer/pytorch/torch/lib/libtorch_cpu.so)
frame #11: <unknown function> + 0x31abcb4 (0x7efdff5accb4 in /home/xwang/Developer/pytorch/torch/lib/libtorch_cpu.so)
frame #12: at::Tensor::cholesky(bool) const + 0x116 (0x7efdfe1a7696 in /home/xwang/Developer/pytorch/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x734e6d (0x7efe04b8de6d in /home/xwang/Developer/pytorch/torch/lib/libtorch_python.so)
<omitting python frames>


----------------------------------------------------------------------
Ran 1 test in 2.616s

FAILED (errors=1)

Expected behavior

No fail

Environment

Collecting environment information...
PyTorch version: 1.8.0a0+533c837
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Manjaro Linux (x86_64)
GCC version: (GCC) 10.2.0
Clang version: Could not collect
CMake version: version 3.18.4

Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: GeForce RTX 2070 SUPER
GPU 1: GeForce GTX 1070 Ti

Nvidia driver version: 455.38
cuDNN version: Probably one of the following:
/usr/lib/libcudnn.so.8.0.5
/usr/lib/libcudnn_adv_infer.so.8.0.5
/usr/lib/libcudnn_adv_train.so.8.0.5
/usr/lib/libcudnn_cnn_infer.so.8.0.5
/usr/lib/libcudnn_cnn_train.so.8.0.5
/usr/lib/libcudnn_ops_infer.so.8.0.5
/usr/lib/libcudnn_ops_train.so.8.0.5
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.4
[pip3] pytorch-ignite==0.4.2
[pip3] torch==1.8.0a0
[pip3] torchvision==0.9.0a0+74de51d
[conda] Could not collect

Additional context

Seems to be MAGMA-related.

CC @ptrblck @mruberry

cc @ezyang @gchanan @zou3519 @bdhirsh @jianyuh @nikitaved @pearu @mruberry @heitorschueroff @walterddr @VitalyFedyunin @IvanYashchuk

Metadata

Metadata

Assignees

No one assigned

    Labels

    high prioritymodule: crashProblem manifests as a hard crash, as opposed to a RuntimeErrormodule: linear algebraIssues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmulmodule: testsIssues related to tests (not the torch.testing module)triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions