Skip to content

Batched MAGMA calls illegally read CUDA memory #26996

@mruberry

Description

@mruberry

🐛 Bug

(Some?) Batched MAGMA calls illegally read CUDA memory.

These illegal reads are often "silent" and harmless. If, however, they access unallocated device memory they will cause the program's future CUDA calls to fail.

To Reproduce

See #26789. Or once that PR goes in just re-enable test_cholesky_batched_many_batches.

This can also be reproduced by calling magma_dpotrf_batched directly with a tensor allocated by cudaMalloc. Run under cuda-memcheck to report all illegal memory accesses, including "silent" ones.

Additional context

This issue was discovered in #26789 and diagnosed by me and @ngimel. We are following-up with @vishwakftw.

A workaround may be to pad tensor inputs to batched MAGMA calls. This requires copying the original tensor into a buffer larger than it needs. How much extra space is needed requires further investigation.

cc @ezyang @gchanan @zou3519 @vincentqb @vishwakftw @jianyuh @ssnl

Metadata

Metadata

Labels

high prioritymodule: dependency bugProblem is not caused by us, but caused by an upstream library we usemodule: linear algebraIssues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmultriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions