Workaround for MAGMA accessing illegal memory in batched cholesky#50957
Workaround for MAGMA accessing illegal memory in batched cholesky#50957heitorschueroff wants to merge 3 commits intogh/heitorschueroff/34/basefrom
Conversation
[ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit 4b4b0f7 (more details on the Dr. CI page):
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
…holesky" MAGMA has an off-by-one error in their batched cholesky implementation which is causing illegal memory access for certain inputs. The workaround implemented in this PR is to pad the input to MAGMA with 1 extra element. Fixes #41394, #26996, #48996 See also #42666, #26789 TODO --- - [ ] Benchmark to check for perf regressions [ghstack-poisoned]
|
Awesome, thanks for fixing! |
Codecov Report
@@ Coverage Diff @@
## gh/heitorschueroff/34/base #50957 +/- ##
==============================================================
- Coverage 80.91% 80.91% -0.01%
==============================================================
Files 1926 1926
Lines 210014 210014
==============================================================
- Hits 169942 169941 -1
- Misses 40072 40073 +1 |
| return self_working_copy.transpose(-1, -2); | ||
| } else { | ||
| return self_working_copy; | ||
| result.transpose_(-1, -2); |
There was a problem hiding this comment.
What's the impact of changing this to be an in-place transpose?
There was a problem hiding this comment.
One less Tensor view created !? The main reason I changed was stylistic. Since result holds a reference to a new Tensor created with either (at::empty or at::clone), it does not affect the input tensor.
|
This looks great, nice work @heitorschueroff! cc @ptrblck and @xwang233. I made one comment suggestion and have a question about a change of transpose variant. |
…holesky" MAGMA has an off-by-one error in their batched cholesky implementation which is causing illegal memory access for certain inputs. The workaround implemented in this PR is to pad the input to MAGMA with 1 extra element. Fixes #41394, #26996, #48996 See also #42666, #26789 TODO --- - [ ] Benchmark to check for perf regressions [ghstack-poisoned]
|
Thanks for the PR and the discussion on Slack! Were you somehow able to verify, that the data/results won't be corrupted by this change? I've spent some time digging into the code and think the first issue is this illegal access:
However, once this is fixed, another IMA is triggered: Where an Based on the I'm happy to dig further into it, if it makes sense. |
|
@heitorschueroff merged this pull request in a7cf04e. |
We have existing tests for batched cholesky that this PR re-enabled. The function was always reading these values and I don't think we've had reports of incorrect or corrupted values. One possible fix would be to change this PR's empty call to a zeros call, so the memory being read was zero, although the effect of that change is ???. I don't want to discourage you from further debugging into MAGMA, but our thinking is that this fix is a definite improvement solving all known segfaults with no harm done to correctness, and we should wait for the community to report new issues if there are any. We also want to begin pursuing a cuSOLVER implementation of batched cholesky. For example, if we identify the issue with MAGMA and submit a fix for it, that fix will only appear in newer versions of MAGMA and this workaround will still be required. If batched cholesky uses cuSOLVER when the CUDA toolkit version is sufficiently high then the overlap between users with a fixed MAGMA version who are also not using cuSOLVER might be very small. |
|
@ptrblck that's a very valid concern. Do you think one way to alleviate it would be to fill the padding element with 'nan' and run a battery of random tests? If that value is indeed read and processed, it would pollute the results. I don't want to leave the |
|
@ngimel Yes, I think it's a very good idea. |
|
You can use Fuzzer to help with input generation https://github.com/pytorch/pytorch/blob/47f0bda3ef8d196e0fa81a1749814dd75ffb1692/torch/utils/benchmark/examples/sparse/fuzzer.py, e.g. you can teach it to generate multiples of 512 half the time, and random sizes distributed according to some distribution for the rest. |
|
Good news! I've checked 600k shapes and no invalid values were found in any result tensor. Testing script: The testing script is defined as: import torch
import sys
import pandas as pd
import numpy as np
def get_mat_dist_pow():
p10 = 0.7 # a matrix_size of 10 is highly likely to fail, so increase probability
p_other = (1 - p10) / 46
dist_matrix_size = {v.long().item(): p_other for v in torch.linspace(4, 50, 47)}
dist_matrix_size[10] = p10
return dist_matrix_size
def get_batch_dist_pow():
return {2**e.long().item(): 1/11. for e in torch.linspace(0, 10, 11)}
def get_matrix_size():
# https://github.com/pytorch/pytorch/blob/47f0bda3ef8d196e0fa81a1749814dd75ffb1692/torch/utils/benchmark/utils/fuzzer.py#L120
index = state.choice(
np.arange(len(mat_dist_pow)),
p=tuple(mat_dist_pow.values()))
matrix_size = list(mat_dist_pow.keys())[index]
return matrix_size
def get_batch_size(p_switch=0.5):
# switch between uniform and pow
if torch.empty(1).uniform_() > p_switch:
batch_size = torch.empty(1).uniform_(1, 1024*1024).long()
else:
# https://github.com/pytorch/pytorch/blob/47f0bda3ef8d196e0fa81a1749814dd75ffb1692/torch/utils/benchmark/utils/fuzzer.py#L120
index = state.choice(
np.arange(len(batch_dist_pow)),
p=tuple(batch_dist_pow.values()))
batch_size = list(batch_dist_pow.keys())[index]
return batch_size
dtype = torch.float32
nb_iters = 600000
columns=["matrix_size", "batch_size", "passed"]
result = pd.DataFrame(columns=columns)
device = 'cuda:0'
mat_dist_pow = get_mat_dist_pow()
batch_dist_pow = get_batch_dist_pow()
state = np.random.RandomState(2809)
for _ in range(nb_iters):
try:
# create fake data on device to move workload
size = 2**torch.randint(0, 30, (1,))
fake = torch.randn(size, device=device)
print('Created fake tensor of {}MB'.format(fake.nelement()*4/1024**2))
matrix_size = get_matrix_size()
batch_size = get_batch_size()
print('Using matrix_Size {}, batch_size {}'.format(matrix_size, batch_size))
input = torch.eye(matrix_size, device=device, dtype=dtype).expand(batch_size, -1, -1)
print('input.shape {}'.format(input.shape))
# execute test multiple times
for _ in range(3):
ret = torch.cholesky(input)
torch.cuda.synchronize()
passed = torch.isfinite(ret).all()
if not passed:
print('ERROR! Invalid values found!')
print(ret)
torch.save(ret, 'invalid_ret01.pt')
sys.exit(-1)
# save results
result = result.append(pd.DataFrame([[matrix_size, batch_size, passed.item()]], columns=columns))
del fake
del input
del ret
except Exception as e:
print(e)
if 'out of memory' in str(e):
result = result.append(pd.DataFrame([[matrix_size, batch_size, 'oom']], columns=columns))
continue
else:
result = result.append(pd.DataFrame([[matrix_size, batch_size, str(e)]], columns=columns))
continue
result.to_csv('magma_ima_result01.csv', index=False)For the
In each iteration a I tried to use the Fuzzer (which is a nice utility btw.), but couldn't get it working in picking randomly between the two distributions of the Attached is the result image, where Also, sorry for the delay, but my estimation of the duration of this test was way off, and it took ~30h to finish. Please let me know, if you see any issues in the current test and/or if I should run additional tests. |
|
@ptrblck Thank you for this extensive exploration and verifying that MAGMA is not corrupting the results with the OOM reads. I think at this point we can feel safe in the fix implemented here. If there are new IMA issues after this fix we can dive deeper into the issue, but as @mruberry suggested, we will likely transition most use cases to cuSOLVER. |
…torch#50957) Summary: Pull Request resolved: pytorch#50957 MAGMA has an off-by-one error in their batched cholesky implementation which is causing illegal memory access for certain inputs. The workaround implemented in this PR is to pad the input to MAGMA with 1 extra element. **Benchmark** Ran the script below for both before and after my PR and got similar results. *Script* ``` import torch from torch.utils import benchmark DTYPE = torch.float32 BATCHSIZE = 512 * 512 MATRIXSIZE = 16 a = torch.eye(MATRIXSIZE, device='cuda', dtype=DTYPE) t0 = benchmark.Timer( stmt='torch.cholesky(a)', globals={'a': a}, label='Single' ) t1 = benchmark.Timer( stmt='torch.cholesky(a)', globals={'a': a.expand(BATCHSIZE, -1, -1)}, label='Batched' ) print(t0.timeit(100)) print(t1.timeit(100)) ``` *Results before* ``` <torch.utils.benchmark.utils.common.Measurement object at 0x7faf9bc63400> Single 2.08 ms 1 measurement, 100 runs , 1 thread <torch.utils.benchmark.utils.common.Measurement object at 0x7faf9bc63400> Batched 7.68 ms 1 measurement, 100 runs , 1 thread ``` *Results after* ``` <torch.utils.benchmark.utils.common.Measurement object at 0x7faf9bc63400> Single 2.10 ms 1 measurement, 100 runs , 1 thread <torch.utils.benchmark.utils.common.Measurement object at 0x7faf9bc63400> Batched 7.56 ms 1 measurement, 100 runs , 1 thread ``` Fixes pytorch#41394, pytorch#26996, pytorch#48996 See also pytorch#42666, pytorch#26789 TODO --- - [x] Benchmark to check for perf regressions Test Plan: Imported from OSS Reviewed By: bdhirsh Differential Revision: D26050978 Pulled By: heitorschueroff fbshipit-source-id: 7a5ba7e34c9d74b58568b2a0c631cc6d7ba63f86

Stack from ghstack:
MAGMA has an off-by-one error in their batched cholesky implementation which is causing illegal memory access for certain inputs. The workaround implemented in this PR is to pad the input to MAGMA with 1 extra element.
Benchmark
Ran the script below for both before and after my PR and got similar results.
Script
Results before
Results after
Fixes #41394, #26996, #48996
See also #42666, #26789
TODO
Differential Revision: D26050978