[CUDA] Large tensor maxpool crash fix#165374
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/165374
Note: Links to docs will display an error until the docs builds have been completed. ⏳ No Failures, 1 PendingAs of commit 2ec7dd1 with merge base 01738a3 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
|
||
| static __device__ inline int p_start(int size, int pad, int kernel, int dilation, int stride) { | ||
| return (size + pad < ((kernel - 1) * dilation + 1)) ? 0 : (size + pad - ((kernel - 1) * dilation + 1)) / stride + 1; | ||
| template <typename T> |
| int64_t in_stride_n, int64_t in_stride_c, | ||
| int64_t in_stride_h, int64_t in_stride_w) | ||
| { | ||
| const int64_t int_max = std::numeric_limits<int>::max(); |
test/test_nn.py
Outdated
| # https://github.com/pytorch/pytorch/issues/165297 | ||
| N, C, H, W = 70, 64, 512, 960 # dims to extend > int32 | ||
| device = torch.device("cuda") | ||
| x_cuda = torch.randn(N, C, H, W, device=device) |
There was a problem hiding this comment.
Do memory requirements go down if a narrower dtype such as half is used?
There was a problem hiding this comment.
Yes, decreased needed memory and updated test to use float16. Initially I wanted to trigger the same illegal memory access that was in the issue with float32 but testing in float16 should be sufficient as well since we compare it to nchw format for correctness
|
@pytorchmergebot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
@pytorchbot rebase -b main |
|
@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here |
|
Successfully rebased |
c9904d3 to
054d398
Compare
|
@pytorchbot merge |
|
Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra. |
|
@eqy Need an approval of workflow here and then we can merge I guess |
| template <typename index_t> | ||
| __device__ inline index_t dmin(index_t a, index_t b) { | ||
| return a <= b ? a : b; | ||
| } |
There was a problem hiding this comment.
What's wrong with std::min?
|
|
||
| template <typename index_t> | ||
| static __device__ inline index_t p_start(index_t size, int pad, int kernel, int dilation, int stride) { | ||
| const index_t kernel_extent = static_cast<index_t>((kernel - 1) * dilation + 1); |
There was a problem hiding this comment.
Nit
| const index_t kernel_extent = static_cast<index_t>((kernel - 1) * dilation + 1); | |
| const auto kernel_extent = static_cast<index_t>((kernel - 1) * dilation + 1); |
| "fractional_max_pool2d requires output_ratio to either be a single Int or tuple of Ints."): | ||
| res = arg_class(*arg_3) | ||
|
|
||
| @unittest.skipIf(not TEST_CUDA, "CUDA not available") |
There was a problem hiding this comment.
This makes me sad, we really should use device and @onlyCUDA decorator
| device = torch.device("cuda") | ||
| x_cuda = torch.randn(N, C, H, W, device=device, dtype=torch.float16) |
There was a problem hiding this comment.
Nit
| device = torch.device("cuda") | |
| x_cuda = torch.randn(N, C, H, W, device=device, dtype=torch.float16) | |
| x_cuda = torch.randn(N, C, H, W, device="cuda", dtype=torch.float16) |
…nto cuda-maxpool-int64
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Fixes pytorch#165297 Pull Request resolved: pytorch#165374 Approved by: https://github.com/eqy, https://github.com/malfet
Fixes pytorch#165297 Pull Request resolved: pytorch#165374 Approved by: https://github.com/eqy, https://github.com/malfet
Fixes #165297
cc @ptrblck @msaroufim @eqy @jerryzh168