Skip to content

Add 64-bit indexing support to THC index reductions#33405

Closed
peterbell10 wants to merge 2 commits intopytorch:masterfrom
peterbell10:thc-index-reduce
Closed

Add 64-bit indexing support to THC index reductions#33405
peterbell10 wants to merge 2 commits intopytorch:masterfrom
peterbell10:thc-index-reduce

Conversation

@peterbell10
Copy link
Collaborator

Fixes #32863, (together with #33310 for the TensorIterator reductions)

This adds 64-bit indexed kernels for THC_reduceDimIndex and uses THCTensor_canUse32BitIndexMath to switch between the two at runtime.

I have a test for this locally but haven't included it here because max is much slower than argmax. To the point where the test takes several minutes to call max on just one 2**32 element tensor. That seems excessive, even for a slow test but I can push it if preferred.

@dr-ci
Copy link

dr-ci bot commented Feb 19, 2020

💊 CircleCI build failures summary and remediations

As of commit e1b6094:

None of the build failures appear to be your fault.

  • 1/1 broken upstream at merge base 8b6a898 since Feb 18

    Please rebase on the viable/strict branch (expand for instructions)

    Since your merge base is older than viable/strict, run these commands:

    git fetch origin viable/strict
    git rebase viable/strict
    

    Check out the recency history of this "viable master" tracking branch.

Detailed failure analysis

One may explore the probable reasons each build failed interactively on the Dr. CI website.

🚧 1 upstream failure recognized by patterns:

These builds matched patterns, but were probably caused by upstream breakages:


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 5 times.

@peterbell10
Copy link
Collaborator Author

Rebased on #33376 to fix conflicts. Unfortunately, that's ahead of upstream/viable/strict so the tests will fail for the time being.

@ezyang
Copy link
Contributor

ezyang commented Feb 19, 2020

The test won't trigger on CircleCI, but it would be good to add the ">4G CUDA tensor" test that would actually trigger this.

Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, waiting on the test

@ezyang
Copy link
Contributor

ezyang commented Feb 20, 2020

This seems to irrecoverably fail the XLA build:

Feb 19 22:14:08 ======================================================================
Feb 19 22:14:08 ERROR [4.475s]: test_minmax_large_axis_xla (__main__.TestTorchDeviceTypeXLA)
Feb 19 22:14:08 ----------------------------------------------------------------------
Feb 19 22:14:08 Traceback (most recent call last):
Feb 19 22:14:08   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 181, in instantiated_test
Feb 19 22:14:08     return test(self, device_arg)
Feb 19 22:14:08   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 382, in wrapper
Feb 19 22:14:08     fn(*args, **kwargs)
Feb 19 22:14:08   File "/var/lib/jenkins/workspace/xla/test/../../test/test_torch.py", line 8526, in test_minmax_large_axis
Feb 19 22:14:08     self.assertEqual(idx, x.shape[0] - 1)
Feb 19 22:14:08   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 336, in assertEqual
Feb 19 22:14:08     return DeviceTypeTestBase.assertEqual(self, x, y, prec, message, allow_inf)
Feb 19 22:14:08   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 784, in assertEqual
Feb 19 22:14:08     self.assertEqual(x.item(), y, prec=prec, message=message,
Feb 19 22:14:08 RuntimeError: Resource exhausted: From /job:localservice/replica:0/task:0:
Feb 19 22:14:08 Failed to allocate request for 33.00GiB (35433480192B) on device ordinal 0
Feb 19 22:14:08 	 [[{{node XRTExecute}}]]
Feb 19 22:14:08 Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Feb 19 22:14:08 

We should follow previous practice for large allocation and use the @unittest.skipIf(not TEST_LARGE_TENSOR, "not enough memory") decorator to turn this off when large memory tests are not requested.

@peterbell10
Copy link
Collaborator Author

@ezyang I can only see TEST_LARGE_TENSOR defined for test_cuda.py. Also, I can't see how a global variable would work here since the available memory depends on the device.

Would it make sense to report the apparent memory leak to XLA and just add a test skip to XLA for the time being?

@ezyang
Copy link
Contributor

ezyang commented Feb 20, 2020

Well, you can still do something similar to what the test does in the end, which is something like torch.cuda.get_device_properties(0).total_memory >= 12e9. It kind of makes sense that one size fits all as the majority of all big tensor bugs have to do with 32-bit indexing. And yes, I'd make this test run only on CUDA.

@peterbell10
Copy link
Collaborator Author

Okay, moved the test to test_cuda.py and am now guarding it with TEST_LARGE_TENSOR.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ezyang merged this pull request in c882425.

ttumiel pushed a commit to ttumiel/pytorch that referenced this pull request Mar 4, 2020
Summary:
Fixes pytorch#32863, (together with pytorch#33310 for the `TensorIterator` reductions)

This adds 64-bit indexed kernels for `THC_reduceDimIndex` and uses `THCTensor_canUse32BitIndexMath` to switch between the two at runtime.

I have a test for this locally but haven't included it here because `max` is much slower than `argmax`. To the point where the test takes several minutes to call max on just one `2**32` element tensor. That seems excessive, even for a slow test but I can push it if preferred.
Pull Request resolved: pytorch#33405

Differential Revision: D20010769

Pulled By: ezyang

fbshipit-source-id: a8a86f662598d5fade4d90448436418422c699a3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Incorrect result of argmax on CUDA with large non-contig tensors and on non-contig dimensions

4 participants