Add 64-bit indexing support to THC index reductions#33405
Add 64-bit indexing support to THC index reductions#33405peterbell10 wants to merge 2 commits intopytorch:masterfrom
Conversation
db08b5e to
66ede08
Compare
66ede08 to
b47e5bb
Compare
💊 CircleCI build failures summary and remediationsAs of commit e1b6094: None of the build failures appear to be your fault.
Detailed failure analysisOne may explore the probable reasons each build failed interactively on the Dr. CI website. 🚧 1 upstream failure recognized by patterns:These builds matched patterns, but were probably caused by upstream breakages:
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker. This comment has been revised 5 times. |
|
Rebased on #33376 to fix conflicts. Unfortunately, that's ahead of |
|
The test won't trigger on CircleCI, but it would be good to add the ">4G CUDA tensor" test that would actually trigger this. |
ezyang
left a comment
There was a problem hiding this comment.
Great, waiting on the test
|
This seems to irrecoverably fail the XLA build: We should follow previous practice for large allocation and use the |
|
@ezyang I can only see Would it make sense to report the apparent memory leak to XLA and just add a test skip to XLA for the time being? |
|
Well, you can still do something similar to what the test does in the end, which is something like |
bf0848b to
e1b6094
Compare
|
Okay, moved the test to |
facebook-github-bot
left a comment
There was a problem hiding this comment.
@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: Fixes pytorch#32863, (together with pytorch#33310 for the `TensorIterator` reductions) This adds 64-bit indexed kernels for `THC_reduceDimIndex` and uses `THCTensor_canUse32BitIndexMath` to switch between the two at runtime. I have a test for this locally but haven't included it here because `max` is much slower than `argmax`. To the point where the test takes several minutes to call max on just one `2**32` element tensor. That seems excessive, even for a slow test but I can push it if preferred. Pull Request resolved: pytorch#33405 Differential Revision: D20010769 Pulled By: ezyang fbshipit-source-id: a8a86f662598d5fade4d90448436418422c699a3
Fixes #32863, (together with #33310 for the
TensorIteratorreductions)This adds 64-bit indexed kernels for
THC_reduceDimIndexand usesTHCTensor_canUse32BitIndexMathto switch between the two at runtime.I have a test for this locally but haven't included it here because
maxis much slower thanargmax. To the point where the test takes several minutes to call max on just one2**32element tensor. That seems excessive, even for a slow test but I can push it if preferred.