Skip to content

Fix index truncation in argmin/max for large tensors#33310

Closed
peterbell10 wants to merge 3 commits intopytorch:masterfrom
peterbell10:argminmax-indexing
Closed

Fix index truncation in argmin/max for large tensors#33310
peterbell10 wants to merge 3 commits intopytorch:masterfrom
peterbell10:argminmax-indexing

Conversation

@peterbell10
Copy link
Collaborator

Fixes the TensorIterator parts of #32863 (THC is still broken)

TensorIterator::split now keeps track of the view_offsets into the full tensor range. With this, I can take the base offset for the reduced dimension and translate partial results from the sub-iter into the index range of the full tensor. This happens only once for each intermediate result, so we should still benefit from the performance of 32-bit indexing in loops.

@dr-ci
Copy link

dr-ci bot commented Feb 13, 2020

💊 CircleCI build failures summary and remediations

As of commit c26c2f6:

  • 1/4 failures introduced in this PR
  • 3/4 recognized as flaky ❄️
    • Re-run these jobs?

Detailed failure analysis

One may explore the probable reasons each build failed interactively on the Dr. CI website.

🕵️ 1 new failure recognized by patterns

The following build failures do not appear to be due to upstream breakage:

See CircleCI build pytorch_xla_linux_xenial_py3_6_clang7_test (1/1)

Step: "Test" (full log | pattern match details)

Feb 14 22:32:47 ERROR [0.018s]: test_atan2_edgecases_xla (__main__.TestTorchDeviceTypeXLA)
Feb 14 22:32:47     return DeviceTypeTestBase.assertEqual(self, x, y, prec, message, allow_inf) 
Feb 14 22:32:47   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 784, in assertEqual 
Feb 14 22:32:47     self.assertEqual(x.item(), y, prec=prec, message=message, 
Feb 14 22:32:47 RuntimeError: Resource exhausted: From /job:localservice/replica:0/task:0: 
Feb 14 22:32:47 Failed to allocate request for 66.00GiB (70866960384B) on device ordinal 0 
Feb 14 22:32:47 	 [[{{node XRTExecute}}]] 
Feb 14 22:32:47 Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. 
Feb 14 22:32:47  
Feb 14 22:32:47  
Feb 14 22:32:47 ====================================================================== 
Feb 14 22:32:47 ERROR [0.018s]: test_atan2_edgecases_xla (__main__.TestTorchDeviceTypeXLA) 
Feb 14 22:32:47 ---------------------------------------------------------------------- 
Feb 14 22:32:47 Traceback (most recent call last): 
Feb 14 22:32:47   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 181, in instantiated_test 
Feb 14 22:32:47     return test(self, device_arg) 
Feb 14 22:32:47   File "/var/lib/jenkins/workspace/xla/test/../../test/test_torch.py", line 11081, in test_atan2_edgecases 
Feb 14 22:32:47     _test_atan2(0, 0, 0, device, dtype) 
Feb 14 22:32:47   File "/var/lib/jenkins/workspace/xla/test/../../test/test_torch.py", line 11078, in _test_atan2 
Feb 14 22:32:47     self.assertTrue(torch.allclose(expected_tensor, actual, rtol=0, atol=0.02)) 
Feb 14 22:32:47 RuntimeError: Resource exhausted: From /job:localservice/replica:0/task:0: 
Feb 14 22:32:47 Failed to allocate request for 66.00GiB (70866960384B) on device ordinal 0 

❄️ 3 failures recognized as flaky

The following build failures have been detected as flaky and may not be your fault:

See CircleCI build caffe2_onnx_py2_gcc5_ubuntu16_04_test (1/3)

Step: "Test" (full log | pattern match details) ❄️

Feb 14 22:22:23 test/onnx/test_utility_funs.py::TestUtilityFuns::test_constant_fold_slice_negative_index /var/lib/jenkins/workspace/scripts/onnx/test.sh: line 57: 24977 Segmentation fault (core dumped) pytest "${args[@]}" --ignore "$top_dir/test/onnx/test_pytorch_onnx_onnxruntime.py" --ignore "$top_dir/test/onnx/test_custom_ops.py" --ignore "$top_dir/test/onnx/test_models_onnxruntime.py" "${test_paths[@]}"
Feb 14 22:22:22 test/onnx/test_utility_funs.py::TestUtilityFuns_opset10::test_is_in_onnx_export PASSED [ 99%] 
Feb 14 22:22:22 test/onnx/test_utility_funs.py::TestUtilityFuns_opset10::test_strip_doc_string PASSED [ 99%] 
Feb 14 22:22:22 test/onnx/test_utility_funs.py::TestUtilityFuns_opset10::test_validate_dynamic_axes_invalid_input_output_name PASSED [ 99%] 
Feb 14 22:22:22 test/onnx/test_utility_funs.py::TestUtilityFuns::test_constant_fold_concat PASSED [ 99%] 
Feb 14 22:22:22 test/onnx/test_utility_funs.py::TestUtilityFuns::test_constant_fold_div PASSED [ 99%] 
Feb 14 22:22:22 test/onnx/test_utility_funs.py::TestUtilityFuns::test_constant_fold_lstm PASSED [ 99%] 
Feb 14 22:22:22 test/onnx/test_utility_funs.py::TestUtilityFuns::test_constant_fold_mul PASSED [ 99%] 
Feb 14 22:22:22 test/onnx/test_utility_funs.py::TestUtilityFuns::test_constant_fold_reshape PASSED [ 99%] 
Feb 14 22:22:22 test/onnx/test_utility_funs.py::TestUtilityFuns::test_constant_fold_slice PASSED [ 99%] 
Feb 14 22:22:22 test/onnx/test_utility_funs.py::TestUtilityFuns::test_constant_fold_slice_index_exceeds_dim PASSED [ 99%] 
Feb 14 22:22:23 test/onnx/test_utility_funs.py::TestUtilityFuns::test_constant_fold_slice_negative_index /var/lib/jenkins/workspace/scripts/onnx/test.sh: line 57: 24977 Segmentation fault      (core dumped) pytest "${args[@]}" --ignore "$top_dir/test/onnx/test_pytorch_onnx_onnxruntime.py" --ignore "$top_dir/test/onnx/test_custom_ops.py" --ignore "$top_dir/test/onnx/test_models_onnxruntime.py" "${test_paths[@]}" 

See CircleCI build pytorch_macos_10_13_py3_test (2/3)

Step: "Test" (full log | pattern match details) ❄️

Feb 14 14:40:28 RuntimeError: Process 0 terminated or timed out after 300.12688302993774 seconds
Feb 14 14:40:28 ====================================================================== 
Feb 14 14:40:28 ERROR [300.151s]: test_backward_python_udf_error (__main__.DistAutogradTestWithSpawn) 
Feb 14 14:40:28 ---------------------------------------------------------------------- 
Feb 14 14:40:28 Traceback (most recent call last): 
Feb 14 14:40:28   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/common_distributed.py", line 175, in wrapper 
Feb 14 14:40:28     self._join_processes(fn) 
Feb 14 14:40:28   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/common_distributed.py", line 256, in _join_processes 
Feb 14 14:40:28     self._check_return_codes(elapsed_time) 
Feb 14 14:40:28   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/common_distributed.py", line 275, in _check_return_codes 
Feb 14 14:40:28     raise RuntimeError('Process {} terminated or timed out after {} seconds'.format(i, elapsed_time)) 
Feb 14 14:40:28 RuntimeError: Process 0 terminated or timed out after 300.12688302993774 seconds 
Feb 14 14:40:28  
Feb 14 14:40:28 ---------------------------------------------------------------------- 
Feb 14 14:40:28 Ran 54 tests in 371.021s 
Feb 14 14:40:28  
Feb 14 14:40:28 FAILED (errors=1, skipped=2) 
Feb 14 14:40:28  
Feb 14 14:40:28 Generating XML reports... 
Feb 14 14:40:29 Traceback (most recent call last): 
Feb 14 14:40:29   File "test/run_test.py", line 486, in <module> 
Feb 14 14:40:29     main() 

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_test (3/3)

Step: "Test" (full log | pattern match details) ❄️

Feb 14 23:41:13 AssertionError: 6 not less than or equal to 1e-05 :
Feb 14 23:41:13 ---------------------------------------------------------------------- 
Feb 14 23:41:13 Traceback (most recent call last): 
Feb 14 23:41:13   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 175, in wrapper 
Feb 14 23:41:13     self._join_processes(fn) 
Feb 14 23:41:13   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 256, in _join_processes 
Feb 14 23:41:13     self._check_return_codes(elapsed_time) 
Feb 14 23:41:13   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 280, in _check_return_codes 
Feb 14 23:41:13     self.assertEqual(first_process.exitcode, 0) 
Feb 14 23:41:13   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 893, in assertEqual 
Feb 14 23:41:13     super(TestCase, self).assertLessEqual(abs(x - y), prec, message) 
Feb 14 23:41:13 AssertionError: 6 not less than or equal to 1e-05 :  
Feb 14 23:41:13  
Feb 14 23:41:13 ---------------------------------------------------------------------- 
Feb 14 23:41:13 Ran 88 tests in 110.033s 
Feb 14 23:41:13  
Feb 14 23:41:13 FAILED (failures=1, skipped=1) 
Feb 14 23:41:13  
Feb 14 23:41:13 Generating XML reports... 
Feb 14 23:41:13 Traceback (most recent call last): 
Feb 14 23:41:13   File "test/run_test.py", line 486, in <module> 
Feb 14 23:41:13     main() 

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 10 times.

@ezyang
Copy link
Contributor

ezyang commented Feb 14, 2020

nifty. @VitalyFedyunin do you think you can take full reviewership on this or should I look?

@ezyang ezyang requested a review from ngimel February 14, 2020 04:53
@ezyang
Copy link
Contributor

ezyang commented Feb 14, 2020

Or maybe Natalia is the best reviewer here.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@peterbell10
Copy link
Collaborator Author

Test failure looks genuine, but for some reason it's the CPU version that failed:

Feb 14 01:40:27 FAIL [8.918s]: test_argminmax_large_axis_cpu (__main__.TestTorchDeviceTypeCPU)
Feb 14 01:40:27 ----------------------------------------------------------------------
Feb 14 01:40:27 Traceback (most recent call last):
Feb 14 01:40:27   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 180, in instantiated_test
Feb 14 01:40:27     return test(self, device_arg)
Feb 14 01:40:27   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 382, in wrapper
Feb 14 01:40:27     fn(*args, **kwargs)
Feb 14 01:40:27   File "test_torch.py", line 8411, in test_argminmax_large_axis
Feb 14 01:40:27     self.assertEqual(x.argmin(1), [x.shape[1] - 1] * 2)
Feb 14 01:40:27   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 883, in assertEqual
Feb 14 01:40:27     allow_inf=allow_inf)
Feb 14 01:40:27   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 785, in assertEqual
Feb 14 01:40:27     allow_inf=allow_inf)
Feb 14 01:40:27   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 893, in assertEqual
Feb 14 01:40:27     super(TestCase, self).assertLessEqual(abs(x - y), prec, message)
Feb 14 01:40:27 AssertionError: 1 not less than or equal to 1e-05 : 

Copy link
Collaborator

@ngimel ngimel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, modulo failing test. Do you need help with debugging it?

@peterbell10
Copy link
Collaborator Author

It looks like there are places where the cpu reduction also splits iterators. I'll see if the same fix will help for there.

@peterbell10
Copy link
Collaborator Author

I applied the fix to the CPU reduction but it turns out the actual cause here was just that I was assigning -1 to a uint8 value.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@vadimkantorov
Copy link
Contributor

I hope #33370 doesn't intervene with this one

@ngimel
Copy link
Collaborator

ngimel commented Feb 15, 2020

#33370 is unrelated.

@facebook-github-bot
Copy link
Contributor

@ngimel merged this pull request in 495bd58.

@vadimkantorov
Copy link
Contributor

Are there any consequences about this mixed 64bit/32bit indexing? e.g. is it faster to do argmax across a contiguous small chunk vs a discontiguous small chunk that may require 64bit indexing?

@peterbell10
Copy link
Collaborator Author

Are there any consequences about this mixed 64bit/32bit indexing?

This PR only introduces 1 extra add per reduced element in each kernel, so I don't expect performance to change much.

is it faster to do argmax across a contiguous small chunk vs a discontiguous small chunk that may require 64bit indexing

There is a potential cost from splitting the operation over multiple kernel launches; it will limit the parallelism available within each kernel. For large contiguous chunks, that's inconsequential but, in the extreme, I suppose a slice x[0::2**32].argmax() would launch a new kernel for every element in the input. That would be terrible for performance.

However, I'd note that this has always been the case for TensorIterator reductions.

@vadimkantorov
Copy link
Contributor

vadimkantorov commented Feb 16, 2020

So in theory there is tradeoff between multiple kernel launches that do 32-bit indexing and a single kernel launch that must do 64-bit indexing?

@peterbell10
Copy link
Collaborator Author

In theory, yes. In practice I don't think it matters much. 32-bit indexing covers ~2GB of memory so you could only have 10-20 tensor splits before you've covered all the memory in a top of the line Quadro card.

facebook-github-bot pushed a commit that referenced this pull request Feb 20, 2020
Summary:
Fixes #32863, (together with #33310 for the `TensorIterator` reductions)

This adds 64-bit indexed kernels for `THC_reduceDimIndex` and uses `THCTensor_canUse32BitIndexMath` to switch between the two at runtime.

I have a test for this locally but haven't included it here because `max` is much slower than `argmax`. To the point where the test takes several minutes to call max on just one `2**32` element tensor. That seems excessive, even for a slow test but I can push it if preferred.
Pull Request resolved: #33405

Differential Revision: D20010769

Pulled By: ezyang

fbshipit-source-id: a8a86f662598d5fade4d90448436418422c699a3
ttumiel pushed a commit to ttumiel/pytorch that referenced this pull request Mar 4, 2020
Summary:
Fixes the `TensorIterator` parts of pytorch#32863 (THC is still broken)

`TensorIterator::split` now keeps track of the `view_offsets` into the full tensor range. With this, I can take the base offset for the reduced dimension and translate partial results from the sub-iter into the index range of the full tensor. This happens only once for each intermediate result, so we should still benefit from the performance of 32-bit indexing in loops.
Pull Request resolved: pytorch#33310

Differential Revision: D19906136

Pulled By: ngimel

fbshipit-source-id: 3372ee4b8d5b115a53be79aeafc52e80ff9c490b
ttumiel pushed a commit to ttumiel/pytorch that referenced this pull request Mar 4, 2020
Summary:
Fixes pytorch#32863, (together with pytorch#33310 for the `TensorIterator` reductions)

This adds 64-bit indexed kernels for `THC_reduceDimIndex` and uses `THCTensor_canUse32BitIndexMath` to switch between the two at runtime.

I have a test for this locally but haven't included it here because `max` is much slower than `argmax`. To the point where the test takes several minutes to call max on just one `2**32` element tensor. That seems excessive, even for a slow test but I can push it if preferred.
Pull Request resolved: pytorch#33405

Differential Revision: D20010769

Pulled By: ezyang

fbshipit-source-id: a8a86f662598d5fade4d90448436418422c699a3
facebook-github-bot pushed a commit that referenced this pull request Jun 16, 2020
Summary:
Closes gh-39060

The `TensorIterator` splitting is based on `can_use_32bit_indexing` which assumes 32-bit signed ints, so we can get away with just 2**31 as the axis length. Also tested on an old commit that I can reproduce the test failure on just a 1d tensor, overall quartering the memory requirement for the test.

https://github.com/pytorch/pytorch/blob/4c7d81f8479bce320cc11d1eb3adaf8ab0b90099/aten/src/ATen/native/TensorIterator.cpp#L879

For reference, the test was first added in gh-33310.
Pull Request resolved: #40036

Differential Revision: D22068690

Pulled By: ezyang

fbshipit-source-id: 83199fd31647d1ef106b08f471c0e9517d3516e3
xwang233 pushed a commit to xwang233/pytorch that referenced this pull request Jun 20, 2020
Summary:
Closes pytorchgh-39060

The `TensorIterator` splitting is based on `can_use_32bit_indexing` which assumes 32-bit signed ints, so we can get away with just 2**31 as the axis length. Also tested on an old commit that I can reproduce the test failure on just a 1d tensor, overall quartering the memory requirement for the test.

https://github.com/pytorch/pytorch/blob/4c7d81f8479bce320cc11d1eb3adaf8ab0b90099/aten/src/ATen/native/TensorIterator.cpp#L879

For reference, the test was first added in pytorchgh-33310.
Pull Request resolved: pytorch#40036

Differential Revision: D22068690

Pulled By: ezyang

fbshipit-source-id: 83199fd31647d1ef106b08f471c0e9517d3516e3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants