Fix index truncation in argmin/max for large tensors by peterbell10 · Pull Request #33310 · pytorch/pytorch

peterbell10 · 2020-02-13T20:39:42Z

Fixes the TensorIterator parts of #32863 (THC is still broken)

TensorIterator::split now keeps track of the view_offsets into the full tensor range. With this, I can take the base offset for the reduced dimension and translate partial results from the sub-iter into the index range of the full tensor. This happens only once for each intermediate result, so we should still benefit from the performance of 32-bit indexing in loops.

dr-ci · 2020-02-13T20:49:06Z

💊 CircleCI build failures summary and remediations

As of commit c26c2f6:

1/4 failures introduced in this PR
3/4 recognized as flaky ❄️
- Re-run these jobs?

Detailed failure analysis

One may explore the probable reasons each build failed interactively on the Dr. CI website.

🕵️ 1 new failure recognized by patterns

The following build failures do not appear to be due to upstream breakage:

pytorch_xla_linux_xenial_py3_6_clang7_test (1/1)

Step: "Test" (full log | pattern match details)

Feb 14 22:32:47 ERROR [0.018s]: test_atan2_edgecases_xla (__main__.TestTorchDeviceTypeXLA)

Feb 14 22:32:47     return DeviceTypeTestBase.assertEqual(self, x, y, prec, message, allow_inf) 
Feb 14 22:32:47   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 784, in assertEqual 
Feb 14 22:32:47     self.assertEqual(x.item(), y, prec=prec, message=message, 
Feb 14 22:32:47 RuntimeError: Resource exhausted: From /job:localservice/replica:0/task:0: 
Feb 14 22:32:47 Failed to allocate request for 66.00GiB (70866960384B) on device ordinal 0 
Feb 14 22:32:47 	 [[{{node XRTExecute}}]] 
Feb 14 22:32:47 Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. 
Feb 14 22:32:47  
Feb 14 22:32:47  
Feb 14 22:32:47 ====================================================================== 
Feb 14 22:32:47 ERROR [0.018s]: test_atan2_edgecases_xla (__main__.TestTorchDeviceTypeXLA) 
Feb 14 22:32:47 ---------------------------------------------------------------------- 
Feb 14 22:32:47 Traceback (most recent call last): 
Feb 14 22:32:47   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 181, in instantiated_test 
Feb 14 22:32:47     return test(self, device_arg) 
Feb 14 22:32:47   File "/var/lib/jenkins/workspace/xla/test/../../test/test_torch.py", line 11081, in test_atan2_edgecases 
Feb 14 22:32:47     _test_atan2(0, 0, 0, device, dtype) 
Feb 14 22:32:47   File "/var/lib/jenkins/workspace/xla/test/../../test/test_torch.py", line 11078, in _test_atan2 
Feb 14 22:32:47     self.assertTrue(torch.allclose(expected_tensor, actual, rtol=0, atol=0.02)) 
Feb 14 22:32:47 RuntimeError: Resource exhausted: From /job:localservice/replica:0/task:0: 
Feb 14 22:32:47 Failed to allocate request for 66.00GiB (70866960384B) on device ordinal 0

❄️ 3 failures recognized as flaky

The following build failures have been detected as flaky and may not be your fault:

caffe2_onnx_py2_gcc5_ubuntu16_04_test (1/3)

Step: "Test" (full log | pattern match details) ❄️

Feb 14 22:22:23 test/onnx/test_utility_funs.py::TestUtilityFuns::test_constant_fold_slice_negative_index /var/lib/jenkins/workspace/scripts/onnx/test.sh: line 57: 24977 Segmentation fault      (core dumped) pytest "${args[@]}" --ignore "$top_dir/test/onnx/test_pytorch_onnx_onnxruntime.py" --ignore "$top_dir/test/onnx/test_custom_ops.py" --ignore "$top_dir/test/onnx/test_models_onnxruntime.py" "${test_paths[@]}"

Feb 14 22:22:22 test/onnx/test_utility_funs.py::TestUtilityFuns_opset10::test_is_in_onnx_export PASSED [ 99%] 
Feb 14 22:22:22 test/onnx/test_utility_funs.py::TestUtilityFuns_opset10::test_strip_doc_string PASSED [ 99%] 
Feb 14 22:22:22 test/onnx/test_utility_funs.py::TestUtilityFuns_opset10::test_validate_dynamic_axes_invalid_input_output_name PASSED [ 99%] 
Feb 14 22:22:22 test/onnx/test_utility_funs.py::TestUtilityFuns::test_constant_fold_concat PASSED [ 99%] 
Feb 14 22:22:22 test/onnx/test_utility_funs.py::TestUtilityFuns::test_constant_fold_div PASSED [ 99%] 
Feb 14 22:22:22 test/onnx/test_utility_funs.py::TestUtilityFuns::test_constant_fold_lstm PASSED [ 99%] 
Feb 14 22:22:22 test/onnx/test_utility_funs.py::TestUtilityFuns::test_constant_fold_mul PASSED [ 99%] 
Feb 14 22:22:22 test/onnx/test_utility_funs.py::TestUtilityFuns::test_constant_fold_reshape PASSED [ 99%] 
Feb 14 22:22:22 test/onnx/test_utility_funs.py::TestUtilityFuns::test_constant_fold_slice PASSED [ 99%] 
Feb 14 22:22:22 test/onnx/test_utility_funs.py::TestUtilityFuns::test_constant_fold_slice_index_exceeds_dim PASSED [ 99%] 
Feb 14 22:22:23 test/onnx/test_utility_funs.py::TestUtilityFuns::test_constant_fold_slice_negative_index /var/lib/jenkins/workspace/scripts/onnx/test.sh: line 57: 24977 Segmentation fault      (core dumped) pytest "${args[@]}" --ignore "$top_dir/test/onnx/test_pytorch_onnx_onnxruntime.py" --ignore "$top_dir/test/onnx/test_custom_ops.py" --ignore "$top_dir/test/onnx/test_models_onnxruntime.py" "${test_paths[@]}"

pytorch_macos_10_13_py3_test (2/3)

Step: "Test" (full log | pattern match details) ❄️

Feb 14 14:40:28 RuntimeError: Process 0 terminated or timed out after 300.12688302993774 seconds

Feb 14 14:40:28 ====================================================================== 
Feb 14 14:40:28 ERROR [300.151s]: test_backward_python_udf_error (__main__.DistAutogradTestWithSpawn) 
Feb 14 14:40:28 ---------------------------------------------------------------------- 
Feb 14 14:40:28 Traceback (most recent call last): 
Feb 14 14:40:28   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/common_distributed.py", line 175, in wrapper 
Feb 14 14:40:28     self._join_processes(fn) 
Feb 14 14:40:28   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/common_distributed.py", line 256, in _join_processes 
Feb 14 14:40:28     self._check_return_codes(elapsed_time) 
Feb 14 14:40:28   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/common_distributed.py", line 275, in _check_return_codes 
Feb 14 14:40:28     raise RuntimeError('Process {} terminated or timed out after {} seconds'.format(i, elapsed_time)) 
Feb 14 14:40:28 RuntimeError: Process 0 terminated or timed out after 300.12688302993774 seconds 
Feb 14 14:40:28  
Feb 14 14:40:28 ---------------------------------------------------------------------- 
Feb 14 14:40:28 Ran 54 tests in 371.021s 
Feb 14 14:40:28  
Feb 14 14:40:28 FAILED (errors=1, skipped=2) 
Feb 14 14:40:28  
Feb 14 14:40:28 Generating XML reports... 
Feb 14 14:40:29 Traceback (most recent call last): 
Feb 14 14:40:29   File "test/run_test.py", line 486, in <module> 
Feb 14 14:40:29     main()

pytorch_linux_xenial_py3_6_gcc5_4_test (3/3)

Step: "Test" (full log | pattern match details) ❄️

Feb 14 23:41:13 AssertionError: 6 not less than or equal to 1e-05 :

Feb 14 23:41:13 ---------------------------------------------------------------------- 
Feb 14 23:41:13 Traceback (most recent call last): 
Feb 14 23:41:13   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 175, in wrapper 
Feb 14 23:41:13     self._join_processes(fn) 
Feb 14 23:41:13   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 256, in _join_processes 
Feb 14 23:41:13     self._check_return_codes(elapsed_time) 
Feb 14 23:41:13   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 280, in _check_return_codes 
Feb 14 23:41:13     self.assertEqual(first_process.exitcode, 0) 
Feb 14 23:41:13   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 893, in assertEqual 
Feb 14 23:41:13     super(TestCase, self).assertLessEqual(abs(x - y), prec, message) 
Feb 14 23:41:13 AssertionError: 6 not less than or equal to 1e-05 :  
Feb 14 23:41:13  
Feb 14 23:41:13 ---------------------------------------------------------------------- 
Feb 14 23:41:13 Ran 88 tests in 110.033s 
Feb 14 23:41:13  
Feb 14 23:41:13 FAILED (failures=1, skipped=1) 
Feb 14 23:41:13  
Feb 14 23:41:13 Generating XML reports... 
Feb 14 23:41:13 Traceback (most recent call last): 
Feb 14 23:41:13   File "test/run_test.py", line 486, in <module> 
Feb 14 23:41:13     main()

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 10 times.

ezyang · 2020-02-14T04:47:56Z

nifty. @VitalyFedyunin do you think you can take full reviewership on this or should I look?

ezyang · 2020-02-14T04:54:14Z

Or maybe Natalia is the best reviewer here.

facebook-github-bot

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

peterbell10 · 2020-02-14T17:32:47Z

Test failure looks genuine, but for some reason it's the CPU version that failed:

Feb 14 01:40:27 FAIL [8.918s]: test_argminmax_large_axis_cpu (__main__.TestTorchDeviceTypeCPU)
Feb 14 01:40:27 ----------------------------------------------------------------------
Feb 14 01:40:27 Traceback (most recent call last):
Feb 14 01:40:27   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 180, in instantiated_test
Feb 14 01:40:27     return test(self, device_arg)
Feb 14 01:40:27   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 382, in wrapper
Feb 14 01:40:27     fn(*args, **kwargs)
Feb 14 01:40:27   File "test_torch.py", line 8411, in test_argminmax_large_axis
Feb 14 01:40:27     self.assertEqual(x.argmin(1), [x.shape[1] - 1] * 2)
Feb 14 01:40:27   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 883, in assertEqual
Feb 14 01:40:27     allow_inf=allow_inf)
Feb 14 01:40:27   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 785, in assertEqual
Feb 14 01:40:27     allow_inf=allow_inf)
Feb 14 01:40:27   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 893, in assertEqual
Feb 14 01:40:27     super(TestCase, self).assertLessEqual(abs(x - y), prec, message)
Feb 14 01:40:27 AssertionError: 1 not less than or equal to 1e-05 :

ngimel

Approving, modulo failing test. Do you need help with debugging it?

peterbell10 · 2020-02-14T19:00:46Z

It looks like there are places where the cpu reduction also splits iterators. I'll see if the same fix will help for there.

peterbell10 · 2020-02-14T21:27:58Z

I applied the fix to the CPU reduction but it turns out the actual cause here was just that I was assigning -1 to a uint8 value.

facebook-github-bot

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

vadimkantorov · 2020-02-15T11:07:21Z

I hope #33370 doesn't intervene with this one

ngimel · 2020-02-15T19:55:43Z

#33370 is unrelated.

facebook-github-bot · 2020-02-16T05:13:17Z

@ngimel merged this pull request in 495bd58.

vadimkantorov · 2020-02-16T10:47:17Z

Are there any consequences about this mixed 64bit/32bit indexing? e.g. is it faster to do argmax across a contiguous small chunk vs a discontiguous small chunk that may require 64bit indexing?

peterbell10 · 2020-02-16T15:57:33Z

Are there any consequences about this mixed 64bit/32bit indexing?

This PR only introduces 1 extra add per reduced element in each kernel, so I don't expect performance to change much.

is it faster to do argmax across a contiguous small chunk vs a discontiguous small chunk that may require 64bit indexing

There is a potential cost from splitting the operation over multiple kernel launches; it will limit the parallelism available within each kernel. For large contiguous chunks, that's inconsequential but, in the extreme, I suppose a slice x[0::2**32].argmax() would launch a new kernel for every element in the input. That would be terrible for performance.

However, I'd note that this has always been the case for TensorIterator reductions.

vadimkantorov · 2020-02-16T16:58:34Z

So in theory there is tradeoff between multiple kernel launches that do 32-bit indexing and a single kernel launch that must do 64-bit indexing?

peterbell10 · 2020-02-16T17:10:36Z

In theory, yes. In practice I don't think it matters much. 32-bit indexing covers ~2GB of memory so you could only have 10-20 tensor splits before you've covered all the memory in a top of the line Quadro card.

Summary: Fixes #32863, (together with #33310 for the `TensorIterator` reductions) This adds 64-bit indexed kernels for `THC_reduceDimIndex` and uses `THCTensor_canUse32BitIndexMath` to switch between the two at runtime. I have a test for this locally but haven't included it here because `max` is much slower than `argmax`. To the point where the test takes several minutes to call max on just one `2**32` element tensor. That seems excessive, even for a slow test but I can push it if preferred. Pull Request resolved: #33405 Differential Revision: D20010769 Pulled By: ezyang fbshipit-source-id: a8a86f662598d5fade4d90448436418422c699a3

Summary: Fixes the `TensorIterator` parts of pytorch#32863 (THC is still broken) `TensorIterator::split` now keeps track of the `view_offsets` into the full tensor range. With this, I can take the base offset for the reduced dimension and translate partial results from the sub-iter into the index range of the full tensor. This happens only once for each intermediate result, so we should still benefit from the performance of 32-bit indexing in loops. Pull Request resolved: pytorch#33310 Differential Revision: D19906136 Pulled By: ngimel fbshipit-source-id: 3372ee4b8d5b115a53be79aeafc52e80ff9c490b

Summary: Fixes pytorch#32863, (together with pytorch#33310 for the `TensorIterator` reductions) This adds 64-bit indexed kernels for `THC_reduceDimIndex` and uses `THCTensor_canUse32BitIndexMath` to switch between the two at runtime. I have a test for this locally but haven't included it here because `max` is much slower than `argmax`. To the point where the test takes several minutes to call max on just one `2**32` element tensor. That seems excessive, even for a slow test but I can push it if preferred. Pull Request resolved: pytorch#33405 Differential Revision: D20010769 Pulled By: ezyang fbshipit-source-id: a8a86f662598d5fade4d90448436418422c699a3

Summary: Closes gh-39060 The `TensorIterator` splitting is based on `can_use_32bit_indexing` which assumes 32-bit signed ints, so we can get away with just 2**31 as the axis length. Also tested on an old commit that I can reproduce the test failure on just a 1d tensor, overall quartering the memory requirement for the test. https://github.com/pytorch/pytorch/blob/4c7d81f8479bce320cc11d1eb3adaf8ab0b90099/aten/src/ATen/native/TensorIterator.cpp#L879 For reference, the test was first added in gh-33310. Pull Request resolved: #40036 Differential Revision: D22068690 Pulled By: ezyang fbshipit-source-id: 83199fd31647d1ef106b08f471c0e9517d3516e3

Summary: Closes pytorchgh-39060 The `TensorIterator` splitting is based on `can_use_32bit_indexing` which assumes 32-bit signed ints, so we can get away with just 2**31 as the axis length. Also tested on an old commit that I can reproduce the test failure on just a 1d tensor, overall quartering the memory requirement for the test. https://github.com/pytorch/pytorch/blob/4c7d81f8479bce320cc11d1eb3adaf8ab0b90099/aten/src/ATen/native/TensorIterator.cpp#L879 For reference, the test was first added in pytorchgh-33310. Pull Request resolved: pytorch#40036 Differential Revision: D22068690 Pulled By: ezyang fbshipit-source-id: 83199fd31647d1ef106b08f471c0e9517d3516e3

peterbell10 added the open source label Feb 13, 2020

peterbell10 requested a review from ezyang February 13, 2020 20:39

peterbell10 force-pushed the argminmax-indexing branch from 09bb2b3 to f2b4178 Compare February 13, 2020 20:43

ezyang requested a review from VitalyFedyunin February 14, 2020 04:47

ezyang requested a review from ngimel February 14, 2020 04:53

facebook-github-bot reviewed Feb 14, 2020

View reviewed changes

ngimel approved these changes Feb 14, 2020

View reviewed changes

peterbell10 added 3 commits February 14, 2020 21:23

Fix index truncation in argmin/max for large tensors

365ccaa

Fix CPU reduction indices for split tensors

16d6f8f

Fix test data type

c26c2f6

peterbell10 force-pushed the argminmax-indexing branch from 4fe198f to c26c2f6 Compare February 14, 2020 21:24

facebook-github-bot reviewed Feb 14, 2020

View reviewed changes

facebook-github-bot closed this in 495bd58 Feb 16, 2020

facebook-github-bot added the merged label Feb 16, 2020

peterbell10 mentioned this pull request Feb 16, 2020

Add 64-bit indexing support to THC index reductions #33405

Closed

peterbell10 mentioned this pull request Jun 15, 2020

Reduce memory requirement for test_argminmax_large_axis #40036

Closed

mruberry added the Merged label Oct 28, 2020

Conversation

peterbell10 commented Feb 13, 2020

Uh oh!

dr-ci bot commented Feb 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CircleCI build failures summary and remediations

Detailed failure analysis

🕵️ 1 new failure recognized by patterns

pytorch_xla_linux_xenial_py3_6_clang7_test (1/1)

❄️ 3 failures recognized as flaky

caffe2_onnx_py2_gcc5_ubuntu16_04_test (1/3)

pytorch_macos_10_13_py3_test (2/3)

pytorch_linux_xenial_py3_6_gcc5_4_test (3/3)

Uh oh!

ezyang commented Feb 14, 2020

Uh oh!

ezyang commented Feb 14, 2020

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

peterbell10 commented Feb 14, 2020

Uh oh!

ngimel left a comment

Choose a reason for hiding this comment

Uh oh!

peterbell10 commented Feb 14, 2020

Uh oh!

peterbell10 commented Feb 14, 2020

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

vadimkantorov commented Feb 15, 2020

Uh oh!

ngimel commented Feb 15, 2020

Uh oh!

facebook-github-bot commented Feb 16, 2020

Uh oh!

vadimkantorov commented Feb 16, 2020

Uh oh!

peterbell10 commented Feb 16, 2020

Uh oh!

vadimkantorov commented Feb 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

peterbell10 commented Feb 16, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

dr-ci bot commented Feb 13, 2020 •

edited

Loading

vadimkantorov commented Feb 16, 2020 •

edited

Loading