Remove gpu_kernel_with_index by zasdfgbnm · Pull Request #33370 · pytorch/pytorch

zasdfgbnm · 2020-02-15T01:59:59Z

Although gpu_kernel_with_index might look like a quite general helper function at first look, it actually isn't.

The problem is not only 32bit indexing, but something more fundamental: TensorIterator reorder dims and shapes, so if you have non-contiguous tensor such as torch.empty(5, 5).t() , the index won't be correct. Since the whole point of TensorIterator is to manipulate shapes/strides to speedup loops, it is fundamentally impossible to get the correct linear index without tons of efforts.

Currently, the range factories are not failing on an out=non_contiguous_tensor is because it is so lucky that has_internal_overlap is stupid enough to return everything not contiguous as TOO_HARD.

Since gpu_kernel_with_index is not general, we should move it from Loops.cuh to RangeFactories.cu. And since the kernel is so simple to implement, it makes no sense to use TensorIterator which goes through tons of unnecessary checks like compute_dtypes.

torch.range is not tested for 64bit-indexing, and I will file a new PR to remove it (it was supposed to be removed at 0.5).

Benchmark:
The device is GTX-1650, I don't have a good GPU at home.

Code:

import torch
print(torch.__version__)

for i in range(100):
    torch.randn(1000, device='cuda')
torch.cuda.synchronize()

for i in range(15, 29):
    %timeit torch.arange(2 ** i, device='cuda'); torch.cuda.synchronize()

Before:

1.5.0a0+c37a9b8
11.9 µs ± 412 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
12.7 µs ± 309 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
19.6 µs ± 209 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
28.9 µs ± 923 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
48.4 µs ± 1.64 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
85.7 µs ± 1.46 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
162 µs ± 1.09 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
312 µs ± 9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
618 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.22 ms ± 9.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.45 ms ± 97.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.9 ms ± 155 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
10.1 ms ± 378 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

After:

1.5.0a0+7960d19
11 µs ± 29.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
12.4 µs ± 550 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
18.4 µs ± 230 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
27.6 µs ± 10.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
46.2 µs ± 18.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
83.3 µs ± 5.61 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
158 µs ± 373 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
307 µs ± 1.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
603 µs ± 112 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.2 ms ± 1.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.4 ms ± 23.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.77 ms ± 25.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.51 ms ± 933 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)

dr-ci · 2020-02-15T02:36:26Z

💊 CircleCI build failures summary and remediations

As of commit ec09ca7:

1/2 failures introduced in this PR
1/2 recognized as flaky ❄️
- Re-run these jobs?

Detailed failure analysis

One may explore the probable reasons each build failed interactively on the Dr. CI website.

🕵️ 1 new failure recognized by patterns

The following build failures do not appear to be due to upstream breakage:

pytorch_windows_vs2019_py36_cuda10.1_test2 (1/1)

Step: "Test" (full log | pattern match details)

RuntimeError: test_jit_fuser failed!

 
---------------------------------------------------------------------- 
Ran 46 tests in 11.552s 
 
FAILED (errors=4, skipped=10) 
Traceback (most recent call last): 
  File "run_test.py", line 486, in <module> 
    main() 
  File "run_test.py", line 479, in main 
    raise RuntimeError(message) 
RuntimeError: test_jit_fuser failed! 
 
(base) circleci@PACKER-5E29F737 C:\Users\circleci\project\test>if ERRORLEVEL 1 exit /b 1  
+ cleanup
+ retcode=1
+ set +x

❄️ 1 failure recognized as flaky

The following build failures have been detected as flaky and may not be your fault:

caffe2_onnx_py2_gcc5_ubuntu16_04_test (1/1)

Step: "Test" (full log | pattern match details) ❄️

Feb 16 04:49:52 test/onnx/test_utility_funs.py::TestUtilityFuns::test_constant_fold_slice_negative_index /var/lib/jenkins/workspace/scripts/onnx/test.sh: line 57: 29209 Segmentation fault      (core dumped) pytest "${args[@]}" --ignore "$top_dir/test/onnx/test_pytorch_onnx_onnxruntime.py" --ignore "$top_dir/test/onnx/test_custom_ops.py" --ignore "$top_dir/test/onnx/test_models_onnxruntime.py" "${test_paths[@]}"

Feb 16 04:49:51 test/onnx/test_utility_funs.py::TestUtilityFuns_opset10::test_is_in_onnx_export PASSED [ 99%] 
Feb 16 04:49:51 test/onnx/test_utility_funs.py::TestUtilityFuns_opset10::test_strip_doc_string PASSED [ 99%] 
Feb 16 04:49:51 test/onnx/test_utility_funs.py::TestUtilityFuns_opset10::test_validate_dynamic_axes_invalid_input_output_name PASSED [ 99%] 
Feb 16 04:49:51 test/onnx/test_utility_funs.py::TestUtilityFuns::test_constant_fold_concat PASSED [ 99%] 
Feb 16 04:49:51 test/onnx/test_utility_funs.py::TestUtilityFuns::test_constant_fold_div PASSED [ 99%] 
Feb 16 04:49:51 test/onnx/test_utility_funs.py::TestUtilityFuns::test_constant_fold_lstm PASSED [ 99%] 
Feb 16 04:49:51 test/onnx/test_utility_funs.py::TestUtilityFuns::test_constant_fold_mul PASSED [ 99%] 
Feb 16 04:49:51 test/onnx/test_utility_funs.py::TestUtilityFuns::test_constant_fold_reshape SKIPPED [ 99%] 
Feb 16 04:49:51 test/onnx/test_utility_funs.py::TestUtilityFuns::test_constant_fold_slice PASSED [ 99%] 
Feb 16 04:49:51 test/onnx/test_utility_funs.py::TestUtilityFuns::test_constant_fold_slice_index_exceeds_dim PASSED [ 99%] 
Feb 16 04:49:52 test/onnx/test_utility_funs.py::TestUtilityFuns::test_constant_fold_slice_negative_index /var/lib/jenkins/workspace/scripts/onnx/test.sh: line 57: 29209 Segmentation fault      (core dumped) pytest "${args[@]}" --ignore "$top_dir/test/onnx/test_pytorch_onnx_onnxruntime.py" --ignore "$top_dir/test/onnx/test_custom_ops.py" --ignore "$top_dir/test/onnx/test_models_onnxruntime.py" "${test_paths[@]}"

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 64 times.

ngimel · 2020-02-15T19:58:14Z

Test failure in test_print and test_pickle might be related?

zasdfgbnm · 2020-02-15T20:50:37Z

@ngimel I don't know. Let me rebase and see.

zasdfgbnm · 2020-02-15T20:51:07Z

@pytorchbot rebase this please

ngimel · 2020-02-15T21:02:44Z

This generally looks good, but if you want to be awesome, can you implement @largeTensorTest wrapper in common_device_type.py? It seems to be usefule, TEST_LARGE_TENSOR is used in a lot of places.

zasdfgbnm · 2020-02-15T21:03:25Z

@ngimel Sure

zasdfgbnm · 2020-02-15T21:12:41Z

@ngimel The failures are related. They are about n dimensional zero sized tensors.

facebook-github-bot

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ngimel · 2020-02-15T22:38:13Z

test_nn now segfaults.

zasdfgbnm · 2020-02-15T23:48:19Z

@ngimel

CUDA out of memory. Tried to allocate 6.00 GiB (GPU 0; 14.85 GiB total capacity; 8.00 GiB already allocated; 1.96 GiB free; 12.13 GiB reserved in total by PyTorch)

It is strange that this test is not skipped on CI. It was correctly skipped locally on my machine. Could it be possible that the total RAM of GPU is 32GB, but only 16GB is allowed by PyTorch to use?

zasdfgbnm · 2020-02-16T00:33:51Z

@ngimel I see what's wrong: I forget a not in the condition and was testing on a wrong PyTorch build locally.....

ngimel · 2020-02-16T00:41:47Z

Which job are you looking at? Generally in Set up CI environment section you can see output of nvidia-smi. E.g. here https://app.circleci.com/jobs/github/pytorch/pytorch/4510128 nvidia-smi reports M60 with 8 GB memory, and the test fails with the correct report (tried to allocate 4 GB, 7.44 or something available), so it's not being skipped correctly?
non-gpu tests fail with segfault, not with oom.

zasdfgbnm · 2020-02-16T00:51:40Z

@ngimel Yes, you are right, it is not skipped correctly. I was looking at a build on a wrong directory, in which it was successfully skipped, that's why I was saying

It is strange that this test is not skipped on CI. It was correctly skipped locally on my machine.

The SIGSEGV fault could be because some test was skipped both on CPU and CUDA (due to the skipIf), but now it is only skipped on CUDA. But the machine on the CI does not have that much of host memory to run it on CPU. Let me check on a 256GB RAM computer.

zasdfgbnm · 2020-02-16T06:04:42Z

@ngimel Ready

facebook-github-bot

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-02-18T03:10:16Z

@ngimel merged this pull request in 55fa133.

Summary: Although `gpu_kernel_with_index` might look like a quite general helper function at first look, it actually isn't. The problem is not only 32bit indexing, but something more fundamental: `TensorIterator` reorder dims and shapes, so if you have non-contiguous tensor such as `torch.empty(5, 5).t()` , the index won't be correct. Since the whole point of `TensorIterator` is to manipulate shapes/strides to speedup loops, it is fundamentally impossible to get the correct linear index without tons of efforts. Currently, the range factories are not failing on an `out=non_contiguous_tensor` is because it is so lucky that `has_internal_overlap` is stupid enough to return everything not contiguous as `TOO_HARD`. Since `gpu_kernel_with_index` is not general, we should move it from `Loops.cuh` to `RangeFactories.cu`. And since the kernel is so simple to implement, it makes no sense to use `TensorIterator` which goes through tons of unnecessary checks like `compute_dtypes`. `torch.range` is not tested for 64bit-indexing, and I will file a new PR to remove it (it was supposed to be removed at 0.5). Benchmark: The device is GTX-1650, I don't have a good GPU at home. Code: ```python import torch print(torch.__version__) for i in range(100): torch.randn(1000, device='cuda') torch.cuda.synchronize() for i in range(15, 29): %timeit torch.arange(2 ** i, device='cuda'); torch.cuda.synchronize() ``` Before: ``` 1.5.0a0+c37a9b8 11.9 µs ± 412 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 12.7 µs ± 309 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 19.6 µs ± 209 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 28.9 µs ± 923 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 48.4 µs ± 1.64 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 85.7 µs ± 1.46 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 162 µs ± 1.09 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 312 µs ± 9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 618 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.22 ms ± 9.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 2.45 ms ± 97.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 4.9 ms ± 155 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 10.1 ms ± 378 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` After: ``` 1.5.0a0+7960d19 11 µs ± 29.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 12.4 µs ± 550 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 18.4 µs ± 230 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 27.6 µs ± 10.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 46.2 µs ± 18.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 83.3 µs ± 5.61 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 158 µs ± 373 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) 307 µs ± 1.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 603 µs ± 112 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) 1.2 ms ± 1.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 2.4 ms ± 23.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 4.77 ms ± 25.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 9.51 ms ± 933 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` Pull Request resolved: pytorch#33370 Differential Revision: D19925990 Pulled By: ngimel fbshipit-source-id: f4a732fe14a5582b35a56618941120d62e82fdce

zasdfgbnm added 5 commits February 14, 2020 17:59

Remove gpu_kernel_with_index

52e2386

clean

02c5070

NL in the end

c282d37

int64_t

8acd675

<=

6df2e96

pytorchbot added the open source label Feb 15, 2020

zasdfgbnm added 9 commits February 14, 2020 19:24

fix

5fafee6

fix

39cf2db

fix

b6c6cd1

fix

a650fec

test

7960d19

fix

61d4e6d

save

be59fbc

reduce mem usage

066b6aa

save

7da291a

zasdfgbnm changed the title ~~[WIP] Remove gpu_kernel_with_index~~ Remove gpu_kernel_with_index Feb 15, 2020

zasdfgbnm requested a review from ngimel February 15, 2020 09:25

zasdfgbnm mentioned this pull request Feb 15, 2020

TensorIterator unrolling and vectorized load for GPU loop #31975

Closed

11 tasks

vadimkantorov mentioned this pull request Feb 15, 2020

Fix index truncation in argmin/max for large tensors #33310

Closed

zasdfgbnm added 2 commits February 15, 2020 12:51

Merge branch 'master' of https://github.com/pytorch/pytorch into range

a57a03a

Merge branch 'range' of github.com:zasdfgbnm/pytorch into range

77fe3d0

Update RangeFactories.cu

4aba290

largeTensorTest

1314c33

facebook-github-bot reviewed Feb 15, 2020

View reviewed changes

ngimel approved these changes Feb 15, 2020

View reviewed changes

Merge branch 'master' of github.com:pytorch/pytorch into range

31ecc3e

zasdfgbnm added 2 commits February 15, 2020 16:30

Merge branch 'range' of github.com:zasdfgbnm/pytorch into range

de6c29a

not

0793fd0

zasdfgbnm added 5 commits February 15, 2020 17:02

skip also on CPU

560a3a7

largeCUDATensorTest

4a0cee2

Update test_nn.py

d30f591

Update test_nn.py

4a72097

fix skip

ec09ca7

facebook-github-bot reviewed Feb 16, 2020

View reviewed changes

facebook-github-bot closed this in 55fa133 Feb 18, 2020

facebook-github-bot added the merged label Feb 18, 2020

zasdfgbnm deleted the range branch February 18, 2020 04:44

mruberry added the Merged label Oct 28, 2020

Conversation

zasdfgbnm commented Feb 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dr-ci bot commented Feb 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CircleCI build failures summary and remediations

Detailed failure analysis

🕵️ 1 new failure recognized by patterns

pytorch_windows_vs2019_py36_cuda10.1_test2 (1/1)

❄️ 1 failure recognized as flaky

caffe2_onnx_py2_gcc5_ubuntu16_04_test (1/1)

Uh oh!

ngimel commented Feb 15, 2020

Uh oh!

zasdfgbnm commented Feb 15, 2020

Uh oh!

zasdfgbnm commented Feb 15, 2020

Uh oh!

ngimel commented Feb 15, 2020

Uh oh!

zasdfgbnm commented Feb 15, 2020

Uh oh!

zasdfgbnm commented Feb 15, 2020

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

ngimel commented Feb 15, 2020

Uh oh!

zasdfgbnm commented Feb 15, 2020

Uh oh!

zasdfgbnm commented Feb 16, 2020

Uh oh!

ngimel commented Feb 16, 2020

Uh oh!

zasdfgbnm commented Feb 16, 2020

Uh oh!

zasdfgbnm commented Feb 16, 2020

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Feb 18, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zasdfgbnm commented Feb 15, 2020 •

edited

Loading

dr-ci bot commented Feb 15, 2020 •

edited

Loading