Conversation
💊 CircleCI build failures summary and remediationsAs of commit ec09ca7:
Detailed failure analysisOne may explore the probable reasons each build failed interactively on the Dr. CI website. 🕵️ 1 new failure recognized by patternsThe following build failures do not appear to be due to upstream breakage:
|
|
Test failure in test_print and test_pickle might be related? |
|
@ngimel I don't know. Let me rebase and see. |
|
@pytorchbot rebase this please |
|
This generally looks good, but if you want to be awesome, can you implement @largeTensorTest wrapper in common_device_type.py? It seems to be usefule, TEST_LARGE_TENSOR is used in a lot of places. |
|
@ngimel Sure |
|
@ngimel The failures are related. They are about n dimensional zero sized tensors. |
facebook-github-bot
left a comment
There was a problem hiding this comment.
@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
test_nn now segfaults. |
It is strange that this test is not skipped on CI. It was correctly skipped locally on my machine. Could it be possible that the total RAM of GPU is 32GB, but only 16GB is allowed by PyTorch to use? |
|
@ngimel I see what's wrong: I forget a |
|
Which job are you looking at? Generally in |
|
@ngimel Yes, you are right, it is not skipped correctly. I was looking at a build on a wrong directory, in which it was successfully skipped, that's why I was saying The SIGSEGV fault could be because some test was skipped both on CPU and CUDA (due to the |
|
@ngimel Ready |
facebook-github-bot
left a comment
There was a problem hiding this comment.
@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary:
Although `gpu_kernel_with_index` might look like a quite general helper function at first look, it actually isn't.
The problem is not only 32bit indexing, but something more fundamental: `TensorIterator` reorder dims and shapes, so if you have non-contiguous tensor such as `torch.empty(5, 5).t()` , the index won't be correct. Since the whole point of `TensorIterator` is to manipulate shapes/strides to speedup loops, it is fundamentally impossible to get the correct linear index without tons of efforts.
Currently, the range factories are not failing on an `out=non_contiguous_tensor` is because it is so lucky that `has_internal_overlap` is stupid enough to return everything not contiguous as `TOO_HARD`.
Since `gpu_kernel_with_index` is not general, we should move it from `Loops.cuh` to `RangeFactories.cu`. And since the kernel is so simple to implement, it makes no sense to use `TensorIterator` which goes through tons of unnecessary checks like `compute_dtypes`.
`torch.range` is not tested for 64bit-indexing, and I will file a new PR to remove it (it was supposed to be removed at 0.5).
Benchmark:
The device is GTX-1650, I don't have a good GPU at home.
Code:
```python
import torch
print(torch.__version__)
for i in range(100):
torch.randn(1000, device='cuda')
torch.cuda.synchronize()
for i in range(15, 29):
%timeit torch.arange(2 ** i, device='cuda'); torch.cuda.synchronize()
```
Before:
```
1.5.0a0+c37a9b8
11.9 µs ± 412 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
12.7 µs ± 309 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
19.6 µs ± 209 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
28.9 µs ± 923 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
48.4 µs ± 1.64 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
85.7 µs ± 1.46 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
162 µs ± 1.09 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
312 µs ± 9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
618 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.22 ms ± 9.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.45 ms ± 97.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.9 ms ± 155 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
10.1 ms ± 378 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```
After:
```
1.5.0a0+7960d19
11 µs ± 29.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
12.4 µs ± 550 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
18.4 µs ± 230 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
27.6 µs ± 10.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
46.2 µs ± 18.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
83.3 µs ± 5.61 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
158 µs ± 373 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
307 µs ± 1.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
603 µs ± 112 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.2 ms ± 1.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.4 ms ± 23.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.77 ms ± 25.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.51 ms ± 933 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
```
Pull Request resolved: pytorch#33370
Differential Revision: D19925990
Pulled By: ngimel
fbshipit-source-id: f4a732fe14a5582b35a56618941120d62e82fdce
Although
gpu_kernel_with_indexmight look like a quite general helper function at first look, it actually isn't.The problem is not only 32bit indexing, but something more fundamental:
TensorIteratorreorder dims and shapes, so if you have non-contiguous tensor such astorch.empty(5, 5).t(), the index won't be correct. Since the whole point ofTensorIteratoris to manipulate shapes/strides to speedup loops, it is fundamentally impossible to get the correct linear index without tons of efforts.Currently, the range factories are not failing on an
out=non_contiguous_tensoris because it is so lucky thathas_internal_overlapis stupid enough to return everything not contiguous asTOO_HARD.Since
gpu_kernel_with_indexis not general, we should move it fromLoops.cuhtoRangeFactories.cu. And since the kernel is so simple to implement, it makes no sense to useTensorIteratorwhich goes through tons of unnecessary checks likecompute_dtypes.torch.rangeis not tested for 64bit-indexing, and I will file a new PR to remove it (it was supposed to be removed at 0.5).Benchmark:
The device is GTX-1650, I don't have a good GPU at home.
Code:
Before:
After: