Speed up tensor.resize_(sizes) when tensor has correct size by zou3519 · Pull Request #12824 · pytorch/pytorch

zou3519 · 2018-10-18T16:47:14Z

While using gbenchmark, I found tensor.resize_({0}) would take 300ns
if tensor already has the correct size. This is important for
at::empty({0}) perf because at::empty always calls resize_, which
in turn is a important for JIT perf: the fusion compiler creates empty
tensors and then resize_s them to computed sizes. Most of the 300ns is
due to DeviceGuard (200ns)

Summary of findings:

at::empty({0}, cuda): 851ns
empty_tensor.resize({0}): 308ns
DeviceGuard(tensor): ctor + dtor: 200ns (Going to look into this
next because it impacts resize_ perf).
vdispatch overhead (tensor.resize_() vs
at::native::resize__cuda(tensor)): ~10ns

This PR rips out the TH resize_ implementation and adds it to ATen
with the following modifications:

DeviceGuard used only after the same-size check.
Same-size check rewritten for simplicity. The new check doesn't
affect perf.
empty_cpu / empty_cuda avoid the dispatch overhead to
tensor.resize_.

Timing with this PR:

at::empty({0}, cuda): 363ns
empty_tensor.resize_({0}): 17ns

Future:

Investigate resize_(sizes) slowness when tensor.sizes() != sizes
Should tell resize_as_ to use the new resize_ implementation...
(because resize_as_ is in TH, it is calling the old TH resize_)

facebook-github-bot

zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ssnl · 2018-10-18T20:48:08Z

Nice. Not a blocker. But now you can call AT functions in TH I believe. :)

Sign in to view

ezyang

Sorry, I know you're just trying to make this fast, but I'd like you to do a few more things:

Don't call set_size manually
Stack on top of #12845 so you can get contiguous strides allocation for free. (Or copy paste it in; it's an easy enough merge conflict to resolve)
Delete the old implementation, replacing it with a call to the at::native:: function

Thanks!

zou3519 · 2018-10-19T16:02:13Z

@ezyang I've addressed points (1) and (2). The conclusion for (3) is that there's no way to call aten native functions from TH at the moment but if that is not too hard to add I can punt this PR until that happens.

facebook-github-bot

zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Sign in to view

ezyang

okey-dokey

zou3519 · 2018-10-23T22:28:40Z

@ezyang I figured out a compromise for (3): I've replaced the old TH implementation with the new implementation in this PR. I modified TensorImpl::set_sizes_and_strides to match the old THTensor_(resizeNd) semantics -- could you take a look and see if this is okay?

ezyang

OK. The code is still looks a bit copy-pastey, but I understand if you want to unblock.

facebook-github-bot

zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

std::memcpy has UB when either of src or dest are NULL, even if length is 0. This can and does happen when the input tensors are scalar tensors. This triggered UBSAN on pytorch#12824 but it is strange that it has not been triggered before. Test Plan: wait for tests

Summary: std::memcpy has UB when either of src or dest are NULL, even if length is 0. This can and does happen when the input tensors are scalar tensors. This triggered UBSAN on #12824 but it is strange that it has not been triggered before. Pull Request resolved: #13121 Differential Revision: D10853113 Pulled By: zou3519 fbshipit-source-id: c4b4ad5e41de6f73dc755e0c25bc9947576a742d

While using gbenchmark, I found `tensor.resize_({0})` would take 300ns if tensor already has the correct size. This is important for `at::empty({0})` perf because `at::empty` always calls `resize_`, which in turn is a important for JIT perf: the fusion compiler creates empty tensors and then `resize_`s them to computed sizes. Most of the 300ns is due to DeviceGuard (200ns) Summary of findings: - `at::empty({0}, cuda)`: 851ns - `empty_tensor.resize({0})`: 308ns - `DeviceGuard(tensor)`: ctor + dtor: 200ns (Going to look into this next because it impacts `resize_` perf). - vdispatch overhead (`tensor.resize_()` vs `at::native::resize__cuda(tensor)`): ~10ns This PR rips out the TH `resize_` implementation and adds it to ATen with the following modifications: - DeviceGuard used only after the same-size check. - Same-size check rewritten for simplicity. The new check doesn't affect perf. - empty_cpu / empty_cuda avoid the dispatch overhead to tensor.resize_. Timing with this PR: - `at::empty({0}, cuda)`: 363ns - `empty_tensor.resize_({0})`: 17ns Future: - Investigate `resize_(sizes)` slowness when `tensor.sizes() != sizes` - Should tell resize_as_ to use the new resize_ implementation... (because resize_as_ is in TH, it is calling the old TH resize_)

facebook-github-bot

zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: While using gbenchmark, I found `tensor.resize_({0})` would take 300ns if tensor already has the correct size. This is important for `at::empty({0})` perf because `at::empty` always calls `resize_`, which in turn is a important for JIT perf: the fusion compiler creates empty tensors and then `resize_`s them to computed sizes. Most of the 300ns is due to DeviceGuard (200ns) Summary of findings: - `at::empty({0}, cuda)`: 851ns - `empty_tensor.resize({0})`: 308ns - `DeviceGuard(tensor)`: ctor + dtor: 200ns (Going to look into this next because it impacts `resize_` perf). - vdispatch overhead (`tensor.resize_()` vs `at::native::resize__cuda(tensor)`): ~10ns This PR rips out the TH `resize_` implementation and adds it to ATen with the following modifications: - DeviceGuard used only after the same-size check. - Same-size check rewritten for simplicity. The new check doesn't affect perf. - empty_cpu / empty_cuda avoid the dispatch overhead to tensor.resize_. Timing with this PR: - `at::empty({0}, cuda)`: 363ns - `empty_tensor.resize_({0})`: 17ns Future: - Investigate `resize_(sizes)` slowness when `tensor.sizes() != sizes` - Should tell resize_as_ to use the new resize_ implementation... (because resize_as_ is in TH, it is calling the old TH resize_) Pull Request resolved: pytorch/pytorch#12824 Differential Revision: D10449209 Pulled By: zou3519 fbshipit-source-id: cecae5e6caf390017c07cd44a8eaf2fa6e3fdeb6

bddppq · 2018-10-26T07:10:26Z

This PR has broken PyTorch ROCm tests:
https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py2-clang7-rocmdeb-ubuntu16.04-test/1261//console

cc @iotamudelta

06:33:27 test_zeros_like (__main__.TestUncoalescedSparse) ... ok
06:33:28 
06:33:28 ======================================================================
06:33:28 ERROR: test_add_dense_sparse_mismatch (__main__.TestCudaSparse)
06:33:28 ----------------------------------------------------------------------
06:33:28 Traceback (most recent call last):
06:33:28   File "test_sparse.py", line 933, in test_add_dense_sparse_mismatch
06:33:28     test_shape([3, 4, 0], [1, 4], [4, 4, 4, 0], [3, 4, 4, 0])
06:33:28   File "test_sparse.py", line 923, in test_shape
06:33:28     x = torch.zeros(dense_size, dtype=self.value_dtype, device=self.device)
06:33:28 RuntimeError: cuda runtime error (1011) : hipErrorInvalidValue at /var/lib/jenkins/workspace/aten/src/THC/generic/THCTensorMath.cu:25
06:33:28 
06:33:28 ======================================================================
06:33:28 ERROR: test_add_dense_sparse_mismatch (__main__.TestCudaUncoalescedSparse)
06:33:28 ----------------------------------------------------------------------
06:33:28 Traceback (most recent call last):
06:33:28   File "test_sparse.py", line 933, in test_add_dense_sparse_mismatch
06:33:28     test_shape([3, 4, 0], [1, 4], [4, 4, 4, 0], [3, 4, 4, 0])
06:33:28   File "test_sparse.py", line 923, in test_shape
06:33:28     x = torch.zeros(dense_size, dtype=self.value_dtype, device=self.device)
06:33:28 RuntimeError: cuda runtime error (1011) : hipErrorInvalidValue at /var/lib/jenkins/workspace/aten/src/THC/generic/THCTensorMath.cu:25
06:33:28 
06:33:28 ======================================================================
06:33:28 ERROR: test_reduction_empty (test_torch.TestTorch)
06:33:28 ----------------------------------------------------------------------
06:33:28 Traceback (most recent call last):
06:33:28   File "/var/lib/jenkins/workspace/test/test_torch.py", line 1047, in test_reduction_empty
06:33:28     self.assertEqual(torch.empty((2, 0, 1), device=device), fn(x, dim=2, keepdim=True))
06:33:28   File "/var/lib/jenkins/workspace/test/test_torch.py", line 1024, in <lambda>
06:33:28     ('sum', lambda *args, **kwargs: torch.sum(*args, **kwargs), 0),
06:33:28 RuntimeError: cuda runtime error (1011) : hipErrorInvalidValue at /var/lib/jenkins/workspace/aten/src/THC/generic/THCTensorMath.cu:25
06:33:28 
06:33:28 ----------------------------------------------------------------------

Followup to pytorch#12824. This PR makes some of the THStorage/THCStorage functions backend-agnostic by using templates and guarding with compiler macros so that the resize_ logic can be de-duplicated. This helps because I am going to rewrite THTensor_setStorageNd later on and need some of these functions.

jithunnair-amd · 2018-10-26T19:07:02Z

This looks like more cases of "hipMemset with size 0"

iotamudelta · 2018-10-26T19:08:19Z

so please skip for now, this will be fixed with ROCm 1.9.2

bddppq · 2018-10-26T19:25:46Z

@iotamudelta Got it, could you submit a PR to skip them?

Summary: For attention: bddppq Pull Request resolved: #13181 Differential Revision: D12811207 Pulled By: bddppq fbshipit-source-id: de1c92e5a8cf4fc634c4644376d07374441c24e3

Summary: std::memcpy has UB when either of src or dest are NULL, even if length is 0. This can and does happen when the input tensors are scalar tensors. This triggered UBSAN on pytorch#12824 but it is strange that it has not been triggered before. Pull Request resolved: pytorch#13121 Differential Revision: D10853113 Pulled By: zou3519 fbshipit-source-id: c4b4ad5e41de6f73dc755e0c25bc9947576a742d

…12824) Summary: While using gbenchmark, I found `tensor.resize_({0})` would take 300ns if tensor already has the correct size. This is important for `at::empty({0})` perf because `at::empty` always calls `resize_`, which in turn is a important for JIT perf: the fusion compiler creates empty tensors and then `resize_`s them to computed sizes. Most of the 300ns is due to DeviceGuard (200ns) Summary of findings: - `at::empty({0}, cuda)`: 851ns - `empty_tensor.resize({0})`: 308ns - `DeviceGuard(tensor)`: ctor + dtor: 200ns (Going to look into this next because it impacts `resize_` perf). - vdispatch overhead (`tensor.resize_()` vs `at::native::resize__cuda(tensor)`): ~10ns This PR rips out the TH `resize_` implementation and adds it to ATen with the following modifications: - DeviceGuard used only after the same-size check. - Same-size check rewritten for simplicity. The new check doesn't affect perf. - empty_cpu / empty_cuda avoid the dispatch overhead to tensor.resize_. Timing with this PR: - `at::empty({0}, cuda)`: 363ns - `empty_tensor.resize_({0})`: 17ns Future: - Investigate `resize_(sizes)` slowness when `tensor.sizes() != sizes` - Should tell resize_as_ to use the new resize_ implementation... (because resize_as_ is in TH, it is calling the old TH resize_) Pull Request resolved: pytorch#12824 Differential Revision: D10449209 Pulled By: zou3519 fbshipit-source-id: cecae5e6caf390017c07cd44a8eaf2fa6e3fdeb6

Summary: For attention: bddppq Pull Request resolved: pytorch#13181 Differential Revision: D12811207 Pulled By: bddppq fbshipit-source-id: de1c92e5a8cf4fc634c4644376d07374441c24e3

zou3519 requested review from ezyang and gchanan October 18, 2018 16:47

facebook-github-bot reviewed Oct 18, 2018

View reviewed changes

ezyang reviewed Oct 19, 2018

View reviewed changes

Comment thread aten/src/ATen/native/Resize.cpp Outdated

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

ezyang reviewed Oct 19, 2018

View reviewed changes

Comment thread aten/src/ATen/native/Resize.cpp Outdated

This comment was marked as off-topic.

Sign in to view

ezyang requested changes Oct 19, 2018

View reviewed changes

zou3519 force-pushed the resize branch from 19e09d1 to dded16b Compare October 19, 2018 15:35

zou3519 force-pushed the resize branch from dded16b to 97b8f9e Compare October 22, 2018 13:46

facebook-github-bot reviewed Oct 22, 2018

View reviewed changes

ezyang reviewed Oct 22, 2018

View reviewed changes

Comment thread aten/src/ATen/core/TensorImpl.h Outdated

This comment was marked as off-topic.

Sign in to view

ezyang approved these changes Oct 22, 2018

View reviewed changes

ezyang approved these changes Oct 24, 2018

View reviewed changes

zou3519 mentioned this pull request Oct 24, 2018

[perf] Reduce tensor & aten overhead #13049

Closed

21 tasks

zou3519 force-pushed the resize branch 2 times, most recently from d95c991 to 31798dd Compare October 24, 2018 18:33

facebook-github-bot reviewed Oct 24, 2018

View reviewed changes

zou3519 force-pushed the resize branch from 31798dd to fb824e6 Compare October 24, 2018 19:19

facebook-github-bot reviewed Oct 24, 2018

View reviewed changes

zou3519 force-pushed the resize branch from fb824e6 to 821b431 Compare October 25, 2018 02:59

facebook-github-bot reviewed Oct 25, 2018

View reviewed changes

zou3519 mentioned this pull request Oct 25, 2018

Fix UB in CPU_tensor_apply #13121

Closed

zou3519 added 4 commits October 25, 2018 11:44

Cleanup resize_ code

421fa36

Cleanup: call new resize_ implementation from TH/THC

ec2c4a3

Cleanup

5a75028

zou3519 force-pushed the resize branch from 821b431 to 5a75028 Compare October 25, 2018 18:44

Merge branch 'master' into resize

3ba8ce0

facebook-github-bot reviewed Oct 25, 2018

View reviewed changes

facebook-github-bot closed this in 4870b1b Oct 26, 2018

iotamudelta added a commit to ROCm/pytorch that referenced this pull request Oct 26, 2018

Skip tests that fail as per pytorch#12824

af3aaff

ezyang added the merged label Jun 25, 2019

Conversation

zou3519 commented Oct 18, 2018

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

ssnl commented Oct 18, 2018

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

ezyang left a comment

Choose a reason for hiding this comment

Uh oh!

zou3519 commented Oct 19, 2018

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

ezyang left a comment

Choose a reason for hiding this comment

Uh oh!

zou3519 commented Oct 23, 2018

Uh oh!

ezyang left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

bddppq commented Oct 26, 2018

Uh oh!

jithunnair-amd commented Oct 26, 2018

Uh oh!

iotamudelta commented Oct 26, 2018

Uh oh!

bddppq commented Oct 26, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants