Skip to content

Speed up tensor.resize_(sizes) when tensor has correct size#12824

Closed
zou3519 wants to merge 5 commits intopytorch:masterfrom
zou3519:resize
Closed

Speed up tensor.resize_(sizes) when tensor has correct size#12824
zou3519 wants to merge 5 commits intopytorch:masterfrom
zou3519:resize

Conversation

@zou3519
Copy link
Copy Markdown
Contributor

@zou3519 zou3519 commented Oct 18, 2018

While using gbenchmark, I found tensor.resize_({0}) would take 300ns
if tensor already has the correct size. This is important for
at::empty({0}) perf because at::empty always calls resize_, which
in turn is a important for JIT perf: the fusion compiler creates empty
tensors and then resize_s them to computed sizes. Most of the 300ns is
due to DeviceGuard (200ns)

Summary of findings:

  • at::empty({0}, cuda): 851ns
  • empty_tensor.resize({0}): 308ns
  • DeviceGuard(tensor): ctor + dtor: 200ns (Going to look into this
    next because it impacts resize_ perf).
  • vdispatch overhead (tensor.resize_() vs
    at::native::resize__cuda(tensor)): ~10ns

This PR rips out the TH resize_ implementation and adds it to ATen
with the following modifications:

  • DeviceGuard used only after the same-size check.
  • Same-size check rewritten for simplicity. The new check doesn't
    affect perf.
  • empty_cpu / empty_cuda avoid the dispatch overhead to
    tensor.resize_.

Timing with this PR:

  • at::empty({0}, cuda): 363ns
  • empty_tensor.resize_({0}): 17ns

Future:

  • Investigate resize_(sizes) slowness when tensor.sizes() != sizes
  • Should tell resize_as_ to use the new resize_ implementation...
    (because resize_as_ is in TH, it is calling the old TH resize_)

@zou3519 zou3519 requested review from ezyang and gchanan October 18, 2018 16:47
Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ssnl
Copy link
Copy Markdown
Collaborator

ssnl commented Oct 18, 2018

Nice. Not a blocker. But now you can call AT functions in TH I believe. :)

Comment thread aten/src/ATen/native/Resize.cpp Outdated

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

Comment thread aten/src/ATen/native/Resize.cpp Outdated

This comment was marked as off-topic.

Copy link
Copy Markdown
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I know you're just trying to make this fast, but I'd like you to do a few more things:

  1. Don't call set_size manually
  2. Stack on top of #12845 so you can get contiguous strides allocation for free. (Or copy paste it in; it's an easy enough merge conflict to resolve)
  3. Delete the old implementation, replacing it with a call to the at::native:: function

Thanks!

@zou3519
Copy link
Copy Markdown
Contributor Author

zou3519 commented Oct 19, 2018

@ezyang I've addressed points (1) and (2). The conclusion for (3) is that there's no way to call aten native functions from TH at the moment but if that is not too hard to add I can punt this PR until that happens.

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Comment thread aten/src/ATen/core/TensorImpl.h Outdated

This comment was marked as off-topic.

Copy link
Copy Markdown
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okey-dokey

@zou3519
Copy link
Copy Markdown
Contributor Author

zou3519 commented Oct 23, 2018

@ezyang I figured out a compromise for (3): I've replaced the old TH implementation with the new implementation in this PR. I modified TensorImpl::set_sizes_and_strides to match the old THTensor_(resizeNd) semantics -- could you take a look and see if this is okay?

Copy link
Copy Markdown
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. The code is still looks a bit copy-pastey, but I understand if you want to unblock.

@zou3519 zou3519 mentioned this pull request Oct 24, 2018
21 tasks
@zou3519 zou3519 force-pushed the resize branch 2 times, most recently from d95c991 to 31798dd Compare October 24, 2018 18:33
Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zou3519 added a commit to zou3519/pytorch that referenced this pull request Oct 25, 2018
std::memcpy has UB when either of src or dest are NULL, even if length
is 0. This can and does happen when the input tensors are scalar tensors.

This triggered UBSAN on pytorch#12824 but it is strange that it has not
been triggered before.

Test Plan: wait for tests
facebook-github-bot pushed a commit that referenced this pull request Oct 25, 2018
Summary:
std::memcpy has UB when either of src or dest are NULL, even if length
is 0. This can and does happen when the input tensors are scalar tensors.

This triggered UBSAN on #12824 but it is strange that it has not
been triggered before.
Pull Request resolved: #13121

Differential Revision: D10853113

Pulled By: zou3519

fbshipit-source-id: c4b4ad5e41de6f73dc755e0c25bc9947576a742d
While using gbenchmark, I found `tensor.resize_({0})` would take 300ns
if tensor already has the correct size. This is important for
`at::empty({0})` perf because `at::empty` always calls `resize_`, which
in turn is a important for JIT perf: the fusion compiler creates empty
tensors and then `resize_`s them to computed sizes. Most of the 300ns is
due to DeviceGuard (200ns)

Summary of findings:
- `at::empty({0}, cuda)`: 851ns
- `empty_tensor.resize({0})`: 308ns
- `DeviceGuard(tensor)`: ctor + dtor: 200ns (Going to look into this
  next because it impacts `resize_` perf).
- vdispatch overhead (`tensor.resize_()` vs
  `at::native::resize__cuda(tensor)`): ~10ns

This PR rips out the TH `resize_` implementation and adds it to ATen
with the following modifications:
- DeviceGuard used only after the same-size check.
- Same-size check rewritten for simplicity. The new check doesn't
affect perf.
- empty_cpu / empty_cuda avoid the dispatch overhead to
tensor.resize_.

Timing with this PR:
- `at::empty({0}, cuda)`: 363ns
- `empty_tensor.resize_({0})`: 17ns

Future:
- Investigate `resize_(sizes)` slowness when `tensor.sizes() != sizes`
- Should tell resize_as_ to use the new resize_ implementation...
(because resize_as_ is in TH, it is calling the old TH resize_)
Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Oct 26, 2018
Summary:
While using gbenchmark, I found `tensor.resize_({0})` would take 300ns
if tensor already has the correct size. This is important for
`at::empty({0})` perf because `at::empty` always calls `resize_`, which
in turn is a important for JIT perf: the fusion compiler creates empty
tensors and then `resize_`s them to computed sizes. Most of the 300ns is
due to DeviceGuard (200ns)

Summary of findings:
- `at::empty({0}, cuda)`: 851ns
- `empty_tensor.resize({0})`: 308ns
- `DeviceGuard(tensor)`: ctor + dtor: 200ns (Going to look into this
  next because it impacts `resize_` perf).
- vdispatch overhead (`tensor.resize_()` vs
  `at::native::resize__cuda(tensor)`): ~10ns

This PR rips out the TH `resize_` implementation and adds it to ATen
with the following modifications:
- DeviceGuard used only after the same-size check.
- Same-size check rewritten for simplicity. The new check doesn't
affect perf.
- empty_cpu / empty_cuda avoid the dispatch overhead to
tensor.resize_.

Timing with this PR:
- `at::empty({0}, cuda)`: 363ns
- `empty_tensor.resize_({0})`: 17ns

Future:
- Investigate `resize_(sizes)` slowness when `tensor.sizes() != sizes`
- Should tell resize_as_ to use the new resize_ implementation...
(because resize_as_ is in TH, it is calling the old TH resize_)
Pull Request resolved: pytorch/pytorch#12824

Differential Revision: D10449209

Pulled By: zou3519

fbshipit-source-id: cecae5e6caf390017c07cd44a8eaf2fa6e3fdeb6
@bddppq
Copy link
Copy Markdown
Contributor

bddppq commented Oct 26, 2018

This PR has broken PyTorch ROCm tests:
https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py2-clang7-rocmdeb-ubuntu16.04-test/1261//console

cc @iotamudelta

06:33:27 test_zeros_like (__main__.TestUncoalescedSparse) ... ok
06:33:28 
06:33:28 ======================================================================
06:33:28 ERROR: test_add_dense_sparse_mismatch (__main__.TestCudaSparse)
06:33:28 ----------------------------------------------------------------------
06:33:28 Traceback (most recent call last):
06:33:28   File "test_sparse.py", line 933, in test_add_dense_sparse_mismatch
06:33:28     test_shape([3, 4, 0], [1, 4], [4, 4, 4, 0], [3, 4, 4, 0])
06:33:28   File "test_sparse.py", line 923, in test_shape
06:33:28     x = torch.zeros(dense_size, dtype=self.value_dtype, device=self.device)
06:33:28 RuntimeError: cuda runtime error (1011) : hipErrorInvalidValue at /var/lib/jenkins/workspace/aten/src/THC/generic/THCTensorMath.cu:25
06:33:28 
06:33:28 ======================================================================
06:33:28 ERROR: test_add_dense_sparse_mismatch (__main__.TestCudaUncoalescedSparse)
06:33:28 ----------------------------------------------------------------------
06:33:28 Traceback (most recent call last):
06:33:28   File "test_sparse.py", line 933, in test_add_dense_sparse_mismatch
06:33:28     test_shape([3, 4, 0], [1, 4], [4, 4, 4, 0], [3, 4, 4, 0])
06:33:28   File "test_sparse.py", line 923, in test_shape
06:33:28     x = torch.zeros(dense_size, dtype=self.value_dtype, device=self.device)
06:33:28 RuntimeError: cuda runtime error (1011) : hipErrorInvalidValue at /var/lib/jenkins/workspace/aten/src/THC/generic/THCTensorMath.cu:25
06:33:28 
06:33:28 ======================================================================
06:33:28 ERROR: test_reduction_empty (test_torch.TestTorch)
06:33:28 ----------------------------------------------------------------------
06:33:28 Traceback (most recent call last):
06:33:28   File "/var/lib/jenkins/workspace/test/test_torch.py", line 1047, in test_reduction_empty
06:33:28     self.assertEqual(torch.empty((2, 0, 1), device=device), fn(x, dim=2, keepdim=True))
06:33:28   File "/var/lib/jenkins/workspace/test/test_torch.py", line 1024, in <lambda>
06:33:28     ('sum', lambda *args, **kwargs: torch.sum(*args, **kwargs), 0),
06:33:28 RuntimeError: cuda runtime error (1011) : hipErrorInvalidValue at /var/lib/jenkins/workspace/aten/src/THC/generic/THCTensorMath.cu:25
06:33:28 
06:33:28 ----------------------------------------------------------------------

zou3519 added a commit to zou3519/pytorch that referenced this pull request Oct 26, 2018
Followup to pytorch#12824. This PR makes some of the THStorage/THCStorage
functions backend-agnostic by using templates and guarding with compiler
macros so that the resize_ logic can be de-duplicated.

This helps because I am going to rewrite THTensor_setStorageNd later on
and need some of these functions.
@jithunnair-amd
Copy link
Copy Markdown
Collaborator

This looks like more cases of "hipMemset with size 0"

@iotamudelta
Copy link
Copy Markdown
Contributor

so please skip for now, this will be fixed with ROCm 1.9.2

@bddppq
Copy link
Copy Markdown
Contributor

bddppq commented Oct 26, 2018

@iotamudelta Got it, could you submit a PR to skip them?

iotamudelta added a commit to ROCm/pytorch that referenced this pull request Oct 26, 2018
facebook-github-bot pushed a commit that referenced this pull request Oct 27, 2018
Summary:
For attention: bddppq
Pull Request resolved: #13181

Differential Revision: D12811207

Pulled By: bddppq

fbshipit-source-id: de1c92e5a8cf4fc634c4644376d07374441c24e3
@ezyang ezyang added the merged label Jun 25, 2019
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
Summary:
std::memcpy has UB when either of src or dest are NULL, even if length
is 0. This can and does happen when the input tensors are scalar tensors.

This triggered UBSAN on pytorch#12824 but it is strange that it has not
been triggered before.
Pull Request resolved: pytorch#13121

Differential Revision: D10853113

Pulled By: zou3519

fbshipit-source-id: c4b4ad5e41de6f73dc755e0c25bc9947576a742d
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
…12824)

Summary:
While using gbenchmark, I found `tensor.resize_({0})` would take 300ns
if tensor already has the correct size. This is important for
`at::empty({0})` perf because `at::empty` always calls `resize_`, which
in turn is a important for JIT perf: the fusion compiler creates empty
tensors and then `resize_`s them to computed sizes. Most of the 300ns is
due to DeviceGuard (200ns)

Summary of findings:
- `at::empty({0}, cuda)`: 851ns
- `empty_tensor.resize({0})`: 308ns
- `DeviceGuard(tensor)`: ctor + dtor: 200ns (Going to look into this
  next because it impacts `resize_` perf).
- vdispatch overhead (`tensor.resize_()` vs
  `at::native::resize__cuda(tensor)`): ~10ns

This PR rips out the TH `resize_` implementation and adds it to ATen
with the following modifications:
- DeviceGuard used only after the same-size check.
- Same-size check rewritten for simplicity. The new check doesn't
affect perf.
- empty_cpu / empty_cuda avoid the dispatch overhead to
tensor.resize_.

Timing with this PR:
- `at::empty({0}, cuda)`: 363ns
- `empty_tensor.resize_({0})`: 17ns

Future:
- Investigate `resize_(sizes)` slowness when `tensor.sizes() != sizes`
- Should tell resize_as_ to use the new resize_ implementation...
(because resize_as_ is in TH, it is calling the old TH resize_)
Pull Request resolved: pytorch#12824

Differential Revision: D10449209

Pulled By: zou3519

fbshipit-source-id: cecae5e6caf390017c07cd44a8eaf2fa6e3fdeb6
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
Summary:
For attention: bddppq
Pull Request resolved: pytorch#13181

Differential Revision: D12811207

Pulled By: bddppq

fbshipit-source-id: de1c92e5a8cf4fc634c4644376d07374441c24e3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants