Speed up tensor.resize_(sizes) when tensor has correct size#12824
Speed up tensor.resize_(sizes) when tensor has correct size#12824zou3519 wants to merge 5 commits intopytorch:masterfrom
Conversation
facebook-github-bot
left a comment
There was a problem hiding this comment.
zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
Nice. Not a blocker. But now you can call AT functions in TH I believe. :) |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
ezyang
left a comment
There was a problem hiding this comment.
Sorry, I know you're just trying to make this fast, but I'd like you to do a few more things:
- Don't call
set_sizemanually - Stack on top of #12845 so you can get contiguous strides allocation for free. (Or copy paste it in; it's an easy enough merge conflict to resolve)
- Delete the old implementation, replacing it with a call to the
at::native::function
Thanks!
|
@ezyang I've addressed points (1) and (2). The conclusion for (3) is that there's no way to call aten native functions from TH at the moment but if that is not too hard to add I can punt this PR until that happens. |
facebook-github-bot
left a comment
There was a problem hiding this comment.
zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
@ezyang I figured out a compromise for (3): I've replaced the old TH implementation with the new implementation in this PR. I modified |
ezyang
left a comment
There was a problem hiding this comment.
OK. The code is still looks a bit copy-pastey, but I understand if you want to unblock.
d95c991 to
31798dd
Compare
facebook-github-bot
left a comment
There was a problem hiding this comment.
zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
facebook-github-bot
left a comment
There was a problem hiding this comment.
zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
facebook-github-bot
left a comment
There was a problem hiding this comment.
zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
std::memcpy has UB when either of src or dest are NULL, even if length is 0. This can and does happen when the input tensors are scalar tensors. This triggered UBSAN on pytorch#12824 but it is strange that it has not been triggered before. Test Plan: wait for tests
Summary: std::memcpy has UB when either of src or dest are NULL, even if length is 0. This can and does happen when the input tensors are scalar tensors. This triggered UBSAN on #12824 but it is strange that it has not been triggered before. Pull Request resolved: #13121 Differential Revision: D10853113 Pulled By: zou3519 fbshipit-source-id: c4b4ad5e41de6f73dc755e0c25bc9947576a742d
While using gbenchmark, I found `tensor.resize_({0})` would take 300ns
if tensor already has the correct size. This is important for
`at::empty({0})` perf because `at::empty` always calls `resize_`, which
in turn is a important for JIT perf: the fusion compiler creates empty
tensors and then `resize_`s them to computed sizes. Most of the 300ns is
due to DeviceGuard (200ns)
Summary of findings:
- `at::empty({0}, cuda)`: 851ns
- `empty_tensor.resize({0})`: 308ns
- `DeviceGuard(tensor)`: ctor + dtor: 200ns (Going to look into this
next because it impacts `resize_` perf).
- vdispatch overhead (`tensor.resize_()` vs
`at::native::resize__cuda(tensor)`): ~10ns
This PR rips out the TH `resize_` implementation and adds it to ATen
with the following modifications:
- DeviceGuard used only after the same-size check.
- Same-size check rewritten for simplicity. The new check doesn't
affect perf.
- empty_cpu / empty_cuda avoid the dispatch overhead to
tensor.resize_.
Timing with this PR:
- `at::empty({0}, cuda)`: 363ns
- `empty_tensor.resize_({0})`: 17ns
Future:
- Investigate `resize_(sizes)` slowness when `tensor.sizes() != sizes`
- Should tell resize_as_ to use the new resize_ implementation...
(because resize_as_ is in TH, it is calling the old TH resize_)
facebook-github-bot
left a comment
There was a problem hiding this comment.
zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary:
While using gbenchmark, I found `tensor.resize_({0})` would take 300ns
if tensor already has the correct size. This is important for
`at::empty({0})` perf because `at::empty` always calls `resize_`, which
in turn is a important for JIT perf: the fusion compiler creates empty
tensors and then `resize_`s them to computed sizes. Most of the 300ns is
due to DeviceGuard (200ns)
Summary of findings:
- `at::empty({0}, cuda)`: 851ns
- `empty_tensor.resize({0})`: 308ns
- `DeviceGuard(tensor)`: ctor + dtor: 200ns (Going to look into this
next because it impacts `resize_` perf).
- vdispatch overhead (`tensor.resize_()` vs
`at::native::resize__cuda(tensor)`): ~10ns
This PR rips out the TH `resize_` implementation and adds it to ATen
with the following modifications:
- DeviceGuard used only after the same-size check.
- Same-size check rewritten for simplicity. The new check doesn't
affect perf.
- empty_cpu / empty_cuda avoid the dispatch overhead to
tensor.resize_.
Timing with this PR:
- `at::empty({0}, cuda)`: 363ns
- `empty_tensor.resize_({0})`: 17ns
Future:
- Investigate `resize_(sizes)` slowness when `tensor.sizes() != sizes`
- Should tell resize_as_ to use the new resize_ implementation...
(because resize_as_ is in TH, it is calling the old TH resize_)
Pull Request resolved: pytorch/pytorch#12824
Differential Revision: D10449209
Pulled By: zou3519
fbshipit-source-id: cecae5e6caf390017c07cd44a8eaf2fa6e3fdeb6
|
This PR has broken PyTorch ROCm tests: cc @iotamudelta |
Followup to pytorch#12824. This PR makes some of the THStorage/THCStorage functions backend-agnostic by using templates and guarding with compiler macros so that the resize_ logic can be de-duplicated. This helps because I am going to rewrite THTensor_setStorageNd later on and need some of these functions.
|
This looks like more cases of "hipMemset with size 0" |
|
so please skip for now, this will be fixed with |
|
@iotamudelta Got it, could you submit a PR to skip them? |
Summary: std::memcpy has UB when either of src or dest are NULL, even if length is 0. This can and does happen when the input tensors are scalar tensors. This triggered UBSAN on pytorch#12824 but it is strange that it has not been triggered before. Pull Request resolved: pytorch#13121 Differential Revision: D10853113 Pulled By: zou3519 fbshipit-source-id: c4b4ad5e41de6f73dc755e0c25bc9947576a742d
…12824) Summary: While using gbenchmark, I found `tensor.resize_({0})` would take 300ns if tensor already has the correct size. This is important for `at::empty({0})` perf because `at::empty` always calls `resize_`, which in turn is a important for JIT perf: the fusion compiler creates empty tensors and then `resize_`s them to computed sizes. Most of the 300ns is due to DeviceGuard (200ns) Summary of findings: - `at::empty({0}, cuda)`: 851ns - `empty_tensor.resize({0})`: 308ns - `DeviceGuard(tensor)`: ctor + dtor: 200ns (Going to look into this next because it impacts `resize_` perf). - vdispatch overhead (`tensor.resize_()` vs `at::native::resize__cuda(tensor)`): ~10ns This PR rips out the TH `resize_` implementation and adds it to ATen with the following modifications: - DeviceGuard used only after the same-size check. - Same-size check rewritten for simplicity. The new check doesn't affect perf. - empty_cpu / empty_cuda avoid the dispatch overhead to tensor.resize_. Timing with this PR: - `at::empty({0}, cuda)`: 363ns - `empty_tensor.resize_({0})`: 17ns Future: - Investigate `resize_(sizes)` slowness when `tensor.sizes() != sizes` - Should tell resize_as_ to use the new resize_ implementation... (because resize_as_ is in TH, it is calling the old TH resize_) Pull Request resolved: pytorch#12824 Differential Revision: D10449209 Pulled By: zou3519 fbshipit-source-id: cecae5e6caf390017c07cd44a8eaf2fa6e3fdeb6
Summary: For attention: bddppq Pull Request resolved: pytorch#13181 Differential Revision: D12811207 Pulled By: bddppq fbshipit-source-id: de1c92e5a8cf4fc634c4644376d07374441c24e3
While using gbenchmark, I found
tensor.resize_({0})would take 300nsif tensor already has the correct size. This is important for
at::empty({0})perf becauseat::emptyalways callsresize_, whichin turn is a important for JIT perf: the fusion compiler creates empty
tensors and then
resize_s them to computed sizes. Most of the 300ns isdue to DeviceGuard (200ns)
Summary of findings:
at::empty({0}, cuda): 851nsempty_tensor.resize({0}): 308nsDeviceGuard(tensor): ctor + dtor: 200ns (Going to look into thisnext because it impacts
resize_perf).tensor.resize_()vsat::native::resize__cuda(tensor)): ~10nsThis PR rips out the TH
resize_implementation and adds it to ATenwith the following modifications:
affect perf.
tensor.resize_.
Timing with this PR:
at::empty({0}, cuda): 363nsempty_tensor.resize_({0}): 17nsFuture:
resize_(sizes)slowness whentensor.sizes() != sizes(because resize_as_ is in TH, it is calling the old TH resize_)