Reduce needless copying when returning lists of tensors in the JIT interpreter.#21690
Reduce needless copying when returning lists of tensors in the JIT interpreter.#21690resistor wants to merge 1 commit intopytorch:masterfrom
Conversation
facebook-github-bot
left a comment
There was a problem hiding this comment.
@resistor has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
torch/csrc/jit/register_prim_ops.cpp
Outdated
There was a problem hiding this comment.
Does moving all the elements of a and b create a use-after-move issue if a or b is accessed later?
There was a problem hiding this comment.
Is it possible for someone to have another reference to them?
There was a problem hiding this comment.
oh, yes it is. Add is a functional operation.
torch/csrc/jit/register_prim_ops.cpp
Outdated
There was a problem hiding this comment.
oh, yes it is. Add is a functional operation.
fbefa94 to
5b0e8b9
Compare
facebook-github-bot
left a comment
There was a problem hiding this comment.
@resistor has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
driazati
left a comment
There was a problem hiding this comment.
Looks good, though the resulting ops could be a little simpler by doing the dispatch on use_count at compile time by returning an Operation instead
|
This currently conflicts with #21170 which I think might have the same use-after-free bug as the original version of this... |
|
I assume you meant to link to the |
|
I'd rather have the optimization inside of the List class, then more code could potentially profit from it. Afaik, #21896 is already doing that? |
facebook-github-bot
left a comment
There was a problem hiding this comment.
@resistor has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
Is this going in soon? I have a PR to rebase past it. |
Summary: In talks with smessmer, we decided that it'd be better to put the logic in `list`, as optimal behavior requires knowing `.capacity()` Results on my cpu (for the benchmark here: https://twitter.com/VahidK/status/1138674536679821312) now look like this: ``` Pytorch batch_gather took 0.018311 seconds. Pytorch batch_gather jit took 0.013921 seconds. Pytorch vectorized batch_gather took 0.001384 seconds. ``` Previously, `batch_gather jit` took 3x as long as `batch_gather`. Some logic taken from #21690. Note that these two PR's are somewhat orthogonal. That PR handles this benchmark by looking at the alias analysis, while this PR specializes for `+=`. Note that we can't jit the vectorized version as we think `torch.arange` returns a float tensor. Pull Request resolved: #21896 Differential Revision: D15998628 Pulled By: Chillee fbshipit-source-id: b0085960da4613578b94deb98ac62c0a4532a8c3
Summary: In talks with smessmer, we decided that it'd be better to put the logic in `list`, as optimal behavior requires knowing `.capacity()` Results on my cpu (for the benchmark here: https://twitter.com/VahidK/status/1138674536679821312) now look like this: ``` Pytorch batch_gather took 0.018311 seconds. Pytorch batch_gather jit took 0.013921 seconds. Pytorch vectorized batch_gather took 0.001384 seconds. ``` Previously, `batch_gather jit` took 3x as long as `batch_gather`. Some logic taken from pytorch/pytorch#21690. Note that these two PR's are somewhat orthogonal. That PR handles this benchmark by looking at the alias analysis, while this PR specializes for `+=`. Note that we can't jit the vectorized version as we think `torch.arange` returns a float tensor. Pull Request resolved: pytorch/pytorch#21896 Differential Revision: D15998628 Pulled By: Chillee fbshipit-source-id: b0085960da4613578b94deb98ac62c0a4532a8c3
facebook-github-bot
left a comment
There was a problem hiding this comment.
@resistor has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
| for (T b_element : b) { | ||
| ret.push_back(std::move(b_element)); | ||
|
|
||
| if (a.use_count() == 1) { |
There was a problem hiding this comment.
Wondering if we should have a List::make_unshared_copy() that ensures that after calling it, use_count == 1 and does a copy if it needs to, or something for this. Feels like this could be useful in other ops as well.
|
This pull request has been merged in 7cc8f37. |
Summary: In talks with smessmer, we decided that it'd be better to put the logic in `list`, as optimal behavior requires knowing `.capacity()` Results on my cpu (for the benchmark here: https://twitter.com/VahidK/status/1138674536679821312) now look like this: ``` Pytorch batch_gather took 0.018311 seconds. Pytorch batch_gather jit took 0.013921 seconds. Pytorch vectorized batch_gather took 0.001384 seconds. ``` Previously, `batch_gather jit` took 3x as long as `batch_gather`. Some logic taken from pytorch#21690. Note that these two PR's are somewhat orthogonal. That PR handles this benchmark by looking at the alias analysis, while this PR specializes for `+=`. Note that we can't jit the vectorized version as we think `torch.arange` returns a float tensor. Pull Request resolved: pytorch#21896 Differential Revision: D15998628 Pulled By: Chillee fbshipit-source-id: b0085960da4613578b94deb98ac62c0a4532a8c3
…terpreter. (pytorch#21690) Summary: This fixes the JIT performance gap reported in https://twitter.com/VahidK/status/1138677898439561216 Pull Request resolved: pytorch#21690 Differential Revision: D15783709 fbshipit-source-id: 23bb4acda6b60c27e95667e1d53c7d261a87167d
This fixes the JIT performance gap reported in https://twitter.com/VahidK/status/1138677898439561216