TensorIterator: Avoid nesting two levels of function_ref in for_each#53613
TensorIterator: Avoid nesting two levels of function_ref in for_each#53613peterbell10 wants to merge 2 commits intopytorch:masterfrom
Conversation
💊 CI failures summary and remediationsAs of commit 310fb71 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
There was a problem hiding this comment.
I especially like this part. IIRC from looking at this previously, SmallVector overhead was a small problem.
There was a problem hiding this comment.
Unfortunately, this doesn't work. data here is potentially shared between multiple threads so it does need to be created inside the lambda after all.
There was a problem hiding this comment.
could we avoid duplicating what used to be LOOP_WRAPPER with a template function of its own that took loop, data, and ntensor (by reference/value as appropriate)?
ezyang
left a comment
There was a problem hiding this comment.
This is really nice work, thanks!
facebook-github-bot
left a comment
There was a problem hiding this comment.
@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
…ytorch#53613) Summary: When calling `TensorIterator::for_each` with a 1d loop, it creates a `function_ref` for the 1D iteration, then wraps it with `LOOP_WRAPPER` to transform it into a 2d loop. That 2d loop then gets wrapped in another `function_ref`. This can result in significant overhead if the 1d inner loop is over a small number of elements. Instead, this wraps the 1d loop before type-erasure so only one level of `function_ref` is introduced. A simple benchmark demonstrates this is a win: ```python import torch a = torch.rand((10000, 2))[::2] %timeit a + a ``` Note the 2D tensor cannot be coalesced into 1D and both `cpu_kernel` and `cpu_kernel_vec` use 1D for_each. On master, this takes 42 us but with this change it's down to 32us. Pull Request resolved: pytorch#53613 Reviewed By: VitalyFedyunin Differential Revision: D26947143 Pulled By: ezyang fbshipit-source-id: 5189ada0d82bbf74170fb446763753f02478abf6
…ytorch#53613) Summary: When calling `TensorIterator::for_each` with a 1d loop, it creates a `function_ref` for the 1D iteration, then wraps it with `LOOP_WRAPPER` to transform it into a 2d loop. That 2d loop then gets wrapped in another `function_ref`. This can result in significant overhead if the 1d inner loop is over a small number of elements. Instead, this wraps the 1d loop before type-erasure so only one level of `function_ref` is introduced. A simple benchmark demonstrates this is a win: ```python import torch a = torch.rand((10000, 2))[::2] %timeit a + a ``` Note the 2D tensor cannot be coalesced into 1D and both `cpu_kernel` and `cpu_kernel_vec` use 1D for_each. On master, this takes 42 us but with this change it's down to 32us. Pull Request resolved: pytorch#53613 Reviewed By: VitalyFedyunin Differential Revision: D26947143 Pulled By: ezyang fbshipit-source-id: 5189ada0d82bbf74170fb446763753f02478abf6
When calling
TensorIterator::for_eachwith a 1d loop, it creates afunction_reffor the 1D iteration, then wraps it withLOOP_WRAPPERto transform it into a 2d loop. That 2d loop then gets wrapped in anotherfunction_ref. This can result in significant overhead if the 1d inner loop is over a small number of elements.Instead, this wraps the 1d loop before type-erasure so only one level of
function_refis introduced. A simple benchmark demonstrates this is a win:Note the 2D tensor cannot be coalesced into 1D and both
cpu_kernelandcpu_kernel_vecuse 1D for_each. On master, this takes 42 us but with this change it's down to 32us.