Skip to content

TensorIterator: Avoid nesting two levels of function_ref in for_each#53613

Closed
peterbell10 wants to merge 2 commits intopytorch:masterfrom
peterbell10:TI_for_each
Closed

TensorIterator: Avoid nesting two levels of function_ref in for_each#53613
peterbell10 wants to merge 2 commits intopytorch:masterfrom
peterbell10:TI_for_each

Conversation

@peterbell10
Copy link
Copy Markdown
Collaborator

When calling TensorIterator::for_each with a 1d loop, it creates a function_ref for the 1D iteration, then wraps it with LOOP_WRAPPER to transform it into a 2d loop. That 2d loop then gets wrapped in another function_ref. This can result in significant overhead if the 1d inner loop is over a small number of elements.

Instead, this wraps the 1d loop before type-erasure so only one level of function_ref is introduced. A simple benchmark demonstrates this is a win:

import torch
a = torch.rand((10000, 2))[::2]
%timeit a + a

Note the 2D tensor cannot be coalesced into 1D and both cpu_kernel and cpu_kernel_vec use 1D for_each. On master, this takes 42 us but with this change it's down to 32us.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

facebook-github-bot commented Mar 9, 2021

💊 CI failures summary and remediations

As of commit 310fb71 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Comment thread aten/src/ATen/TensorIterator.h Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I especially like this part. IIRC from looking at this previously, SmallVector overhead was a small problem.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, this doesn't work. data here is potentially shared between multiple threads so it does need to be created inside the lambda after all.

Comment thread aten/src/ATen/TensorIterator.h Outdated
Comment on lines 252 to 260
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we avoid duplicating what used to be LOOP_WRAPPER with a template function of its own that took loop, data, and ntensor (by reference/value as appropriate)?

@peterbell10 peterbell10 requested a review from swolchok March 10, 2021 14:02
@mrshenli mrshenli requested review from ezyang and izdeby March 10, 2021 15:16
@mrshenli mrshenli added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 10, 2021
Copy link
Copy Markdown
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really nice work, thanks!

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@ezyang merged this pull request in 895735c.

xsacha pushed a commit to xsacha/pytorch that referenced this pull request Mar 31, 2021
…ytorch#53613)

Summary:
When calling `TensorIterator::for_each` with a 1d loop, it creates a `function_ref` for the 1D iteration, then wraps it with `LOOP_WRAPPER` to transform it into a 2d loop. That 2d loop then gets wrapped in another `function_ref`. This can result in significant overhead if the 1d inner loop is over a small number of elements.

Instead, this wraps the 1d loop before type-erasure so only one level of `function_ref` is introduced. A simple benchmark demonstrates this is a win:
```python
import torch
a = torch.rand((10000, 2))[::2]
%timeit a + a
```

Note the 2D tensor cannot be coalesced into 1D and both `cpu_kernel` and `cpu_kernel_vec` use 1D for_each. On master, this takes 42 us but with this change it's down to 32us.

Pull Request resolved: pytorch#53613

Reviewed By: VitalyFedyunin

Differential Revision: D26947143

Pulled By: ezyang

fbshipit-source-id: 5189ada0d82bbf74170fb446763753f02478abf6
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
…ytorch#53613)

Summary:
When calling `TensorIterator::for_each` with a 1d loop, it creates a `function_ref` for the 1D iteration, then wraps it with `LOOP_WRAPPER` to transform it into a 2d loop. That 2d loop then gets wrapped in another `function_ref`. This can result in significant overhead if the 1d inner loop is over a small number of elements.

Instead, this wraps the 1d loop before type-erasure so only one level of `function_ref` is introduced. A simple benchmark demonstrates this is a win:
```python
import torch
a = torch.rand((10000, 2))[::2]
%timeit a + a
```

Note the 2D tensor cannot be coalesced into 1D and both `cpu_kernel` and `cpu_kernel_vec` use 1D for_each. On master, this takes 42 us but with this change it's down to 32us.

Pull Request resolved: pytorch#53613

Reviewed By: VitalyFedyunin

Differential Revision: D26947143

Pulled By: ezyang

fbshipit-source-id: 5189ada0d82bbf74170fb446763753f02478abf6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged module: TensorIterator open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants