Skip to content

Refactor TensorIterator to do allocations via MetaBase::set_output#48659

Closed
ezyang wants to merge 4 commits intogh/ezyang/875/basefrom
gh/ezyang/875/head
Closed

Refactor TensorIterator to do allocations via MetaBase::set_output#48659
ezyang wants to merge 4 commits intogh/ezyang/875/basefrom
gh/ezyang/875/head

Conversation

@ezyang
Copy link
Copy Markdown
Contributor

@ezyang ezyang commented Dec 1, 2020

Stack from ghstack:

Detailed RFC at
https://github.com/pytorch/rfcs/blob/rfc-0005/RFC-0005-structured-kernel-definitions.md#handling-tensoriterator

What this diff does:

  • Refactor allocation of outputs in TensorIterator into a call to a single function TensorIterator::set_output. This nicely centralizes restriding logic and mostly eliminates the need for a separate named tensor propagation pass. The one exception is for inplace operations (add_), where previously we never actually call set_output when we determine resizing is not necessary; there's an extra propagate names in allocate_or_resize_outputs to handle this case (I audited all other set_output sites and found that we always hit this path in that situation). Although hypothetically this could cause problems for structured kernels (which require a set_output call in all cases), this codepath is irrelevant for structured kernels as a TensorIterator will never be constructed with an explicit out argument (remember, structured kernels handle out/functional/inplace variants). There's also a tricky case in compute_types; check the comments there for more details.
  • Split TensorIterator into a TensorIteratorBase, which contains most of the logic but doesn't define set_output. A decent chunk of the diff is just the mechanical rename of TensorIterator to TensorIteratorBase. However, there are a few cases where we create fresh TensorIterator objects from another TensorIterator. In those cases, we always construct a fresh TensorIterator (rather than preserving the subclass of TensorIteratorBase that induced this construction). This makes sense, because a structured function class will contain metadata that isn't relevant for these downstream uses. This is done by intentionally permitting object slicing with the TensorIterator(const TensorIteratorBase&) constructor.
  • Introduce a new MetaBase class which contains the canonical virtual method definition for set_output. This will allow structured classes to make use of it directly without going through TensorIterator (not in this PR).

Signed-off-by: Edward Z. Yang ezyang@fb.com

Differential Revision: D25261844

Detailed RFC at
https://github.com/pytorch/rfcs/blob/rfc-0005/RFC-0005-structured-kernel-definitions.md#handling-tensoriterator

What this diff does:
* Refactor allocation of outputs in TensorIterator into a call
  to a single function set_output.  This nicely centralizes restriding
  logic and eliminates the need for a separate named tensor propagation
  pass.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

[ghstack-poisoned]
ezyang added a commit that referenced this pull request Dec 1, 2020
Detailed RFC at
https://github.com/pytorch/rfcs/blob/rfc-0005/RFC-0005-structured-kernel-definitions.md#handling-tensoriterator

What this diff does:
* Refactor allocation of outputs in TensorIterator into a call
  to a single function set_output.  This nicely centralizes restriding
  logic and eliminates the need for a separate named tensor propagation
  pass.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

ghstack-source-id: 65f9fb4
Pull Request resolved: #48659
@dr-ci
Copy link
Copy Markdown

dr-ci bot commented Dec 1, 2020

💊 CI failures summary and remediations

As of commit 6969196 (more details on the Dr. CI page):


None of the CI failures appear to be your fault 💚



This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 14 times.

…t_output"

Detailed RFC at
https://github.com/pytorch/rfcs/blob/rfc-0005/RFC-0005-structured-kernel-definitions.md#handling-tensoriterator

What this diff does:
* Refactor allocation of outputs in TensorIterator into a call
  to a single function set_output.  This nicely centralizes restriding
  logic and eliminates the need for a separate named tensor propagation
  pass.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

[ghstack-poisoned]
ezyang added a commit that referenced this pull request Dec 1, 2020
Detailed RFC at
https://github.com/pytorch/rfcs/blob/rfc-0005/RFC-0005-structured-kernel-definitions.md#handling-tensoriterator

What this diff does:
* Refactor allocation of outputs in TensorIterator into a call
  to a single function set_output.  This nicely centralizes restriding
  logic and eliminates the need for a separate named tensor propagation
  pass.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

ghstack-source-id: f7f16f6
Pull Request resolved: #48659
@ezyang ezyang requested review from bhosmer and ngimel December 1, 2020 20:54
ezyang added a commit to ezyang/pytorch that referenced this pull request Dec 1, 2020
Detailed RFC at
https://github.com/pytorch/rfcs/blob/rfc-0005/RFC-0005-structured-kernel-definitions.md#handling-tensoriterator

What this diff does:
* Refactor allocation of outputs in TensorIterator into a call
  to a single function set_output.  This nicely centralizes restriding
  logic and eliminates the need for a separate named tensor propagation
  pass.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

ghstack-source-id: f7f16f6
Pull Request resolved: pytorch#48659
…t_output"


Detailed RFC at
https://github.com/pytorch/rfcs/blob/rfc-0005/RFC-0005-structured-kernel-definitions.md#handling-tensoriterator

What this diff does:
* Refactor allocation of outputs in TensorIterator into a call to a single function TensorIterator::set_output.  This nicely centralizes restriding logic and mostly eliminates the need for a separate named tensor propagation pass. The one exception is for inplace operations (`add_`), where previously we never actually call `set_output` when we determine resizing is not necessary; there's an extra propagate names in `allocate_or_resize_outputs` to handle this case (I audited all other `set_output` sites and found that we always hit this path in that situation). Although hypothetically this could cause problems for structured kernels (which require a `set_output` call in all cases), this codepath is irrelevant for structured kernels as a TensorIterator will never be constructed with an explicit out argument (remember, structured kernels handle out/functional/inplace variants). There's also a tricky case in `compute_types`; check the comments there for more details.
* Split TensorIterator into a TensorIteratorBase, which contains most of the logic but doesn't define `set_output`. A decent chunk of the diff is just the mechanical rename of TensorIterator to TensorIteratorBase. However, there are a few cases where we create fresh TensorIterator objects from another TensorIterator. In those cases, we always construct a fresh TensorIterator (rather than preserving the subclass of TensorIteratorBase that induced this construction). This makes sense, because a structured function class will contain metadata that isn't relevant for these downstream uses. This is done by *intentionally* permitting object slicing with the `TensorIterator(const TensorIteratorBase&)` constructor.
* Introduce a new `MetaBase` class which contains the canonical virtual method definition for `set_output`. This will allow structured classes to make use of it directly without going through TensorIterator (not in this PR).

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

[ghstack-poisoned]
ezyang added a commit to ezyang/pytorch that referenced this pull request Dec 2, 2020
Detailed RFC at
https://github.com/pytorch/rfcs/blob/rfc-0005/RFC-0005-structured-kernel-definitions.md#handling-tensoriterator

What this diff does:
* Refactor allocation of outputs in TensorIterator into a call
  to a single function set_output.  This nicely centralizes restriding
  logic and eliminates the need for a separate named tensor propagation
  pass.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

ghstack-source-id: 999f9c0
Pull Request resolved: pytorch#48659
…t_output"


Detailed RFC at
https://github.com/pytorch/rfcs/blob/rfc-0005/RFC-0005-structured-kernel-definitions.md#handling-tensoriterator

What this diff does:
* Refactor allocation of outputs in TensorIterator into a call to a single function TensorIterator::set_output.  This nicely centralizes restriding logic and mostly eliminates the need for a separate named tensor propagation pass. The one exception is for inplace operations (`add_`), where previously we never actually call `set_output` when we determine resizing is not necessary; there's an extra propagate names in `allocate_or_resize_outputs` to handle this case (I audited all other `set_output` sites and found that we always hit this path in that situation). Although hypothetically this could cause problems for structured kernels (which require a `set_output` call in all cases), this codepath is irrelevant for structured kernels as a TensorIterator will never be constructed with an explicit out argument (remember, structured kernels handle out/functional/inplace variants). There's also a tricky case in `compute_types`; check the comments there for more details.
* Split TensorIterator into a TensorIteratorBase, which contains most of the logic but doesn't define `set_output`. A decent chunk of the diff is just the mechanical rename of TensorIterator to TensorIteratorBase. However, there are a few cases where we create fresh TensorIterator objects from another TensorIterator. In those cases, we always construct a fresh TensorIterator (rather than preserving the subclass of TensorIteratorBase that induced this construction). This makes sense, because a structured function class will contain metadata that isn't relevant for these downstream uses. This is done by *intentionally* permitting object slicing with the `TensorIterator(const TensorIteratorBase&)` constructor.
* Introduce a new `MetaBase` class which contains the canonical virtual method definition for `set_output`. This will allow structured classes to make use of it directly without going through TensorIterator (not in this PR).

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

[ghstack-poisoned]
Copy link
Copy Markdown
Collaborator

@ngimel ngimel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Copy Markdown

@bhosmer bhosmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Only real question is about some minorly redundant logic that I think fell out of the refactoring straddle between allocate_or_resize_outputs and fast_set_up call sites. But I'd buy that this is the right sweet spot, at least for now.

if (common_device == kCPU) {
// Casts to outputs by creating temporaries of the correct dtype (if needed)
if (config.cast_common_dtype_to_outputs_ && op.is_output && op.current_dtype != common_dtype_) {
TORCH_INTERNAL_ASSERT(op.tensor.defined());
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm seeing the continue on line 372... not seeing how we get here with op.tensor undefined, but maybe I'm missing it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't get here with op.tensor() being undefined, that's why it's an assert :) Actually I missed the continue on line 372, but I was fairly confident (for other reasons) that it was impossible, but not 100% so.

auto& op = operands_[output_idx];
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(output_idx < num_outputs_);
if (!op.tensor.defined()) {
TORCH_INTERNAL_ASSERT(op.is_type_defined(), "no type for operand", output_idx);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a quick read it looks like a less aggressive factoring would avoid some redundant tests when this is called from allocate_or_resize_outputs, at the cost of leaving more code in fast_set_up - but I'm assuming you figure the slow path can afford it (or that there's some API requirement that it be done in set_output that I'm missing)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good question. When I wrote this to start, I tried to assume as few preconditions on the call-site as possible, ergo the redundant tests. If I understand you correctly, it's just this assert that could be removed (I don't think any of the other asserts are removable). Seems reasonable.

// of a lot of boilerplate above
TensorIterator build() {
return TensorIterator(*this);
TensorIterator iter;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you kill the TensorIterator(TensorIteratorConfig&) constructor just to clean things up?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I don't think I was supposed to delete this constructor, and I ended up just refactoring around the problem. Not sure if it's worth restoring it though!

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think this is better - more transparent. The other just hid the build call

ezyang added a commit to ezyang/pytorch that referenced this pull request Dec 2, 2020
Detailed RFC at
https://github.com/pytorch/rfcs/blob/rfc-0005/RFC-0005-structured-kernel-definitions.md#handling-tensoriterator

What this diff does:
* Refactor allocation of outputs in TensorIterator into a call
  to a single function set_output.  This nicely centralizes restriding
  logic and eliminates the need for a separate named tensor propagation
  pass.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

ghstack-source-id: d403964
Pull Request resolved: pytorch#48659
@ezyang
Copy link
Copy Markdown
Contributor Author

ezyang commented Dec 2, 2020

@bhosmer I merged this because I got a clean merge window but I still owe you updates for your comments. Coming soon.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@ezyang merged this pull request in 6ba7709.

ezyang added a commit that referenced this pull request Dec 2, 2020
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

[ghstack-poisoned]
ezyang added a commit that referenced this pull request Dec 2, 2020
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

[ghstack-poisoned]
shaibagon pushed a commit to shaibagon/pytorch that referenced this pull request Dec 3, 2020
…ytorch#48659)

Summary:
Pull Request resolved: pytorch#48659

Detailed RFC at
https://github.com/pytorch/rfcs/blob/rfc-0005/RFC-0005-structured-kernel-definitions.md#handling-tensoriterator

What this diff does:
* Refactor allocation of outputs in TensorIterator into a call to a single function TensorIterator::set_output.  This nicely centralizes restriding logic and mostly eliminates the need for a separate named tensor propagation pass. The one exception is for inplace operations (`add_`), where previously we never actually call `set_output` when we determine resizing is not necessary; there's an extra propagate names in `allocate_or_resize_outputs` to handle this case (I audited all other `set_output` sites and found that we always hit this path in that situation). Although hypothetically this could cause problems for structured kernels (which require a `set_output` call in all cases), this codepath is irrelevant for structured kernels as a TensorIterator will never be constructed with an explicit out argument (remember, structured kernels handle out/functional/inplace variants). There's also a tricky case in `compute_types`; check the comments there for more details.
* Split TensorIterator into a TensorIteratorBase, which contains most of the logic but doesn't define `set_output`. A decent chunk of the diff is just the mechanical rename of TensorIterator to TensorIteratorBase. However, there are a few cases where we create fresh TensorIterator objects from another TensorIterator. In those cases, we always construct a fresh TensorIterator (rather than preserving the subclass of TensorIteratorBase that induced this construction). This makes sense, because a structured function class will contain metadata that isn't relevant for these downstream uses. This is done by *intentionally* permitting object slicing with the `TensorIterator(const TensorIteratorBase&)` constructor.
* Introduce a new `MetaBase` class which contains the canonical virtual method definition for `set_output`. This will allow structured classes to make use of it directly without going through TensorIterator (not in this PR).

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D25261844

Pulled By: ezyang

fbshipit-source-id: 34a9830cccbc07eaaf7c4f75114cd00953e3db7d
ezyang added a commit to ezyang/pytorch that referenced this pull request Dec 3, 2020
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

ghstack-source-id: 1de9cce
Pull Request resolved: pytorch#48731
facebook-github-bot pushed a commit that referenced this pull request Dec 3, 2020
Summary:
Pull Request resolved: #48731

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D25278034

Pulled By: ezyang

fbshipit-source-id: 73652311b48d8d80c06e9385b7ff18ef3a158ae8
@facebook-github-bot facebook-github-bot deleted the gh/ezyang/875/head branch December 6, 2020 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants