Skip to content

Class-based structured kernels, with migration of add to framework#48718

Closed
ezyang wants to merge 22 commits intogh/ezyang/879/basefrom
gh/ezyang/879/head
Closed

Class-based structured kernels, with migration of add to framework#48718
ezyang wants to merge 22 commits intogh/ezyang/879/basefrom
gh/ezyang/879/head

Conversation

@ezyang
Copy link
Copy Markdown
Contributor

@ezyang ezyang commented Dec 2, 2020

Stack from ghstack:

This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here.

High level structure of this PR (the order you should review files):

  • TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call set_output to allocate/resize their outputs. MetaBase gets a new maybe_get_output virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator.
  • TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available.
  • tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels
  • tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of set_output, which by in large replicates the logic in TensorIterator::set_output. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device.
  • aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest.
  • aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured.
  • To get static runtime to work, I manually extended the structured class that is available from NativeFunctions.h and then called it manually. This isn't really intended to be public API and I don't want a lot of call sites, but there is one happy side effect of doing it this way which is that you can easily skip shape checking (omit meta call, not done in this PR) or skip stringent resize checking (done in this PR). This speeds up static runtime.

TODO:

  • Make Tensor-Scalar addition structured to fix perf regression
  • Make empty_strided work with an empty stride list, so we can remove special case in codegen for empty strides

Signed-off-by: Edward Z. Yang ezyang@fb.com

Differential Revision: D25278031

(description to be written)

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

[ghstack-poisoned]
@facebook-github-bot facebook-github-bot added cla signed oncall: jit Add this issue/PR to JIT oncall triage queue labels Dec 2, 2020
@ezyang ezyang requested review from bhosmer, ngimel and smessmer December 2, 2020 16:00
@dr-ci
Copy link
Copy Markdown

dr-ci bot commented Dec 2, 2020

💊 CI failures summary and remediations

As of commit e507b0d (more details on the Dr. CI page):


  • 3/3 failures possibly* introduced in this PR
    • 1/3 non-CircleCI failure(s)

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_xenial_cuda9_2_cudnn7_py3_gcc5_4_build (1/2)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

Dec 08 17:53:16 sccache: error: couldn't connect to server
Dec 08 17:53:16 +++ eval 'extract_trap_cmd ' 
Dec 08 17:53:16 ++++ extract_trap_cmd 
Dec 08 17:53:16 ++++ printf '%s\n' '' 
Dec 08 17:53:16 +++ printf '%s\n' cleanup 
Dec 08 17:53:16 ++ trap -- ' 
Dec 08 17:53:16 cleanup' EXIT 
Dec 08 17:53:16 ++ [[ pytorch-linux-xenial-cuda9.2-cudnn7-py3-gcc5.4-build != *pytorch-win-* ]] 
Dec 08 17:53:16 ++ which sccache 
Dec 08 17:53:16 ++ sccache --stop-server 
Dec 08 17:53:16 Stopping sccache server... 
Dec 08 17:53:16 sccache: error: couldn't connect to server 
Dec 08 17:53:16 sccache: caused by: Connection refused (os error 111) 
Dec 08 17:53:16 ++ true 
Dec 08 17:53:16 ++ rm /var/lib/jenkins/sccache_error.log 
Dec 08 17:53:16 rm: cannot remove '/var/lib/jenkins/sccache_error.log': No such file or directory 
Dec 08 17:53:16 ++ true 
Dec 08 17:53:16 ++ [[ pytorch-linux-xenial-cuda9.2-cudnn7-py3-gcc5.4-build == *rocm* ]] 
Dec 08 17:53:16 ++ SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log 
Dec 08 17:53:16 ++ SCCACHE_IDLE_TIMEOUT=1200 
Dec 08 17:53:16 ++ RUST_LOG=sccache::server=error 
Dec 08 17:53:16 ++ sccache --start-server 

See CircleCI build pytorch_linux_bionic_py3_6_clang9_test (2/2)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Dec 08 19:56:11 [E request_callback_no_python.cpp:636] Received error while processing request type 258: RuntimeError: Can not pickle torch.futures.Future
Dec 08 19:56:11 At: 
Dec 08 19:56:11   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(120): serialize 
Dec 08 19:56:11   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(172): serialize 
Dec 08 19:56:11  
Dec 08 19:56:11 [E request_callback_no_python.cpp:636] Received error while processing request type 258: RuntimeError: Can not pickle torch.futures.Future 
Dec 08 19:56:11  
Dec 08 19:56:11 At: 
Dec 08 19:56:11   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(120): serialize 
Dec 08 19:56:11   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(172): serialize 
Dec 08 19:56:11  
Dec 08 19:56:11 [E request_callback_no_python.cpp:636] Received error while processing request type 258: RuntimeError: Can not pickle torch.futures.Future 
Dec 08 19:56:11  
Dec 08 19:56:11 At: 
Dec 08 19:56:11   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(120): serialize 
Dec 08 19:56:11   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(172): serialize 
Dec 08 19:56:11  
Dec 08 19:56:11 ok (1.740s) 
Dec 08 19:56:12   test_return_future_remote (__main__.ProcessGroupRpcTestWithSpawn) ... RPC was initialized with the PROCESS_GROUP backend which is deprecated and slated to be removed and superseded by the TENSORPIPE backend. It is recommended to migrate to the TENSORPIPE backend. 
Dec 08 19:56:12 RPC was initialized with the PROCESS_GROUP backend which is deprecated and slated to be removed and superseded by the TENSORPIPE backend. It is recommended to migrate to the TENSORPIPE backend. 
Dec 08 19:56:12 RPC was initialized with the PROCESS_GROUP backend which is deprecated and slated to be removed and superseded by the TENSORPIPE backend. It is recommended to migrate to the TENSORPIPE backend. 
Dec 08 19:56:12 RPC was initialized with the PROCESS_GROUP backend which is deprecated and slated to be removed and superseded by the TENSORPIPE backend. It is recommended to migrate to the TENSORPIPE backend. 

Extra GitHub checks: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 108 times.

…ramework"


This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here.

Stuff in this PR that should be split out:
* Header hygiene improvements; e.g., removing unnecessary ATen/ATen.h includes from headers. There is enough here so that I was able to have NativeFunctions.h (indirectly) include TensorIterator.h without causing a cycle.
* Make various functions that previously took `TensorIterator&` take `TensorIteratorBase&` so that they work with class-based structured kernels

High level structure of this PR (the order you should review files):
* TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator.
* TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available.
* tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels
* tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device.
* aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest.
* aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured.

TODO:
* Work out an appropriate entry point for static runtime, since native:: function stubs no longer are generated
* Refactor TensorIteratorConfig construction into helper functions, like before
* Make Tensor-Scalar addition structured to fix perf regression
* Fix `verify_api_visibility.cpp`
* Refactor tools/codegen/gen.py for clarity

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

[ghstack-poisoned]
…ramework"


This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here.

Stuff in this PR that should be split out:
* Header hygiene improvements; e.g., removing unnecessary ATen/ATen.h includes from headers. There is enough here so that I was able to have NativeFunctions.h (indirectly) include TensorIterator.h without causing a cycle.
* Make various functions that previously took `TensorIterator&` take `TensorIteratorBase&` so that they work with class-based structured kernels

High level structure of this PR (the order you should review files):
* TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator.
* TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available.
* tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels
* tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device.
* aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest.
* aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured.

TODO:
* Work out an appropriate entry point for static runtime, since native:: function stubs no longer are generated
* Refactor TensorIteratorConfig construction into helper functions, like before
* Make Tensor-Scalar addition structured to fix perf regression
* Fix `verify_api_visibility.cpp`
* Refactor tools/codegen/gen.py for clarity
* Figure out why header changes resulted in undefined reference to `at::Tensor::operator[](long) const`

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

[ghstack-poisoned]
@ezyang ezyang mentioned this pull request Dec 2, 2020
…ramework"


This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here.

High level structure of this PR (the order you should review files):
* TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator.
* TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available.
* tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels
* tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device.
* aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest.
* aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured.

TODO:
* Work out an appropriate entry point for static runtime, since native:: function stubs no longer are generated
* Refactor TensorIteratorConfig construction into helper functions, like before
* Make Tensor-Scalar addition structured to fix perf regression
* Fix `verify_api_visibility.cpp`
* Refactor tools/codegen/gen.py for clarity
* Figure out why header changes resulted in undefined reference to `at::Tensor::operator[](long) const`

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

[ghstack-poisoned]
…ramework"


This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here.

High level structure of this PR (the order you should review files):
* TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator.
* TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available.
* tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels
* tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device.
* aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest.
* aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured.

TODO:
* Work out an appropriate entry point for static runtime, since native:: function stubs no longer are generated
* Refactor TensorIteratorConfig construction into helper functions, like before
* Make Tensor-Scalar addition structured to fix perf regression
* Fix `verify_api_visibility.cpp`
* Refactor tools/codegen/gen.py for clarity
* Figure out why header changes resulted in undefined reference to `at::Tensor::operator[](long) const`

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

[ghstack-poisoned]
…ramework"


This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here.

High level structure of this PR (the order you should review files):
* TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator.
* TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available.
* tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels
* tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device.
* aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest.
* aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured.

TODO:
* Work out an appropriate entry point for static runtime, since native:: function stubs no longer are generated
* Refactor TensorIteratorConfig construction into helper functions, like before
* Make Tensor-Scalar addition structured to fix perf regression
* Fix `verify_api_visibility.cpp`
* Refactor tools/codegen/gen.py for clarity
* Figure out why header changes resulted in undefined reference to `at::Tensor::operator[](long) const`

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

[ghstack-poisoned]
ezyang added a commit that referenced this pull request Dec 2, 2020
(description to be written)

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

ghstack-source-id: 0660291
Pull Request resolved: #48718
…ramework"


This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here.

High level structure of this PR (the order you should review files):
* TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator.
* TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available.
* tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels
* tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device.
* aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest.
* aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured.

TODO:
* Work out an appropriate entry point for static runtime, since native:: function stubs no longer are generated
* Refactor TensorIteratorConfig construction into helper functions, like before
* Make Tensor-Scalar addition structured to fix perf regression
* Refactor tools/codegen/gen.py for clarity

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D25278031](https://our.internmc.facebook.com/intern/diff/D25278031)

[ghstack-poisoned]
ezyang added a commit that referenced this pull request Dec 2, 2020
(description to be written)

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

ghstack-source-id: ba1a582
Pull Request resolved: #48718
@ezyang ezyang requested a review from bwasti December 2, 2020 20:40
@ezyang
Copy link
Copy Markdown
Contributor Author

ezyang commented Dec 2, 2020

@bwasti you're added for static runtime changes

Copy link
Copy Markdown

@bhosmer bhosmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per "this is a first draft" and the TODOs, you're not looking to land this right away, right? some comments inline - it's definitely getting trickier as it grows the necessary complexity to fit into the rest of the system, but it still feels pretty reasonable. Te control flow around set_output is pretty gnarly though.

Given the description comments I didn't spend too much time on the codegen and am waiting on the subsequent work to accept, but LMK if I misunderstood your timeline.

//
// Example usage:
//
// TORCH_META_FUNC(add_cpu) (
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TORCH_IMPL_FUNC

// functional/out/inplace, and could also be specialized for CPU/CUDA/etc
// (although presently it isn't).
//
// A notable subclass of this interface is TensorIterator(Base).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure the parens clarify

// A notable subclass of this interface is TensorIterator(Base).
struct CAFFE2_API MetaBase {
virtual void set_output(int64_t output_idx, IntArrayRef sizes, IntArrayRef strides, TensorOptions options, DimnameList names) = 0;
virtual const Tensor& maybe_get_output(int64_t output_idx) = 0;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self, ask for profile data


// Retrieve a reference to the function for an operator. This remains
// valid for as long as the underlying function si valid.
const KernelFunction& getKernel(const OperatorHandle& op, DispatchKey dispatchKey) const;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty significant addition to the API, given its potential for (mis)use in devirtualization-type use cases. Is it worth adding a warning of some sort? If anything I think the current comment kind of underplays the trouble you could get into. (also "si")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*warning in the comments

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm dumping this, I can't use it due to compiler bug, see #48763

.cast_common_dtype_to_outputs(true)
.enforce_safe_casting_to_output(true)
.build();
return binary_op_config(out, a, b).build();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this one factored and not the others?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I only did add. Would make sense to do the rest of them in one go in a separate PR.

for (int i = 0; i < config.tensors_.size(); i++) {
operands_.emplace_back(std::move(config.tensors_[i]));
if (operands_[i].tensor.is_meta()) {
is_meta_ = true;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like this logic silently echoes emergent behavior of the current multidispatch algorithm + dispatch key declaration order, might be worth a comment

}
}

// Not actually used by anything, here for completeness
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might make the design clearer to just omit it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't, it's a pure virtual method and I don't implement it then the class is abstract

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok then I'd elaborate the comment, if possible in like a sentence or two - i.e. why it is that maybe_get_output() isn't part of the control flow at all when TensorIterator is the leaf class.

// to the TRUE output, but op.tensor will be the different
// scalar type output
if (op.original_tensor.defined()) {
// Replay on op.tensor too!
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand what's going on here - I'd gotten to the point of grokking the earlier, flatter version but I'm going to have to assume this logic is right

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is crushing and we might have to talk it through on VC. I was surprised when this one worked. It should be less complicated but I haven't figured out how to simplify it.

view_offsets_ = DimVector(ndim_offsets, 0);
}

void TensorIteratorBase::set_output(int64_t output_idx, IntArrayRef sizes, IntArrayRef strides, TensorOptions options, DimnameList names) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth a comment above each of these set_output overrides with the brief description of the calling contexts from the PR description.

@ezyang
Copy link
Copy Markdown
Contributor Author

ezyang commented Dec 3, 2020

holy shit all tests passed

…ramework"


This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here.

High level structure of this PR (the order you should review files):
* TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator.
* TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available.
* tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels
* tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device.
* aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest.
* aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured.
* To get static runtime to work, I manually extended the structured class that is available from NativeFunctions.h and then called it manually. This isn't really intended to be public API and I don't want a lot of call sites, but there is one happy side effect of doing it this way which is that you can easily skip shape checking (omit meta call, not done in this PR) or skip stringent resize checking (done in this PR). This speeds up static runtime.

TODO:
* Make Tensor-Scalar addition structured to fix perf regression
* Make `empty_strided` work with an empty stride list, so we can remove special case in codegen for empty strides

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D25278031](https://our.internmc.facebook.com/intern/diff/D25278031)

[ghstack-poisoned]
…ramework"


This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here.

High level structure of this PR (the order you should review files):
* TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator.
* TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available.
* tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels
* tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device.
* aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest.
* aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured.
* To get static runtime to work, I manually extended the structured class that is available from NativeFunctions.h and then called it manually. This isn't really intended to be public API and I don't want a lot of call sites, but there is one happy side effect of doing it this way which is that you can easily skip shape checking (omit meta call, not done in this PR) or skip stringent resize checking (done in this PR). This speeds up static runtime.

TODO:
* Make Tensor-Scalar addition structured to fix perf regression
* Make `empty_strided` work with an empty stride list, so we can remove special case in codegen for empty strides

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D25278031](https://our.internmc.facebook.com/intern/diff/D25278031)

[ghstack-poisoned]
…ramework"


This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here.

High level structure of this PR (the order you should review files):
* TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator.
* TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available.
* tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels
* tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device.
* aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest.
* aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured.
* To get static runtime to work, I manually extended the structured class that is available from NativeFunctions.h and then called it manually. This isn't really intended to be public API and I don't want a lot of call sites, but there is one happy side effect of doing it this way which is that you can easily skip shape checking (omit meta call, not done in this PR) or skip stringent resize checking (done in this PR). This speeds up static runtime.

TODO:
* Make Tensor-Scalar addition structured to fix perf regression
* Make `empty_strided` work with an empty stride list, so we can remove special case in codegen for empty strides

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D25278031](https://our.internmc.facebook.com/intern/diff/D25278031)

[ghstack-poisoned]
…ramework"


This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here.

High level structure of this PR (the order you should review files):
* TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator.
* TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available.
* tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels
* tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device.
* aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest.
* aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured.
* To get static runtime to work, I manually extended the structured class that is available from NativeFunctions.h and then called it manually. This isn't really intended to be public API and I don't want a lot of call sites, but there is one happy side effect of doing it this way which is that you can easily skip shape checking (omit meta call, not done in this PR) or skip stringent resize checking (done in this PR). This speeds up static runtime.

TODO:
* Make Tensor-Scalar addition structured to fix perf regression
* Make `empty_strided` work with an empty stride list, so we can remove special case in codegen for empty strides

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D25278031](https://our.internmc.facebook.com/intern/diff/D25278031)

[ghstack-poisoned]
…ramework"


This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here.

High level structure of this PR (the order you should review files):
* TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator.
* TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available.
* tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels
* tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device.
* aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest.
* aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured.
* To get static runtime to work, I manually extended the structured class that is available from NativeFunctions.h and then called it manually. This isn't really intended to be public API and I don't want a lot of call sites, but there is one happy side effect of doing it this way which is that you can easily skip shape checking (omit meta call, not done in this PR) or skip stringent resize checking (done in this PR). This speeds up static runtime.

TODO:
* Make Tensor-Scalar addition structured to fix perf regression
* Make `empty_strided` work with an empty stride list, so we can remove special case in codegen for empty strides

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D25278031](https://our.internmc.facebook.com/intern/diff/D25278031)

[ghstack-poisoned]
…ramework"


This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here.

High level structure of this PR (the order you should review files):
* TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator.
* TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available.
* tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels
* tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device.
* aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest.
* aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured.
* To get static runtime to work, I manually extended the structured class that is available from NativeFunctions.h and then called it manually. This isn't really intended to be public API and I don't want a lot of call sites, but there is one happy side effect of doing it this way which is that you can easily skip shape checking (omit meta call, not done in this PR) or skip stringent resize checking (done in this PR). This speeds up static runtime.

TODO:
* Make Tensor-Scalar addition structured to fix perf regression
* Make `empty_strided` work with an empty stride list, so we can remove special case in codegen for empty strides

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D25278031](https://our.internmc.facebook.com/intern/diff/D25278031)

[ghstack-poisoned]
…ramework"


This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here.

High level structure of this PR (the order you should review files):
* TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator.
* TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available.
* tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels
* tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device.
* aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest.
* aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured.
* To get static runtime to work, I manually extended the structured class that is available from NativeFunctions.h and then called it manually. This isn't really intended to be public API and I don't want a lot of call sites, but there is one happy side effect of doing it this way which is that you can easily skip shape checking (omit meta call, not done in this PR) or skip stringent resize checking (done in this PR). This speeds up static runtime.

TODO:
* Make Tensor-Scalar addition structured to fix perf regression
* Make `empty_strided` work with an empty stride list, so we can remove special case in codegen for empty strides

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D25278031](https://our.internmc.facebook.com/intern/diff/D25278031)

[ghstack-poisoned]
ezyang added a commit that referenced this pull request Dec 8, 2020
(description to be written)

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

ghstack-source-id: 6d010a1
Pull Request resolved: #48718
…ramework"


This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here.

High level structure of this PR (the order you should review files):
* TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator.
* TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available.
* tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels
* tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device.
* aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest.
* aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured.
* To get static runtime to work, I manually extended the structured class that is available from NativeFunctions.h and then called it manually. This isn't really intended to be public API and I don't want a lot of call sites, but there is one happy side effect of doing it this way which is that you can easily skip shape checking (omit meta call, not done in this PR) or skip stringent resize checking (done in this PR). This speeds up static runtime.

TODO:
* Make Tensor-Scalar addition structured to fix perf regression
* Make `empty_strided` work with an empty stride list, so we can remove special case in codegen for empty strides

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D25278031](https://our.internmc.facebook.com/intern/diff/D25278031)

[ghstack-poisoned]
…ramework"


This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here.

High level structure of this PR (the order you should review files):
* TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator.
* TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available.
* tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels
* tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device.
* aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest.
* aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured.
* To get static runtime to work, I manually extended the structured class that is available from NativeFunctions.h and then called it manually. This isn't really intended to be public API and I don't want a lot of call sites, but there is one happy side effect of doing it this way which is that you can easily skip shape checking (omit meta call, not done in this PR) or skip stringent resize checking (done in this PR). This speeds up static runtime.

TODO:
* Make Tensor-Scalar addition structured to fix perf regression
* Make `empty_strided` work with an empty stride list, so we can remove special case in codegen for empty strides

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D25278031](https://our.internmc.facebook.com/intern/diff/D25278031)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@ezyang merged this pull request in 16b8e6a.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged oncall: jit Add this issue/PR to JIT oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants