Class-based structured kernels, with migration of add to framework by ezyang · Pull Request #48718 · pytorch/pytorch

ezyang · 2020-12-02T15:59:57Z

Stack from ghstack:

Introduce tools.codegen.api.translate #49122 Introduce tools.codegen.api.translate
Delete cpp.group_arguments #49043 Delete cpp.group_arguments
Rename positional and kwarg_only to have flat prefix #49042 Rename positional and kwarg_only to have flat prefix
Delete some dead functions from tools.codegen.api.meta #49041 Delete some dead functions from tools.codegen.api.meta
Class-based structured kernels, with migration of add to framework #48718 Class-based structured kernels, with migration of add to framework

This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here.

High level structure of this PR (the order you should review files):

TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call set_output to allocate/resize their outputs. MetaBase gets a new maybe_get_output virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator.
TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available.
tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels
tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of set_output, which by in large replicates the logic in TensorIterator::set_output. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device.
aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest.
aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured.
To get static runtime to work, I manually extended the structured class that is available from NativeFunctions.h and then called it manually. This isn't really intended to be public API and I don't want a lot of call sites, but there is one happy side effect of doing it this way which is that you can easily skip shape checking (omit meta call, not done in this PR) or skip stringent resize checking (done in this PR). This speeds up static runtime.

TODO:

Make Tensor-Scalar addition structured to fix perf regression
Make empty_strided work with an empty stride list, so we can remove special case in codegen for empty strides

Signed-off-by: Edward Z. Yang ezyang@fb.com

Differential Revision: D25278031

(description to be written) Signed-off-by: Edward Z. Yang <ezyang@fb.com> [ghstack-poisoned]

dr-ci · 2020-12-02T16:10:19Z

💊 CI failures summary and remediations

As of commit e507b0d (more details on the Dr. CI page):

3/3 failures possibly* introduced in this PR
- 1/3 non-CircleCI failure(s)

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_linux_xenial_cuda9_2_cudnn7_py3_gcc5_4_build (1/2)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

Dec 08 17:53:16 sccache: error: couldn't connect to server

Dec 08 17:53:16 +++ eval 'extract_trap_cmd ' 
Dec 08 17:53:16 ++++ extract_trap_cmd 
Dec 08 17:53:16 ++++ printf '%s\n' '' 
Dec 08 17:53:16 +++ printf '%s\n' cleanup 
Dec 08 17:53:16 ++ trap -- ' 
Dec 08 17:53:16 cleanup' EXIT 
Dec 08 17:53:16 ++ [[ pytorch-linux-xenial-cuda9.2-cudnn7-py3-gcc5.4-build != *pytorch-win-* ]] 
Dec 08 17:53:16 ++ which sccache 
Dec 08 17:53:16 ++ sccache --stop-server 
Dec 08 17:53:16 Stopping sccache server... 
Dec 08 17:53:16 sccache: error: couldn't connect to server 
Dec 08 17:53:16 sccache: caused by: Connection refused (os error 111) 
Dec 08 17:53:16 ++ true 
Dec 08 17:53:16 ++ rm /var/lib/jenkins/sccache_error.log 
Dec 08 17:53:16 rm: cannot remove '/var/lib/jenkins/sccache_error.log': No such file or directory 
Dec 08 17:53:16 ++ true 
Dec 08 17:53:16 ++ [[ pytorch-linux-xenial-cuda9.2-cudnn7-py3-gcc5.4-build == *rocm* ]] 
Dec 08 17:53:16 ++ SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log 
Dec 08 17:53:16 ++ SCCACHE_IDLE_TIMEOUT=1200 
Dec 08 17:53:16 ++ RUST_LOG=sccache::server=error 
Dec 08 17:53:16 ++ sccache --start-server

pytorch_linux_bionic_py3_6_clang9_test (2/2)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Dec 08 19:56:11 [E request_callback_no_python.cpp:636] Received error while processing request type 258: RuntimeError: Can not pickle torch.futures.Future

Dec 08 19:56:11 At: 
Dec 08 19:56:11   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(120): serialize 
Dec 08 19:56:11   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(172): serialize 
Dec 08 19:56:11  
Dec 08 19:56:11 [E request_callback_no_python.cpp:636] Received error while processing request type 258: RuntimeError: Can not pickle torch.futures.Future 
Dec 08 19:56:11  
Dec 08 19:56:11 At: 
Dec 08 19:56:11   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(120): serialize 
Dec 08 19:56:11   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(172): serialize 
Dec 08 19:56:11  
Dec 08 19:56:11 [E request_callback_no_python.cpp:636] Received error while processing request type 258: RuntimeError: Can not pickle torch.futures.Future 
Dec 08 19:56:11  
Dec 08 19:56:11 At: 
Dec 08 19:56:11   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(120): serialize 
Dec 08 19:56:11   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(172): serialize 
Dec 08 19:56:11  
Dec 08 19:56:11 ok (1.740s) 
Dec 08 19:56:12   test_return_future_remote (__main__.ProcessGroupRpcTestWithSpawn) ... RPC was initialized with the PROCESS_GROUP backend which is deprecated and slated to be removed and superseded by the TENSORPIPE backend. It is recommended to migrate to the TENSORPIPE backend. 
Dec 08 19:56:12 RPC was initialized with the PROCESS_GROUP backend which is deprecated and slated to be removed and superseded by the TENSORPIPE backend. It is recommended to migrate to the TENSORPIPE backend. 
Dec 08 19:56:12 RPC was initialized with the PROCESS_GROUP backend which is deprecated and slated to be removed and superseded by the TENSORPIPE backend. It is recommended to migrate to the TENSORPIPE backend. 
Dec 08 19:56:12 RPC was initialized with the PROCESS_GROUP backend which is deprecated and slated to be removed and superseded by the TENSORPIPE backend. It is recommended to migrate to the TENSORPIPE backend.

Extra GitHub checks: 1 failed

Failed: GitHub Actions - clang-tidy

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 108 times.

…ramework" This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here. Stuff in this PR that should be split out: * Header hygiene improvements; e.g., removing unnecessary ATen/ATen.h includes from headers. There is enough here so that I was able to have NativeFunctions.h (indirectly) include TensorIterator.h without causing a cycle. * Make various functions that previously took `TensorIterator&` take `TensorIteratorBase&` so that they work with class-based structured kernels High level structure of this PR (the order you should review files): * TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator. * TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available. * tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels * tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device. * aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest. * aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured. TODO: * Work out an appropriate entry point for static runtime, since native:: function stubs no longer are generated * Refactor TensorIteratorConfig construction into helper functions, like before * Make Tensor-Scalar addition structured to fix perf regression * Fix `verify_api_visibility.cpp` * Refactor tools/codegen/gen.py for clarity Signed-off-by: Edward Z. Yang <ezyang@fb.com> [ghstack-poisoned]

…ramework" This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here. Stuff in this PR that should be split out: * Header hygiene improvements; e.g., removing unnecessary ATen/ATen.h includes from headers. There is enough here so that I was able to have NativeFunctions.h (indirectly) include TensorIterator.h without causing a cycle. * Make various functions that previously took `TensorIterator&` take `TensorIteratorBase&` so that they work with class-based structured kernels High level structure of this PR (the order you should review files): * TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator. * TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available. * tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels * tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device. * aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest. * aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured. TODO: * Work out an appropriate entry point for static runtime, since native:: function stubs no longer are generated * Refactor TensorIteratorConfig construction into helper functions, like before * Make Tensor-Scalar addition structured to fix perf regression * Fix `verify_api_visibility.cpp` * Refactor tools/codegen/gen.py for clarity * Figure out why header changes resulted in undefined reference to `at::Tensor::operator[](long) const` Signed-off-by: Edward Z. Yang <ezyang@fb.com> [ghstack-poisoned]

…ramework" This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here. High level structure of this PR (the order you should review files): * TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator. * TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available. * tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels * tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device. * aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest. * aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured. TODO: * Work out an appropriate entry point for static runtime, since native:: function stubs no longer are generated * Refactor TensorIteratorConfig construction into helper functions, like before * Make Tensor-Scalar addition structured to fix perf regression * Fix `verify_api_visibility.cpp` * Refactor tools/codegen/gen.py for clarity * Figure out why header changes resulted in undefined reference to `at::Tensor::operator[](long) const` Signed-off-by: Edward Z. Yang <ezyang@fb.com> [ghstack-poisoned]

(description to be written) Signed-off-by: Edward Z. Yang <ezyang@fb.com> ghstack-source-id: 0660291 Pull Request resolved: #48718

…ramework" This PR rewrites structured kernels to do the class-based mechanism (instead of defining a meta and impl function, they are methods on a class), and adds enough customizability on the class to support TensorIterator. To show it works, add is made a structured kernel. Don't forget to check pytorch/rfcs#9 for a mostly up-to-date high level description of what's going on here. High level structure of this PR (the order you should review files): * TensorMeta.h - TensorMeta is deleted entirely; instead, meta functions will call `set_output` to allocate/resize their outputs. MetaBase gets a new `maybe_get_output` virtual method for retrieving the (possibly non-existent) output tensor in a meta function; this makes it easier to do special promotion behavior, e.g., as in TensorIterator. * TensorIterator.cpp - Two major changes: first, we add TensorIteratorBase::set_output, which is a "light" version of TensorIterator::set_output; it sets up the internal data structures in TensorIterator, but it doesn't do allocation (that is assumed to have been handled by the structured kernels framework). The control flow here is someone will call the subclassed set_output, which will allocate output, and then we will call the parent class (TensorIteratorBase) to populate the fields in TensorIterator so that other TensorIterator phases can keep track of it. Second, we add some tests for meta tensors, and skip parts of TensorIterator which are not necessary when data is not available. * tools/codegen/model.py - One new field in native_functions.yaml, structured_inherits. This lets you override the parent class of a structured meta class; normally it's MetaBase, but you can make it point at TensorIteratorBase instead for TensorIterator based kernels * tools/codegen/gen.py - Now generate all of the classes we promised. It's kind of hairy because this is the first draft. Check the RFC for what the output looks like, and then follow the logic here. There are some complications: I need to continue to generate old style wrapper functions even if an operator is structured, because SparseCPU/SparseCUDA/etc won't actually use structured kernels to start. The most complicated code generation is the instantiation of `set_output`, which by in large replicates the logic in `TensorIterator::set_output`. This will continue to live in codegen for the forseeable future as we would like to specialize this logic per device. * aten/src/ATen/native/UpSampleNearest1d.cpp - The previous structured kernel is ported to the new format. The changes are very modest. * aten/src/ATen/native/BinaryOps.cpp - Add is ported to structured. TODO: * Work out an appropriate entry point for static runtime, since native:: function stubs no longer are generated * Refactor TensorIteratorConfig construction into helper functions, like before * Make Tensor-Scalar addition structured to fix perf regression * Refactor tools/codegen/gen.py for clarity Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D25278031](https://our.internmc.facebook.com/intern/diff/D25278031) [ghstack-poisoned]

(description to be written) Signed-off-by: Edward Z. Yang <ezyang@fb.com> ghstack-source-id: ba1a582 Pull Request resolved: #48718

ezyang · 2020-12-02T20:40:29Z

@bwasti you're added for static runtime changes

bhosmer

per "this is a first draft" and the TODOs, you're not looking to land this right away, right? some comments inline - it's definitely getting trickier as it grows the necessary complexity to fit into the rest of the system, but it still feels pretty reasonable. Te control flow around set_output is pretty gnarly though.

Given the description comments I didn't spend too much time on the codegen and am waiting on the subsequent work to accept, but LMK if I misunderstood your timeline.

bhosmer · 2020-12-02T23:40:30Z

aten/src/ATen/TensorMeta.h

+//
+// Example usage:
+//
+//    TORCH_META_FUNC(add_cpu) (


TORCH_IMPL_FUNC

bhosmer · 2020-12-02T23:56:45Z

aten/src/ATen/TensorMeta.h

+// functional/out/inplace, and could also be specialized for CPU/CUDA/etc
+// (although presently it isn't).
+//
+// A notable subclass of this interface is TensorIterator(Base).


Not sure the parens clarify

bhosmer · 2020-12-03T00:01:08Z

aten/src/ATen/TensorMeta.h

+// A notable subclass of this interface is TensorIterator(Base).
+struct CAFFE2_API MetaBase {
  virtual void set_output(int64_t output_idx, IntArrayRef sizes, IntArrayRef strides, TensorOptions options, DimnameList names) = 0;
+  virtual const Tensor& maybe_get_output(int64_t output_idx) = 0;


note to self, ask for profile data

bhosmer · 2020-12-03T00:03:16Z

aten/src/ATen/core/dispatch/Dispatcher.h


+  // Retrieve a reference to the function for an operator.  This remains
+  // valid for as long as the underlying function si valid.
+  const KernelFunction& getKernel(const OperatorHandle& op, DispatchKey dispatchKey) const;


This is a pretty significant addition to the API, given its potential for (mis)use in devirtualization-type use cases. Is it worth adding a warning of some sort? If anything I think the current comment kind of underplays the trouble you could get into. (also "si")

*warning in the comments

I'm dumping this, I can't use it due to compiler bug, see #48763

bhosmer · 2020-12-03T00:17:56Z