beef up at::_ops API by bdhirsh · Pull Request #59115 · pytorch/pytorch

bdhirsh · 2021-05-27T23:20:53Z

This PR beefs up the at::_ops:: API as a source of truth for compile-time information about each operator.

Changes

For every op defined in native_functions.yaml, e.g. at::_ops::add_Tensor previously defined an unambiguous function; effectively an unambiguously named version of the C++ API that you could decltype() successfully because it had no overloads with a user-facing macro: decltype(ATEN_FN2(add, Tensor)) // expands to decltype(at::_ops::add_Tensor).

Now, at::_ops::add_Tensor is a struct containing a few static fields and methods (declared in Operators.h, defined in Operators.cpp):

struct TORCH_API add_Tensor {
  using schema = at::Tensor (const at::Tensor &, const at::Tensor &, const at::Scalar &);
  using ptr_schema = at::Tensor (*)(const at::Tensor &, const at::Tensor &, const at::Scalar &);
  static constexpr const char* name = "aten::add";
  static constexpr const char* overload_name = "Tensor";
  static constexpr const char* schema_str = "add.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor";
  static at::Tensor call(const at::Tensor & self, const at::Tensor & other, const at::Scalar & alpha);
  static at::Tensor redispatch(c10::DispatchKeySet dispatchKeySet, const at::Tensor & self, const at::Tensor & ot
};

What used to be the function at::_ops::add_Tensor can now be accessed as at::_ops::add_Tensor::call, and I've added a new macro to access the entire struct (naming suggestions welcome) - ATEN_OP2(add, Tensor).

Motivation

There were two motivations for this change:

Codegen refactor
The at::_ops:: API as it exists now is (yet another) C++ entry point into the dispatcher, in addition to the Function, Method, and Redispatch APIs. Instead, after this PR, the existing three API's are all inline-able wrapper API's that call into the at::_ops API to do the real work. The function and method API's call into at::_ops::{op}::call, while the redispatch API calls into at::_ops::{op}::redispatch.

This will hopefully make it easier to pile in any future C++ API's that we want to code-generate. It also means that stuff like the string name, overload name, and schema of each operator is consolidated in a single place, rather than having the codegen hardcode various strings in multiple codegen output files.

Extra compile-time metadata
In the boxed CPU fallback PR above this in the stack, I added a new API that external backends can use to call directly into their boxed fallback from an unboxed context. Adding extra metadata to at::_ops means that XLA's usage of that API doesn't require passing in the string name and overload of each name as arguments; we can just infer them.

The updated API looks like this (see the XLA-side PR for more examples)

return at::native::call_fallback_fn<&xla_cpu_fallback, ATEN_OP2(add, Tensor)>::call(a, b, 1.0);

Characteristics of the at::_ops API
(I also commented this in the codegen)

(1) It follows the Dispatcher API.

This means, e.g., that it takes in the expanded arguments rather than TensorOptions. This is kind of necessary for perf, if we want to at::_ops to serve as the main implementation of the existing C++ API's. For example: if it followed the C++ API, then all of the faithful C++ factory functions would need to wrap their arguments into TensorOptions only to unwrap them again.

(2) Overload names are disambiguated.

This is the same as before; it's helpful for pytorch extenders who would like to decltype() an aten operator, that has overloads, e.g. decltype(at::_ops::mul_Tensor::call)

(3) No argument defaulting is allowed.

This is more of an implementation detail to avoid #include cycles, since TensorBody.h (which defines the Tensor class) needs to include this file. The #include situation is precarious though!

(4) manual_cpp_bindings and faithful names are not included in the API.

I think that this is one we have a choice with. This applies to stuff like __dispatch__is_complex(), and add_outf(). These aren't "real native_functions.yaml ops", they're just additional functions provided by the C++ API. They're implemented as wrappers in Functions.h that call into the actual operators defined here, i.e. at::_ops::is_complex::call() and at::_ops::add_out::call(). This means that ATEN_OP(is_complex) will not fastpath, and will go through the dispatcher. It also means that ATEN_OP2(add, out) is automatically faithful and takes its out argument at the end (this is just because it follows the dispatcher API).

I thought it made sense not to move these into the at::_ops API because they aren't universal: the Redispatch API doesn't care about which ops have manual_cpp_bindings. But... it could.

Details

Instead of codegen'ing the existing 3 API's in Functions.cpp, TensorMethods.cpp and RedispatchFunctions.cpp, I codegen them directly into the headers: Functions.h, TensorBody.h, and RedispatchFunctions.h. I mostly did this for perf, since we want to avoid introducing an extra function call in the hot path of every operator. These functions are also now all one-liners that call into at::_ops, so the compiler should just inline them all anyway.

The main downside in doing that though was that I had to bend over backwards in a few cases to avoid cyclical #include statements. The issue is that TensorBody.h now includes Operators.h (because the codegen'd method API is implemented by calling into at::_ops), but TensorBody.h also includes the definition of the Tensor class. That means that Operators.h can't be aware of the Tensor class; it needs to forward declare everything and avoid using the Tensor class directly. To fix cyclic includes, I had to:

Not allow defaulting in the at::_ops API
Move some code that was called when translating from C++ to Dispatcher API's directly into the codegen template (check_tensor_options_and_extract_memory_format)

It's not great, but I don't think this specific include cycle will break down in the near future; the only code that we need to call before getting to Operators.cpp is the translations from various API's to the dispatcher API; there aren't many of them, and there's no major reason for them to live an external utils file somewhere.

Moving the code into the headers also meant that the codegen no longer needs to deal with Functions.cpp/TensorMethods.cpp/RedispatchFunctions.cpp. All of the functions that used to be defined in TensorMethods.cpp seemed small enough for me to lump into TensorBody.h, but some of the functions in Functions.cpp looked pretty big to put in a header, so I moved the file to aten/src/ATen/native/Functions.cpp.

It might be worth keeping TensorMethods.cpp there and leaving it too, in-case we have any beefy hand-written tensor methods that we don't want to put in a header.

Perf
I ran a few benchmarks in callgrind, and didn't see a noticeable instruction count change when calling at::add(). I also saw in the output that at::add() was successfully getting inlined.

There's also probably a light risk of binary size increase; I think that there's a binary size regression test that I can run in phabricator (going to try it). I can also try inspecting libtorch.so directly and seeing if it's any bigger, but my hope is that the inline-ing means that we aren't generated separate symbols for at::add and at::_ops::add_Tensor::call.

Stack from ghstack:

move all external kernels into a class for better compiler error messages #59839 move all external kernels into a class for better compiler error messages
add a boxed CPU fallback kernel #58065 add a boxed CPU fallback kernel
beef up at::_ops API #59115 beef up at::_ops API

Differential Revision: D28833086

[ghstack-poisoned]

facebook-github-bot · 2021-05-27T23:20:58Z

💊 CI failures summary and remediations

As of commit 5c0657e (more details on the Dr. CI page and at hud.pytorch.org/pr/59115):

3/3 failures introduced in this PR

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_build (1/2)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

Jun 17 14:22:04 rm: cannot remove '/var/lib/jenkins/sccache_error.log': No such file or directory

Jun 17 14:22:04 ++++ extract_trap_cmd
Jun 17 14:22:04 ++++ printf '%s\n' ''
Jun 17 14:22:04 +++ printf '%s\n' cleanup
Jun 17 14:22:04 ++ trap -- '
Jun 17 14:22:04 cleanup' EXIT
Jun 17 14:22:04 ++ [[ pytorch-linux-xenial-cuda11.1-cudnn8-py3-gcc7-build != *pytorch-win-* ]]
Jun 17 14:22:04 ++ which sccache
Jun 17 14:22:04 ++ sccache --stop-server
Jun 17 14:22:04 ++ true
Jun 17 14:22:04 ++ rm /var/lib/jenkins/sccache_error.log
Jun 17 14:22:04 rm: cannot remove '/var/lib/jenkins/sccache_error.log': No such file or directory
Jun 17 14:22:04 ++ true
Jun 17 14:22:04 ++ [[ -n '' ]]
Jun 17 14:22:04 ++ [[ pytorch-linux-xenial-cuda11.1-cudnn8-py3-gcc7-build == *rocm* ]]
Jun 17 14:22:04 ++ SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log
Jun 17 14:22:04 ++ SCCACHE_IDLE_TIMEOUT=1200
Jun 17 14:22:04 ++ RUST_LOG=sccache::server=error
Jun 17 14:22:04 ++ sccache --start-server
Jun 17 14:22:04 sccache: Starting the server...
Jun 17 14:22:04 ++ sccache --zero-stats
Jun 17 14:22:04 Compile requests                      0

Lint / clang-tidy (2/2)

Step: "Run clang-tidy" (full log | diagnosis details | 🔁 rerun)

2021-06-17T14:14:13.3451960Z /__w/pytorch/pytor...fter top level declarator [clang-diagnostic-error]

2021-06-17T14:14:13.3441841Z /__w/pytorch/pytorch/aten/src/ATen/templates/Operators.cpp:12:15: error: expected ';' after top level declarator [clang-diagnostic-error]
2021-06-17T14:14:13.3443131Z ${definitions}
2021-06-17T14:14:13.3443457Z               ^
2021-06-17T14:14:13.3443722Z               ;
2021-06-17T14:14:13.3445119Z /__w/pytorch/pytorch/aten/src/ATen/templates/RegisterBackendSelect.cpp:17:1: error: C++ requires a type specifier for all declarations [clang-diagnostic-error]
2021-06-17T14:14:13.3446366Z ${backend_select_method_definitions}
2021-06-17T14:14:13.3446769Z ^
2021-06-17T14:14:13.3448183Z /__w/pytorch/pytorch/aten/src/ATen/templates/RegisterBackendSelect.cpp:17:3: error: use of undeclared identifier 'backend_select_method_definitions' [clang-diagnostic-error]
2021-06-17T14:14:13.3449644Z ${backend_select_method_definitions}
2021-06-17T14:14:13.3450220Z   ^
2021-06-17T14:14:13.3451960Z /__w/pytorch/pytorch/aten/src/ATen/templates/RegisterBackendSelect.cpp:17:37: error: expected ';' after top level declarator [clang-diagnostic-error]
2021-06-17T14:14:13.3453535Z ${backend_select_method_definitions}
2021-06-17T14:14:13.3453977Z                                     ^
2021-06-17T14:14:13.3454321Z                                     ;
2021-06-17T14:14:13.3470430Z ##[error]Process completed with exit code 1.
2021-06-17T14:14:13.3583246Z Post job cleanup.
2021-06-17T14:14:13.3591897Z ##[command]/usr/bin/docker exec  a5fd74491f8e15df02544e0bde4b051c0dcd093166dad7677a7470bce8302be8 sh -c "cat /etc/*release | grep ^ID"
2021-06-17T14:14:13.8024920Z [command]/usr/bin/git version
2021-06-17T14:14:13.8092758Z git version 2.32.0
2021-06-17T14:14:13.8145819Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2021-06-17T14:14:13.8186628Z [command]/usr/bin/git submodule foreach --recursive git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :

1 failure not recognized by patterns:

Job	Step	Action
^{pytorch_linux_bionic_py3_8_gcc9_coverage_test2}	^{Run tests}	🔁 rerun

1 job timed out:

pytorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_build

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

bdhirsh · 2021-06-01T22:05:33Z

aten/src/ATen/templates/Operators.h

+template<typename T>
+class List;
+class Stream;
+struct Storage;


forward declared a bunch of classes needed in the function declarations; We don't need to #include any of these files since we're only declaring method signatures in this file.

bdhirsh · 2021-06-01T22:06:20Z

aten/src/ATen/templates/TensorBody.h

  /// Returns the `TensorOptions` corresponding to this `Tensor`. Defined in
  /// TensorOptions.h.
-  TensorOptions options() const;
+  TensorOptions options() const {


I could also move all of these methods back into TensorMethods.cpp like I mentioned in the description, and keep that file around. These all just seemed small enough to inline though

This will also probably be a perf boost, I have some local benchmarks showing that options() function-call overhead is noticeable

bdhirsh · 2021-06-01T22:07:27Z

caffe2/contrib/aten/gen_op.py

    'bool': 'assignToValue<int64_t>(Output(${offset}),${output});',
    'int64_t': 'assignToValue<int64_t>(Output(${offset}),${output});',
-    'std::vector<at::Tensor>': 'assignListStartingAt(${offset}, ${output});',
+    '::std::vector<at::Tensor>': 'assignListStartingAt(${offset}, ${output});',


this was terrible; we have a native operator called std, so the compiler confused this with at::_ops::std 😛

This is indeed terrible lol. Wonder if we should munge the at::_ops:: names haha

bdhirsh · 2021-06-01T22:32:29Z

I'll rebase and look more closely at tests once #59018 lands

This PR beefs up the `at::_ops::` API as a source of truth for compile-time information about each operator. ### Changes For every op defined in native_functions.yaml, e.g. `at::_ops::add_Tensor` previously defined an unambiguous function; effectively an unambiguously named version of the C++ API that you could decltype() successfully because it had no overloads with a user-facing macro: `decltype(ATEN_FN2(add, Tensor)) // expands to decltype(at::_ops::add_Tensor)`. Now, `at::_ops::add_Tensor` is a struct containing a few static fields and methods (declared in `Operators.h`, defined in `Operators.cpp`): ``` struct TORCH_API add_Tensor { using schema = at::Tensor (const at::Tensor &, const at::Tensor &, const at::Scalar &); using ptr_schema = at::Tensor (*)(const at::Tensor &, const at::Tensor &, const at::Scalar &); static constexpr const char* name = "aten::add"; static constexpr const char* overload_name = "Tensor"; static constexpr const char* schema_str = "add.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor"; static at::Tensor call(const at::Tensor & self, const at::Tensor & other, const at::Scalar & alpha); static at::Tensor redispatch(c10::DispatchKeySet dispatchKeySet, const at::Tensor & self, const at::Tensor & ot }; ``` What used to be the function `at::_ops::add_Tensor` can now be accessed as `at::_ops::add_Tensor::call`, and I've added a new macro to access the entire struct (naming suggestions welcome) - `ATEN_OP2(add, Tensor)`. ### Motivation There were two motivations for this change: **Codegen refactor** The `at::_ops::` API as it exists now is (yet another) C++ entry point into the dispatcher, in addition to the Function, Method, and Redispatch APIs. Instead, after this PR, the existing three API's are all inline-able wrapper API's that call into the `at::_ops` API to do the real work. The function and method API's call into `at::_ops::{op}::call`, while the redispatch API calls into `at::_ops::{op}::redispatch`. This will hopefully make it easier to pile in any future C++ API's that we want to code-generate. It also means that stuff like the string name, overload name, and schema of each operator is consolidated in a single place, rather than having the codegen hardcode various strings in multiple codegen output files. **Extra compile-time metadata** In the [boxed CPU fallback PR](https://github.com/pytorch/pytorch/pull/58065/files#diff-c9b55f0d692a9bea8019c6f19bc46877f1efa0f9d4fc2086cf299b52768343b4R31) above this in the stack, I added a new API that external backends can use to call directly into their boxed fallback from an unboxed context. Adding extra metadata to `at::_ops` means that XLA's usage of that API doesn't require passing in the string name and overload of each name as arguments; we can just infer them. The updated API looks like this (see [the XLA-side PR ](https://github.com/pytorch/xla/pull/2945/files#diff-5e65c3c1d847191cb691d1874732e971f09fa1aad7a980a555c3b0504a5b6470R250) for more examples) ``` return at::native::call_fallback_fn<&xla_cpu_fallback, ATEN_OP2(add, Tensor)>::call(a, b, 1.0); ``` **Characteristics of the `at::_ops` API** (I also commented this in the codegen) (1) It follows the Dispatcher API. This means, e.g., that it takes in the expanded arguments rather than `TensorOptions`. This is kind of necessary for perf, if we want to `at::_ops` to serve as the main implementation of the existing C++ API's. For example: if it followed the C++ API, then all of the faithful C++ factory functions would need to wrap their arguments into TensorOptions only to unwrap them again. (2) Overload names are disambiguated. This is the same as before; it's helpful for pytorch extenders who would like to decltype() an aten operator, that has overloads, e.g. decltype(at::_ops::mul_Tensor::call) (3) No argument defaulting is allowed. This is more of an implementation detail to avoid #include cycles, since TensorBody.h (which defines the Tensor class) needs to include this file. The #include situation is precarious though! (4) manual_cpp_bindings and faithful names are not included in the API. I think that this is one we have a choice with. This applies to stuff like __dispatch__is_complex(), and add_outf(). These aren't "real native_functions.yaml ops", they're just additional functions provided by the C++ API. They're implemented as wrappers in Functions.h that call into the actual operators defined here, i.e. at::_ops::is_complex::call() and at::_ops::add_out::call(). This means that ATEN_OP(is_complex) will not fastpath, and will go through the dispatcher. It also means that `ATEN_OP2(add, out)` is automatically faithful and takes its out argument at the end (this is just because it follows the dispatcher API). **Details** Instead of codegen'ing the existing 3 API's in `Functions.cpp`, `TensorMethods.cpp` and `RedispatchFunctions.cpp`, I codegen them directly into the headers: `Functions.h`, `TensorBody.h`, and `RedispatchFunctions.h`. I mostly did this for perf, since we want to avoid introducing an extra function call in the hot path of every operator. These functions are also now all one-liners that call into `at::_ops`, so the compiler should just inline them all anyway. The main downside in doing that though was that I had to bend over backwards in a few cases to avoid cyclical #include statements. The issue is that `TensorBody.h` now includes `Operators.h` (because the codegen'd method API is implemented by calling into `at::_ops`), but `TensorBody.h` also includes the definition of the Tensor class. That means that `Operators.h` can't be aware of the Tensor class; it needs to forward declare everything and avoid using the Tensor class directly. To fix cyclic includes, I had to: - Not allow defaulting in the `at::_ops` API - Move some code that was called when translating from C++ to Dispatcher API's directly into the codegen template (`check_tensor_options_and_extract_memory_format`) It's not great, but I don't think this specific include cycle will break down in the near future; the only code that we need to call before getting to `Operators.cpp` is the translations from various API's to the dispatcher API; there aren't many of them, and there's no major reason for them to live an external utils file somewhere. Moving the code into the headers also meant that the codegen no longer needs to deal with `Functions.cpp`/`TensorMethods.cpp`/`RedispatchFunctions.cpp`. All of the functions that used to be defined in `TensorMethods.cpp` seemed small enough for me to lump into `TensorBody.h`, but some of the functions in `Functions.cpp` looked pretty big to put in a header, so I moved the file to `aten/src/ATen/native/Functions.cpp`. It might be worth keeping `TensorMethods.cpp` there and leaving it too, in-case we have any beefy hand-written tensor methods that we don't want to put in a header. **Perf** I ran a few benchmarks in callgrind, and didn't see a noticeable instruction count change when calling `at::add()`. I also saw in the output that `at::add()` was successfully getting inlined. There's also probably a light risk of binary size increase; I think that there's a binary size regression test that I can run in phabricator (going to try it). I can also try inspecting `libtorch.so` directly and seeing if it's any bigger, but my hope is that the inline-ing means that we aren't generated separate symbols for `at::add` and `at::_ops::add_Tensor::call`. [ghstack-poisoned]

bdhirsh · 2021-06-02T13:11:45Z

@bdhirsh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

This PR beefs up the `at::_ops::` API as a source of truth for compile-time information about each operator. ### Changes For every op defined in native_functions.yaml, e.g. `at::_ops::add_Tensor` previously defined an unambiguous function; effectively an unambiguously named version of the C++ API that you could decltype() successfully because it had no overloads with a user-facing macro: `decltype(ATEN_FN2(add, Tensor)) // expands to decltype(at::_ops::add_Tensor)`. Now, `at::_ops::add_Tensor` is a struct containing a few static fields and methods (declared in `Operators.h`, defined in `Operators.cpp`): ``` struct TORCH_API add_Tensor { using schema = at::Tensor (const at::Tensor &, const at::Tensor &, const at::Scalar &); using ptr_schema = at::Tensor (*)(const at::Tensor &, const at::Tensor &, const at::Scalar &); static constexpr const char* name = "aten::add"; static constexpr const char* overload_name = "Tensor"; static constexpr const char* schema_str = "add.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor"; static at::Tensor call(const at::Tensor & self, const at::Tensor & other, const at::Scalar & alpha); static at::Tensor redispatch(c10::DispatchKeySet dispatchKeySet, const at::Tensor & self, const at::Tensor & ot }; ``` What used to be the function `at::_ops::add_Tensor` can now be accessed as `at::_ops::add_Tensor::call`, and I've added a new macro to access the entire struct (naming suggestions welcome) - `ATEN_OP2(add, Tensor)`. ### Motivation There were two motivations for this change: **Codegen refactor** The `at::_ops::` API as it exists now is (yet another) C++ entry point into the dispatcher, in addition to the Function, Method, and Redispatch APIs. Instead, after this PR, the existing three API's are all inline-able wrapper API's that call into the `at::_ops` API to do the real work. The function and method API's call into `at::_ops::{op}::call`, while the redispatch API calls into `at::_ops::{op}::redispatch`. This will hopefully make it easier to pile in any future C++ API's that we want to code-generate. It also means that stuff like the string name, overload name, and schema of each operator is consolidated in a single place, rather than having the codegen hardcode various strings in multiple codegen output files. **Extra compile-time metadata** In the [boxed CPU fallback PR](https://github.com/pytorch/pytorch/pull/58065/files#diff-c9b55f0d692a9bea8019c6f19bc46877f1efa0f9d4fc2086cf299b52768343b4R31) above this in the stack, I added a new API that external backends can use to call directly into their boxed fallback from an unboxed context. Adding extra metadata to `at::_ops` means that XLA's usage of that API doesn't require passing in the string name and overload of each name as arguments; we can just infer them. The updated API looks like this (see [the XLA-side PR ](https://github.com/pytorch/xla/pull/2945/files#diff-5e65c3c1d847191cb691d1874732e971f09fa1aad7a980a555c3b0504a5b6470R250) for more examples) ``` return at::native::call_fallback_fn<&xla_cpu_fallback, ATEN_OP2(add, Tensor)>::call(a, b, 1.0); ``` **Characteristics of the `at::_ops` API** (I also commented this in the codegen) (1) It follows the Dispatcher API. This means, e.g., that it takes in the expanded arguments rather than `TensorOptions`. This is kind of necessary for perf, if we want to `at::_ops` to serve as the main implementation of the existing C++ API's. For example: if it followed the C++ API, then all of the faithful C++ factory functions would need to wrap their arguments into TensorOptions only to unwrap them again. (2) Overload names are disambiguated. This is the same as before; it's helpful for pytorch extenders who would like to decltype() an aten operator, that has overloads, e.g. decltype(at::_ops::mul_Tensor::call) (3) No argument defaulting is allowed. This is more of an implementation detail to avoid #include cycles, since TensorBody.h (which defines the Tensor class) needs to include this file. The #include situation is precarious though! (4) manual_cpp_bindings and faithful names are not included in the API. I think that this is one we have a choice with. This applies to stuff like __dispatch__is_complex(), and add_outf(). These aren't "real native_functions.yaml ops", they're just additional functions provided by the C++ API. They're implemented as wrappers in Functions.h that call into the actual operators defined here, i.e. at::_ops::is_complex::call() and at::_ops::add_out::call(). This means that ATEN_OP(is_complex) will not fastpath, and will go through the dispatcher. It also means that `ATEN_OP2(add, out)` is automatically faithful and takes its out argument at the end (this is just because it follows the dispatcher API). **Details** Instead of codegen'ing the existing 3 API's in `Functions.cpp`, `TensorMethods.cpp` and `RedispatchFunctions.cpp`, I codegen them directly into the headers: `Functions.h`, `TensorBody.h`, and `RedispatchFunctions.h`. I mostly did this for perf, since we want to avoid introducing an extra function call in the hot path of every operator. These functions are also now all one-liners that call into `at::_ops`, so the compiler should just inline them all anyway. The main downside in doing that though was that I had to bend over backwards in a few cases to avoid cyclical #include statements. The issue is that `TensorBody.h` now includes `Operators.h` (because the codegen'd method API is implemented by calling into `at::_ops`), but `TensorBody.h` also includes the definition of the Tensor class. That means that `Operators.h` can't be aware of the Tensor class; it needs to forward declare everything and avoid using the Tensor class directly. To fix cyclic includes, I had to: - Not allow defaulting in the `at::_ops` API - Move some code that was called when translating from C++ to Dispatcher API's directly into the codegen template (`check_tensor_options_and_extract_memory_format`) It's not great, but I don't think this specific include cycle will break down in the near future; the only code that we need to call before getting to `Operators.cpp` is the translations from various API's to the dispatcher API; there aren't many of them, and there's no major reason for them to live an external utils file somewhere. Moving the code into the headers also meant that the codegen no longer needs to deal with `Functions.cpp`/`TensorMethods.cpp`/`RedispatchFunctions.cpp`. All of the functions that used to be defined in `TensorMethods.cpp` seemed small enough for me to lump into `TensorBody.h`, but some of the functions in `Functions.cpp` looked pretty big to put in a header, so I moved the file to `aten/src/ATen/native/Functions.cpp`. It might be worth keeping `TensorMethods.cpp` there and leaving it too, in-case we have any beefy hand-written tensor methods that we don't want to put in a header. **Perf** I ran a few benchmarks in callgrind, and didn't see a noticeable instruction count change when calling `at::add()`. I also saw in the output that `at::add()` was successfully getting inlined. There's also probably a light risk of binary size increase; I think that there's a binary size regression test that I can run in phabricator (going to try it). I can also try inspecting `libtorch.so` directly and seeing if it's any bigger, but my hope is that the inline-ing means that we aren't generated separate symbols for `at::add` and `at::_ops::add_Tensor::call`. Differential Revision: [D28833086](https://our.internmc.facebook.com/intern/diff/D28833086) [ghstack-poisoned]

bdhirsh · 2021-06-02T14:05:45Z

@bdhirsh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

This PR beefs up the `at::_ops::` API as a source of truth for compile-time information about each operator. ### Changes For every op defined in native_functions.yaml, e.g. `at::_ops::add_Tensor` previously defined an unambiguous function; effectively an unambiguously named version of the C++ API that you could decltype() successfully because it had no overloads with a user-facing macro: `decltype(ATEN_FN2(add, Tensor)) // expands to decltype(at::_ops::add_Tensor)`. Now, `at::_ops::add_Tensor` is a struct containing a few static fields and methods (declared in `Operators.h`, defined in `Operators.cpp`): ``` struct TORCH_API add_Tensor { using schema = at::Tensor (const at::Tensor &, const at::Tensor &, const at::Scalar &); using ptr_schema = at::Tensor (*)(const at::Tensor &, const at::Tensor &, const at::Scalar &); static constexpr const char* name = "aten::add"; static constexpr const char* overload_name = "Tensor"; static constexpr const char* schema_str = "add.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor"; static at::Tensor call(const at::Tensor & self, const at::Tensor & other, const at::Scalar & alpha); static at::Tensor redispatch(c10::DispatchKeySet dispatchKeySet, const at::Tensor & self, const at::Tensor & ot }; ``` What used to be the function `at::_ops::add_Tensor` can now be accessed as `at::_ops::add_Tensor::call`, and I've added a new macro to access the entire struct (naming suggestions welcome) - `ATEN_OP2(add, Tensor)`. ### Motivation There were two motivations for this change: **Codegen refactor** The `at::_ops::` API as it exists now is (yet another) C++ entry point into the dispatcher, in addition to the Function, Method, and Redispatch APIs. Instead, after this PR, the existing three API's are all inline-able wrapper API's that call into the `at::_ops` API to do the real work. The function and method API's call into `at::_ops::{op}::call`, while the redispatch API calls into `at::_ops::{op}::redispatch`. This will hopefully make it easier to pile in any future C++ API's that we want to code-generate. It also means that stuff like the string name, overload name, and schema of each operator is consolidated in a single place, rather than having the codegen hardcode various strings in multiple codegen output files. **Extra compile-time metadata** In the [boxed CPU fallback PR](https://github.com/pytorch/pytorch/pull/58065/files#diff-c9b55f0d692a9bea8019c6f19bc46877f1efa0f9d4fc2086cf299b52768343b4R31) above this in the stack, I added a new API that external backends can use to call directly into their boxed fallback from an unboxed context. Adding extra metadata to `at::_ops` means that XLA's usage of that API doesn't require passing in the string name and overload of each name as arguments; we can just infer them. The updated API looks like this (see [the XLA-side PR ](https://github.com/pytorch/xla/pull/2945/files#diff-5e65c3c1d847191cb691d1874732e971f09fa1aad7a980a555c3b0504a5b6470R250) for more examples) ``` return at::native::call_fallback_fn<&xla_cpu_fallback, ATEN_OP2(add, Tensor)>::call(a, b, 1.0); ``` **Characteristics of the `at::_ops` API** (I also commented this in the codegen) (1) It follows the Dispatcher API. This means, e.g., that it takes in the expanded arguments rather than `TensorOptions`. This is kind of necessary for perf, if we want to `at::_ops` to serve as the main implementation of the existing C++ API's. For example: if it followed the C++ API, then all of the faithful C++ factory functions would need to wrap their arguments into TensorOptions only to unwrap them again. (2) Overload names are disambiguated. This is the same as before; it's helpful for pytorch extenders who would like to decltype() an aten operator, that has overloads, e.g. decltype(at::_ops::mul_Tensor::call) (3) No argument defaulting is allowed. This is more of an implementation detail to avoid #include cycles, since TensorBody.h (which defines the Tensor class) needs to include this file. The #include situation is precarious though! (4) manual_cpp_bindings and faithful names are not included in the API. I think that this is one we have a choice with. This applies to stuff like __dispatch__is_complex(), and add_outf(). These aren't "real native_functions.yaml ops", they're just additional functions provided by the C++ API. They're implemented as wrappers in Functions.h that call into the actual operators defined here, i.e. at::_ops::is_complex::call() and at::_ops::add_out::call(). This means that ATEN_OP(is_complex) will not fastpath, and will go through the dispatcher. It also means that `ATEN_OP2(add, out)` is automatically faithful and takes its out argument at the end (this is just because it follows the dispatcher API). **Details** Instead of codegen'ing the existing 3 API's in `Functions.cpp`, `TensorMethods.cpp` and `RedispatchFunctions.cpp`, I codegen them directly into the headers: `Functions.h`, `TensorBody.h`, and `RedispatchFunctions.h`. I mostly did this for perf, since we want to avoid introducing an extra function call in the hot path of every operator. These functions are also now all one-liners that call into `at::_ops`, so the compiler should just inline them all anyway. The main downside in doing that though was that I had to bend over backwards in a few cases to avoid cyclical #include statements. The issue is that `TensorBody.h` now includes `Operators.h` (because the codegen'd method API is implemented by calling into `at::_ops`), but `TensorBody.h` also includes the definition of the Tensor class. That means that `Operators.h` can't be aware of the Tensor class; it needs to forward declare everything and avoid using the Tensor class directly. To fix cyclic includes, I had to: - Not allow defaulting in the `at::_ops` API - Move some code that was called when translating from C++ to Dispatcher API's directly into the codegen template (`check_tensor_options_and_extract_memory_format`) It's not great, but I don't think this specific include cycle will break down in the near future; the only code that we need to call before getting to `Operators.cpp` is the translations from various API's to the dispatcher API; there aren't many of them, and there's no major reason for them to live an external utils file somewhere. Moving the code into the headers also meant that the codegen no longer needs to deal with `Functions.cpp`/`TensorMethods.cpp`/`RedispatchFunctions.cpp`. All of the functions that used to be defined in `TensorMethods.cpp` seemed small enough for me to lump into `TensorBody.h`, but some of the functions in `Functions.cpp` looked pretty big to put in a header, so I moved the file to `aten/src/ATen/native/Functions.cpp`. It might be worth keeping `TensorMethods.cpp` there and leaving it too, in-case we have any beefy hand-written tensor methods that we don't want to put in a header. **Perf** I ran a few benchmarks in callgrind, and didn't see a noticeable instruction count change when calling `at::add()`. I also saw in the output that `at::add()` was successfully getting inlined. There's also probably a light risk of binary size increase; I think that there's a binary size regression test that I can run in phabricator (going to try it). I can also try inspecting `libtorch.so` directly and seeing if it's any bigger, but my hope is that the inline-ing means that we aren't generated separate symbols for `at::add` and `at::_ops::add_Tensor::call`. Differential Revision: [D28833086](https://our.internmc.facebook.com/intern/diff/D28833086) [ghstack-poisoned]

ezyang · 2021-06-02T17:58:11Z

nit:

  using ptr_schema = at::Tensor (*)(const at::Tensor &, const at::Tensor &, const at::Scalar &);

This can just be

  using ptr_schema = schema*;

This PR beefs up the `at::_ops::` API as a source of truth for compile-time information about each operator. ### Changes For every op defined in native_functions.yaml, e.g. `at::_ops::add_Tensor` previously defined an unambiguous function; effectively an unambiguously named version of the C++ API that you could decltype() successfully because it had no overloads with a user-facing macro: `decltype(ATEN_FN2(add, Tensor)) // expands to decltype(at::_ops::add_Tensor)`. Now, `at::_ops::add_Tensor` is a struct containing a few static fields and methods (declared in `Operators.h`, defined in `Operators.cpp`): ``` struct TORCH_API add_Tensor { using schema = at::Tensor (const at::Tensor &, const at::Tensor &, const at::Scalar &); using ptr_schema = at::Tensor (*)(const at::Tensor &, const at::Tensor &, const at::Scalar &); static constexpr const char* name = "aten::add"; static constexpr const char* overload_name = "Tensor"; static constexpr const char* schema_str = "add.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor"; static at::Tensor call(const at::Tensor & self, const at::Tensor & other, const at::Scalar & alpha); static at::Tensor redispatch(c10::DispatchKeySet dispatchKeySet, const at::Tensor & self, const at::Tensor & ot }; ``` What used to be the function `at::_ops::add_Tensor` can now be accessed as `at::_ops::add_Tensor::call`, and I've added a new macro to access the entire struct (naming suggestions welcome) - `ATEN_OP2(add, Tensor)`. ### Motivation There were two motivations for this change: **Codegen refactor** The `at::_ops::` API as it exists now is (yet another) C++ entry point into the dispatcher, in addition to the Function, Method, and Redispatch APIs. Instead, after this PR, the existing three API's are all inline-able wrapper API's that call into the `at::_ops` API to do the real work. The function and method API's call into `at::_ops::{op}::call`, while the redispatch API calls into `at::_ops::{op}::redispatch`. This will hopefully make it easier to pile in any future C++ API's that we want to code-generate. It also means that stuff like the string name, overload name, and schema of each operator is consolidated in a single place, rather than having the codegen hardcode various strings in multiple codegen output files. **Extra compile-time metadata** In the [boxed CPU fallback PR](https://github.com/pytorch/pytorch/pull/58065/files#diff-c9b55f0d692a9bea8019c6f19bc46877f1efa0f9d4fc2086cf299b52768343b4R31) above this in the stack, I added a new API that external backends can use to call directly into their boxed fallback from an unboxed context. Adding extra metadata to `at::_ops` means that XLA's usage of that API doesn't require passing in the string name and overload of each name as arguments; we can just infer them. The updated API looks like this (see [the XLA-side PR ](https://github.com/pytorch/xla/pull/2945/files#diff-5e65c3c1d847191cb691d1874732e971f09fa1aad7a980a555c3b0504a5b6470R250) for more examples) ``` return at::native::call_fallback_fn<&xla_cpu_fallback, ATEN_OP2(add, Tensor)>::call(a, b, 1.0); ``` **Characteristics of the `at::_ops` API** (I also commented this in the codegen) (1) It follows the Dispatcher API. This means, e.g., that it takes in the expanded arguments rather than `TensorOptions`. This is kind of necessary for perf, if we want to `at::_ops` to serve as the main implementation of the existing C++ API's. For example: if it followed the C++ API, then all of the faithful C++ factory functions would need to wrap their arguments into TensorOptions only to unwrap them again. (2) Overload names are disambiguated. This is the same as before; it's helpful for pytorch extenders who would like to decltype() an aten operator, that has overloads, e.g. decltype(at::_ops::mul_Tensor::call) (3) No argument defaulting is allowed. This is more of an implementation detail to avoid #include cycles, since TensorBody.h (which defines the Tensor class) needs to include this file. The #include situation is precarious though! (4) manual_cpp_bindings and faithful names are not included in the API. I think that this is one we have a choice with. This applies to stuff like __dispatch__is_complex(), and add_outf(). These aren't "real native_functions.yaml ops", they're just additional functions provided by the C++ API. They're implemented as wrappers in Functions.h that call into the actual operators defined here, i.e. at::_ops::is_complex::call() and at::_ops::add_out::call(). This means that ATEN_OP(is_complex) will not fastpath, and will go through the dispatcher. It also means that `ATEN_OP2(add, out)` is automatically faithful and takes its out argument at the end (this is just because it follows the dispatcher API). **Details** Instead of codegen'ing the existing 3 API's in `Functions.cpp`, `TensorMethods.cpp` and `RedispatchFunctions.cpp`, I codegen them directly into the headers: `Functions.h`, `TensorBody.h`, and `RedispatchFunctions.h`. I mostly did this for perf, since we want to avoid introducing an extra function call in the hot path of every operator. These functions are also now all one-liners that call into `at::_ops`, so the compiler should just inline them all anyway. The main downside in doing that though was that I had to bend over backwards in a few cases to avoid cyclical #include statements. The issue is that `TensorBody.h` now includes `Operators.h` (because the codegen'd method API is implemented by calling into `at::_ops`), but `TensorBody.h` also includes the definition of the Tensor class. That means that `Operators.h` can't be aware of the Tensor class; it needs to forward declare everything and avoid using the Tensor class directly. To fix cyclic includes, I had to: - Not allow defaulting in the `at::_ops` API - Move some code that was called when translating from C++ to Dispatcher API's directly into the codegen template (`check_tensor_options_and_extract_memory_format`) It's not great, but I don't think this specific include cycle will break down in the near future; the only code that we need to call before getting to `Operators.cpp` is the translations from various API's to the dispatcher API; there aren't many of them, and there's no major reason for them to live an external utils file somewhere. Moving the code into the headers also meant that the codegen no longer needs to deal with `Functions.cpp`/`TensorMethods.cpp`/`RedispatchFunctions.cpp`. All of the functions that used to be defined in `TensorMethods.cpp` seemed small enough for me to lump into `TensorBody.h`, but some of the functions in `Functions.cpp` looked pretty big to put in a header, so I moved the file to `aten/src/ATen/native/Functions.cpp`. It might be worth keeping `TensorMethods.cpp` there and leaving it too, in-case we have any beefy hand-written tensor methods that we don't want to put in a header. **Perf** I ran a few benchmarks in callgrind, and didn't see a noticeable instruction count change when calling `at::add()`. I also saw in the output that `at::add()` was successfully getting inlined. There's also probably a light risk of binary size increase; I think that there's a binary size regression test that I can run in phabricator (going to try it). I can also try inspecting `libtorch.so` directly and seeing if it's any bigger, but my hope is that the inline-ing means that we aren't generated separate symbols for `at::add` and `at::_ops::add_Tensor::call`. Differential Revision: [D28833086](https://our.internmc.facebook.com/intern/diff/D28833086) [ghstack-poisoned]

ezyang · 2021-06-02T21:04:59Z

aten/src/ATen/templates/Functions.h

-    std::initializer_list<int64_t> padding, IntArrayRef dilation = 1, int64_t groups = 1);
-TORCH_API at::Tensor conv2d(
+    std::initializer_list<int64_t> padding_, IntArrayRef dilation = 1, int64_t groups = 1) {
+  auto padding = IntArrayRef(padding_);


What's going on here? Initializer list should implicitly construct to IntArrayRef

I copied that straight in from where it used to live in Functions.cpp. I didn't bother checking for whether there was an implicit conversion- cool, I'll fix up the other conv methods here too.

We actually might be able to get rid of these; they were added in #45667 because {1, 2} can implicitly be converted to a std::string OR an IntArrayRef and is ambiguous, but that doesn't matter now that we've moved to c10::string_view; it doesn't look like you can implicitly convert {1, 2} to a string_view. I'll confirm locally.

{1, 2} can implicitly be converted to a std::string

bruhhh LOL. OK, either way is fine then.

ezyang · 2021-06-03T16:26:54Z

tools/codegen/gen.py

+        if Variant.function not in f.variants:
            return None

-        with native_function_manager(f):


I'm really liking the simplification haha

ezyang

The ::std business is awful but otherwise this is a very nice simplification.

This PR beefs up the `at::_ops::` API as a source of truth for compile-time information about each operator. ### Changes For every op defined in native_functions.yaml, e.g. `at::_ops::add_Tensor` previously defined an unambiguous function; effectively an unambiguously named version of the C++ API that you could decltype() successfully because it had no overloads with a user-facing macro: `decltype(ATEN_FN2(add, Tensor)) // expands to decltype(at::_ops::add_Tensor)`. Now, `at::_ops::add_Tensor` is a struct containing a few static fields and methods (declared in `Operators.h`, defined in `Operators.cpp`): ``` struct TORCH_API add_Tensor { using schema = at::Tensor (const at::Tensor &, const at::Tensor &, const at::Scalar &); using ptr_schema = at::Tensor (*)(const at::Tensor &, const at::Tensor &, const at::Scalar &); static constexpr const char* name = "aten::add"; static constexpr const char* overload_name = "Tensor"; static constexpr const char* schema_str = "add.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor"; static at::Tensor call(const at::Tensor & self, const at::Tensor & other, const at::Scalar & alpha); static at::Tensor redispatch(c10::DispatchKeySet dispatchKeySet, const at::Tensor & self, const at::Tensor & ot }; ``` What used to be the function `at::_ops::add_Tensor` can now be accessed as `at::_ops::add_Tensor::call`, and I've added a new macro to access the entire struct (naming suggestions welcome) - `ATEN_OP2(add, Tensor)`. ### Motivation There were two motivations for this change: **Codegen refactor** The `at::_ops::` API as it exists now is (yet another) C++ entry point into the dispatcher, in addition to the Function, Method, and Redispatch APIs. Instead, after this PR, the existing three API's are all inline-able wrapper API's that call into the `at::_ops` API to do the real work. The function and method API's call into `at::_ops::{op}::call`, while the redispatch API calls into `at::_ops::{op}::redispatch`. This will hopefully make it easier to pile in any future C++ API's that we want to code-generate. It also means that stuff like the string name, overload name, and schema of each operator is consolidated in a single place, rather than having the codegen hardcode various strings in multiple codegen output files. **Extra compile-time metadata** In the [boxed CPU fallback PR](https://github.com/pytorch/pytorch/pull/58065/files#diff-c9b55f0d692a9bea8019c6f19bc46877f1efa0f9d4fc2086cf299b52768343b4R31) above this in the stack, I added a new API that external backends can use to call directly into their boxed fallback from an unboxed context. Adding extra metadata to `at::_ops` means that XLA's usage of that API doesn't require passing in the string name and overload of each name as arguments; we can just infer them. The updated API looks like this (see [the XLA-side PR ](https://github.com/pytorch/xla/pull/2945/files#diff-5e65c3c1d847191cb691d1874732e971f09fa1aad7a980a555c3b0504a5b6470R250) for more examples) ``` return at::native::call_fallback_fn<&xla_cpu_fallback, ATEN_OP2(add, Tensor)>::call(a, b, 1.0); ``` **Characteristics of the `at::_ops` API** (I also commented this in the codegen) (1) It follows the Dispatcher API. This means, e.g., that it takes in the expanded arguments rather than `TensorOptions`. This is kind of necessary for perf, if we want to `at::_ops` to serve as the main implementation of the existing C++ API's. For example: if it followed the C++ API, then all of the faithful C++ factory functions would need to wrap their arguments into TensorOptions only to unwrap them again. (2) Overload names are disambiguated. This is the same as before; it's helpful for pytorch extenders who would like to decltype() an aten operator, that has overloads, e.g. decltype(at::_ops::mul_Tensor::call) (3) No argument defaulting is allowed. This is more of an implementation detail to avoid #include cycles, since TensorBody.h (which defines the Tensor class) needs to include this file. The #include situation is precarious though! (4) manual_cpp_bindings and faithful names are not included in the API. I think that this is one we have a choice with. This applies to stuff like __dispatch__is_complex(), and add_outf(). These aren't "real native_functions.yaml ops", they're just additional functions provided by the C++ API. They're implemented as wrappers in Functions.h that call into the actual operators defined here, i.e. at::_ops::is_complex::call() and at::_ops::add_out::call(). This means that ATEN_OP(is_complex) will not fastpath, and will go through the dispatcher. It also means that `ATEN_OP2(add, out)` is automatically faithful and takes its out argument at the end (this is just because it follows the dispatcher API). I thought it made sense not to move these into the at::_ops API because they aren't universal: the Redispatch API doesn't care about which ops have manual_cpp_bindings. But... it could. **Details** Instead of codegen'ing the existing 3 API's in `Functions.cpp`, `TensorMethods.cpp` and `RedispatchFunctions.cpp`, I codegen them directly into the headers: `Functions.h`, `TensorBody.h`, and `RedispatchFunctions.h`. I mostly did this for perf, since we want to avoid introducing an extra function call in the hot path of every operator. These functions are also now all one-liners that call into `at::_ops`, so the compiler should just inline them all anyway. The main downside in doing that though was that I had to bend over backwards in a few cases to avoid cyclical #include statements. The issue is that `TensorBody.h` now includes `Operators.h` (because the codegen'd method API is implemented by calling into `at::_ops`), but `TensorBody.h` also includes the definition of the Tensor class. That means that `Operators.h` can't be aware of the Tensor class; it needs to forward declare everything and avoid using the Tensor class directly. To fix cyclic includes, I had to: - Not allow defaulting in the `at::_ops` API - Move some code that was called when translating from C++ to Dispatcher API's directly into the codegen template (`check_tensor_options_and_extract_memory_format`) It's not great, but I don't think this specific include cycle will break down in the near future; the only code that we need to call before getting to `Operators.cpp` is the translations from various API's to the dispatcher API; there aren't many of them, and there's no major reason for them to live an external utils file somewhere. Moving the code into the headers also meant that the codegen no longer needs to deal with `Functions.cpp`/`TensorMethods.cpp`/`RedispatchFunctions.cpp`. All of the functions that used to be defined in `TensorMethods.cpp` seemed small enough for me to lump into `TensorBody.h`, but some of the functions in `Functions.cpp` looked pretty big to put in a header, so I moved the file to `aten/src/ATen/native/Functions.cpp`. It might be worth keeping `TensorMethods.cpp` there and leaving it too, in-case we have any beefy hand-written tensor methods that we don't want to put in a header. **Perf** I ran a few benchmarks in callgrind, and didn't see a noticeable instruction count change when calling `at::add()`. I also saw in the output that `at::add()` was successfully getting inlined. There's also probably a light risk of binary size increase; I think that there's a binary size regression test that I can run in phabricator (going to try it). I can also try inspecting `libtorch.so` directly and seeing if it's any bigger, but my hope is that the inline-ing means that we aren't generated separate symbols for `at::add` and `at::_ops::add_Tensor::call`. Differential Revision: [D28833086](https://our.internmc.facebook.com/intern/diff/D28833086) [ghstack-poisoned]

bdhirsh · 2021-06-10T23:34:28Z

@bdhirsh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

This PR beefs up the `at::_ops::` API as a source of truth for compile-time information about each operator. ### Changes For every op defined in native_functions.yaml, e.g. `at::_ops::add_Tensor` previously defined an unambiguous function; effectively an unambiguously named version of the C++ API that you could decltype() successfully because it had no overloads with a user-facing macro: `decltype(ATEN_FN2(add, Tensor)) // expands to decltype(at::_ops::add_Tensor)`. Now, `at::_ops::add_Tensor` is a struct containing a few static fields and methods (declared in `Operators.h`, defined in `Operators.cpp`): ``` struct TORCH_API add_Tensor { using schema = at::Tensor (const at::Tensor &, const at::Tensor &, const at::Scalar &); using ptr_schema = at::Tensor (*)(const at::Tensor &, const at::Tensor &, const at::Scalar &); static constexpr const char* name = "aten::add"; static constexpr const char* overload_name = "Tensor"; static constexpr const char* schema_str = "add.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor"; static at::Tensor call(const at::Tensor & self, const at::Tensor & other, const at::Scalar & alpha); static at::Tensor redispatch(c10::DispatchKeySet dispatchKeySet, const at::Tensor & self, const at::Tensor & ot }; ``` What used to be the function `at::_ops::add_Tensor` can now be accessed as `at::_ops::add_Tensor::call`, and I've added a new macro to access the entire struct (naming suggestions welcome) - `ATEN_OP2(add, Tensor)`. ### Motivation There were two motivations for this change: **Codegen refactor** The `at::_ops::` API as it exists now is (yet another) C++ entry point into the dispatcher, in addition to the Function, Method, and Redispatch APIs. Instead, after this PR, the existing three API's are all inline-able wrapper API's that call into the `at::_ops` API to do the real work. The function and method API's call into `at::_ops::{op}::call`, while the redispatch API calls into `at::_ops::{op}::redispatch`. This will hopefully make it easier to pile in any future C++ API's that we want to code-generate. It also means that stuff like the string name, overload name, and schema of each operator is consolidated in a single place, rather than having the codegen hardcode various strings in multiple codegen output files. **Extra compile-time metadata** In the [boxed CPU fallback PR](https://github.com/pytorch/pytorch/pull/58065/files#diff-c9b55f0d692a9bea8019c6f19bc46877f1efa0f9d4fc2086cf299b52768343b4R31) above this in the stack, I added a new API that external backends can use to call directly into their boxed fallback from an unboxed context. Adding extra metadata to `at::_ops` means that XLA's usage of that API doesn't require passing in the string name and overload of each name as arguments; we can just infer them. The updated API looks like this (see [the XLA-side PR ](https://github.com/pytorch/xla/pull/2945/files#diff-5e65c3c1d847191cb691d1874732e971f09fa1aad7a980a555c3b0504a5b6470R250) for more examples) ``` return at::native::call_fallback_fn<&xla_cpu_fallback, ATEN_OP2(add, Tensor)>::call(a, b, 1.0); ``` **Characteristics of the `at::_ops` API** (I also commented this in the codegen) (1) It follows the Dispatcher API. This means, e.g., that it takes in the expanded arguments rather than `TensorOptions`. This is kind of necessary for perf, if we want to `at::_ops` to serve as the main implementation of the existing C++ API's. For example: if it followed the C++ API, then all of the faithful C++ factory functions would need to wrap their arguments into TensorOptions only to unwrap them again. (2) Overload names are disambiguated. This is the same as before; it's helpful for pytorch extenders who would like to decltype() an aten operator, that has overloads, e.g. decltype(at::_ops::mul_Tensor::call) (3) No argument defaulting is allowed. This is more of an implementation detail to avoid #include cycles, since TensorBody.h (which defines the Tensor class) needs to include this file. The #include situation is precarious though! (4) manual_cpp_bindings and faithful names are not included in the API. I think that this is one we have a choice with. This applies to stuff like __dispatch__is_complex(), and add_outf(). These aren't "real native_functions.yaml ops", they're just additional functions provided by the C++ API. They're implemented as wrappers in Functions.h that call into the actual operators defined here, i.e. at::_ops::is_complex::call() and at::_ops::add_out::call(). This means that ATEN_OP(is_complex) will not fastpath, and will go through the dispatcher. It also means that `ATEN_OP2(add, out)` is automatically faithful and takes its out argument at the end (this is just because it follows the dispatcher API). I thought it made sense not to move these into the at::_ops API because they aren't universal: the Redispatch API doesn't care about which ops have manual_cpp_bindings. But... it could. **Details** Instead of codegen'ing the existing 3 API's in `Functions.cpp`, `TensorMethods.cpp` and `RedispatchFunctions.cpp`, I codegen them directly into the headers: `Functions.h`, `TensorBody.h`, and `RedispatchFunctions.h`. I mostly did this for perf, since we want to avoid introducing an extra function call in the hot path of every operator. These functions are also now all one-liners that call into `at::_ops`, so the compiler should just inline them all anyway. The main downside in doing that though was that I had to bend over backwards in a few cases to avoid cyclical #include statements. The issue is that `TensorBody.h` now includes `Operators.h` (because the codegen'd method API is implemented by calling into `at::_ops`), but `TensorBody.h` also includes the definition of the Tensor class. That means that `Operators.h` can't be aware of the Tensor class; it needs to forward declare everything and avoid using the Tensor class directly. To fix cyclic includes, I had to: - Not allow defaulting in the `at::_ops` API - Move some code that was called when translating from C++ to Dispatcher API's directly into the codegen template (`check_tensor_options_and_extract_memory_format`) It's not great, but I don't think this specific include cycle will break down in the near future; the only code that we need to call before getting to `Operators.cpp` is the translations from various API's to the dispatcher API; there aren't many of them, and there's no major reason for them to live an external utils file somewhere. Moving the code into the headers also meant that the codegen no longer needs to deal with `Functions.cpp`/`TensorMethods.cpp`/`RedispatchFunctions.cpp`. All of the functions that used to be defined in `TensorMethods.cpp` seemed small enough for me to lump into `TensorBody.h`, but some of the functions in `Functions.cpp` looked pretty big to put in a header, so I moved the file to `aten/src/ATen/native/Functions.cpp`. It might be worth keeping `TensorMethods.cpp` there and leaving it too, in-case we have any beefy hand-written tensor methods that we don't want to put in a header. **Perf** I ran a few benchmarks in callgrind, and didn't see a noticeable instruction count change when calling `at::add()`. I also saw in the output that `at::add()` was successfully getting inlined. There's also probably a light risk of binary size increase; I think that there's a binary size regression test that I can run in phabricator (going to try it). I can also try inspecting `libtorch.so` directly and seeing if it's any bigger, but my hope is that the inline-ing means that we aren't generated separate symbols for `at::add` and `at::_ops::add_Tensor::call`. Differential Revision: [D28833086](https://our.internmc.facebook.com/intern/diff/D28833086) [ghstack-poisoned]

bdhirsh · 2021-06-11T00:51:17Z

@bdhirsh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

This PR beefs up the `at::_ops::` API as a source of truth for compile-time information about each operator. ### Changes For every op defined in native_functions.yaml, e.g. `at::_ops::add_Tensor` previously defined an unambiguous function; effectively an unambiguously named version of the C++ API that you could decltype() successfully because it had no overloads with a user-facing macro: `decltype(ATEN_FN2(add, Tensor)) // expands to decltype(at::_ops::add_Tensor)`. Now, `at::_ops::add_Tensor` is a struct containing a few static fields and methods (declared in `Operators.h`, defined in `Operators.cpp`): ``` struct TORCH_API add_Tensor { using schema = at::Tensor (const at::Tensor &, const at::Tensor &, const at::Scalar &); using ptr_schema = at::Tensor (*)(const at::Tensor &, const at::Tensor &, const at::Scalar &); static constexpr const char* name = "aten::add"; static constexpr const char* overload_name = "Tensor"; static constexpr const char* schema_str = "add.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor"; static at::Tensor call(const at::Tensor & self, const at::Tensor & other, const at::Scalar & alpha); static at::Tensor redispatch(c10::DispatchKeySet dispatchKeySet, const at::Tensor & self, const at::Tensor & ot }; ``` What used to be the function `at::_ops::add_Tensor` can now be accessed as `at::_ops::add_Tensor::call`, and I've added a new macro to access the entire struct (naming suggestions welcome) - `ATEN_OP2(add, Tensor)`. ### Motivation There were two motivations for this change: **Codegen refactor** The `at::_ops::` API as it exists now is (yet another) C++ entry point into the dispatcher, in addition to the Function, Method, and Redispatch APIs. Instead, after this PR, the existing three API's are all inline-able wrapper API's that call into the `at::_ops` API to do the real work. The function and method API's call into `at::_ops::{op}::call`, while the redispatch API calls into `at::_ops::{op}::redispatch`. This will hopefully make it easier to pile in any future C++ API's that we want to code-generate. It also means that stuff like the string name, overload name, and schema of each operator is consolidated in a single place, rather than having the codegen hardcode various strings in multiple codegen output files. **Extra compile-time metadata** In the [boxed CPU fallback PR](https://github.com/pytorch/pytorch/pull/58065/files#diff-c9b55f0d692a9bea8019c6f19bc46877f1efa0f9d4fc2086cf299b52768343b4R31) above this in the stack, I added a new API that external backends can use to call directly into their boxed fallback from an unboxed context. Adding extra metadata to `at::_ops` means that XLA's usage of that API doesn't require passing in the string name and overload of each name as arguments; we can just infer them. The updated API looks like this (see [the XLA-side PR ](https://github.com/pytorch/xla/pull/2945/files#diff-5e65c3c1d847191cb691d1874732e971f09fa1aad7a980a555c3b0504a5b6470R250) for more examples) ``` return at::native::call_fallback_fn<&xla_cpu_fallback, ATEN_OP2(add, Tensor)>::call(a, b, 1.0); ``` **Characteristics of the `at::_ops` API** (I also commented this in the codegen) (1) It follows the Dispatcher API. This means, e.g., that it takes in the expanded arguments rather than `TensorOptions`. This is kind of necessary for perf, if we want to `at::_ops` to serve as the main implementation of the existing C++ API's. For example: if it followed the C++ API, then all of the faithful C++ factory functions would need to wrap their arguments into TensorOptions only to unwrap them again. (2) Overload names are disambiguated. This is the same as before; it's helpful for pytorch extenders who would like to decltype() an aten operator, that has overloads, e.g. decltype(at::_ops::mul_Tensor::call) (3) No argument defaulting is allowed. This is more of an implementation detail to avoid #include cycles, since TensorBody.h (which defines the Tensor class) needs to include this file. The #include situation is precarious though! (4) manual_cpp_bindings and faithful names are not included in the API. I think that this is one we have a choice with. This applies to stuff like __dispatch__is_complex(), and add_outf(). These aren't "real native_functions.yaml ops", they're just additional functions provided by the C++ API. They're implemented as wrappers in Functions.h that call into the actual operators defined here, i.e. at::_ops::is_complex::call() and at::_ops::add_out::call(). This means that ATEN_OP(is_complex) will not fastpath, and will go through the dispatcher. It also means that `ATEN_OP2(add, out)` is automatically faithful and takes its out argument at the end (this is just because it follows the dispatcher API). I thought it made sense not to move these into the at::_ops API because they aren't universal: the Redispatch API doesn't care about which ops have manual_cpp_bindings. But... it could. **Details** Instead of codegen'ing the existing 3 API's in `Functions.cpp`, `TensorMethods.cpp` and `RedispatchFunctions.cpp`, I codegen them directly into the headers: `Functions.h`, `TensorBody.h`, and `RedispatchFunctions.h`. I mostly did this for perf, since we want to avoid introducing an extra function call in the hot path of every operator. These functions are also now all one-liners that call into `at::_ops`, so the compiler should just inline them all anyway. The main downside in doing that though was that I had to bend over backwards in a few cases to avoid cyclical #include statements. The issue is that `TensorBody.h` now includes `Operators.h` (because the codegen'd method API is implemented by calling into `at::_ops`), but `TensorBody.h` also includes the definition of the Tensor class. That means that `Operators.h` can't be aware of the Tensor class; it needs to forward declare everything and avoid using the Tensor class directly. To fix cyclic includes, I had to: - Not allow defaulting in the `at::_ops` API - Move some code that was called when translating from C++ to Dispatcher API's directly into the codegen template (`check_tensor_options_and_extract_memory_format`) It's not great, but I don't think this specific include cycle will break down in the near future; the only code that we need to call before getting to `Operators.cpp` is the translations from various API's to the dispatcher API; there aren't many of them, and there's no major reason for them to live an external utils file somewhere. Moving the code into the headers also meant that the codegen no longer needs to deal with `Functions.cpp`/`TensorMethods.cpp`/`RedispatchFunctions.cpp`. All of the functions that used to be defined in `TensorMethods.cpp` seemed small enough for me to lump into `TensorBody.h`, but some of the functions in `Functions.cpp` looked pretty big to put in a header, so I moved the file to `aten/src/ATen/native/Functions.cpp`. It might be worth keeping `TensorMethods.cpp` there and leaving it too, in-case we have any beefy hand-written tensor methods that we don't want to put in a header. **Perf** I ran a few benchmarks in callgrind, and didn't see a noticeable instruction count change when calling `at::add()`. I also saw in the output that `at::add()` was successfully getting inlined. There's also probably a light risk of binary size increase; I think that there's a binary size regression test that I can run in phabricator (going to try it). I can also try inspecting `libtorch.so` directly and seeing if it's any bigger, but my hope is that the inline-ing means that we aren't generated separate symbols for `at::add` and `at::_ops::add_Tensor::call`. Differential Revision: [D28833086](https://our.internmc.facebook.com/intern/diff/D28833086) [ghstack-poisoned]

bdhirsh · 2021-06-11T15:14:38Z

@bdhirsh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

This PR beefs up the `at::_ops::` API as a source of truth for compile-time information about each operator. ### Changes For every op defined in native_functions.yaml, e.g. `at::_ops::add_Tensor` previously defined an unambiguous function; effectively an unambiguously named version of the C++ API that you could decltype() successfully because it had no overloads with a user-facing macro: `decltype(ATEN_FN2(add, Tensor)) // expands to decltype(at::_ops::add_Tensor)`. Now, `at::_ops::add_Tensor` is a struct containing a few static fields and methods (declared in `Operators.h`, defined in `Operators.cpp`): ``` struct TORCH_API add_Tensor { using schema = at::Tensor (const at::Tensor &, const at::Tensor &, const at::Scalar &); using ptr_schema = at::Tensor (*)(const at::Tensor &, const at::Tensor &, const at::Scalar &); static constexpr const char* name = "aten::add"; static constexpr const char* overload_name = "Tensor"; static constexpr const char* schema_str = "add.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor"; static at::Tensor call(const at::Tensor & self, const at::Tensor & other, const at::Scalar & alpha); static at::Tensor redispatch(c10::DispatchKeySet dispatchKeySet, const at::Tensor & self, const at::Tensor & ot }; ``` What used to be the function `at::_ops::add_Tensor` can now be accessed as `at::_ops::add_Tensor::call`, and I've added a new macro to access the entire struct (naming suggestions welcome) - `ATEN_OP2(add, Tensor)`. ### Motivation There were two motivations for this change: **Codegen refactor** The `at::_ops::` API as it exists now is (yet another) C++ entry point into the dispatcher, in addition to the Function, Method, and Redispatch APIs. Instead, after this PR, the existing three API's are all inline-able wrapper API's that call into the `at::_ops` API to do the real work. The function and method API's call into `at::_ops::{op}::call`, while the redispatch API calls into `at::_ops::{op}::redispatch`. This will hopefully make it easier to pile in any future C++ API's that we want to code-generate. It also means that stuff like the string name, overload name, and schema of each operator is consolidated in a single place, rather than having the codegen hardcode various strings in multiple codegen output files. **Extra compile-time metadata** In the [boxed CPU fallback PR](https://github.com/pytorch/pytorch/pull/58065/files#diff-c9b55f0d692a9bea8019c6f19bc46877f1efa0f9d4fc2086cf299b52768343b4R31) above this in the stack, I added a new API that external backends can use to call directly into their boxed fallback from an unboxed context. Adding extra metadata to `at::_ops` means that XLA's usage of that API doesn't require passing in the string name and overload of each name as arguments; we can just infer them. The updated API looks like this (see [the XLA-side PR ](https://github.com/pytorch/xla/pull/2945/files#diff-5e65c3c1d847191cb691d1874732e971f09fa1aad7a980a555c3b0504a5b6470R250) for more examples) ``` return at::native::call_fallback_fn<&xla_cpu_fallback, ATEN_OP2(add, Tensor)>::call(a, b, 1.0); ``` **Characteristics of the `at::_ops` API** (I also commented this in the codegen) (1) It follows the Dispatcher API. This means, e.g., that it takes in the expanded arguments rather than `TensorOptions`. This is kind of necessary for perf, if we want to `at::_ops` to serve as the main implementation of the existing C++ API's. For example: if it followed the C++ API, then all of the faithful C++ factory functions would need to wrap their arguments into TensorOptions only to unwrap them again. (2) Overload names are disambiguated. This is the same as before; it's helpful for pytorch extenders who would like to decltype() an aten operator, that has overloads, e.g. decltype(at::_ops::mul_Tensor::call) (3) No argument defaulting is allowed. This is more of an implementation detail to avoid #include cycles, since TensorBody.h (which defines the Tensor class) needs to include this file. The #include situation is precarious though! (4) manual_cpp_bindings and faithful names are not included in the API. I think that this is one we have a choice with. This applies to stuff like __dispatch__is_complex(), and add_outf(). These aren't "real native_functions.yaml ops", they're just additional functions provided by the C++ API. They're implemented as wrappers in Functions.h that call into the actual operators defined here, i.e. at::_ops::is_complex::call() and at::_ops::add_out::call(). This means that ATEN_OP(is_complex) will not fastpath, and will go through the dispatcher. It also means that `ATEN_OP2(add, out)` is automatically faithful and takes its out argument at the end (this is just because it follows the dispatcher API). I thought it made sense not to move these into the at::_ops API because they aren't universal: the Redispatch API doesn't care about which ops have manual_cpp_bindings. But... it could. **Details** Instead of codegen'ing the existing 3 API's in `Functions.cpp`, `TensorMethods.cpp` and `RedispatchFunctions.cpp`, I codegen them directly into the headers: `Functions.h`, `TensorBody.h`, and `RedispatchFunctions.h`. I mostly did this for perf, since we want to avoid introducing an extra function call in the hot path of every operator. These functions are also now all one-liners that call into `at::_ops`, so the compiler should just inline them all anyway. The main downside in doing that though was that I had to bend over backwards in a few cases to avoid cyclical #include statements. The issue is that `TensorBody.h` now includes `Operators.h` (because the codegen'd method API is implemented by calling into `at::_ops`), but `TensorBody.h` also includes the definition of the Tensor class. That means that `Operators.h` can't be aware of the Tensor class; it needs to forward declare everything and avoid using the Tensor class directly. To fix cyclic includes, I had to: - Not allow defaulting in the `at::_ops` API - Move some code that was called when translating from C++ to Dispatcher API's directly into the codegen template (`check_tensor_options_and_extract_memory_format`) It's not great, but I don't think this specific include cycle will break down in the near future; the only code that we need to call before getting to `Operators.cpp` is the translations from various API's to the dispatcher API; there aren't many of them, and there's no major reason for them to live an external utils file somewhere. Moving the code into the headers also meant that the codegen no longer needs to deal with `Functions.cpp`/`TensorMethods.cpp`/`RedispatchFunctions.cpp`. All of the functions that used to be defined in `TensorMethods.cpp` seemed small enough for me to lump into `TensorBody.h`, but some of the functions in `Functions.cpp` looked pretty big to put in a header, so I moved the file to `aten/src/ATen/native/Functions.cpp`. It might be worth keeping `TensorMethods.cpp` there and leaving it too, in-case we have any beefy hand-written tensor methods that we don't want to put in a header. **Perf** I ran a few benchmarks in callgrind, and didn't see a noticeable instruction count change when calling `at::add()`. I also saw in the output that `at::add()` was successfully getting inlined. There's also probably a light risk of binary size increase; I think that there's a binary size regression test that I can run in phabricator (going to try it). I can also try inspecting `libtorch.so` directly and seeing if it's any bigger, but my hope is that the inline-ing means that we aren't generated separate symbols for `at::add` and `at::_ops::add_Tensor::call`. Differential Revision: [D28833086](https://our.internmc.facebook.com/intern/diff/D28833086) [ghstack-poisoned]

bdhirsh · 2021-06-11T19:51:21Z

aten/src/ATen/templates/DispatchKeyFunctions_inl.h

+// The only #includes we need are for custom classes that have defaults in the C++ API
+#include <c10/core/MemoryFormat.h>
+#include <c10/core/Scalar.h>
+#include <ATen/core/Reduction.h>


doesn't look like git displayed the [rename DispatchKeyFunctions.h -> DispatchKeyFunctions_inl.h] + [create DispatchKeyFunctions.h] changes nicely. The only real change was tightening the #includes.

bdhirsh · 2021-06-11T19:52:40Z

@ezyang @bhosmer I just updated to fix the static build #include cycle fiasco, so you might want to give the PR (or just the last commit) a look. See the note in DispatchKeyFunctions.h for the details.

I still need to watch CI for more failures + fix the internal build afterwards (we have new codegen'd files to add to buck again 💀 )

bdhirsh · 2021-06-11T21:43:35Z

@bdhirsh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

This PR beefs up the `at::_ops::` API as a source of truth for compile-time information about each operator. ### Changes For every op defined in native_functions.yaml, e.g. `at::_ops::add_Tensor` previously defined an unambiguous function; effectively an unambiguously named version of the C++ API that you could decltype() successfully because it had no overloads with a user-facing macro: `decltype(ATEN_FN2(add, Tensor)) // expands to decltype(at::_ops::add_Tensor)`. Now, `at::_ops::add_Tensor` is a struct containing a few static fields and methods (declared in `Operators.h`, defined in `Operators.cpp`): ``` struct TORCH_API add_Tensor { using schema = at::Tensor (const at::Tensor &, const at::Tensor &, const at::Scalar &); using ptr_schema = at::Tensor (*)(const at::Tensor &, const at::Tensor &, const at::Scalar &); static constexpr const char* name = "aten::add"; static constexpr const char* overload_name = "Tensor"; static constexpr const char* schema_str = "add.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor"; static at::Tensor call(const at::Tensor & self, const at::Tensor & other, const at::Scalar & alpha); static at::Tensor redispatch(c10::DispatchKeySet dispatchKeySet, const at::Tensor & self, const at::Tensor & ot }; ``` What used to be the function `at::_ops::add_Tensor` can now be accessed as `at::_ops::add_Tensor::call`, and I've added a new macro to access the entire struct (naming suggestions welcome) - `ATEN_OP2(add, Tensor)`. ### Motivation There were two motivations for this change: **Codegen refactor** The `at::_ops::` API as it exists now is (yet another) C++ entry point into the dispatcher, in addition to the Function, Method, and Redispatch APIs. Instead, after this PR, the existing three API's are all inline-able wrapper API's that call into the `at::_ops` API to do the real work. The function and method API's call into `at::_ops::{op}::call`, while the redispatch API calls into `at::_ops::{op}::redispatch`. This will hopefully make it easier to pile in any future C++ API's that we want to code-generate. It also means that stuff like the string name, overload name, and schema of each operator is consolidated in a single place, rather than having the codegen hardcode various strings in multiple codegen output files. **Extra compile-time metadata** In the [boxed CPU fallback PR](https://github.com/pytorch/pytorch/pull/58065/files#diff-c9b55f0d692a9bea8019c6f19bc46877f1efa0f9d4fc2086cf299b52768343b4R31) above this in the stack, I added a new API that external backends can use to call directly into their boxed fallback from an unboxed context. Adding extra metadata to `at::_ops` means that XLA's usage of that API doesn't require passing in the string name and overload of each name as arguments; we can just infer them. The updated API looks like this (see [the XLA-side PR ](https://github.com/pytorch/xla/pull/2945/files#diff-5e65c3c1d847191cb691d1874732e971f09fa1aad7a980a555c3b0504a5b6470R250) for more examples) ``` return at::native::call_fallback_fn<&xla_cpu_fallback, ATEN_OP2(add, Tensor)>::call(a, b, 1.0); ``` **Characteristics of the `at::_ops` API** (I also commented this in the codegen) (1) It follows the Dispatcher API. This means, e.g., that it takes in the expanded arguments rather than `TensorOptions`. This is kind of necessary for perf, if we want to `at::_ops` to serve as the main implementation of the existing C++ API's. For example: if it followed the C++ API, then all of the faithful C++ factory functions would need to wrap their arguments into TensorOptions only to unwrap them again. (2) Overload names are disambiguated. This is the same as before; it's helpful for pytorch extenders who would like to decltype() an aten operator, that has overloads, e.g. decltype(at::_ops::mul_Tensor::call) (3) No argument defaulting is allowed. This is more of an implementation detail to avoid #include cycles, since TensorBody.h (which defines the Tensor class) needs to include this file. The #include situation is precarious though! (4) manual_cpp_bindings and faithful names are not included in the API. I think that this is one we have a choice with. This applies to stuff like __dispatch__is_complex(), and add_outf(). These aren't "real native_functions.yaml ops", they're just additional functions provided by the C++ API. They're implemented as wrappers in Functions.h that call into the actual operators defined here, i.e. at::_ops::is_complex::call() and at::_ops::add_out::call(). This means that ATEN_OP(is_complex) will not fastpath, and will go through the dispatcher. It also means that `ATEN_OP2(add, out)` is automatically faithful and takes its out argument at the end (this is just because it follows the dispatcher API). I thought it made sense not to move these into the at::_ops API because they aren't universal: the Redispatch API doesn't care about which ops have manual_cpp_bindings. But... it could. **Details** Instead of codegen'ing the existing 3 API's in `Functions.cpp`, `TensorMethods.cpp` and `RedispatchFunctions.cpp`, I codegen them directly into the headers: `Functions.h`, `TensorBody.h`, and `RedispatchFunctions.h`. I mostly did this for perf, since we want to avoid introducing an extra function call in the hot path of every operator. These functions are also now all one-liners that call into `at::_ops`, so the compiler should just inline them all anyway. The main downside in doing that though was that I had to bend over backwards in a few cases to avoid cyclical #include statements. The issue is that `TensorBody.h` now includes `Operators.h` (because the codegen'd method API is implemented by calling into `at::_ops`), but `TensorBody.h` also includes the definition of the Tensor class. That means that `Operators.h` can't be aware of the Tensor class; it needs to forward declare everything and avoid using the Tensor class directly. To fix cyclic includes, I had to: - Not allow defaulting in the `at::_ops` API - Move some code that was called when translating from C++ to Dispatcher API's directly into the codegen template (`check_tensor_options_and_extract_memory_format`) It's not great, but I don't think this specific include cycle will break down in the near future; the only code that we need to call before getting to `Operators.cpp` is the translations from various API's to the dispatcher API; there aren't many of them, and there's no major reason for them to live an external utils file somewhere. Moving the code into the headers also meant that the codegen no longer needs to deal with `Functions.cpp`/`TensorMethods.cpp`/`RedispatchFunctions.cpp`. All of the functions that used to be defined in `TensorMethods.cpp` seemed small enough for me to lump into `TensorBody.h`, but some of the functions in `Functions.cpp` looked pretty big to put in a header, so I moved the file to `aten/src/ATen/native/Functions.cpp`. It might be worth keeping `TensorMethods.cpp` there and leaving it too, in-case we have any beefy hand-written tensor methods that we don't want to put in a header. **Perf** I ran a few benchmarks in callgrind, and didn't see a noticeable instruction count change when calling `at::add()`. I also saw in the output that `at::add()` was successfully getting inlined. There's also probably a light risk of binary size increase; I think that there's a binary size regression test that I can run in phabricator (going to try it). I can also try inspecting `libtorch.so` directly and seeing if it's any bigger, but my hope is that the inline-ing means that we aren't generated separate symbols for `at::add` and `at::_ops::add_Tensor::call`. Differential Revision: [D28833086](https://our.internmc.facebook.com/intern/diff/D28833086) [ghstack-poisoned]

bdhirsh · 2021-06-14T14:08:16Z

@bdhirsh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

This PR beefs up the `at::_ops::` API as a source of truth for compile-time information about each operator. ### Changes For every op defined in native_functions.yaml, e.g. `at::_ops::add_Tensor` previously defined an unambiguous function; effectively an unambiguously named version of the C++ API that you could decltype() successfully because it had no overloads with a user-facing macro: `decltype(ATEN_FN2(add, Tensor)) // expands to decltype(at::_ops::add_Tensor)`. Now, `at::_ops::add_Tensor` is a struct containing a few static fields and methods (declared in `Operators.h`, defined in `Operators.cpp`): ``` struct TORCH_API add_Tensor { using schema = at::Tensor (const at::Tensor &, const at::Tensor &, const at::Scalar &); using ptr_schema = at::Tensor (*)(const at::Tensor &, const at::Tensor &, const at::Scalar &); static constexpr const char* name = "aten::add"; static constexpr const char* overload_name = "Tensor"; static constexpr const char* schema_str = "add.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor"; static at::Tensor call(const at::Tensor & self, const at::Tensor & other, const at::Scalar & alpha); static at::Tensor redispatch(c10::DispatchKeySet dispatchKeySet, const at::Tensor & self, const at::Tensor & ot }; ``` What used to be the function `at::_ops::add_Tensor` can now be accessed as `at::_ops::add_Tensor::call`, and I've added a new macro to access the entire struct (naming suggestions welcome) - `ATEN_OP2(add, Tensor)`. ### Motivation There were two motivations for this change: **Codegen refactor** The `at::_ops::` API as it exists now is (yet another) C++ entry point into the dispatcher, in addition to the Function, Method, and Redispatch APIs. Instead, after this PR, the existing three API's are all inline-able wrapper API's that call into the `at::_ops` API to do the real work. The function and method API's call into `at::_ops::{op}::call`, while the redispatch API calls into `at::_ops::{op}::redispatch`. This will hopefully make it easier to pile in any future C++ API's that we want to code-generate. It also means that stuff like the string name, overload name, and schema of each operator is consolidated in a single place, rather than having the codegen hardcode various strings in multiple codegen output files. **Extra compile-time metadata** In the [boxed CPU fallback PR](https://github.com/pytorch/pytorch/pull/58065/files#diff-c9b55f0d692a9bea8019c6f19bc46877f1efa0f9d4fc2086cf299b52768343b4R31) above this in the stack, I added a new API that external backends can use to call directly into their boxed fallback from an unboxed context. Adding extra metadata to `at::_ops` means that XLA's usage of that API doesn't require passing in the string name and overload of each name as arguments; we can just infer them. The updated API looks like this (see [the XLA-side PR ](https://github.com/pytorch/xla/pull/2945/files#diff-5e65c3c1d847191cb691d1874732e971f09fa1aad7a980a555c3b0504a5b6470R250) for more examples) ``` return at::native::call_fallback_fn<&xla_cpu_fallback, ATEN_OP2(add, Tensor)>::call(a, b, 1.0); ``` **Characteristics of the `at::_ops` API** (I also commented this in the codegen) (1) It follows the Dispatcher API. This means, e.g., that it takes in the expanded arguments rather than `TensorOptions`. This is kind of necessary for perf, if we want to `at::_ops` to serve as the main implementation of the existing C++ API's. For example: if it followed the C++ API, then all of the faithful C++ factory functions would need to wrap their arguments into TensorOptions only to unwrap them again. (2) Overload names are disambiguated. This is the same as before; it's helpful for pytorch extenders who would like to decltype() an aten operator, that has overloads, e.g. decltype(at::_ops::mul_Tensor::call) (3) No argument defaulting is allowed. This is more of an implementation detail to avoid #include cycles, since TensorBody.h (which defines the Tensor class) needs to include this file. The #include situation is precarious though! (4) manual_cpp_bindings and faithful names are not included in the API. I think that this is one we have a choice with. This applies to stuff like __dispatch__is_complex(), and add_outf(). These aren't "real native_functions.yaml ops", they're just additional functions provided by the C++ API. They're implemented as wrappers in Functions.h that call into the actual operators defined here, i.e. at::_ops::is_complex::call() and at::_ops::add_out::call(). This means that ATEN_OP(is_complex) will not fastpath, and will go through the dispatcher. It also means that `ATEN_OP2(add, out)` is automatically faithful and takes its out argument at the end (this is just because it follows the dispatcher API). I thought it made sense not to move these into the at::_ops API because they aren't universal: the Redispatch API doesn't care about which ops have manual_cpp_bindings. But... it could. **Details** Instead of codegen'ing the existing 3 API's in `Functions.cpp`, `TensorMethods.cpp` and `RedispatchFunctions.cpp`, I codegen them directly into the headers: `Functions.h`, `TensorBody.h`, and `RedispatchFunctions.h`. I mostly did this for perf, since we want to avoid introducing an extra function call in the hot path of every operator. These functions are also now all one-liners that call into `at::_ops`, so the compiler should just inline them all anyway. The main downside in doing that though was that I had to bend over backwards in a few cases to avoid cyclical #include statements. The issue is that `TensorBody.h` now includes `Operators.h` (because the codegen'd method API is implemented by calling into `at::_ops`), but `TensorBody.h` also includes the definition of the Tensor class. That means that `Operators.h` can't be aware of the Tensor class; it needs to forward declare everything and avoid using the Tensor class directly. To fix cyclic includes, I had to: - Not allow defaulting in the `at::_ops` API - Move some code that was called when translating from C++ to Dispatcher API's directly into the codegen template (`check_tensor_options_and_extract_memory_format`) It's not great, but I don't think this specific include cycle will break down in the near future; the only code that we need to call before getting to `Operators.cpp` is the translations from various API's to the dispatcher API; there aren't many of them, and there's no major reason for them to live an external utils file somewhere. Moving the code into the headers also meant that the codegen no longer needs to deal with `Functions.cpp`/`TensorMethods.cpp`/`RedispatchFunctions.cpp`. All of the functions that used to be defined in `TensorMethods.cpp` seemed small enough for me to lump into `TensorBody.h`, but some of the functions in `Functions.cpp` looked pretty big to put in a header, so I moved the file to `aten/src/ATen/native/Functions.cpp`. It might be worth keeping `TensorMethods.cpp` there and leaving it too, in-case we have any beefy hand-written tensor methods that we don't want to put in a header. **Perf** I ran a few benchmarks in callgrind, and didn't see a noticeable instruction count change when calling `at::add()`. I also saw in the output that `at::add()` was successfully getting inlined. There's also probably a light risk of binary size increase; I think that there's a binary size regression test that I can run in phabricator (going to try it). I can also try inspecting `libtorch.so` directly and seeing if it's any bigger, but my hope is that the inline-ing means that we aren't generated separate symbols for `at::add` and `at::_ops::add_Tensor::call`. Differential Revision: [D28833086](https://our.internmc.facebook.com/intern/diff/D28833086) [ghstack-poisoned]

bdhirsh · 2021-06-14T19:37:38Z

@bdhirsh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

This PR beefs up the `at::_ops::` API as a source of truth for compile-time information about each operator. ### Changes For every op defined in native_functions.yaml, e.g. `at::_ops::add_Tensor` previously defined an unambiguous function; effectively an unambiguously named version of the C++ API that you could decltype() successfully because it had no overloads with a user-facing macro: `decltype(ATEN_FN2(add, Tensor)) // expands to decltype(at::_ops::add_Tensor)`. Now, `at::_ops::add_Tensor` is a struct containing a few static fields and methods (declared in `Operators.h`, defined in `Operators.cpp`): ``` struct TORCH_API add_Tensor { using schema = at::Tensor (const at::Tensor &, const at::Tensor &, const at::Scalar &); using ptr_schema = at::Tensor (*)(const at::Tensor &, const at::Tensor &, const at::Scalar &); static constexpr const char* name = "aten::add"; static constexpr const char* overload_name = "Tensor"; static constexpr const char* schema_str = "add.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor"; static at::Tensor call(const at::Tensor & self, const at::Tensor & other, const at::Scalar & alpha); static at::Tensor redispatch(c10::DispatchKeySet dispatchKeySet, const at::Tensor & self, const at::Tensor & ot }; ``` What used to be the function `at::_ops::add_Tensor` can now be accessed as `at::_ops::add_Tensor::call`, and I've added a new macro to access the entire struct (naming suggestions welcome) - `ATEN_OP2(add, Tensor)`. ### Motivation There were two motivations for this change: **Codegen refactor** The `at::_ops::` API as it exists now is (yet another) C++ entry point into the dispatcher, in addition to the Function, Method, and Redispatch APIs. Instead, after this PR, the existing three API's are all inline-able wrapper API's that call into the `at::_ops` API to do the real work. The function and method API's call into `at::_ops::{op}::call`, while the redispatch API calls into `at::_ops::{op}::redispatch`. This will hopefully make it easier to pile in any future C++ API's that we want to code-generate. It also means that stuff like the string name, overload name, and schema of each operator is consolidated in a single place, rather than having the codegen hardcode various strings in multiple codegen output files. **Extra compile-time metadata** In the [boxed CPU fallback PR](https://github.com/pytorch/pytorch/pull/58065/files#diff-c9b55f0d692a9bea8019c6f19bc46877f1efa0f9d4fc2086cf299b52768343b4R31) above this in the stack, I added a new API that external backends can use to call directly into their boxed fallback from an unboxed context. Adding extra metadata to `at::_ops` means that XLA's usage of that API doesn't require passing in the string name and overload of each name as arguments; we can just infer them. The updated API looks like this (see [the XLA-side PR ](https://github.com/pytorch/xla/pull/2945/files#diff-5e65c3c1d847191cb691d1874732e971f09fa1aad7a980a555c3b0504a5b6470R250) for more examples) ``` return at::native::call_fallback_fn<&xla_cpu_fallback, ATEN_OP2(add, Tensor)>::call(a, b, 1.0); ``` **Characteristics of the `at::_ops` API** (I also commented this in the codegen) (1) It follows the Dispatcher API. This means, e.g., that it takes in the expanded arguments rather than `TensorOptions`. This is kind of necessary for perf, if we want to `at::_ops` to serve as the main implementation of the existing C++ API's. For example: if it followed the C++ API, then all of the faithful C++ factory functions would need to wrap their arguments into TensorOptions only to unwrap them again. (2) Overload names are disambiguated. This is the same as before; it's helpful for pytorch extenders who would like to decltype() an aten operator, that has overloads, e.g. decltype(at::_ops::mul_Tensor::call) (3) No argument defaulting is allowed. This is more of an implementation detail to avoid #include cycles, since TensorBody.h (which defines the Tensor class) needs to include this file. The #include situation is precarious though! (4) manual_cpp_bindings and faithful names are not included in the API. I think that this is one we have a choice with. This applies to stuff like __dispatch__is_complex(), and add_outf(). These aren't "real native_functions.yaml ops", they're just additional functions provided by the C++ API. They're implemented as wrappers in Functions.h that call into the actual operators defined here, i.e. at::_ops::is_complex::call() and at::_ops::add_out::call(). This means that ATEN_OP(is_complex) will not fastpath, and will go through the dispatcher. It also means that `ATEN_OP2(add, out)` is automatically faithful and takes its out argument at the end (this is just because it follows the dispatcher API). I thought it made sense not to move these into the at::_ops API because they aren't universal: the Redispatch API doesn't care about which ops have manual_cpp_bindings. But... it could. **Details** Instead of codegen'ing the existing 3 API's in `Functions.cpp`, `TensorMethods.cpp` and `RedispatchFunctions.cpp`, I codegen them directly into the headers: `Functions.h`, `TensorBody.h`, and `RedispatchFunctions.h`. I mostly did this for perf, since we want to avoid introducing an extra function call in the hot path of every operator. These functions are also now all one-liners that call into `at::_ops`, so the compiler should just inline them all anyway. The main downside in doing that though was that I had to bend over backwards in a few cases to avoid cyclical #include statements. The issue is that `TensorBody.h` now includes `Operators.h` (because the codegen'd method API is implemented by calling into `at::_ops`), but `TensorBody.h` also includes the definition of the Tensor class. That means that `Operators.h` can't be aware of the Tensor class; it needs to forward declare everything and avoid using the Tensor class directly. To fix cyclic includes, I had to: - Not allow defaulting in the `at::_ops` API - Move some code that was called when translating from C++ to Dispatcher API's directly into the codegen template (`check_tensor_options_and_extract_memory_format`) It's not great, but I don't think this specific include cycle will break down in the near future; the only code that we need to call before getting to `Operators.cpp` is the translations from various API's to the dispatcher API; there aren't many of them, and there's no major reason for them to live an external utils file somewhere. Moving the code into the headers also meant that the codegen no longer needs to deal with `Functions.cpp`/`TensorMethods.cpp`/`RedispatchFunctions.cpp`. All of the functions that used to be defined in `TensorMethods.cpp` seemed small enough for me to lump into `TensorBody.h`, but some of the functions in `Functions.cpp` looked pretty big to put in a header, so I moved the file to `aten/src/ATen/native/Functions.cpp`. It might be worth keeping `TensorMethods.cpp` there and leaving it too, in-case we have any beefy hand-written tensor methods that we don't want to put in a header. **Perf** I ran a few benchmarks in callgrind, and didn't see a noticeable instruction count change when calling `at::add()`. I also saw in the output that `at::add()` was successfully getting inlined. There's also probably a light risk of binary size increase; I think that there's a binary size regression test that I can run in phabricator (going to try it). I can also try inspecting `libtorch.so` directly and seeing if it's any bigger, but my hope is that the inline-ing means that we aren't generated separate symbols for `at::add` and `at::_ops::add_Tensor::call`. Differential Revision: [D28833086](https://our.internmc.facebook.com/intern/diff/D28833086) [ghstack-poisoned]

bhosmer

Really nice piece of work 😁

This PR beefs up the `at::_ops::` API as a source of truth for compile-time information about each operator. ### Changes For every op defined in native_functions.yaml, e.g. `at::_ops::add_Tensor` previously defined an unambiguous function; effectively an unambiguously named version of the C++ API that you could decltype() successfully because it had no overloads with a user-facing macro: `decltype(ATEN_FN2(add, Tensor)) // expands to decltype(at::_ops::add_Tensor)`. Now, `at::_ops::add_Tensor` is a struct containing a few static fields and methods (declared in `Operators.h`, defined in `Operators.cpp`): ``` struct TORCH_API add_Tensor { using schema = at::Tensor (const at::Tensor &, const at::Tensor &, const at::Scalar &); using ptr_schema = at::Tensor (*)(const at::Tensor &, const at::Tensor &, const at::Scalar &); static constexpr const char* name = "aten::add"; static constexpr const char* overload_name = "Tensor"; static constexpr const char* schema_str = "add.Tensor(Tensor self, Tensor other, *, Scalar alpha=1) -> Tensor"; static at::Tensor call(const at::Tensor & self, const at::Tensor & other, const at::Scalar & alpha); static at::Tensor redispatch(c10::DispatchKeySet dispatchKeySet, const at::Tensor & self, const at::Tensor & ot }; ``` What used to be the function `at::_ops::add_Tensor` can now be accessed as `at::_ops::add_Tensor::call`, and I've added a new macro to access the entire struct (naming suggestions welcome) - `ATEN_OP2(add, Tensor)`. ### Motivation There were two motivations for this change: **Codegen refactor** The `at::_ops::` API as it exists now is (yet another) C++ entry point into the dispatcher, in addition to the Function, Method, and Redispatch APIs. Instead, after this PR, the existing three API's are all inline-able wrapper API's that call into the `at::_ops` API to do the real work. The function and method API's call into `at::_ops::{op}::call`, while the redispatch API calls into `at::_ops::{op}::redispatch`. This will hopefully make it easier to pile in any future C++ API's that we want to code-generate. It also means that stuff like the string name, overload name, and schema of each operator is consolidated in a single place, rather than having the codegen hardcode various strings in multiple codegen output files. **Extra compile-time metadata** In the [boxed CPU fallback PR](https://github.com/pytorch/pytorch/pull/58065/files#diff-c9b55f0d692a9bea8019c6f19bc46877f1efa0f9d4fc2086cf299b52768343b4R31) above this in the stack, I added a new API that external backends can use to call directly into their boxed fallback from an unboxed context. Adding extra metadata to `at::_ops` means that XLA's usage of that API doesn't require passing in the string name and overload of each name as arguments; we can just infer them. The updated API looks like this (see [the XLA-side PR ](https://github.com/pytorch/xla/pull/2945/files#diff-5e65c3c1d847191cb691d1874732e971f09fa1aad7a980a555c3b0504a5b6470R250) for more examples) ``` return at::native::call_fallback_fn<&xla_cpu_fallback, ATEN_OP2(add, Tensor)>::call(a, b, 1.0); ``` **Characteristics of the `at::_ops` API** (I also commented this in the codegen) (1) It follows the Dispatcher API. This means, e.g., that it takes in the expanded arguments rather than `TensorOptions`. This is kind of necessary for perf, if we want to `at::_ops` to serve as the main implementation of the existing C++ API's. For example: if it followed the C++ API, then all of the faithful C++ factory functions would need to wrap their arguments into TensorOptions only to unwrap them again. (2) Overload names are disambiguated. This is the same as before; it's helpful for pytorch extenders who would like to decltype() an aten operator, that has overloads, e.g. decltype(at::_ops::mul_Tensor::call) (3) No argument defaulting is allowed. This is more of an implementation detail to avoid #include cycles, since TensorBody.h (which defines the Tensor class) needs to include this file. The #include situation is precarious though! (4) manual_cpp_bindings and faithful names are not included in the API. I think that this is one we have a choice with. This applies to stuff like __dispatch__is_complex(), and add_outf(). These aren't "real native_functions.yaml ops", they're just additional functions provided by the C++ API. They're implemented as wrappers in Functions.h that call into the actual operators defined here, i.e. at::_ops::is_complex::call() and at::_ops::add_out::call(). This means that ATEN_OP(is_complex) will not fastpath, and will go through the dispatcher. It also means that `ATEN_OP2(add, out)` is automatically faithful and takes its out argument at the end (this is just because it follows the dispatcher API). I thought it made sense not to move these into the at::_ops API because they aren't universal: the Redispatch API doesn't care about which ops have manual_cpp_bindings. But... it could. **Details** Instead of codegen'ing the existing 3 API's in `Functions.cpp`, `TensorMethods.cpp` and `RedispatchFunctions.cpp`, I codegen them directly into the headers: `Functions.h`, `TensorBody.h`, and `RedispatchFunctions.h`. I mostly did this for perf, since we want to avoid introducing an extra function call in the hot path of every operator. These functions are also now all one-liners that call into `at::_ops`, so the compiler should just inline them all anyway. The main downside in doing that though was that I had to bend over backwards in a few cases to avoid cyclical #include statements. The issue is that `TensorBody.h` now includes `Operators.h` (because the codegen'd method API is implemented by calling into `at::_ops`), but `TensorBody.h` also includes the definition of the Tensor class. That means that `Operators.h` can't be aware of the Tensor class; it needs to forward declare everything and avoid using the Tensor class directly. To fix cyclic includes, I had to: - Not allow defaulting in the `at::_ops` API - Move some code that was called when translating from C++ to Dispatcher API's directly into the codegen template (`check_tensor_options_and_extract_memory_format`) It's not great, but I don't think this specific include cycle will break down in the near future; the only code that we need to call before getting to `Operators.cpp` is the translations from various API's to the dispatcher API; there aren't many of them, and there's no major reason for them to live an external utils file somewhere. Moving the code into the headers also meant that the codegen no longer needs to deal with `Functions.cpp`/`TensorMethods.cpp`/`RedispatchFunctions.cpp`. All of the functions that used to be defined in `TensorMethods.cpp` seemed small enough for me to lump into `TensorBody.h`, but some of the functions in `Functions.cpp` looked pretty big to put in a header, so I moved the file to `aten/src/ATen/native/Functions.cpp`. It might be worth keeping `TensorMethods.cpp` there and leaving it too, in-case we have any beefy hand-written tensor methods that we don't want to put in a header. **Perf** I ran a few benchmarks in callgrind, and didn't see a noticeable instruction count change when calling `at::add()`. I also saw in the output that `at::add()` was successfully getting inlined. There's also probably a light risk of binary size increase; I think that there's a binary size regression test that I can run in phabricator (going to try it). I can also try inspecting `libtorch.so` directly and seeing if it's any bigger, but my hope is that the inline-ing means that we aren't generated separate symbols for `at::add` and `at::_ops::add_Tensor::call`. Differential Revision: [D28833086](https://our.internmc.facebook.com/intern/diff/D28833086) [ghstack-poisoned]

facebook-github-bot · 2021-06-17T20:11:19Z

@bdhirsh merged this pull request in e2129d1.

Reland of #59115, but with a fix for windows cuda builds (example failure in master here: https://github.com/pytorch/pytorch/runs/2852662871) This is identical to the original PR except for one change in `tools/codegen/gen.py`: `static constexpr` -> `static CONSTEXPR_CONST_EXCEPT_WIN_CUDA` (had to make a new macro that was a slight adjustment to an existing one) This actually took a while to figure out, until I tracked down a previous pytorch PR that encountered a similar issue: #40675 This reverts commit 6d0fb85. Differential Revision: [D29213932](https://our.internmc.facebook.com/intern/diff/D29213932) [ghstack-poisoned]

… API"" Reland of #59115, but with a fix for windows cuda builds (example failure in master here: https://github.com/pytorch/pytorch/runs/2852662871) This is identical to the original PR except for one change in `tools/codegen/gen.py`: `static constexpr` -> `static CONSTEXPR_CONST_EXCEPT_WIN_CUDA` (had to make a new macro that was a slight adjustment to an existing one) This actually took a while to figure out, until I tracked down a previous pytorch PR that encountered a similar issue: #40675 This reverts commit 6d0fb85. Differential Revision: [D29213932](https://our.internmc.facebook.com/intern/diff/D29213932) [ghstack-poisoned]

Reland of #59115, but with a fix for windows cuda builds (example failure in master here: https://github.com/pytorch/pytorch/runs/2852662871) This is identical to the original PR except for one change in `tools/codegen/gen.py`: `static constexpr` -> `static CONSTEXPR_CONST_EXCEPT_WIN_CUDA` (had to make a new macro that was a slight adjustment to an existing one) This actually took a while to figure out, until I tracked down a previous pytorch PR that encountered a similar issue: #40675 This reverts commit 6d0fb85. Differential Revision: [D29213932](https://our.internmc.facebook.com/intern/diff/D29213932) [ghstack-poisoned]

beef up at::_ops API

c118911

[ghstack-poisoned]

This was referenced May 27, 2021

bugfix: ensure that at::{dispatch_key}:: API gets external linkage #58569

Closed

generate C++ API for meta functions using at::meta:: #58570

Closed

facebook-github-bot added the cla signed label May 27, 2021

bdhirsh mentioned this pull request May 27, 2021

add a boxed CPU fallback kernel #58065

Closed

bdhirsh commented Jun 1, 2021

View reviewed changes

bdhirsh requested review from bhosmer and ezyang June 1, 2021 22:15

bdhirsh added 2 commits June 2, 2021 07:13

ezyang reviewed Jun 2, 2021

View reviewed changes

ezyang reviewed Jun 3, 2021

View reviewed changes

ezyang approved these changes Jun 3, 2021

View reviewed changes

bdhirsh mentioned this pull request Jun 3, 2021

proof of concept for redispatch inline optimization #59395

Closed

bdhirsh mentioned this pull request Jun 10, 2021

move all external kernels into a class for better compiler error messages #59839

Closed

bdhirsh added 2 commits June 10, 2021 17:52

bdhirsh commented Jun 11, 2021

View reviewed changes

bdhirsh added 4 commits June 15, 2021 10:00

bhosmer approved these changes Jun 17, 2021

View reviewed changes

facebook-github-bot closed this in e2129d1 Jun 17, 2021

facebook-github-bot added the Merged label Jun 17, 2021

facebook-github-bot deleted the gh/bdhirsh/123/head branch June 21, 2021 14:15

bdhirsh mentioned this pull request Jun 22, 2021

Revert "Revert D28833086: beef up at::_ops API" #60214

Closed

Conversation

bdhirsh commented May 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Motivation

Uh oh!

facebook-github-bot commented May 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🕵️ 2 new failures recognized by patterns

pytorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_build (1/2)

Lint / clang-tidy (2/2)

1 failure not recognized by patterns:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bdhirsh commented Jun 1, 2021

Uh oh!

bdhirsh commented Jun 2, 2021

Uh oh!

bdhirsh commented Jun 2, 2021

Uh oh!

ezyang commented Jun 2, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bdhirsh Jun 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ezyang left a comment

Choose a reason for hiding this comment

Uh oh!

bdhirsh commented Jun 10, 2021

Uh oh!

bdhirsh commented Jun 11, 2021

Uh oh!

bdhirsh commented Jun 11, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bdhirsh commented Jun 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bdhirsh commented Jun 11, 2021

Uh oh!

bdhirsh commented Jun 14, 2021

Uh oh!

bdhirsh commented Jun 14, 2021

Uh oh!

bhosmer left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jun 17, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bdhirsh commented May 27, 2021 •

edited

Loading

facebook-github-bot commented May 27, 2021 •

edited

Loading

bdhirsh Jun 3, 2021 •

edited

Loading

bdhirsh commented Jun 11, 2021 •

edited

Loading