generate in-place/out wrappers for external kernels#56835
generate in-place/out wrappers for external kernels#56835bdhirsh wants to merge 4 commits intogh/bdhirsh/109/basefrom
Conversation
[ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit 847ab16 (more details on the Dr. CI page):
4 failures not recognized by patterns:
ci.pytorch.org: 1 failedThis comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
[ghstack-poisoned]
| ExternalBackendFunction, | ||
| ExternalBackendFunctionsGroup, | ||
| Union[NativeFunction, NativeFunctionsGroup], | ||
| Union[NativeFunction, ExternalBackendFunction], |
There was a problem hiding this comment.
the blow-up continues
**Overview** This PR adds support for auto-gening in-place/out wrappers given a functional kernel for external backends. - If the backend provides a functional kernel but not an out kernel, we generate one for them - If the backend provides a functional kernel but not an inplace kernel, we generate one for them - If they explicitly provide either kernel, we use whatever they provided With this change, the codegen generates two new files for xla that re more in line with what our core codegen generates: - `RegisterXLA.cpp` - `RegisterAutogradXLA.cpp` **functionality delta** For context, what XLA’s codegen currently provides is: - Support for generating `out` wrapper kernels, which is specified in a hardcoded list. This PR will subsume that change, and generate out kernels whenever we encounter an op that has a functional kernel but no out kernel - No support for in-place wrappers. XLA has hand-written kernels for the in-place variants of most ops. Need to confirm this, but I think we’ll be able to remove a lot of that code in favor of these codegen’d wrappers later. As a test, I have a test PR [here](pytorch/xla#2915) that removes `add_` and uses the codegen’d version instead. **code changes** The PR also splits up the external backend codegen logic over two places. - `register_dispatch_key.py`. Contains all the "new" stuff: kernel wrappers + dispatcher registrations for all xla kernels, including codegen'd inplace/out wrappers. It also provides support for namespaced definitions, e.g. `at::xla::{op}`, although this feature isn't "turned on" yet. - `gen_external_aten_fallbacks.py`. I updated this file to *really* only do fallback-related stuff, which should make it easier to kill if we switch to a boxed fallback later. It not longer generates `out` wrappers, and it no longer generates dispatcher registrations for all kernels; only for CPU fallbacks. **logic re-use in the codegen** Just want to call out my reasoning for what goes where in the codegen w.r.t. two changes: 1. For normal unstructured kernels, I almost entirely re-used `RegisterDispatchKey.gen_unstructured` 2. For generating the out/inplace wrappers, I pushed all of that logic into it's own function; it's small enough + different enough from the existing structured/unstructured logic that I had a hard time trying to generalize the existing logic to handle this case. 3. One note on the split between CPU fallback logic vs. everything else. There's one area where the logic bleeds over: Right now in `gen_unstructured()`, I actually generate namespaced definitions for all backend kernels, include those in a CPU fallback. My thought process was that the goal of a CPU fallback is to kind of brush over deficiencies in what the backend has implemented, and in a similar vein it would be nice if `at::xla::{op}` "just worked" for all ops, if someone wanted to write C++ code in a uniform API that was meant just for xla. **What this PR does not do** Structured kernel support for external backends. To add at some point in the future. [ghstack-poisoned]
**Overview** This PR adds support for auto-gening in-place/out wrappers given a functional kernel for external backends. - If the backend provides a functional kernel but not an out kernel, we generate one for them - If the backend provides a functional kernel but not an inplace kernel, we generate one for them - If they explicitly provide either kernel, we use whatever they provided With this change, the codegen generates two new files for xla that re more in line with what our core codegen generates: - `RegisterXLA.cpp` - `RegisterAutogradXLA.cpp` **functionality delta** For context, what XLA’s codegen currently provides is: - Support for generating `out` wrapper kernels, which is specified in a hardcoded list. This PR will subsume that change, and generate out kernels whenever we encounter an op that has a functional kernel but no out kernel - No support for in-place wrappers. XLA has hand-written kernels for the in-place variants of most ops. Need to confirm this, but I think we’ll be able to remove a lot of that code in favor of these codegen’d wrappers later. As a test, I have a test PR [here](pytorch/xla#2915) that removes `add_` and uses the codegen’d version instead. **code changes** The PR also splits up the external backend codegen logic over two places. - `register_dispatch_key.py`. Contains all the "new" stuff: kernel wrappers + dispatcher registrations for all xla kernels, including codegen'd inplace/out wrappers. It also provides support for namespaced definitions, e.g. `at::xla::{op}`, although this feature isn't "turned on" yet. - `gen_external_aten_fallbacks.py`. I updated this file to *really* only do fallback-related stuff, which should make it easier to kill if we switch to a boxed fallback later. It not longer generates `out` wrappers, and it no longer generates dispatcher registrations for all kernels; only for CPU fallbacks. **logic re-use in the codegen** Just want to call out my reasoning for what goes where in the codegen w.r.t. two changes: 1. For normal unstructured kernels, I almost entirely re-used `RegisterDispatchKey.gen_unstructured`, with some changes to make it work for external backend kernels. 2. For generating the out/inplace wrappers, I pushed all of that logic into it's own function; it's small enough + different enough from the existing structured/unstructured logic that I had a hard time trying to generalize the existing logic to handle this case. 3. One note on the split between CPU fallback logic vs. everything else. There's one area where the logic bleeds over: Right now in `gen_unstructured()`, I actually generate namespaced definitions for all backend kernels, include those in a CPU fallback. My thought process was that the goal of a CPU fallback is to kind of brush over deficiencies in what the backend has implemented, and in a similar vein it would be nice if `at::xla::{op}` "just worked" for all ops, if someone wanted to write C++ code in a uniform API that was meant just for xla. Before (pictorially) ``` RegisterDispatchKey.__call__() -------> gen_unstructured() ``` After (pictorially). My ascii art is awful but hopefully you get the idea ``` RegisterDispatchKey.__call__() -------------------------------------> gen_unstructured() \ / \ / --> gen_inplace_out_wrapper() ``` `gen_inplace_out_wrapper()` has the specific task of generating anonymous definitions (wrappers) for inplace/out kernels, and otherwise forwards all other logic, including dispatcher registrations + namespaced definitions, to `gen_unstructured()`. One wrinkle: `gen_unstructured` takes in a single `Union[NativeFunction, ExternalBackendFunction]`, which technically isn't enough context to decide when to generate registrations/namespaced definitions. That's because it now also needs the context of the entire function group: if the backend has not implemented an `out` kernel, then the information on whether or not to generate a registration for it is entirely dependent on whether the backend has implemented a `functional` kernel. I ended up adding an extra optional argument to `gen_unstructured()` to deal with that; a bool that tells gen_unstructured if it's dealing with a autogenerated wrapper. **What this PR does not do** Structured kernel support for external backends. To add at some point in the future. [ghstack-poisoned]
| ) | ||
| return list(mapMaybe(structured_gen.gen_one, g.functions())) | ||
|
|
||
| @method_with_native_function |
There was a problem hiding this comment.
So, I'm going to have to add this back before I merge the PR. One annoying aspect of my change is that I added an optional is_generated_wrapper parameter to gen_unstructured. In order to re-use gen_unstructured for the code-generated inplace/out wrappers, gen_unstructured needs that extra bit of information: it's not available just from the ExternalBackendFunction, because it requires information relevant to the entire function group. i.e. we only generate a wrapper if (a) the inplace/out kernel is not implemented, and (b) the functional kernel is implemented.
Another option would have been to try to factor out the logic for registrations/definitions/declarations into their own functions that are then called by both gen_unstructured and the gen_out_inplace_wrappers code, but I tentatively found the approach I took here to be less confusing.
After some digging I couldn't find a way to make a flavor of method_with_native_function that properly deals with defaultable arguments, so my plan was to add a with native_function_manager(f): ... at the beginning of this function. That would indent the entire function and make it impossible to see the delta in this PR though, so my plan is to wait until it's been reviewed to add that in. Or, take it out completely depending on what the feedback is :)
There was a problem hiding this comment.
I didn't understand all of this (still reading), but indentation is not a problem, you can tell people to use GH ignore whitespace feature (or review on phab)
There was a problem hiding this comment.
Probably just need to rework @method_with_native_function entirely. Tis really annoying to do decorators and strongly typed Python at the same time.
| # This is needed in order for namespaced definitions to call into CPU fallbacks, | ||
| # which live in a different namespace. | ||
| # TODO: this logic will change if we implement a boxed CPU fallback. | ||
| if isinstance(native_or_external, ExternalBackendFunction) \ |
There was a problem hiding this comment.
if we want at::xla::{op} to work for all ops (including cpu fallbacks), this is where the extra logic lives.
There was a problem hiding this comment.
I take it back, maybe we shouldn't generate at::xla 😂
| # This isn't an important characteristic of the dispatcher API, and could be removed if we want to further unify | ||
| # The dispatcher and C++ API's. There happen to be a few places in the codegen where we need to guarantee name uniqueness. | ||
| name = f'{name}_{func.name.overload_name}' | ||
| return name |
| elif isinstance(f, NativeFunction): | ||
| elif isinstance(f, ExternalBackendFunctionsGroup): | ||
| if f.structured: | ||
| raise AssertionError("structured kernels not implemented yet for external backends") |
There was a problem hiding this comment.
Do you even ever fill structured field on ExternalBackendFunctionsGroup?
| yield self.functional | ||
| yield self.out | ||
| if self.inplace is not None: | ||
| yield self.inplace |
There was a problem hiding this comment.
so... when do you ever need the out variant to come first?
There was a problem hiding this comment.
yeah you don't. I guess I just left the out-first version in to be consistent with in-tree codegen, but that's not really necessary. I'll just kill the flag and add a comment explaining why we need functional to come first.
This is a re-write of #56835, which is significantly shorter thanks to the data model change in the PR below this one in the stack. See the original description in the linked PR for details. The functional changes in this PR are the same as in the above linked one, so the description is the same with a few small changes: - I don't bother generating `at::xla::{op}` entries for CPU fallbacks. After looking around, I see precedent for that. For example, we don't have `at::cpu::{op}` entries for composite ops- if you really want to bypass the dispatcher you need to call `at::compositeimplicitautograd::{op}`. Maybe we should revisit that later if we find an important use case for having full namespace coverage, but that doesn't seem worth half-fixing for external backends in this PR. [ghstack-poisoned]
This is a re-write of #56835, which is significantly shorter thanks to the data model change in the PR below this one in the stack. See the original description in the linked PR for details. The functional changes in this PR are the same as in the above linked one, so the description is the same with a few small changes: - I don't bother generating `at::xla::{op}` entries for CPU fallbacks. After looking around, I see precedent for that. For example, we don't have `at::cpu::{op}` entries for composite ops- if you really want to bypass the dispatcher you need to call `at::compositeimplicitautograd::{op}`. Maybe we should revisit that later if we find an important use case for having full namespace coverage, but that doesn't seem worth half-fixing for external backends in this PR. [ghstack-poisoned]
This is a re-write of #56835, which is significantly shorter thanks to the data model change in the PR below this one in the stack. See the original description in the linked PR for details. The functional changes in this PR are the same as in the above linked one, so the description is the same with a few small changes: - I don't bother generating `at::xla::{op}` entries for CPU fallbacks. After looking around, I see precedent for that. For example, we don't have `at::cpu::{op}` entries for composite ops- if you really want to bypass the dispatcher you need to call `at::compositeimplicitautograd::{op}`. Maybe we should revisit that later if we find an important use case for having full namespace coverage, but that doesn't seem worth half-fixing for external backends in this PR. [ghstack-poisoned]
This is a re-write of #56835, which is significantly shorter thanks to the data model change in the PR below this one in the stack. See the original description in the linked PR for details. The functional changes in this PR are the same as in the above linked one, so the description is the same with a few small changes: - I don't bother generating `at::xla::{op}` entries for CPU fallbacks. After looking around, I see precedent for that. For example, we don't have `at::cpu::{op}` entries for composite ops- if you really want to bypass the dispatcher you need to call `at::compositeimplicitautograd::{op}`. Maybe we should revisit that later if we find an important use case for having full namespace coverage, but that doesn't seem worth half-fixing for external backends in this PR. [ghstack-poisoned]
This is a re-write of #56835, which is significantly shorter thanks to the data model change in the PR below this one in the stack. See the original description in the linked PR for details. The functional changes in this PR are the same as in the above linked one, so the description is the same with a few small changes: - I don't bother generating `at::xla::{op}` entries for CPU fallbacks. After looking around, I see precedent for that. For example, we don't have `at::cpu::{op}` entries for composite ops- if you really want to bypass the dispatcher you need to call `at::compositeimplicitautograd::{op}`. Maybe we should revisit that later if we find an important use case for having full namespace coverage, but that doesn't seem worth half-fixing for external backends in this PR. [ghstack-poisoned]
This is a re-write of #56835, which is significantly shorter thanks to the data model change in the PR below this one in the stack. See the original description in the linked PR for details. The functional changes in this PR are the same as in the above linked one, so the description is the same with a few small changes: - I don't bother generating `at::xla::{op}` entries for CPU fallbacks. After looking around, I see precedent for that. For example, we don't have `at::cpu::{op}` entries for composite ops- if you really want to bypass the dispatcher you need to call `at::compositeimplicitautograd::{op}`. Maybe we should revisit that later if we find an important use case for having full namespace coverage, but that doesn't seem worth half-fixing for external backends in this PR. [ghstack-poisoned]
This is a re-write of #56835, which is significantly shorter thanks to the data model change in the PR below this one in the stack. See the original description in the linked PR for details. The functional changes in this PR are the same as in the above linked one, so the description is the same with a few small changes: - I don't bother generating `at::xla::{op}` entries for CPU fallbacks. After looking around, I see precedent for that. For example, we don't have `at::cpu::{op}` entries for composite ops- if you really want to bypass the dispatcher you need to call `at::compositeimplicitautograd::{op}`. Maybe we should revisit that later if we find an important use case for having full namespace coverage, but that doesn't seem worth half-fixing for external backends in this PR. [ghstack-poisoned]
This is a re-write of #56835, which is significantly shorter thanks to the data model change in the PR below this one in the stack. See the original description in the linked PR for details. The functional changes in this PR are the same as in the above linked one, so the description is the same with a few small changes: - I don't bother generating `at::xla::{op}` entries for CPU fallbacks. After looking around, I see precedent for that. For example, we don't have `at::cpu::{op}` entries for composite ops- if you really want to bypass the dispatcher you need to call `at::compositeimplicitautograd::{op}`. Maybe we should revisit that later if we find an important use case for having full namespace coverage, but that doesn't seem worth half-fixing for external backends in this PR. [ghstack-poisoned]
Summary: Pull Request resolved: #57510 This is a re-write of #56835, which is significantly shorter thanks to the data model change in the PR below this one in the stack. See the original description in the linked PR for details. The functional changes in this PR are the same as in the above linked one, so the description is the same with a few small changes: - I don't bother generating `at::xla::{op}` entries for CPU fallbacks. After looking around, I see precedent for that. For example, we don't have `at::cpu::{op}` entries for composite ops- if you really want to bypass the dispatcher you need to call `at::compositeimplicitautograd::{op}`. Maybe we should revisit that later if we find an important use case for having full namespace coverage, but that doesn't seem worth half-fixing for external backends in this PR. Test Plan: Imported from OSS Reviewed By: navahgar Differential Revision: D28474364 Pulled By: bdhirsh fbshipit-source-id: 4d58b60e5debad6f1ff06420597d8df8505b2876
Summary: Pull Request resolved: pytorch#57510 This is a re-write of pytorch#56835, which is significantly shorter thanks to the data model change in the PR below this one in the stack. See the original description in the linked PR for details. The functional changes in this PR are the same as in the above linked one, so the description is the same with a few small changes: - I don't bother generating `at::xla::{op}` entries for CPU fallbacks. After looking around, I see precedent for that. For example, we don't have `at::cpu::{op}` entries for composite ops- if you really want to bypass the dispatcher you need to call `at::compositeimplicitautograd::{op}`. Maybe we should revisit that later if we find an important use case for having full namespace coverage, but that doesn't seem worth half-fixing for external backends in this PR. Test Plan: Imported from OSS Reviewed By: navahgar Differential Revision: D28474364 Pulled By: bdhirsh fbshipit-source-id: 4d58b60e5debad6f1ff06420597d8df8505b2876
Summary: Pull Request resolved: #57510 This is a re-write of #56835, which is significantly shorter thanks to the data model change in the PR below this one in the stack. See the original description in the linked PR for details. The functional changes in this PR are the same as in the above linked one, so the description is the same with a few small changes: - I don't bother generating `at::xla::{op}` entries for CPU fallbacks. After looking around, I see precedent for that. For example, we don't have `at::cpu::{op}` entries for composite ops- if you really want to bypass the dispatcher you need to call `at::compositeimplicitautograd::{op}`. Maybe we should revisit that later if we find an important use case for having full namespace coverage, but that doesn't seem worth half-fixing for external backends in this PR. Test Plan: Imported from OSS Reviewed By: navahgar Differential Revision: D28474364 Pulled By: bdhirsh fbshipit-source-id: 4d58b60e5debad6f1ff06420597d8df8505b2876
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
Overview
This PR adds support for auto-gening in-place/out wrappers given a functional kernel for external backends.
With this change, the codegen generates two new files for xla that re more in line with what our core codegen generates:
RegisterXLA.cppRegisterAutogradXLA.cppfunctionality delta
For context, what XLA’s codegen currently provides is:
outwrapper kernels, which is specified in a hardcoded list. This PR will subsume that change, and generate out kernels whenever we encounter an op that has a functional kernel but no out kerneladd_and uses the codegen’d version instead.code changes
The PR also splits up the external backend codegen logic over two places.
register_dispatch_key.py. Contains all the "new" stuff: kernel wrappers + dispatcher registrations for all xla kernels, including codegen'd inplace/out wrappers. It also provides support for namespaced definitions, e.g.at::xla::{op}, although this feature isn't "turned on" yet.gen_external_aten_fallbacks.py. I updated this file to really only do fallback-related stuff, which should make it easier to kill if we switch to a boxed fallback later. It not longer generatesoutwrappers, and it no longer generates dispatcher registrations for all kernels; only for CPU fallbacks.logic re-use in the codegen
Just want to call out my reasoning for what goes where in the codegen w.r.t. two changes:
RegisterDispatchKey.gen_unstructured, with some changes to make it work for external backend kernels.gen_unstructured(), I actually generate namespaced definitions for all backend kernels, include those in a CPU fallback. My thought process was that the goal of a CPU fallback is to kind of brush over deficiencies in what the backend has implemented, and in a similar vein it would be nice ifat::xla::{op}"just worked" for all ops, if someone wanted to write C++ code in a uniform API that was meant just for xla.Before (pictorially)
After (pictorially). My ascii art is awful but hopefully you get the idea
gen_inplace_out_wrapper()has the specific task of generating anonymous definitions (wrappers) for inplace/out kernels, and otherwise forwards all other logic, including dispatcher registrations + namespaced definitions, togen_unstructured().One wrinkle:
gen_unstructuredtakes in a singleUnion[NativeFunction, ExternalBackendFunction], which technically isn't enough context to decide when to generate registrations/namespaced definitions. That's because it now also needs the context of the entire function group: if the backend has not implemented anoutkernel, then the information on whether or not to generate a registration for it is entirely dependent on whether the backend has implemented afunctionalkernel. I ended up adding an extra optional argument togen_unstructured()to deal with that; a bool that tells gen_unstructured if it's dealing with a autogenerated wrapper.What this PR does not do
Structured kernel support for external backends. To add at some point in the future.
Stack from ghstack: