Skip to content

generate in-place/out wrappers for external kernels#56835

Closed
bdhirsh wants to merge 4 commits intogh/bdhirsh/109/basefrom
gh/bdhirsh/109/head
Closed

generate in-place/out wrappers for external kernels#56835
bdhirsh wants to merge 4 commits intogh/bdhirsh/109/basefrom
gh/bdhirsh/109/head

Conversation

@bdhirsh
Copy link
Copy Markdown
Collaborator

@bdhirsh bdhirsh commented Apr 23, 2021

Overview
This PR adds support for auto-gening in-place/out wrappers given a functional kernel for external backends.

  • If the backend provides a functional kernel but not an out kernel, we generate one for them
  • If the backend provides a functional kernel but not an inplace kernel, we generate one for them
  • If they explicitly provide either kernel, we use whatever they provided

With this change, the codegen generates two new files for xla that re more in line with what our core codegen generates:

  • RegisterXLA.cpp
  • RegisterAutogradXLA.cpp

functionality delta
For context, what XLA’s codegen currently provides is:

  • Support for generating out wrapper kernels, which is specified in a hardcoded list. This PR will subsume that change, and generate out kernels whenever we encounter an op that has a functional kernel but no out kernel
  • No support for in-place wrappers. XLA has hand-written kernels for the in-place variants of most ops. Need to confirm this, but I think we’ll be able to remove a lot of that code in favor of these codegen’d wrappers later. As a test, I have a test PR here that removes add_ and uses the codegen’d version instead.

code changes

The PR also splits up the external backend codegen logic over two places.

  • register_dispatch_key.py. Contains all the "new" stuff: kernel wrappers + dispatcher registrations for all xla kernels, including codegen'd inplace/out wrappers. It also provides support for namespaced definitions, e.g. at::xla::{op}, although this feature isn't "turned on" yet.
  • gen_external_aten_fallbacks.py. I updated this file to really only do fallback-related stuff, which should make it easier to kill if we switch to a boxed fallback later. It not longer generates out wrappers, and it no longer generates dispatcher registrations for all kernels; only for CPU fallbacks.

logic re-use in the codegen
Just want to call out my reasoning for what goes where in the codegen w.r.t. two changes:

  1. For normal unstructured kernels, I almost entirely re-used RegisterDispatchKey.gen_unstructured, with some changes to make it work for external backend kernels.
  2. For generating the out/inplace wrappers, I pushed all of that logic into it's own function; it's small enough + different enough from the existing structured/unstructured logic that I had a hard time trying to generalize the existing logic to handle this case.
  3. One note on the split between CPU fallback logic vs. everything else. There's one area where the logic bleeds over: Right now in gen_unstructured(), I actually generate namespaced definitions for all backend kernels, include those in a CPU fallback. My thought process was that the goal of a CPU fallback is to kind of brush over deficiencies in what the backend has implemented, and in a similar vein it would be nice if at::xla::{op} "just worked" for all ops, if someone wanted to write C++ code in a uniform API that was meant just for xla.

Before (pictorially)

RegisterDispatchKey.__call__() -------> gen_unstructured()

After (pictorially). My ascii art is awful but hopefully you get the idea

RegisterDispatchKey.__call__() -------------------------------------> gen_unstructured()
                             \                                       /
                               \                                   /
                                     --> gen_inplace_out_wrapper()

gen_inplace_out_wrapper() has the specific task of generating anonymous definitions (wrappers) for inplace/out kernels, and otherwise forwards all other logic, including dispatcher registrations + namespaced definitions, to gen_unstructured().

One wrinkle: gen_unstructured takes in a single Union[NativeFunction, ExternalBackendFunction], which technically isn't enough context to decide when to generate registrations/namespaced definitions. That's because it now also needs the context of the entire function group: if the backend has not implemented an out kernel, then the information on whether or not to generate a registration for it is entirely dependent on whether the backend has implemented a functional kernel. I ended up adding an extra optional argument to gen_unstructured() to deal with that; a bool that tells gen_unstructured if it's dealing with a autogenerated wrapper.

What this PR does not do
Structured kernel support for external backends. To add at some point in the future.

Stack from ghstack:

@facebook-github-bot
Copy link
Copy Markdown
Contributor

facebook-github-bot commented Apr 23, 2021

💊 CI failures summary and remediations

As of commit 847ab16 (more details on the Dr. CI page):


  • 5/5 failures possibly* introduced in this PR
    • 1/5 non-scanned failure(s)

4 failures not recognized by patterns:

Job Step Action
GitHub Actions flake8-py3 Unknown 🔁 rerun
GitHub Actions clang-tidy Unknown 🔁 rerun
GitHub Actions mypy Unknown 🔁 rerun
GitHub Actions quick-checks Unknown 🔁 rerun

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

bdhirsh added a commit that referenced this pull request Apr 23, 2021
ExternalBackendFunction,
ExternalBackendFunctionsGroup,
Union[NativeFunction, NativeFunctionsGroup],
Union[NativeFunction, ExternalBackendFunction],
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the blow-up continues

**Overview**
This PR adds support for auto-gening in-place/out wrappers given a functional kernel for external backends.
- If the backend provides a functional kernel but not an out kernel, we generate one for them
- If the backend provides a functional kernel but not an inplace kernel, we generate one for them
- If they explicitly provide either kernel, we use whatever they provided

With this change, the codegen generates two new files for xla that re more in line with what our core codegen generates:
- `RegisterXLA.cpp`
- `RegisterAutogradXLA.cpp`


**functionality delta**
For context, what XLA’s codegen currently provides is:
- Support for generating `out` wrapper kernels, which is specified in a hardcoded list. This PR will subsume that change, and generate out kernels whenever we encounter an op that has a functional kernel but no out kernel
- No support for in-place wrappers. XLA has hand-written kernels for the in-place variants of most ops. Need to confirm this, but I think we’ll be able to remove a lot of that code in favor of these codegen’d wrappers later. As a test, I have a test PR [here](pytorch/xla#2915) that removes `add_` and uses the codegen’d version instead.


**code changes**
The PR also splits up the external backend codegen logic over two places.
- `register_dispatch_key.py`. Contains all the "new" stuff: kernel wrappers + dispatcher registrations for all xla kernels, including codegen'd inplace/out wrappers. It also provides support for namespaced definitions, e.g. `at::xla::{op}`, although this feature isn't "turned on" yet.
- `gen_external_aten_fallbacks.py`. I updated this file to *really* only do fallback-related stuff, which should make it easier to kill if we switch to a boxed fallback later. It not longer generates `out` wrappers, and it no longer generates dispatcher registrations for all kernels; only for CPU fallbacks.


**logic re-use in the codegen**
Just want to call out my reasoning for what goes where in the codegen w.r.t. two changes:
1. For normal unstructured kernels, I almost entirely re-used `RegisterDispatchKey.gen_unstructured`
2. For generating the out/inplace wrappers, I pushed all of that logic into it's own function; it's small enough + different enough from the existing structured/unstructured logic that I had a hard time trying to generalize the existing logic to handle this case.
3. One note on the split between CPU fallback logic vs. everything else. There's one area where the logic bleeds over: Right now in `gen_unstructured()`, I actually generate namespaced definitions for all backend kernels, include those in a CPU fallback. My thought process was that the goal of a CPU fallback is to kind of brush over deficiencies in what the backend has implemented, and in a similar vein it would be nice if `at::xla::{op}` "just worked" for all ops, if someone wanted to write C++ code in a uniform API that was meant just for xla.

**What this PR does not do**
Structured kernel support for external backends. To add at some point in the future.




[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request Apr 26, 2021
**Overview**
This PR adds support for auto-gening in-place/out wrappers given a functional kernel for external backends.
- If the backend provides a functional kernel but not an out kernel, we generate one for them
- If the backend provides a functional kernel but not an inplace kernel, we generate one for them
- If they explicitly provide either kernel, we use whatever they provided

With this change, the codegen generates two new files for xla that re more in line with what our core codegen generates:
- `RegisterXLA.cpp`
- `RegisterAutogradXLA.cpp`


**functionality delta**
For context, what XLA’s codegen currently provides is:
- Support for generating `out` wrapper kernels, which is specified in a hardcoded list. This PR will subsume that change, and generate out kernels whenever we encounter an op that has a functional kernel but no out kernel
- No support for in-place wrappers. XLA has hand-written kernels for the in-place variants of most ops. Need to confirm this, but I think we’ll be able to remove a lot of that code in favor of these codegen’d wrappers later. As a test, I have a test PR [here](pytorch/xla#2915) that removes `add_` and uses the codegen’d version instead.


**code changes**

The PR also splits up the external backend codegen logic over two places.
- `register_dispatch_key.py`. Contains all the "new" stuff: kernel wrappers + dispatcher registrations for all xla kernels, including codegen'd inplace/out wrappers. It also provides support for namespaced definitions, e.g. `at::xla::{op}`, although this feature isn't "turned on" yet.
- `gen_external_aten_fallbacks.py`. I updated this file to *really* only do fallback-related stuff, which should make it easier to kill if we switch to a boxed fallback later. It not longer generates `out` wrappers, and it no longer generates dispatcher registrations for all kernels; only for CPU fallbacks.


**logic re-use in the codegen**
Just want to call out my reasoning for what goes where in the codegen w.r.t. two changes:
1. For normal unstructured kernels, I almost entirely re-used `RegisterDispatchKey.gen_unstructured`, with some changes to make it work for external backend kernels.
2. For generating the out/inplace wrappers, I pushed all of that logic into it's own function; it's small enough + different enough from the existing structured/unstructured logic that I had a hard time trying to generalize the existing logic to handle this case.
3. One note on the split between CPU fallback logic vs. everything else. There's one area where the logic bleeds over: Right now in `gen_unstructured()`, I actually generate namespaced definitions for all backend kernels, include those in a CPU fallback. My thought process was that the goal of a CPU fallback is to kind of brush over deficiencies in what the backend has implemented, and in a similar vein it would be nice if `at::xla::{op}` "just worked" for all ops, if someone wanted to write C++ code in a uniform API that was meant just for xla.

Before (pictorially)
```
RegisterDispatchKey.__call__() -------> gen_unstructured()
```

After (pictorially). My ascii art is awful but hopefully you get the idea
```
RegisterDispatchKey.__call__() -------------------------------------> gen_unstructured()
                             \                                       /
                               \                                   /
                                     --> gen_inplace_out_wrapper()

```
`gen_inplace_out_wrapper()` has the specific task of generating anonymous definitions (wrappers) for inplace/out kernels, and otherwise forwards all other logic, including dispatcher registrations + namespaced definitions, to `gen_unstructured()`.

One wrinkle: `gen_unstructured` takes in a single `Union[NativeFunction, ExternalBackendFunction]`, which technically isn't enough context to decide when to generate registrations/namespaced definitions. That's because it now also needs the context of the entire function group: if the backend has not implemented an `out` kernel, then the information on whether or not to generate a registration for it is entirely dependent on whether the backend has implemented a `functional` kernel. I ended up adding an extra optional argument to `gen_unstructured()` to deal with that; a bool that tells gen_unstructured if it's dealing with a autogenerated wrapper.

**What this PR does not do**
Structured kernel support for external backends. To add at some point in the future.




[ghstack-poisoned]
)
return list(mapMaybe(structured_gen.gen_one, g.functions()))

@method_with_native_function
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I'm going to have to add this back before I merge the PR. One annoying aspect of my change is that I added an optional is_generated_wrapper parameter to gen_unstructured. In order to re-use gen_unstructured for the code-generated inplace/out wrappers, gen_unstructured needs that extra bit of information: it's not available just from the ExternalBackendFunction, because it requires information relevant to the entire function group. i.e. we only generate a wrapper if (a) the inplace/out kernel is not implemented, and (b) the functional kernel is implemented.

Another option would have been to try to factor out the logic for registrations/definitions/declarations into their own functions that are then called by both gen_unstructured and the gen_out_inplace_wrappers code, but I tentatively found the approach I took here to be less confusing.

After some digging I couldn't find a way to make a flavor of method_with_native_function that properly deals with defaultable arguments, so my plan was to add a with native_function_manager(f): ... at the beginning of this function. That would indent the entire function and make it impossible to see the delta in this PR though, so my plan is to wait until it's been reviewed to add that in. Or, take it out completely depending on what the feedback is :)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't understand all of this (still reading), but indentation is not a problem, you can tell people to use GH ignore whitespace feature (or review on phab)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably just need to rework @method_with_native_function entirely. Tis really annoying to do decorators and strongly typed Python at the same time.

# This is needed in order for namespaced definitions to call into CPU fallbacks,
# which live in a different namespace.
# TODO: this logic will change if we implement a boxed CPU fallback.
if isinstance(native_or_external, ExternalBackendFunction) \
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we want at::xla::{op} to work for all ops (including cpu fallbacks), this is where the extra logic lives.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take it back, maybe we shouldn't generate at::xla 😂

@bdhirsh bdhirsh requested review from bhosmer and ezyang April 26, 2021 23:20
@bdhirsh bdhirsh changed the title [WIP] generate in-place/out wrappers for external kernels generate in-place/out wrappers for external kernels Apr 26, 2021
# This isn't an important characteristic of the dispatcher API, and could be removed if we want to further unify
# The dispatcher and C++ API's. There happen to be a few places in the codegen where we need to guarantee name uniqueness.
name = f'{name}_{func.name.overload_name}'
return name
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm 🤔

elif isinstance(f, NativeFunction):
elif isinstance(f, ExternalBackendFunctionsGroup):
if f.structured:
raise AssertionError("structured kernels not implemented yet for external backends")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you even ever fill structured field on ExternalBackendFunctionsGroup?

yield self.functional
yield self.out
if self.inplace is not None:
yield self.inplace
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so... when do you ever need the out variant to come first?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah you don't. I guess I just left the out-first version in to be consistent with in-tree codegen, but that's not really necessary. I'll just kill the flag and add a comment explaining why we need functional to come first.

bdhirsh added a commit that referenced this pull request May 4, 2021
This is a re-write of #56835, which is significantly shorter thanks to the data model change in the PR below this one in the stack. See the original description in the linked PR for details.

The functional changes in this PR are the same as in the above linked one, so the description is the same with a few small changes:
- I don't bother generating `at::xla::{op}` entries for CPU fallbacks. After looking around, I see precedent for that. For example, we don't have `at::cpu::{op}` entries for composite ops- if you really want to bypass the dispatcher you need to call `at::compositeimplicitautograd::{op}`. Maybe we should revisit that later if we find an important use case for having full namespace coverage, but that doesn't seem worth half-fixing for external backends in this PR.




[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request May 4, 2021
This is a re-write of #56835, which is significantly shorter thanks to the data model change in the PR below this one in the stack. See the original description in the linked PR for details.

The functional changes in this PR are the same as in the above linked one, so the description is the same with a few small changes:
- I don't bother generating `at::xla::{op}` entries for CPU fallbacks. After looking around, I see precedent for that. For example, we don't have `at::cpu::{op}` entries for composite ops- if you really want to bypass the dispatcher you need to call `at::compositeimplicitautograd::{op}`. Maybe we should revisit that later if we find an important use case for having full namespace coverage, but that doesn't seem worth half-fixing for external backends in this PR.




[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request May 11, 2021
This is a re-write of #56835, which is significantly shorter thanks to the data model change in the PR below this one in the stack. See the original description in the linked PR for details.

The functional changes in this PR are the same as in the above linked one, so the description is the same with a few small changes:
- I don't bother generating `at::xla::{op}` entries for CPU fallbacks. After looking around, I see precedent for that. For example, we don't have `at::cpu::{op}` entries for composite ops- if you really want to bypass the dispatcher you need to call `at::compositeimplicitautograd::{op}`. Maybe we should revisit that later if we find an important use case for having full namespace coverage, but that doesn't seem worth half-fixing for external backends in this PR.




[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request May 11, 2021
This is a re-write of #56835, which is significantly shorter thanks to the data model change in the PR below this one in the stack. See the original description in the linked PR for details.

The functional changes in this PR are the same as in the above linked one, so the description is the same with a few small changes:
- I don't bother generating `at::xla::{op}` entries for CPU fallbacks. After looking around, I see precedent for that. For example, we don't have `at::cpu::{op}` entries for composite ops- if you really want to bypass the dispatcher you need to call `at::compositeimplicitautograd::{op}`. Maybe we should revisit that later if we find an important use case for having full namespace coverage, but that doesn't seem worth half-fixing for external backends in this PR.




[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request May 12, 2021
This is a re-write of #56835, which is significantly shorter thanks to the data model change in the PR below this one in the stack. See the original description in the linked PR for details.

The functional changes in this PR are the same as in the above linked one, so the description is the same with a few small changes:
- I don't bother generating `at::xla::{op}` entries for CPU fallbacks. After looking around, I see precedent for that. For example, we don't have `at::cpu::{op}` entries for composite ops- if you really want to bypass the dispatcher you need to call `at::compositeimplicitautograd::{op}`. Maybe we should revisit that later if we find an important use case for having full namespace coverage, but that doesn't seem worth half-fixing for external backends in this PR.




[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request May 12, 2021
This is a re-write of #56835, which is significantly shorter thanks to the data model change in the PR below this one in the stack. See the original description in the linked PR for details.

The functional changes in this PR are the same as in the above linked one, so the description is the same with a few small changes:
- I don't bother generating `at::xla::{op}` entries for CPU fallbacks. After looking around, I see precedent for that. For example, we don't have `at::cpu::{op}` entries for composite ops- if you really want to bypass the dispatcher you need to call `at::compositeimplicitautograd::{op}`. Maybe we should revisit that later if we find an important use case for having full namespace coverage, but that doesn't seem worth half-fixing for external backends in this PR.




[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request May 14, 2021
This is a re-write of #56835, which is significantly shorter thanks to the data model change in the PR below this one in the stack. See the original description in the linked PR for details.

The functional changes in this PR are the same as in the above linked one, so the description is the same with a few small changes:
- I don't bother generating `at::xla::{op}` entries for CPU fallbacks. After looking around, I see precedent for that. For example, we don't have `at::cpu::{op}` entries for composite ops- if you really want to bypass the dispatcher you need to call `at::compositeimplicitautograd::{op}`. Maybe we should revisit that later if we find an important use case for having full namespace coverage, but that doesn't seem worth half-fixing for external backends in this PR.




[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request May 14, 2021
This is a re-write of #56835, which is significantly shorter thanks to the data model change in the PR below this one in the stack. See the original description in the linked PR for details.

The functional changes in this PR are the same as in the above linked one, so the description is the same with a few small changes:
- I don't bother generating `at::xla::{op}` entries for CPU fallbacks. After looking around, I see precedent for that. For example, we don't have `at::cpu::{op}` entries for composite ops- if you really want to bypass the dispatcher you need to call `at::compositeimplicitautograd::{op}`. Maybe we should revisit that later if we find an important use case for having full namespace coverage, but that doesn't seem worth half-fixing for external backends in this PR.




[ghstack-poisoned]
facebook-github-bot pushed a commit that referenced this pull request May 17, 2021
Summary:
Pull Request resolved: #57510

This is a re-write of #56835, which is significantly shorter thanks to the data model change in the PR below this one in the stack. See the original description in the linked PR for details.

The functional changes in this PR are the same as in the above linked one, so the description is the same with a few small changes:
- I don't bother generating `at::xla::{op}` entries for CPU fallbacks. After looking around, I see precedent for that. For example, we don't have `at::cpu::{op}` entries for composite ops- if you really want to bypass the dispatcher you need to call `at::compositeimplicitautograd::{op}`. Maybe we should revisit that later if we find an important use case for having full namespace coverage, but that doesn't seem worth half-fixing for external backends in this PR.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D28474364

Pulled By: bdhirsh

fbshipit-source-id: 4d58b60e5debad6f1ff06420597d8df8505b2876
krshrimali pushed a commit to krshrimali/pytorch that referenced this pull request May 19, 2021
Summary:
Pull Request resolved: pytorch#57510

This is a re-write of pytorch#56835, which is significantly shorter thanks to the data model change in the PR below this one in the stack. See the original description in the linked PR for details.

The functional changes in this PR are the same as in the above linked one, so the description is the same with a few small changes:
- I don't bother generating `at::xla::{op}` entries for CPU fallbacks. After looking around, I see precedent for that. For example, we don't have `at::cpu::{op}` entries for composite ops- if you really want to bypass the dispatcher you need to call `at::compositeimplicitautograd::{op}`. Maybe we should revisit that later if we find an important use case for having full namespace coverage, but that doesn't seem worth half-fixing for external backends in this PR.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D28474364

Pulled By: bdhirsh

fbshipit-source-id: 4d58b60e5debad6f1ff06420597d8df8505b2876
bdhirsh added a commit that referenced this pull request May 20, 2021
Summary:
Pull Request resolved: #57510

This is a re-write of #56835, which is significantly shorter thanks to the data model change in the PR below this one in the stack. See the original description in the linked PR for details.

The functional changes in this PR are the same as in the above linked one, so the description is the same with a few small changes:
- I don't bother generating `at::xla::{op}` entries for CPU fallbacks. After looking around, I see precedent for that. For example, we don't have `at::cpu::{op}` entries for composite ops- if you really want to bypass the dispatcher you need to call `at::compositeimplicitautograd::{op}`. Maybe we should revisit that later if we find an important use case for having full namespace coverage, but that doesn't seem worth half-fixing for external backends in this PR.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D28474364

Pulled By: bdhirsh

fbshipit-source-id: 4d58b60e5debad6f1ff06420597d8df8505b2876
@github-actions
Copy link
Copy Markdown
Contributor

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Apr 13, 2022
@github-actions github-actions bot closed this May 13, 2022
@facebook-github-bot facebook-github-bot deleted the gh/bdhirsh/109/head branch June 12, 2022 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants