Skip to content

generate inplace/out kernels for xla#57510

Closed
bdhirsh wants to merge 7 commits intogh/bdhirsh/113/basefrom
gh/bdhirsh/113/head
Closed

generate inplace/out kernels for xla#57510
bdhirsh wants to merge 7 commits intogh/bdhirsh/113/basefrom
gh/bdhirsh/113/head

Conversation

@bdhirsh
Copy link
Copy Markdown
Collaborator

@bdhirsh bdhirsh commented May 3, 2021

This is a re-write of #56835, which is significantly shorter thanks to the data model change in the PR below this one in the stack. See the original description in the linked PR for details.

The functional changes in this PR are the same as in the above linked one, so the description is the same with a few small changes:

  • I don't bother generating at::xla::{op} entries for CPU fallbacks. After looking around, I see precedent for that. For example, we don't have at::cpu::{op} entries for composite ops- if you really want to bypass the dispatcher you need to call at::compositeimplicitautograd::{op}. Maybe we should revisit that later if we find an important use case for having full namespace coverage, but that doesn't seem worth half-fixing for external backends in this PR.

Stack from ghstack:

Differential Revision: D28474364

@facebook-github-bot
Copy link
Copy Markdown
Contributor

facebook-github-bot commented May 3, 2021

💊 CI failures summary and remediations

As of commit 9656b73 (more details on the Dr. CI page):


None of the CI failures appear to be your fault 💚



🚧 1 fixed upstream failure:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

If your commit is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

bdhirsh added a commit that referenced this pull request May 3, 2021
ghstack-source-id: 2b0d1e3
Pull Request resolved: #57510
bdhirsh added a commit that referenced this pull request May 3, 2021
ghstack-source-id: 2fcc843
Pull Request resolved: #57510
bdhirsh added a commit that referenced this pull request May 4, 2021
ghstack-source-id: 3742d8b
Pull Request resolved: #57510
@bdhirsh bdhirsh requested a review from ezyang May 4, 2021 17:33
inplace_meta = True
else:
def gen_unstructured(self, f: NativeFunction, *, is_generated_wrapper: bool = False) -> Optional[str]:
with native_function_manager(f):
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The with native_function_manager(f) makes the diff painful, but there aren't too many changes in this function. As per the comment in the last PR, I can't easily use a decorator since I have a defaultable argument

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine, reviewers can ignore whitespace in GH UI

@bdhirsh
Copy link
Copy Markdown
Collaborator Author

bdhirsh commented May 4, 2021

Hooking this up to an XLA-side PR to sanity check.

@bdhirsh bdhirsh requested a review from bhosmer May 4, 2021 20:09
This is a re-write of #56835, which is significantly shorter thanks to the data model change in the PR below this one in the stack. See the original description in the linked PR for details.

The functional changes in this PR are the same as in the above linked one, so the description is the same with a few small changes:
- I don't bother generating `at::xla::{op}` entries for CPU fallbacks. After looking around, I see precedent for that. For example, we don't have `at::cpu::{op}` entries for composite ops- if you really want to bypass the dispatcher you need to call `at::compositeimplicitautograd::{op}`. Maybe we should revisit that later if we find an important use case for having full namespace coverage, but that doesn't seem worth half-fixing for external backends in this PR.




[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request May 4, 2021
ghstack-source-id: 172973c
Pull Request resolved: #57510
def functions(self) -> Iterator[NativeFunction]:
yield self.out
yield self.functional
def functions(self, *, functional_first: bool = False) -> Iterator[NativeFunction]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of goofy

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah you're right, I think we should be able to unconditionally use the functional kernel first.

TORCH_CHECK(options.device() == out.device(),
"Expected out tensor to have device ", options.device(), ", but got ", out.device(), " instead");
bool resized = at::native::resize_output(outputs_[output_idx], sizes);
bool resized = {self.cpp_namespace}::resize_output(outputs_[output_idx], sizes);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a little questionable; are you actually changing what namespace we reference resize_output here?

Copy link
Copy Markdown
Collaborator Author

@bdhirsh bdhirsh May 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep that's a bug, thanks! Since at::native::resize_output is the thing we want to call to resize the output tensor, regardless of which backend we're generating. We wouldn't have hit it until we eventually tried to enable structured external kernels.

I think that's true for the as_strided_ call below too

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also for the empty calls above, I think

# We also skip registrations if there is a functional backend kernel,
# because we generate out/inplace wrappers in that case (handled in register_dispatch_key.py).
if self.backend_index.get(f) is not None or \
(isinstance(g, NativeFunctionsGroup) and self.backend_index.get(g.functional) is not None):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The downside to having this split in a separate dest here is that you have to keep the logic here in sync with register_dispatch_key.py for when to generate out/inplace wrappers. This is probably... OK, because there are fairly large benefits to having this be a separate file, but it's probably good to explicitly cross ref the corresponding condition and/or have a single function where you do the test that commons up the logic (even though it is short to type).

else:
assert_never(f)

def gets_generated_out_inplace_wrapper(self, f: NativeFunction, g: NativeFunctionsGroup) -> bool:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the code to xref, I think

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep thanks, this should be shared in gen_external_aten_fallbacks.py. Another thing to not worry about later when/if the boxed CPU fallback works out. (That's also why I kind of wanted to keep all the fallback stuff in a separate file, so it'll be easier to disentangle if we want to)

# All other functionality can be directly pulled from gen_unstructured.
is_generated_wrapper = self.gets_generated_out_inplace_wrapper(f, g)
if self.target is not Target.ANONYMOUS_DEFINITION:
return self.gen_unstructured(f, is_generated_wrapper=is_generated_wrapper)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This short circuit smacks of "we put the interception point for gen_inplace_out_wrapper too early"--to tell if this function is actually implemented correctly, I still have to understand all of gen_unstructured (and there is now an extra illegal state which is when is_generated_wrapper=True but self.target is Target.ANONYMOUS_DEFINITION). Instead of the condition at https://github.com/pytorch/pytorch/pull/57510/files?diff=unified&w=1#diff-e9998363493c5dcd97969d077dcdea1e835c95bf39f2bbb5afde127a3c1ef728R82 ; maybe you should just directly embed this logic in gen_unstructured. You can guarantee that you end up in there even in the structured case by adjusting the condition at https://github.com/pytorch/pytorch/pull/57510/files?diff=unified&w=1#diff-e9998363493c5dcd97969d077dcdea1e835c95bf39f2bbb5afde127a3c1ef728R170 appropriately. If the complaint is that gen_unstructured is too long, I'd suggest remedying that by refactoring the internal codegen pieces of gen_unstructured into helper methods.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that all the conditional logic that decides when to call gen_unstructured vs. gen_inplace_out_wrapper, is hard to follow, and I like the idea of keeping it all in the same place - that section of gen_structured() that you pointed out.

My main motivation for keeping this in a separate function was that this is the only bit of unstructured codegen that actually cares about the entire NativeFunctionGroup object, so it would be confusing to pass an extra (optional) NativeFunctionGroup object to gen_unstructured that's only used under certain conditions- but you're right, factoring out bits of gen_unstructured into helper functions should be fine.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be a good idea to just pass it unconditionally, tbh, in case more codegen in unstructured kernels start making use of the group information.

updates = f'at::_copy_from_and_resize({functional_result_name}, {ret_name});'
returns = ret_name

functional_sig = self.wrapper_kernel_sig(g.functional)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although you can skip the translate call here because everything is the same, I wonder if it would be clearer for readers to translate anyway... (maybe there's some stuff translate chokes on?)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The argument for consistency makes sense to me, I'll add it in (just tested it and translate doesn't choke)

return_names = cpp.return_names(f)
if len(return_names) > 1:
updates = '\n '.join(
f'at::_copy_from_and_resize(std::get<{i}>({functional_result_name}), {ret_name});'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you shouldn't actually be willing to resize if it's an inplace, this is only a very modest oversight though

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good catch. I think that any size mismatch errors would be caught in the meta function, so we should never end up in that situation. But getting the codegen to explicitly call a non-resizing function for inplace wrappers definitely expresses intent more clearly.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually yeah that was a straight up bug- since a lot of XLA kernels don't do shape checks (LazyTensor meta() support can't come soon enough). So before when I was using _copy_from_and_resize() for inplace ops, we would just re-size self on XLA, while CPU would correctly error in the meta function.

using _copy_from gives you a pretty crappy error message, but there's probably no point trying to make that better until we start using meta functions.

Copy link
Copy Markdown
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please check reviews

This is a re-write of #56835, which is significantly shorter thanks to the data model change in the PR below this one in the stack. See the original description in the linked PR for details.

The functional changes in this PR are the same as in the above linked one, so the description is the same with a few small changes:
- I don't bother generating `at::xla::{op}` entries for CPU fallbacks. After looking around, I see precedent for that. For example, we don't have `at::cpu::{op}` entries for composite ops- if you really want to bypass the dispatcher you need to call `at::compositeimplicitautograd::{op}`. Maybe we should revisit that later if we find an important use case for having full namespace coverage, but that doesn't seem worth half-fixing for external backends in this PR.




[ghstack-poisoned]
# The schema this signature is derived from
func: FunctionSchema

prefix: str = ""
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth a comment? it opens up a pretty big axis of variation for what the class represents...

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely fair, I'll add it in

@staticmethod
def from_schema(func: FunctionSchema) -> 'DispatcherSignature':
return DispatcherSignature(func)
def from_schema(func: FunctionSchema, *, prefix: str = '') -> 'DispatcherSignature':
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto re comment

# Helper functions

def kernel_signature(f: NativeFunction, backend_index: BackendIndex) -> Union['NativeSignature', 'DispatcherSignature']:
def kernel_signature(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto re comment for prefix - also I'm not far enough into the PR yet to understand why it's a freestanding param and not part of the data model.


def wrapper_kernel_sig(self, f: NativeFunction) -> Union[NativeSignature, DispatcherSignature]:
# The prefix is just to ensure uniqueness. The Dispatcher API doesn't guarantee unique kernel names.
return kernel_signature(f, self.backend_index, prefix=f'wrapper_{f.func.name.overload_name}_')
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a high level observation and not to overengineer, but here prefix is derived from f in a pretty straightforward way. If that's true everywhere, then it might be cleaner to host the prefix-making logic in NativeFunction.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, let me know if we disagree, but the value of prefix here seems more to me like an implementation detail of how we code-generate the stuff that we register to the dispatcher than a property of a NativeFunction. Rather than directly registering native kernels to the dispatcher, the codegen creates anonymous wrappers that call into to the native kernels, and registers those wrappers instead. We just append wrapper_ to each wrapper function's name for clarity, and to avoid name collisions with the namespaced definitions that we generate in the same file.

So I kind of see NativeSignature/DispatcherSignature as a lens into how to display a given NativeFunction's signature, like what schema and naming transformations to apply, and the wrapper prefix is just another knob along those same lines.

Copy link
Copy Markdown

@bhosmer bhosmer May 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Edit: to be clear, this is a minor point in the context of the PR! It just touches on an interesting data modeling design point. Totally fine with me to this land as-is if you'd rather move things along.]

Ah sorry! Not NativeFunction but rather NativeSignature and DispatchSignature. You're right that prefixes shouldn't be part of NativeFunction's world. But the only "free" value here is the 'wrapper_{}_' - the function name in the middle comes from f, which is 1-1 with whichever Signature wrapper_kernel_sig() constructs - so making the caller supply the correspondence here would only be useful if we ever wanted to break that relationship, which I don't think we would, right? It seems more like an opportunity for the caller to mess things up by passing a constant string or whatever.

Anyway to cut to the chase, the whole correct by construction thing would say, only pass the constant part of the prefix here and onward into the Signatures, and make them (the Signatures) responsible for folding f.func.name.overload_name into nonempty prefixes.

(BTW if you don't actually use multiple prefixes, you could just have the Signatures carry a bit saying whether to mangle in the overload name. (And if kernel_signature() is always called with a nonempty prefix, you could just have it unilaterally construct the Signature instances with prefixes turned on.)

bdhirsh added 2 commits May 12, 2021 07:05
This is a re-write of #56835, which is significantly shorter thanks to the data model change in the PR below this one in the stack. See the original description in the linked PR for details.

The functional changes in this PR are the same as in the above linked one, so the description is the same with a few small changes:
- I don't bother generating `at::xla::{op}` entries for CPU fallbacks. After looking around, I see precedent for that. For example, we don't have `at::cpu::{op}` entries for composite ops- if you really want to bypass the dispatcher you need to call `at::compositeimplicitautograd::{op}`. Maybe we should revisit that later if we find an important use case for having full namespace coverage, but that doesn't seem worth half-fixing for external backends in this PR.




[ghstack-poisoned]
This is a re-write of #56835, which is significantly shorter thanks to the data model change in the PR below this one in the stack. See the original description in the linked PR for details.

The functional changes in this PR are the same as in the above linked one, so the description is the same with a few small changes:
- I don't bother generating `at::xla::{op}` entries for CPU fallbacks. After looking around, I see precedent for that. For example, we don't have `at::cpu::{op}` entries for composite ops- if you really want to bypass the dispatcher you need to call `at::compositeimplicitautograd::{op}`. Maybe we should revisit that later if we find an important use case for having full namespace coverage, but that doesn't seem worth half-fixing for external backends in this PR.




[ghstack-poisoned]
@bdhirsh
Copy link
Copy Markdown
Collaborator Author

bdhirsh commented May 17, 2021

@bdhirsh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@bdhirsh merged this pull request in 1a9efbb.

krshrimali pushed a commit to krshrimali/pytorch that referenced this pull request May 19, 2021
Summary:
Pull Request resolved: pytorch#57510

This is a re-write of pytorch#56835, which is significantly shorter thanks to the data model change in the PR below this one in the stack. See the original description in the linked PR for details.

The functional changes in this PR are the same as in the above linked one, so the description is the same with a few small changes:
- I don't bother generating `at::xla::{op}` entries for CPU fallbacks. After looking around, I see precedent for that. For example, we don't have `at::cpu::{op}` entries for composite ops- if you really want to bypass the dispatcher you need to call `at::compositeimplicitautograd::{op}`. Maybe we should revisit that later if we find an important use case for having full namespace coverage, but that doesn't seem worth half-fixing for external backends in this PR.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D28474364

Pulled By: bdhirsh

fbshipit-source-id: 4d58b60e5debad6f1ff06420597d8df8505b2876
bdhirsh added a commit that referenced this pull request May 20, 2021
Summary:
Pull Request resolved: #57510

This is a re-write of #56835, which is significantly shorter thanks to the data model change in the PR below this one in the stack. See the original description in the linked PR for details.

The functional changes in this PR are the same as in the above linked one, so the description is the same with a few small changes:
- I don't bother generating `at::xla::{op}` entries for CPU fallbacks. After looking around, I see precedent for that. For example, we don't have `at::cpu::{op}` entries for composite ops- if you really want to bypass the dispatcher you need to call `at::compositeimplicitautograd::{op}`. Maybe we should revisit that later if we find an important use case for having full namespace coverage, but that doesn't seem worth half-fixing for external backends in this PR.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D28474364

Pulled By: bdhirsh

fbshipit-source-id: 4d58b60e5debad6f1ff06420597d8df8505b2876
This was referenced May 20, 2021
@facebook-github-bot facebook-github-bot deleted the gh/bdhirsh/113/head branch May 21, 2021 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants