generate inplace/out kernels for xla#57510
generate inplace/out kernels for xla#57510bdhirsh wants to merge 7 commits intogh/bdhirsh/113/basefrom
Conversation
[ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit 9656b73 (more details on the Dr. CI page): ✅ None of the CI failures appear to be your fault 💚
🚧 1 fixed upstream failure:These were probably caused by upstream breakages that were already fixed.
Please rebase on the
|
[ghstack-poisoned]
[ghstack-poisoned]
| inplace_meta = True | ||
| else: | ||
| def gen_unstructured(self, f: NativeFunction, *, is_generated_wrapper: bool = False) -> Optional[str]: | ||
| with native_function_manager(f): |
There was a problem hiding this comment.
The with native_function_manager(f) makes the diff painful, but there aren't too many changes in this function. As per the comment in the last PR, I can't easily use a decorator since I have a defaultable argument
There was a problem hiding this comment.
It's fine, reviewers can ignore whitespace in GH UI
|
Hooking this up to an XLA-side PR to sanity check. |
This is a re-write of #56835, which is significantly shorter thanks to the data model change in the PR below this one in the stack. See the original description in the linked PR for details. The functional changes in this PR are the same as in the above linked one, so the description is the same with a few small changes: - I don't bother generating `at::xla::{op}` entries for CPU fallbacks. After looking around, I see precedent for that. For example, we don't have `at::cpu::{op}` entries for composite ops- if you really want to bypass the dispatcher you need to call `at::compositeimplicitautograd::{op}`. Maybe we should revisit that later if we find an important use case for having full namespace coverage, but that doesn't seem worth half-fixing for external backends in this PR. [ghstack-poisoned]
tools/codegen/model.py
Outdated
| def functions(self) -> Iterator[NativeFunction]: | ||
| yield self.out | ||
| yield self.functional | ||
| def functions(self, *, functional_first: bool = False) -> Iterator[NativeFunction]: |
There was a problem hiding this comment.
yeah you're right, I think we should be able to unconditionally use the functional kernel first.
| TORCH_CHECK(options.device() == out.device(), | ||
| "Expected out tensor to have device ", options.device(), ", but got ", out.device(), " instead"); | ||
| bool resized = at::native::resize_output(outputs_[output_idx], sizes); | ||
| bool resized = {self.cpp_namespace}::resize_output(outputs_[output_idx], sizes); |
There was a problem hiding this comment.
This seems a little questionable; are you actually changing what namespace we reference resize_output here?
There was a problem hiding this comment.
Yep that's a bug, thanks! Since at::native::resize_output is the thing we want to call to resize the output tensor, regardless of which backend we're generating. We wouldn't have hit it until we eventually tried to enable structured external kernels.
I think that's true for the as_strided_ call below too
There was a problem hiding this comment.
Also for the empty calls above, I think
| # We also skip registrations if there is a functional backend kernel, | ||
| # because we generate out/inplace wrappers in that case (handled in register_dispatch_key.py). | ||
| if self.backend_index.get(f) is not None or \ | ||
| (isinstance(g, NativeFunctionsGroup) and self.backend_index.get(g.functional) is not None): |
There was a problem hiding this comment.
The downside to having this split in a separate dest here is that you have to keep the logic here in sync with register_dispatch_key.py for when to generate out/inplace wrappers. This is probably... OK, because there are fairly large benefits to having this be a separate file, but it's probably good to explicitly cross ref the corresponding condition and/or have a single function where you do the test that commons up the logic (even though it is short to type).
| else: | ||
| assert_never(f) | ||
|
|
||
| def gets_generated_out_inplace_wrapper(self, f: NativeFunction, g: NativeFunctionsGroup) -> bool: |
There was a problem hiding this comment.
This is the code to xref, I think
There was a problem hiding this comment.
yep thanks, this should be shared in gen_external_aten_fallbacks.py. Another thing to not worry about later when/if the boxed CPU fallback works out. (That's also why I kind of wanted to keep all the fallback stuff in a separate file, so it'll be easier to disentangle if we want to)
| # All other functionality can be directly pulled from gen_unstructured. | ||
| is_generated_wrapper = self.gets_generated_out_inplace_wrapper(f, g) | ||
| if self.target is not Target.ANONYMOUS_DEFINITION: | ||
| return self.gen_unstructured(f, is_generated_wrapper=is_generated_wrapper) |
There was a problem hiding this comment.
This short circuit smacks of "we put the interception point for gen_inplace_out_wrapper too early"--to tell if this function is actually implemented correctly, I still have to understand all of gen_unstructured (and there is now an extra illegal state which is when is_generated_wrapper=True but self.target is Target.ANONYMOUS_DEFINITION). Instead of the condition at https://github.com/pytorch/pytorch/pull/57510/files?diff=unified&w=1#diff-e9998363493c5dcd97969d077dcdea1e835c95bf39f2bbb5afde127a3c1ef728R82 ; maybe you should just directly embed this logic in gen_unstructured. You can guarantee that you end up in there even in the structured case by adjusting the condition at https://github.com/pytorch/pytorch/pull/57510/files?diff=unified&w=1#diff-e9998363493c5dcd97969d077dcdea1e835c95bf39f2bbb5afde127a3c1ef728R170 appropriately. If the complaint is that gen_unstructured is too long, I'd suggest remedying that by refactoring the internal codegen pieces of gen_unstructured into helper methods.
There was a problem hiding this comment.
Agreed that all the conditional logic that decides when to call gen_unstructured vs. gen_inplace_out_wrapper, is hard to follow, and I like the idea of keeping it all in the same place - that section of gen_structured() that you pointed out.
My main motivation for keeping this in a separate function was that this is the only bit of unstructured codegen that actually cares about the entire NativeFunctionGroup object, so it would be confusing to pass an extra (optional) NativeFunctionGroup object to gen_unstructured that's only used under certain conditions- but you're right, factoring out bits of gen_unstructured into helper functions should be fine.
There was a problem hiding this comment.
Might be a good idea to just pass it unconditionally, tbh, in case more codegen in unstructured kernels start making use of the group information.
| updates = f'at::_copy_from_and_resize({functional_result_name}, {ret_name});' | ||
| returns = ret_name | ||
|
|
||
| functional_sig = self.wrapper_kernel_sig(g.functional) |
There was a problem hiding this comment.
Although you can skip the translate call here because everything is the same, I wonder if it would be clearer for readers to translate anyway... (maybe there's some stuff translate chokes on?)
There was a problem hiding this comment.
The argument for consistency makes sense to me, I'll add it in (just tested it and translate doesn't choke)
| return_names = cpp.return_names(f) | ||
| if len(return_names) > 1: | ||
| updates = '\n '.join( | ||
| f'at::_copy_from_and_resize(std::get<{i}>({functional_result_name}), {ret_name});' |
There was a problem hiding this comment.
I guess you shouldn't actually be willing to resize if it's an inplace, this is only a very modest oversight though
There was a problem hiding this comment.
Oh, good catch. I think that any size mismatch errors would be caught in the meta function, so we should never end up in that situation. But getting the codegen to explicitly call a non-resizing function for inplace wrappers definitely expresses intent more clearly.
There was a problem hiding this comment.
Actually yeah that was a straight up bug- since a lot of XLA kernels don't do shape checks (LazyTensor meta() support can't come soon enough). So before when I was using _copy_from_and_resize() for inplace ops, we would just re-size self on XLA, while CPU would correctly error in the meta function.
using _copy_from gives you a pretty crappy error message, but there's probably no point trying to make that better until we start using meta functions.
This is a re-write of #56835, which is significantly shorter thanks to the data model change in the PR below this one in the stack. See the original description in the linked PR for details. The functional changes in this PR are the same as in the above linked one, so the description is the same with a few small changes: - I don't bother generating `at::xla::{op}` entries for CPU fallbacks. After looking around, I see precedent for that. For example, we don't have `at::cpu::{op}` entries for composite ops- if you really want to bypass the dispatcher you need to call `at::compositeimplicitautograd::{op}`. Maybe we should revisit that later if we find an important use case for having full namespace coverage, but that doesn't seem worth half-fixing for external backends in this PR. [ghstack-poisoned]
| # The schema this signature is derived from | ||
| func: FunctionSchema | ||
|
|
||
| prefix: str = "" |
There was a problem hiding this comment.
worth a comment? it opens up a pretty big axis of variation for what the class represents...
There was a problem hiding this comment.
Definitely fair, I'll add it in
| @staticmethod | ||
| def from_schema(func: FunctionSchema) -> 'DispatcherSignature': | ||
| return DispatcherSignature(func) | ||
| def from_schema(func: FunctionSchema, *, prefix: str = '') -> 'DispatcherSignature': |
| # Helper functions | ||
|
|
||
| def kernel_signature(f: NativeFunction, backend_index: BackendIndex) -> Union['NativeSignature', 'DispatcherSignature']: | ||
| def kernel_signature( |
There was a problem hiding this comment.
ditto re comment for prefix - also I'm not far enough into the PR yet to understand why it's a freestanding param and not part of the data model.
|
|
||
| def wrapper_kernel_sig(self, f: NativeFunction) -> Union[NativeSignature, DispatcherSignature]: | ||
| # The prefix is just to ensure uniqueness. The Dispatcher API doesn't guarantee unique kernel names. | ||
| return kernel_signature(f, self.backend_index, prefix=f'wrapper_{f.func.name.overload_name}_') |
There was a problem hiding this comment.
Just a high level observation and not to overengineer, but here prefix is derived from f in a pretty straightforward way. If that's true everywhere, then it might be cleaner to host the prefix-making logic in NativeFunction.
There was a problem hiding this comment.
So, let me know if we disagree, but the value of prefix here seems more to me like an implementation detail of how we code-generate the stuff that we register to the dispatcher than a property of a NativeFunction. Rather than directly registering native kernels to the dispatcher, the codegen creates anonymous wrappers that call into to the native kernels, and registers those wrappers instead. We just append wrapper_ to each wrapper function's name for clarity, and to avoid name collisions with the namespaced definitions that we generate in the same file.
So I kind of see NativeSignature/DispatcherSignature as a lens into how to display a given NativeFunction's signature, like what schema and naming transformations to apply, and the wrapper prefix is just another knob along those same lines.
There was a problem hiding this comment.
[Edit: to be clear, this is a minor point in the context of the PR! It just touches on an interesting data modeling design point. Totally fine with me to this land as-is if you'd rather move things along.]
Ah sorry! Not NativeFunction but rather NativeSignature and DispatchSignature. You're right that prefixes shouldn't be part of NativeFunction's world. But the only "free" value here is the 'wrapper_{}_' - the function name in the middle comes from f, which is 1-1 with whichever Signature wrapper_kernel_sig() constructs - so making the caller supply the correspondence here would only be useful if we ever wanted to break that relationship, which I don't think we would, right? It seems more like an opportunity for the caller to mess things up by passing a constant string or whatever.
Anyway to cut to the chase, the whole correct by construction thing would say, only pass the constant part of the prefix here and onward into the Signatures, and make them (the Signatures) responsible for folding f.func.name.overload_name into nonempty prefixes.
(BTW if you don't actually use multiple prefixes, you could just have the Signatures carry a bit saying whether to mangle in the overload name. (And if kernel_signature() is always called with a nonempty prefix, you could just have it unilaterally construct the Signature instances with prefixes turned on.)
This is a re-write of #56835, which is significantly shorter thanks to the data model change in the PR below this one in the stack. See the original description in the linked PR for details. The functional changes in this PR are the same as in the above linked one, so the description is the same with a few small changes: - I don't bother generating `at::xla::{op}` entries for CPU fallbacks. After looking around, I see precedent for that. For example, we don't have `at::cpu::{op}` entries for composite ops- if you really want to bypass the dispatcher you need to call `at::compositeimplicitautograd::{op}`. Maybe we should revisit that later if we find an important use case for having full namespace coverage, but that doesn't seem worth half-fixing for external backends in this PR. [ghstack-poisoned]
This is a re-write of #56835, which is significantly shorter thanks to the data model change in the PR below this one in the stack. See the original description in the linked PR for details. The functional changes in this PR are the same as in the above linked one, so the description is the same with a few small changes: - I don't bother generating `at::xla::{op}` entries for CPU fallbacks. After looking around, I see precedent for that. For example, we don't have `at::cpu::{op}` entries for composite ops- if you really want to bypass the dispatcher you need to call `at::compositeimplicitautograd::{op}`. Maybe we should revisit that later if we find an important use case for having full namespace coverage, but that doesn't seem worth half-fixing for external backends in this PR. [ghstack-poisoned]
|
@bdhirsh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Summary: Pull Request resolved: pytorch#57510 This is a re-write of pytorch#56835, which is significantly shorter thanks to the data model change in the PR below this one in the stack. See the original description in the linked PR for details. The functional changes in this PR are the same as in the above linked one, so the description is the same with a few small changes: - I don't bother generating `at::xla::{op}` entries for CPU fallbacks. After looking around, I see precedent for that. For example, we don't have `at::cpu::{op}` entries for composite ops- if you really want to bypass the dispatcher you need to call `at::compositeimplicitautograd::{op}`. Maybe we should revisit that later if we find an important use case for having full namespace coverage, but that doesn't seem worth half-fixing for external backends in this PR. Test Plan: Imported from OSS Reviewed By: navahgar Differential Revision: D28474364 Pulled By: bdhirsh fbshipit-source-id: 4d58b60e5debad6f1ff06420597d8df8505b2876
Summary: Pull Request resolved: #57510 This is a re-write of #56835, which is significantly shorter thanks to the data model change in the PR below this one in the stack. See the original description in the linked PR for details. The functional changes in this PR are the same as in the above linked one, so the description is the same with a few small changes: - I don't bother generating `at::xla::{op}` entries for CPU fallbacks. After looking around, I see precedent for that. For example, we don't have `at::cpu::{op}` entries for composite ops- if you really want to bypass the dispatcher you need to call `at::compositeimplicitautograd::{op}`. Maybe we should revisit that later if we find an important use case for having full namespace coverage, but that doesn't seem worth half-fixing for external backends in this PR. Test Plan: Imported from OSS Reviewed By: navahgar Differential Revision: D28474364 Pulled By: bdhirsh fbshipit-source-id: 4d58b60e5debad6f1ff06420597d8df8505b2876
This is a re-write of #56835, which is significantly shorter thanks to the data model change in the PR below this one in the stack. See the original description in the linked PR for details.
The functional changes in this PR are the same as in the above linked one, so the description is the same with a few small changes:
at::xla::{op}entries for CPU fallbacks. After looking around, I see precedent for that. For example, we don't haveat::cpu::{op}entries for composite ops- if you really want to bypass the dispatcher you need to callat::compositeimplicitautograd::{op}. Maybe we should revisit that later if we find an important use case for having full namespace coverage, but that doesn't seem worth half-fixing for external backends in this PR.Stack from ghstack:
Differential Revision: D28474364