[PyTorch] Put functor at the end of wrap_kernel_functor_unboxed params#51164
[PyTorch] Put functor at the end of wrap_kernel_functor_unboxed params#51164swolchok wants to merge 1 commit intogh/swolchok/92/basefrom
Conversation
In the (hopefully) common case where the unboxed functor is really a `c10::CompileTimeFunctionPointer`, this enables a jump straight to said functor. Note that I had no end of trouble getting this to have decent performance without the previous diff, and even with it, all the manual C10_ALWAYS_INLINE I added were necessary. Differential Revision: [D26069629](https://our.internmc.facebook.com/intern/diff/D26069629/) [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit a5bb6dd (more details on the Dr. CI page):
🕵️ 5 new failures recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
| Job | Step | Action |
|---|---|---|
| Run tests | 🔁 rerun |
🚧 3 fixed upstream failures:
These were probably caused by upstream breakages that were already fixed.
Please rebase on the viable/strict branch (expand for instructions)
If your commit is older than viable/strict, run these commands:
git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD
Check out the recency history of this "viable master" tracking branch.
- pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test1 on Jan 26 from 1:29pm to 6:17pm (6f3aa58 - 5748410)
- pytorch_linux_xenial_py3_6_gcc5_4_test on Jan 26 from 1:51pm to 5:48pm (f08464f - 5748410)
- pytorch_linux_bionic_py3_6_clang9_test on Jan 26 from 1:29pm to 5:39pm (6f3aa58 - 5748410)
This comment was automatically generated by Dr. CI (expand for details).
Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group.
| compute_dk = f"DispatchKey _dk = {dispatch_key};" | ||
| return f"""\ | ||
| // aten::{f.func} | ||
| C10_ALWAYS_INLINE |
There was a problem hiding this comment.
Are you sure this does anything? This is getting generated to a cpp file so it's hard for me to imagine that you're actually ever going to get to inline it.
There was a problem hiding this comment.
Surprisingly, yes! It can get inlined into wrap_kernel_functor_unboxed_::call; the call chain is Library::impl -> CppFunction ctor -> KernelFunction::makeFromUnboxedFunction -> KernelFunction::makeFromUnboxedFunctor -> wrap_kernel_functor_unboxed. (My working theory is that the fragility of inlining decisions for dispatcher code and the high compile times for generated aten files like Functions.cpp is the huge amount of inlineable stuff going on between this path and the call out to Dispatcher::call{,WithDispatchKey}, repeated for every at:: function.)
| "Parameter types mismatch"); | ||
|
|
||
| static ReturnType call(OperatorKernel* functor, ParameterTypes... args) { | ||
| static ReturnType call(ParameterTypes... args, OperatorKernel* functor) { |
There was a problem hiding this comment.
More for my understanding than anything else- how does this help out the compiler?
There was a problem hiding this comment.
Function arguments go in fixed places defined by the System V x86_64 ABI, at least on Linux & Mac. You can see Low Level System Interface > Function Calling Interface for the gory details (in which case you'll also want to refer to the Itanium C++ ABI to see how C++ things work), but roughly speaking and ignoring a bunch of subtleties and exceptions, the first 8 pointers' worth of arguments go in a fixed set of 8 registers. (I'll call them r0-r7 for simplicity, but those are definitely not their actual names on x86_64.) Supposing that functor takes at most 7 arguments, then with functor at the front, we have r0 = functor and r1-r7 = the first 7 args to functor. We then need to shift r1-r7 into r0-r6 before jumping to functor. With functor at the back, we have r0-r6 = the first 7 args to functor and r7 = functor, and we can jump straight to functor without moving more things around.
(If functor takes 8 arguments or more, then we still have the args to it in r0-r7 and the remaining arguments on the stack. Arguments that go on the stack are pushed from right to left and thus popped from left to right, so I haven't checked the assembly for this case but I think that we could just leave functor in place, load it from the stack and call it, and then pop it off the stack when functor returns.)
Splitting this out from #51164 (D26069629) to allow it to land separately; I'm sure this is a good idea but I'm less sure about #51164. Differential Revision: [D26112627](https://our.internmc.facebook.com/intern/diff/D26112627/) [ghstack-poisoned]
|
After splitting out the C10_ALWAYS_INLINE part and re-measuring, I am seeing a small regression from this change. For the sake of being able to move forward, I am shelving it for now. I have split out #51245 containing the additions of C10_ALWAYS_INLINE. |
Splitting this out from #51164 (D26069629) to allow it to land separately; I'm sure this is a good idea but I'm less sure about #51164. Differential Revision: [D26112627](https://our.internmc.facebook.com/intern/diff/D26112627/) [ghstack-poisoned]
Splitting this out from #51164 (D26069629) to allow it to land separately; I'm sure this is a good idea but I'm less sure about #51164. Differential Revision: [D26112627](https://our.internmc.facebook.com/intern/diff/D26112627/) [ghstack-poisoned]
Summary: Pull Request resolved: #51245 Splitting this out from #51164 (D26069629) to allow it to land separately; I'm sure this is a good idea but I'm less sure about #51164. ghstack-source-id: 120697499 Test Plan: double-check effect on empty benchmark with perf stat; didn't move Reviweers: ezyang, messmer Reviewed By: ezyang Differential Revision: D26112627 fbshipit-source-id: 50d4418d351527bcedd5ccdc49106bc642699870
…#51245) Summary: Pull Request resolved: pytorch#51245 Splitting this out from pytorch#51164 (D26069629) to allow it to land separately; I'm sure this is a good idea but I'm less sure about pytorch#51164. ghstack-source-id: 120697499 Test Plan: double-check effect on empty benchmark with perf stat; didn't move Reviweers: ezyang, messmer Reviewed By: ezyang Differential Revision: D26112627 fbshipit-source-id: 50d4418d351527bcedd5ccdc49106bc642699870
Stack from ghstack:
In the (hopefully) common case where the unboxed functor is
really a
c10::CompileTimeFunctionPointer, this enables a jumpstraight to said functor.
Note that I had no end of trouble getting this to have decent
performance without the previous diff, and even with it, all the
manual C10_ALWAYS_INLINE I added were necessary.
Differential Revision: D26069629