[PyTorch] Refactor Dispatcher to inline less code in fast path#51163
[PyTorch] Refactor Dispatcher to inline less code in fast path#51163swolchok wants to merge 4 commits intogh/swolchok/91/basefrom
Conversation
The Dispatcher seems to have been in a precarious local maximum: I tried to make several different changes to parameter passing and ended up with regressions due to reduced inlining that swamped any gains I might have gotten from the parameter passing changes. This diff reduces the amount of inline code on the fast path. It should both reduce code size and provide a platform for making further improvements to the dispatcher code. It is a slight performance regression, but it unblocked the following two diffs (which seem to get us back where we were) from landing. Differential Revision: [D26091377](https://our.internmc.facebook.com/intern/diff/D26091377/) [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit 41bbe13 (more details on the Dr. CI page):
Extra GitHub checks: 1 failed
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
| } | ||
|
|
||
| template<class Return, class... Args> | ||
| inline Return Dispatcher::callWithDispatchKeySlowPath(const TypedOperatorHandle<Return(Args...)>& op, bool pre_sampled, DispatchKey dispatchKey, const KernelFunction& kernel, Args... args) { |
There was a problem hiding this comment.
Just curious- if everything here is technically slow path, is there a reason not to dump this whole function in Dispatcher.cpp to prevent inlining? It looks like you did that, but only for runRecordFunction(), which is probably "slower path".
My understanding of the relative levels of hotness in the Dispatcher path:
- hot:
shouldRunRecordFunction,kernel.call<...>() - less hot (shouldRun returned true): create
RecordFunctionguard and give it inputs - least hot (guard also must be active):
RecordFunction.before
So does putting just the last of those 3 in a .cpp file yield better results than the bottom 2?
There was a problem hiding this comment.
is there a reason not to dump this whole function in Dispatcher.cpp to prevent inlining?
I would be happy in theory to dump it in Dispatcher.cpp, but unfortunately it has to be templated on the type signature of the operator (just like Dispatcher::call and Dispatcher::callWithDispatcherKey) so that we can call out to impl::boxArgs, so it must be visible inline in order for a copy to be instantiated for every distinct operator signature.
It's possible that we could extract impl::boxArgs(args...) into a lambda and pass it into a hypothetical callWithDispatchKeySlowPathHelper that wasn't templated, but I'm not sure how to do that without introducing a round of copying or moving on the arguments and a std::function, which would penalize the profiling use case to uncertain benefit. Maybe we'll need to do it in the future, but I think that the space of inlineable dispatch-related code is bigger than just the Dispatcher and we may be able to see bigger gains for less work in other places.
|
By the way, I've diffed the annotated assembly for I will see if I can claw this regression back in a separate PR. |
…path" The Dispatcher seems to have been in a precarious local maximum: I tried to make several different changes to parameter passing and ended up with regressions due to reduced inlining that swamped any gains I might have gotten from the parameter passing changes. This diff reduces the amount of inline code on the fast path. It should both reduce code size and provide a platform for making further improvements to the dispatcher code. It is a slight performance regression, but it unblocked the following two diffs (which seem to get us back where we were) from landing. Differential Revision: [D26091377](https://our.internmc.facebook.com/intern/diff/D26091377/) [ghstack-poisoned]
…path" The Dispatcher seems to have been in a precarious local maximum: I tried to make several different changes to parameter passing and ended up with regressions due to reduced inlining that swamped any gains I might have gotten from the parameter passing changes. This diff reduces the amount of inline code on the fast path. It should both reduce code size and provide a platform for making further improvements to the dispatcher code. It is a slight performance regression, but it unblocked the following two diffs (which seem to get us back where we were) from landing. Differential Revision: [D26091377](https://our.internmc.facebook.com/intern/diff/D26091377/) [ghstack-poisoned]
…path" The Dispatcher seems to have been in a precarious local maximum: I tried to make several different changes to parameter passing and ended up with regressions due to reduced inlining that swamped any gains I might have gotten from the parameter passing changes. This diff reduces the amount of inline code on the fast path. It should both reduce code size and provide a platform for making further improvements to the dispatcher code. It is a slight performance regression, but it unblocked the following two diffs (which seem to get us back where we were) from landing. Differential Revision: [D26091377](https://our.internmc.facebook.com/intern/diff/D26091377/) [ghstack-poisoned]
Pull Request resolved: #51163 The Dispatcher seems to have been in a precarious local maximum: I tried to make several different changes to parameter passing and ended up with regressions due to reduced inlining that swamped any gains I might have gotten from the parameter passing changes. This diff reduces the amount of inline code on the fast path. It should both reduce code size and provide a platform for making further improvements to the dispatcher code. It is a slight performance regression, but it unblocked the following two diffs (which seem to get us back where we were) from landing. ghstack-source-id: 120693163 Differential Revision: [D26091377](https://our.internmc.facebook.com/intern/diff/D26091377/)
|
This pull request has been merged in 673687e. |
…ch#51163) Summary: Pull Request resolved: pytorch#51163 The Dispatcher seems to have been in a precarious local maximum: I tried to make several different changes to parameter passing and ended up with regressions due to reduced inlining that swamped any gains I might have gotten from the parameter passing changes. This diff reduces the amount of inline code on the fast path. It should both reduce code size and provide a platform for making further improvements to the dispatcher code. It is a slight performance regression, but it unblocked the following two diffs (which seem to get us back where we were) from landing. ghstack-source-id: 120693163 Test Plan: CI, framework overhead benchmarks to check the size of the regression Compared timing for empty framework overhead benchmark before/after. Build command: `buck build mode/no-gpu //caffe2/caffe2/fb/high_perf_models/pytorch/benchmark_framework_overheads:cpp_benchmark mode/opt-clang --show-output` Run with `numactl -m 0 -C 3 path/to/cpp_benchmark -op empty -niter 100` Before: ``` I0126 16:02:04.373075 2135872 bench.cpp:139] Mean 0.266272 I0126 16:02:04.373106 2135872 bench.cpp:140] Median 0.266347 I0126 16:02:04.373111 2135872 bench.cpp:141] Min 0.263585 I0126 16:02:04.373117 2135872 bench.cpp:142] stddev 0.0021264 I0126 16:02:04.373131 2135872 bench.cpp:143] stddev / mean 0.00798581 ``` After: ``` I0126 16:02:30.377992 2137048 bench.cpp:139] Mean 0.27579 I0126 16:02:30.378023 2137048 bench.cpp:140] Median 0.275281 I0126 16:02:30.378029 2137048 bench.cpp:141] Min 0.270617 I0126 16:02:30.378034 2137048 bench.cpp:142] stddev 0.00308287 I0126 16:02:30.378044 2137048 bench.cpp:143] stddev / mean 0.0111783 ``` Yes, it's a regression, but I compared D26069629 stacked on this diff vs not: With this diff: ``` I0126 16:02:50.662864 2137574 bench.cpp:139] Mean 0.268645 I0126 16:02:50.662891 2137574 bench.cpp:140] Median 0.267485 I0126 16:02:50.662896 2137574 bench.cpp:141] Min 0.266485 I0126 16:02:50.662901 2137574 bench.cpp:142] stddev 0.00219359 I0126 16:02:50.662915 2137574 bench.cpp:143] stddev / mean 0.00816537 ``` Without: ``` I0126 20:40:27.815824 3240699 bench.cpp:139] Mean 0.270755 I0126 20:40:27.815860 3240699 bench.cpp:140] Median 0.268998 I0126 20:40:27.815866 3240699 bench.cpp:141] Min 0.268306 I0126 20:40:27.815873 3240699 bench.cpp:142] stddev 0.00260365 I0126 20:40:27.815886 3240699 bench.cpp:143] stddev / mean 0.00961624 ``` So we do seem to have accomplished something w.r.t. not overwhelming the inliner. Reviewed By: ezyang Differential Revision: D26091377 fbshipit-source-id: c9b7f4e187059fa15452b7c75fc29816022b92b1
Stack from ghstack:
The Dispatcher seems to have been in a precarious local
maximum: I tried to make several different changes to parameter
passing and ended up with regressions due to reduced inlining that
swamped any gains I might have gotten from the parameter passing
changes.
This diff reduces the amount of inline code on the fast path. It
should both reduce code size and provide a platform for making further
improvements to the dispatcher code.
It is a slight performance regression, but it unblocked the following
two diffs (which seem to get us back where we were) from landing.
Differential Revision: D26091377