[PyTorch] Refactor Dispatcher to inline less code in fast path by swolchok · Pull Request #51163 · pytorch/pytorch

swolchok · 2021-01-27T00:39:21Z

Stack from ghstack:

[PyTorch] Use real if constexpr behind macro in hot template #51368 [PyTorch] Use real if constexpr behind macro in hot template
[PyTorch] Debug-gate static_assert in KernelFunction::makeFromUnboxedFunctor #51367 [PyTorch] Debug-gate static_assert in KernelFunction::makeFromUnboxedFunctor
[PyTorch] Remove reference_cast in make_boxed_from_unboxed_functor #51319 [PyTorch] Remove reference_cast in make_boxed_from_unboxed_functor
[PyTorch] Fix return value of IValue::to for Tensor/String #51463 [PyTorch] Fix return value of IValue::to for Tensor/String
[PyTorch][easy] Fix TODOs in CppFunction constructors #51315 [PyTorch][easy] Fix TODOs in CppFunction constructors
[PyTorch] Reduce template expansion in call_functor_with_args_from_stack #51313 [PyTorch] Reduce template expansion in call_functor_with_args_from_stack
[PyTorch] Remove always-empty string args to inferFunctionSchemaFromFunctor #51307 [PyTorch] Remove always-empty string args to inferFunctionSchemaFromFunctor
[PyTorch] IWYU in torch/csrc/utils/future.h #51293 [PyTorch] IWYU in torch/csrc/utils/future.h
[PyTorch] Outline OperatorEntry::assertSignatureIsCorrect fail path #51269 [PyTorch] Outline OperatorEntry::assertSignatureIsCorrect fail path
[PyTorch] check isValidUnboxed() in the dispatcher #51247 [PyTorch] check isValidUnboxed() in the dispatcher
[PyTorch] Pass TensorOptions by value #51165 [PyTorch] Pass TensorOptions by value
[PyTorch] Add C10_ALWAYS_INLINE to critical dispatcher paths #51245 [PyTorch] Add C10_ALWAYS_INLINE to critical dispatcher paths
[PyTorch] Refactor Dispatcher to inline less code in fast path #51163 [PyTorch] Refactor Dispatcher to inline less code in fast path

The Dispatcher seems to have been in a precarious local
maximum: I tried to make several different changes to parameter
passing and ended up with regressions due to reduced inlining that
swamped any gains I might have gotten from the parameter passing
changes.

This diff reduces the amount of inline code on the fast path. It
should both reduce code size and provide a platform for making further
improvements to the dispatcher code.

It is a slight performance regression, but it unblocked the following
two diffs (which seem to get us back where we were) from landing.

Differential Revision: D26091377

The Dispatcher seems to have been in a precarious local maximum: I tried to make several different changes to parameter passing and ended up with regressions due to reduced inlining that swamped any gains I might have gotten from the parameter passing changes. This diff reduces the amount of inline code on the fast path. It should both reduce code size and provide a platform for making further improvements to the dispatcher code. It is a slight performance regression, but it unblocked the following two diffs (which seem to get us back where we were) from landing. Differential Revision: [D26091377](https://our.internmc.facebook.com/intern/diff/D26091377/) [ghstack-poisoned]

facebook-github-bot · 2021-01-27T00:39:36Z

💊 CI failures summary and remediations

As of commit 41bbe13 (more details on the Dr. CI page):

2/2 failures possibly* introduced in this PR
- 2/2 non-CircleCI failure(s)

Extra GitHub checks: 1 failed

Failed: GitHub Actions - clang-tidy

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

bdhirsh · 2021-01-27T15:21:06Z

+}
+
+template<class Return, class... Args>
+inline Return Dispatcher::callWithDispatchKeySlowPath(const TypedOperatorHandle<Return(Args...)>& op, bool pre_sampled, DispatchKey dispatchKey, const KernelFunction& kernel, Args... args) {


Just curious- if everything here is technically slow path, is there a reason not to dump this whole function in Dispatcher.cpp to prevent inlining? It looks like you did that, but only for runRecordFunction(), which is probably "slower path".

My understanding of the relative levels of hotness in the Dispatcher path:

hot: shouldRunRecordFunction, kernel.call<...>()

less hot (shouldRun returned true): create RecordFunction guard and give it inputs

least hot (guard also must be active): RecordFunction.before

So does putting just the last of those 3 in a .cpp file yield better results than the bottom 2?

is there a reason not to dump this whole function in Dispatcher.cpp to prevent inlining?

I would be happy in theory to dump it in Dispatcher.cpp, but unfortunately it has to be templated on the type signature of the operator (just like Dispatcher::call and Dispatcher::callWithDispatcherKey) so that we can call out to impl::boxArgs, so it must be visible inline in order for a copy to be instantiated for every distinct operator signature.

It's possible that we could extract impl::boxArgs(args...) into a lambda and pass it into a hypothetical callWithDispatchKeySlowPathHelper that wasn't templated, but I'm not sure how to do that without introducing a round of copying or moving on the arguments and a std::function, which would penalize the profiling use case to uncertain benefit. Maybe we'll need to do it in the future, but I think that the space of inlineable dispatch-related code is bigger than just the Dispatcher and we may be able to see bigger gains for less work in other places.

swolchok · 2021-01-27T19:38:24Z

By the way, I've diffed the annotated assembly for at::empty (which is where Dispatcher::call{,WithDispatchKey} end up getting inlined) in a benchmark that calls torch::empty in a loop, and it looks like the source of the regression is that we end up with slightly more register pressure for whatever reason (haven't checked exactly why but I suspect we are leaving one or more things in a register in case we have to call into callWithDispatchKeySlowPath even though it's behind a C10_UNLIKELY branch; register allocation is hard and we can't expect it to be perfect) and have to reload the address of the relevant OperatorEntry in between checking KernelFunction::isValid (checks if the boxed_kernel_func_ != nullptr) and executing KernelFunction::call (loads up unboxed_kernel_func_ instead and checks it against nullptr).

I will see if I can claw this regression back in a separate PR.

…path" The Dispatcher seems to have been in a precarious local maximum: I tried to make several different changes to parameter passing and ended up with regressions due to reduced inlining that swamped any gains I might have gotten from the parameter passing changes. This diff reduces the amount of inline code on the fast path. It should both reduce code size and provide a platform for making further improvements to the dispatcher code. It is a slight performance regression, but it unblocked the following two diffs (which seem to get us back where we were) from landing. Differential Revision: [D26091377](https://our.internmc.facebook.com/intern/diff/D26091377/) [ghstack-poisoned]

Pull Request resolved: #51163 The Dispatcher seems to have been in a precarious local maximum: I tried to make several different changes to parameter passing and ended up with regressions due to reduced inlining that swamped any gains I might have gotten from the parameter passing changes. This diff reduces the amount of inline code on the fast path. It should both reduce code size and provide a platform for making further improvements to the dispatcher code. It is a slight performance regression, but it unblocked the following two diffs (which seem to get us back where we were) from landing. ghstack-source-id: 120693163 Differential Revision: [D26091377](https://our.internmc.facebook.com/intern/diff/D26091377/)

facebook-github-bot · 2021-02-01T20:40:57Z

This pull request has been merged in 673687e.

…ch#51163) Summary: Pull Request resolved: pytorch#51163 The Dispatcher seems to have been in a precarious local maximum: I tried to make several different changes to parameter passing and ended up with regressions due to reduced inlining that swamped any gains I might have gotten from the parameter passing changes. This diff reduces the amount of inline code on the fast path. It should both reduce code size and provide a platform for making further improvements to the dispatcher code. It is a slight performance regression, but it unblocked the following two diffs (which seem to get us back where we were) from landing. ghstack-source-id: 120693163 Test Plan: CI, framework overhead benchmarks to check the size of the regression Compared timing for empty framework overhead benchmark before/after. Build command: `buck build mode/no-gpu //caffe2/caffe2/fb/high_perf_models/pytorch/benchmark_framework_overheads:cpp_benchmark mode/opt-clang --show-output` Run with `numactl -m 0 -C 3 path/to/cpp_benchmark -op empty -niter 100` Before: ``` I0126 16:02:04.373075 2135872 bench.cpp:139] Mean 0.266272 I0126 16:02:04.373106 2135872 bench.cpp:140] Median 0.266347 I0126 16:02:04.373111 2135872 bench.cpp:141] Min 0.263585 I0126 16:02:04.373117 2135872 bench.cpp:142] stddev 0.0021264 I0126 16:02:04.373131 2135872 bench.cpp:143] stddev / mean 0.00798581 ``` After: ``` I0126 16:02:30.377992 2137048 bench.cpp:139] Mean 0.27579 I0126 16:02:30.378023 2137048 bench.cpp:140] Median 0.275281 I0126 16:02:30.378029 2137048 bench.cpp:141] Min 0.270617 I0126 16:02:30.378034 2137048 bench.cpp:142] stddev 0.00308287 I0126 16:02:30.378044 2137048 bench.cpp:143] stddev / mean 0.0111783 ``` Yes, it's a regression, but I compared D26069629 stacked on this diff vs not: With this diff: ``` I0126 16:02:50.662864 2137574 bench.cpp:139] Mean 0.268645 I0126 16:02:50.662891 2137574 bench.cpp:140] Median 0.267485 I0126 16:02:50.662896 2137574 bench.cpp:141] Min 0.266485 I0126 16:02:50.662901 2137574 bench.cpp:142] stddev 0.00219359 I0126 16:02:50.662915 2137574 bench.cpp:143] stddev / mean 0.00816537 ``` Without: ``` I0126 20:40:27.815824 3240699 bench.cpp:139] Mean 0.270755 I0126 20:40:27.815860 3240699 bench.cpp:140] Median 0.268998 I0126 20:40:27.815866 3240699 bench.cpp:141] Min 0.268306 I0126 20:40:27.815873 3240699 bench.cpp:142] stddev 0.00260365 I0126 20:40:27.815886 3240699 bench.cpp:143] stddev / mean 0.00961624 ``` So we do seem to have accomplished something w.r.t. not overwhelming the inliner. Reviewed By: ezyang Differential Revision: D26091377 fbshipit-source-id: c9b7f4e187059fa15452b7c75fc29816022b92b1

facebook-github-bot added the cla signed label Jan 27, 2021

This was referenced Jan 27, 2021

[PyTorch] Remove unnecessary dispatcher.h include in torch/library.h #51162

Closed

[PyTorch] Put functor at the end of wrap_kernel_functor_unboxed params #51164

Closed

[PyTorch] Pass TensorOptions by value #51165

Closed

ezyang requested a review from bdhirsh January 27, 2021 01:19

bdhirsh reviewed Jan 27, 2021

View reviewed changes

swolchok requested review from bhosmer and smessmer January 29, 2021 05:24

ezyang approved these changes Jan 29, 2021

View reviewed changes

This was referenced Jan 29, 2021

[PyTorch] Debug-gate static_assert in KernelFunction::makeFromUnboxedFunctor #51367

Closed

[PyTorch] Use real if constexpr behind macro in hot template #51368

Closed

smessmer approved these changes Jan 29, 2021

View reviewed changes

swolchok mentioned this pull request Feb 1, 2021

[PyTorch] Fix return value of IValue::to for Tensor/String #51463

Closed

facebook-github-bot closed this in 673687e Feb 1, 2021

facebook-github-bot added the Merged label Feb 1, 2021

facebook-github-bot deleted the gh/swolchok/91/head branch February 5, 2021 15:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] Refactor Dispatcher to inline less code in fast path#51163

[PyTorch] Refactor Dispatcher to inline less code in fast path#51163
swolchok wants to merge 4 commits intogh/swolchok/91/basefrom
gh/swolchok/91/head

swolchok commented Jan 27, 2021 •

edited

Loading

Uh oh!

facebook-github-bot commented Jan 27, 2021 •

edited

Loading

Uh oh!

bdhirsh Jan 27, 2021 •

edited

Loading

Uh oh!

swolchok Jan 27, 2021

Uh oh!

swolchok commented Jan 27, 2021

Uh oh!

facebook-github-bot commented Feb 1, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

swolchok commented Jan 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Jan 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Extra GitHub checks: 1 failed

Uh oh!

bdhirsh Jan 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

swolchok Jan 27, 2021

Choose a reason for hiding this comment

Uh oh!

swolchok commented Jan 27, 2021

Uh oh!

facebook-github-bot commented Feb 1, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

swolchok commented Jan 27, 2021 •

edited

Loading

facebook-github-bot commented Jan 27, 2021 •

edited

Loading

bdhirsh Jan 27, 2021 •

edited

Loading