Skip to content

[PyTorch] Refactor Dispatcher to inline less code in fast path#51163

Closed
swolchok wants to merge 4 commits intogh/swolchok/91/basefrom
gh/swolchok/91/head
Closed

[PyTorch] Refactor Dispatcher to inline less code in fast path#51163
swolchok wants to merge 4 commits intogh/swolchok/91/basefrom
gh/swolchok/91/head

Conversation

@swolchok
Copy link
Copy Markdown
Contributor

@swolchok swolchok commented Jan 27, 2021

Stack from ghstack:

The Dispatcher seems to have been in a precarious local
maximum: I tried to make several different changes to parameter
passing and ended up with regressions due to reduced inlining that
swamped any gains I might have gotten from the parameter passing
changes.

This diff reduces the amount of inline code on the fast path. It
should both reduce code size and provide a platform for making further
improvements to the dispatcher code.

It is a slight performance regression, but it unblocked the following
two diffs (which seem to get us back where we were) from landing.

Differential Revision: D26091377

The Dispatcher seems to have been in a precarious local
maximum: I tried to make several different changes to parameter
passing and ended up with regressions due to reduced inlining that
swamped any gains I might have gotten from the parameter passing
changes.

This diff reduces the amount of inline code on the fast path. It
should both reduce code size and provide a platform for making further
improvements to the dispatcher code.

It is a slight performance regression, but it unblocked the following
two diffs (which seem to get us back where we were) from landing.

Differential Revision: [D26091377](https://our.internmc.facebook.com/intern/diff/D26091377/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Copy Markdown
Contributor

facebook-github-bot commented Jan 27, 2021

💊 CI failures summary and remediations

As of commit 41bbe13 (more details on the Dr. CI page):


  • 2/2 failures possibly* introduced in this PR
    • 2/2 non-CircleCI failure(s)

Extra GitHub checks: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

}

template<class Return, class... Args>
inline Return Dispatcher::callWithDispatchKeySlowPath(const TypedOperatorHandle<Return(Args...)>& op, bool pre_sampled, DispatchKey dispatchKey, const KernelFunction& kernel, Args... args) {
Copy link
Copy Markdown
Collaborator

@bdhirsh bdhirsh Jan 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious- if everything here is technically slow path, is there a reason not to dump this whole function in Dispatcher.cpp to prevent inlining? It looks like you did that, but only for runRecordFunction(), which is probably "slower path".

My understanding of the relative levels of hotness in the Dispatcher path:

  • hot: shouldRunRecordFunction, kernel.call<...>()
  • less hot (shouldRun returned true): create RecordFunction guard and give it inputs
  • least hot (guard also must be active): RecordFunction.before

So does putting just the last of those 3 in a .cpp file yield better results than the bottom 2?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason not to dump this whole function in Dispatcher.cpp to prevent inlining?

I would be happy in theory to dump it in Dispatcher.cpp, but unfortunately it has to be templated on the type signature of the operator (just like Dispatcher::call and Dispatcher::callWithDispatcherKey) so that we can call out to impl::boxArgs, so it must be visible inline in order for a copy to be instantiated for every distinct operator signature.

It's possible that we could extract impl::boxArgs(args...) into a lambda and pass it into a hypothetical callWithDispatchKeySlowPathHelper that wasn't templated, but I'm not sure how to do that without introducing a round of copying or moving on the arguments and a std::function, which would penalize the profiling use case to uncertain benefit. Maybe we'll need to do it in the future, but I think that the space of inlineable dispatch-related code is bigger than just the Dispatcher and we may be able to see bigger gains for less work in other places.

@swolchok
Copy link
Copy Markdown
Contributor Author

By the way, I've diffed the annotated assembly for at::empty (which is where Dispatcher::call{,WithDispatchKey} end up getting inlined) in a benchmark that calls torch::empty in a loop, and it looks like the source of the regression is that we end up with slightly more register pressure for whatever reason (haven't checked exactly why but I suspect we are leaving one or more things in a register in case we have to call into callWithDispatchKeySlowPath even though it's behind a C10_UNLIKELY branch; register allocation is hard and we can't expect it to be perfect) and have to reload the address of the relevant OperatorEntry in between checking KernelFunction::isValid (checks if the boxed_kernel_func_ != nullptr) and executing KernelFunction::call (loads up unboxed_kernel_func_ instead and checks it against nullptr).

I will see if I can claw this regression back in a separate PR.

…path"

The Dispatcher seems to have been in a precarious local
maximum: I tried to make several different changes to parameter
passing and ended up with regressions due to reduced inlining that
swamped any gains I might have gotten from the parameter passing
changes.

This diff reduces the amount of inline code on the fast path. It
should both reduce code size and provide a platform for making further
improvements to the dispatcher code.

It is a slight performance regression, but it unblocked the following
two diffs (which seem to get us back where we were) from landing.

Differential Revision: [D26091377](https://our.internmc.facebook.com/intern/diff/D26091377/)

[ghstack-poisoned]
…path"

The Dispatcher seems to have been in a precarious local
maximum: I tried to make several different changes to parameter
passing and ended up with regressions due to reduced inlining that
swamped any gains I might have gotten from the parameter passing
changes.

This diff reduces the amount of inline code on the fast path. It
should both reduce code size and provide a platform for making further
improvements to the dispatcher code.

It is a slight performance regression, but it unblocked the following
two diffs (which seem to get us back where we were) from landing.

Differential Revision: [D26091377](https://our.internmc.facebook.com/intern/diff/D26091377/)

[ghstack-poisoned]
@swolchok swolchok requested review from bhosmer and smessmer January 29, 2021 05:24
…path"

The Dispatcher seems to have been in a precarious local
maximum: I tried to make several different changes to parameter
passing and ended up with regressions due to reduced inlining that
swamped any gains I might have gotten from the parameter passing
changes.

This diff reduces the amount of inline code on the fast path. It
should both reduce code size and provide a platform for making further
improvements to the dispatcher code.

It is a slight performance regression, but it unblocked the following
two diffs (which seem to get us back where we were) from landing.

Differential Revision: [D26091377](https://our.internmc.facebook.com/intern/diff/D26091377/)

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Jan 29, 2021
Pull Request resolved: #51163

The Dispatcher seems to have been in a precarious local
maximum: I tried to make several different changes to parameter
passing and ended up with regressions due to reduced inlining that
swamped any gains I might have gotten from the parameter passing
changes.

This diff reduces the amount of inline code on the fast path. It
should both reduce code size and provide a platform for making further
improvements to the dispatcher code.

It is a slight performance regression, but it unblocked the following
two diffs (which seem to get us back where we were) from landing.
ghstack-source-id: 120693163

Differential Revision: [D26091377](https://our.internmc.facebook.com/intern/diff/D26091377/)
@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request has been merged in 673687e.

@facebook-github-bot facebook-github-bot deleted the gh/swolchok/91/head branch February 5, 2021 15:22
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
…ch#51163)

Summary:
Pull Request resolved: pytorch#51163

The Dispatcher seems to have been in a precarious local
maximum: I tried to make several different changes to parameter
passing and ended up with regressions due to reduced inlining that
swamped any gains I might have gotten from the parameter passing
changes.

This diff reduces the amount of inline code on the fast path. It
should both reduce code size and provide a platform for making further
improvements to the dispatcher code.

It is a slight performance regression, but it unblocked the following
two diffs (which seem to get us back where we were) from landing.
ghstack-source-id: 120693163

Test Plan:
CI, framework overhead benchmarks to check the size of the
regression

Compared timing for empty framework overhead benchmark before/after.

Build command: `buck build mode/no-gpu //caffe2/caffe2/fb/high_perf_models/pytorch/benchmark_framework_overheads:cpp_benchmark mode/opt-clang --show-output`
Run with `numactl -m  0 -C 3 path/to/cpp_benchmark -op empty -niter 100`

Before:
```
I0126 16:02:04.373075 2135872 bench.cpp:139] Mean 0.266272
I0126 16:02:04.373106 2135872 bench.cpp:140] Median 0.266347
I0126 16:02:04.373111 2135872 bench.cpp:141] Min 0.263585
I0126 16:02:04.373117 2135872 bench.cpp:142] stddev 0.0021264
I0126 16:02:04.373131 2135872 bench.cpp:143] stddev / mean 0.00798581
```

After:
```
I0126 16:02:30.377992 2137048 bench.cpp:139] Mean 0.27579
I0126 16:02:30.378023 2137048 bench.cpp:140] Median 0.275281
I0126 16:02:30.378029 2137048 bench.cpp:141] Min 0.270617
I0126 16:02:30.378034 2137048 bench.cpp:142] stddev 0.00308287
I0126 16:02:30.378044 2137048 bench.cpp:143] stddev / mean 0.0111783
```

Yes, it's a regression, but I compared D26069629 stacked on this diff vs not:

With this diff:

```
I0126 16:02:50.662864 2137574 bench.cpp:139] Mean 0.268645
I0126 16:02:50.662891 2137574 bench.cpp:140] Median 0.267485
I0126 16:02:50.662896 2137574 bench.cpp:141] Min 0.266485
I0126 16:02:50.662901 2137574 bench.cpp:142] stddev 0.00219359
I0126 16:02:50.662915 2137574 bench.cpp:143] stddev / mean 0.00816537
```

Without:
```
I0126 20:40:27.815824 3240699 bench.cpp:139] Mean 0.270755
I0126 20:40:27.815860 3240699 bench.cpp:140] Median 0.268998
I0126 20:40:27.815866 3240699 bench.cpp:141] Min 0.268306
I0126 20:40:27.815873 3240699 bench.cpp:142] stddev 0.00260365
I0126 20:40:27.815886 3240699 bench.cpp:143] stddev / mean 0.00961624
```

So we do seem to have accomplished something w.r.t. not overwhelming the inliner.

Reviewed By: ezyang

Differential Revision: D26091377

fbshipit-source-id: c9b7f4e187059fa15452b7c75fc29816022b92b1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants