Pass optional by reference#46598
Conversation
Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/) [ghstack-poisoned]
Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/) [ghstack-poisoned]
Pull Request resolved: #46598 ghstack-source-id: 114736156 Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/)
💊 CI failures summary and remediationsAs of commit e0bb8af (more details on the Dr. CI page):
XLA failureJob pytorch_xla_linux_bionic_py3_6_clang9_build is failing. Please create an issue with title prefixed by 🚧 1 fixed upstream failure:These were probably caused by upstream breakages that were already fixed.
Please rebase on the
|
Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/) [ghstack-poisoned]
Pull Request resolved: #46598 ghstack-source-id: 114801324 Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/)
|
At least some test failures look real though |
|
Probably bc breaking for backend implementers (I see xla failing, at the very least) |
This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in #46092 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/ Benchmark: https://www.internalfb.com/intern/paste/P146297481/ Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/) [ghstack-poisoned]
This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in #46092 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/ Benchmark: https://www.internalfb.com/intern/paste/P146297481/ Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/) [ghstack-poisoned]
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression. This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481 Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) [ghstack-poisoned]
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression. This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481 Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) [ghstack-poisoned]
This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in #46092 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/ Benchmark: https://www.internalfb.com/intern/paste/P146297481/ Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/) [ghstack-poisoned]
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression. This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481 Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) [ghstack-poisoned]
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression. This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481 Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) [ghstack-poisoned]
This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in #46092 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/ Benchmark: https://www.internalfb.com/intern/paste/P146297481/ Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/) [ghstack-poisoned]
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression. This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481 Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) [ghstack-poisoned]
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression. This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481 Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) [ghstack-poisoned]
Pull Request resolved: #46598 This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in D24219944 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/ ghstack-source-id: 114925816 Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/)
Pull Request resolved: #46598 This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in D24219944 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/ ghstack-source-id: 115362982 Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/)
|
#47015 is an alternative to this change. |
This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in #46092 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/ Benchmark: https://www.internalfb.com/intern/paste/P146297481/ Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/) [ghstack-poisoned]
This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in #46092 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/ Benchmark: https://www.internalfb.com/intern/paste/P146297481/ Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/) [ghstack-poisoned]
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression. This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481 Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) [ghstack-poisoned]
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression. This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481 Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) [ghstack-poisoned]
Pull Request resolved: #46598 This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in D24219944 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/ ghstack-source-id: 115468618 Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/)
This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in #46092 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/ Benchmark: https://www.internalfb.com/intern/paste/P146297481/ Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/) [ghstack-poisoned]
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression. This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481 Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) [ghstack-poisoned]
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression. This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481 Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) [ghstack-poisoned]
Pull Request resolved: #46598 This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in D24219944 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/ ghstack-source-id: 115684330 Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/)
Summary: Pull Request resolved: #47015 c10::optional has non-trivial copy and move operations always. This change specializes it for 32-bit scalars so that it has trivial copy and move operations in that case. Ideally, we would instead rely on P0602 "variant and optional should propagate copy/move triviality" and use `std::optional` (or implement that functionality ourselves). We can't use `std::optional` because we are stuck with C++14. Implementing the full P0602 ourselves would add even more complexity. We could do it, but this should be a helpful first step. ghstack-source-id: 115886743 Test Plan: Collect Callgrind instruction counts for `torch.empty(())`. Data: Make empty c10-ful (#46092): ``` <torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7ffaed1128e0> torch.empty(()) All Noisy symbols removed Instructions: 648005 632899 Baseline: 4144 3736 100 runs per measurement, 1 thread ``` This diff atop #46092: ``` <torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f943f1dc8e0> torch.empty(()) All Noisy symbols removed Instructions: 602347 591005 Baseline: 4106 3736 100 runs per measurement, 1 thread ``` (6.6% improvement vs #46092) Pass optionals by const reference (#46598) ``` <torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f1abb3988e0> torch.empty(()) All Noisy symbols removed Instructions: 601349 590005 Baseline: 4162 3736 100 runs per measurement, 1 thread ``` (6.8% improvement vs #46092) This diff atop #46598 (i.e., both together) ``` <torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f9577c22850> torch.empty(()) All Noisy symbols removed Instructions: 596095 582451 Baseline: 4162 3736 100 runs per measurement, 1 thread Warning: PyTorch was not built with debug symbols. Source information may be limited. Rebuild with REL_WITH_DEB_INFO=1 for more detailed results. ``` (another 1.3% savings!) #46598 outperformed this change slightly, and combining the two leads to further benefits. I guess we should do both! (Though I still don't understand why passing optionals that should fit in a register by const reference would help...) Reviewed By: smessmer Differential Revision: D24552280 fbshipit-source-id: 4d93bfcffafebd8c01559398513fa6b9db959d11
This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in #46092 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/ Benchmark: https://www.internalfb.com/intern/paste/P146297481/ Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/) [ghstack-poisoned]
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression. This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481 Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) [ghstack-poisoned]
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression. This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481 Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) [ghstack-poisoned]
Pull Request resolved: #46598 This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in D24219944 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/ ghstack-source-id: 115948320 Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/)
This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in #46092 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/ Benchmark: https://www.internalfb.com/intern/paste/P146297481/ Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/) [ghstack-poisoned]
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression. This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481 Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) [ghstack-poisoned]
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression. This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481 Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) [ghstack-poisoned]
Pull Request resolved: #46598 This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in D24219944 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/ ghstack-source-id: 116189891 Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/)
|
We decided to not land this because the benefit is much smaller after @swolchok's fixes to c10::optional (only 1.2% improvement now) and since passing by value is better code style. |
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression. This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481 Update: with @swolchok's fixes to c10::optional, the regression went down to 2%. We decided to not land #46598. Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) [ghstack-poisoned]
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression. This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481 Update: with @swolchok's fixes to c10::optional, the regression went down to 2%. We decided to not land #46598. Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) [ghstack-poisoned]
Pull Request resolved: #46598 This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in D24219944 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/ ghstack-source-id: 116189891 Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/)
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression. This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481 Update: with @swolchok's fixes to c10::optional, the regression went down to 2%. We decided to not land #46598. Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) [ghstack-poisoned]
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression. This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481 Update: with @swolchok's fixes to c10::optional, the regression went down to 2%. We decided to not land #46598. Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) [ghstack-poisoned]
Stack from ghstack:
This fixes perf issues due to optional not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in #46092 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts).
See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/
Benchmark: https://www.internalfb.com/intern/paste/P146297481/
Differential Revision: D24425836