Conversation
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit dfb7acb (more details on the Dr. CI page):
XLA failureJob pytorch_xla_linux_bionic_py3_6_clang9_build is failing. Please create an issue with title prefixed by ci.pytorch.org: 1 failedThis comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 223 times. |
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) [ghstack-poisoned]
Pull Request resolved: #46092 Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. ghstack-source-id: 113982141 Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/)
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) [ghstack-poisoned]
Pull Request resolved: #46092 Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. ghstack-source-id: 114074272 Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/)
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) [ghstack-poisoned]
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) [ghstack-poisoned]
Pull Request resolved: #46092 Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. ghstack-source-id: 114177162 Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/)
There was a problem hiding this comment.
[edit: at least some of the test failures look real, happy to re-review if they need nontrivial further changes ofc]
Nice to have empty() back in the c10-full family 😁 one nontrivial suggestion inline (I'd pitch removing that new assert), otherwise just formatting notes.
Oh also I can't remember, were you able to confirm that this is ~instruction-count neutral?
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) [ghstack-poisoned]
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression. This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481 Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) [ghstack-poisoned]
This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in #46092 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/ Benchmark: https://www.internalfb.com/intern/paste/P146297481/ Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/) [ghstack-poisoned]
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression. This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481 Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) [ghstack-poisoned]
This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in #46092 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/ Benchmark: https://www.internalfb.com/intern/paste/P146297481/ Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/) [ghstack-poisoned]
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression. This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481 Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) [ghstack-poisoned]
This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in #46092 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/ Benchmark: https://www.internalfb.com/intern/paste/P146297481/ Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/) [ghstack-poisoned]
This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in #46092 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/ Benchmark: https://www.internalfb.com/intern/paste/P146297481/ Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/) [ghstack-poisoned]
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression. This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481 Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) [ghstack-poisoned]
This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in #46092 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/ Benchmark: https://www.internalfb.com/intern/paste/P146297481/ Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/) [ghstack-poisoned]
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression. This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481 Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) [ghstack-poisoned]
This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in #46092 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/ Benchmark: https://www.internalfb.com/intern/paste/P146297481/ Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/) [ghstack-poisoned]
Summary: Pull Request resolved: #47015 c10::optional has non-trivial copy and move operations always. This change specializes it for 32-bit scalars so that it has trivial copy and move operations in that case. Ideally, we would instead rely on P0602 "variant and optional should propagate copy/move triviality" and use `std::optional` (or implement that functionality ourselves). We can't use `std::optional` because we are stuck with C++14. Implementing the full P0602 ourselves would add even more complexity. We could do it, but this should be a helpful first step. ghstack-source-id: 115886743 Test Plan: Collect Callgrind instruction counts for `torch.empty(())`. Data: Make empty c10-ful (#46092): ``` <torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7ffaed1128e0> torch.empty(()) All Noisy symbols removed Instructions: 648005 632899 Baseline: 4144 3736 100 runs per measurement, 1 thread ``` This diff atop #46092: ``` <torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f943f1dc8e0> torch.empty(()) All Noisy symbols removed Instructions: 602347 591005 Baseline: 4106 3736 100 runs per measurement, 1 thread ``` (6.6% improvement vs #46092) Pass optionals by const reference (#46598) ``` <torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f1abb3988e0> torch.empty(()) All Noisy symbols removed Instructions: 601349 590005 Baseline: 4162 3736 100 runs per measurement, 1 thread ``` (6.8% improvement vs #46092) This diff atop #46598 (i.e., both together) ``` <torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f9577c22850> torch.empty(()) All Noisy symbols removed Instructions: 596095 582451 Baseline: 4162 3736 100 runs per measurement, 1 thread Warning: PyTorch was not built with debug symbols. Source information may be limited. Rebuild with REL_WITH_DEB_INFO=1 for more detailed results. ``` (another 1.3% savings!) #46598 outperformed this change slightly, and combining the two leads to further benefits. I guess we should do both! (Though I still don't understand why passing optionals that should fit in a register by const reference would help...) Reviewed By: smessmer Differential Revision: D24552280 fbshipit-source-id: 4d93bfcffafebd8c01559398513fa6b9db959d11
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression. This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481 Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) [ghstack-poisoned]
This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in #46092 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/ Benchmark: https://www.internalfb.com/intern/paste/P146297481/ Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/) [ghstack-poisoned]
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression. This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481 Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) [ghstack-poisoned]
This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in #46092 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/ Benchmark: https://www.internalfb.com/intern/paste/P146297481/ Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/) [ghstack-poisoned]
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression. This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481 Update: with @swolchok's fixes to c10::optional, the regression went down to 2%. We decided to not land #46598. Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) [ghstack-poisoned]
This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in #46092 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/ Benchmark: https://www.internalfb.com/intern/paste/P146297481/ Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/) [ghstack-poisoned]
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression. This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481 Update: with @swolchok's fixes to c10::optional, the regression went down to 2%. We decided to not land #46598. Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) [ghstack-poisoned]
Pull Request resolved: #46092 Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. ghstack-source-id: 116544203 @override-unit-failures (Note: this ignores all push blocking failures!) Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D24219944/)!
|
This pull request has been merged in edf751c. |
Summary: Pull Request resolved: pytorch#46092 Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. ghstack-source-id: 116544203 (Note: this ignores all push blocking failures!) Test Plan: vs prev diff (outdated, before c10::optional fix): https://www.internalfb.com/intern/fblearner/details/224735103/ after c10::optional fix: https://www.internalfb.com/intern/fblearner/details/231391773/ Also, after the c10::optional fix, the instruction counting benchmark shows a 2% regression for calling empty from Python. We decided this is acceptable and decided against landing D24425836 which would fix the regression. Reviewed By: ezyang Differential Revision: D24219944 fbshipit-source-id: e554096e90ce438c75b679131c3151ff8e5c5d50
Summary: Pull Request resolved: pytorch/pytorch#46092 Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. ghstack-source-id: 116544203 (Note: this ignores all push blocking failures!) Test Plan: vs prev diff (outdated, before c10::optional fix): https://www.internalfb.com/intern/fblearner/details/224735103/ after c10::optional fix: https://www.internalfb.com/intern/fblearner/details/231391773/ Also, after the c10::optional fix, the instruction counting benchmark shows a 2% regression for calling empty from Python. We decided this is acceptable and decided against landing D24425836 which would fix the regression. Reviewed By: ezyang Differential Revision: D24219944 fbshipit-source-id: e554096e90ce438c75b679131c3151ff8e5c5d50
Summary: Pull Request resolved: pytorch/pytorch#46092 Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. ghstack-source-id: 116544203 (Note: this ignores all push blocking failures!) Test Plan: vs prev diff (outdated, before c10::optional fix): https://www.internalfb.com/intern/fblearner/details/224735103/ after c10::optional fix: https://www.internalfb.com/intern/fblearner/details/231391773/ Also, after the c10::optional fix, the instruction counting benchmark shows a 2% regression for calling empty from Python. We decided this is acceptable and decided against landing D24425836 which would fix the regression. Reviewed By: ezyang Differential Revision: D24219944 fbshipit-source-id: e554096e90ce438c75b679131c3151ff8e5c5d50
Stack from ghstack:
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature.
This PR also changes the signature of some helpers called by empty to the new style.
Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression.
This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481
Update: with @swolchok's fixes to c10::optional, the regression went down to 2%. We decided to not land #46598.
Differential Revision: D24219944