Pass optional by reference by smessmer · Pull Request #46598 · pytorch/pytorch

smessmer · 2020-10-20T18:07:32Z

Stack from ghstack:

Pass optional by reference #46598 Pass optional by reference
Make empty c10-full #46092 Make empty c10-full

This fixes perf issues due to optional not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in #46092 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts).

See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/
Benchmark: https://www.internalfb.com/intern/paste/P146297481/

Differential Revision: D24425836

Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/) [ghstack-poisoned]

Pull Request resolved: #46598 ghstack-source-id: 114736156 Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/)

dr-ci · 2020-10-20T19:33:23Z

💊 CI failures summary and remediations

As of commit e0bb8af (more details on the Dr. CI page):

2/3 failures possibly* introduced in this PR
- 1/2 non-CircleCI failure(s)
1/3 broken upstream at merge base 6248e06 from Nov 06 until Nov 09 (50 commits; 160db3d - 4159191)

XLA failure

Job pytorch_xla_linux_bionic_py3_6_clang9_build is failing. Please create an issue with title prefixed by [PT_BREAK] in pytorch/xla and link to to this PR. If you have questions, please reach out to @ailzhang / @dlibenzi / @JackCaoG.

🚧 1 fixed upstream failure:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

Since your merge base is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

Check out the recency history of this "viable master" tracking branch.

binary_linux_libtorch_3_7m_cpu_devtoolset7_shared-with-deps_build from Nov 06 until Nov 09 (50 commits; 160db3d - 4159191)
- 🔁 rerun

Extra GitHub checks: 1 failed

Failed: GitHub Actions - flake8-py3

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 165 times.

Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/) [ghstack-poisoned]

Pull Request resolved: #46598 ghstack-source-id: 114801324 Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/)

bhosmer · 2020-10-21T18:30:56Z

At least some test failures look real though

ezyang · 2020-10-21T19:46:05Z

Probably bc breaking for backend implementers (I see xla failing, at the very least)

This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in #46092 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/ Benchmark: https://www.internalfb.com/intern/paste/P146297481/ Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/) [ghstack-poisoned]

Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression. This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481 Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) [ghstack-poisoned]

This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in #46092 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/ Benchmark: https://www.internalfb.com/intern/paste/P146297481/ Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/) [ghstack-poisoned]

Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression. This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481 Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) [ghstack-poisoned]

This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in #46092 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/ Benchmark: https://www.internalfb.com/intern/paste/P146297481/ Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/) [ghstack-poisoned]

Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression. This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481 Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) [ghstack-poisoned]

Pull Request resolved: #46598 This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in D24219944 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/ ghstack-source-id: 114925816 Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/)

Pull Request resolved: #46598 This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in D24219944 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/ ghstack-source-id: 115362982 Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/)

swolchok · 2020-10-28T21:01:51Z

#47015 is an alternative to this change.

This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in #46092 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/ Benchmark: https://www.internalfb.com/intern/paste/P146297481/ Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/) [ghstack-poisoned]

Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression. This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481 Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) [ghstack-poisoned]

Pull Request resolved: #46598 This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in D24219944 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/ ghstack-source-id: 115468618 Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/)

This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in #46092 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/ Benchmark: https://www.internalfb.com/intern/paste/P146297481/ Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/) [ghstack-poisoned]

Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression. This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481 Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) [ghstack-poisoned]

Pull Request resolved: #46598 This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in D24219944 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/ ghstack-source-id: 115684330 Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/)

Summary: Pull Request resolved: #47015 c10::optional has non-trivial copy and move operations always. This change specializes it for 32-bit scalars so that it has trivial copy and move operations in that case. Ideally, we would instead rely on P0602 "variant and optional should propagate copy/move triviality" and use `std::optional` (or implement that functionality ourselves). We can't use `std::optional` because we are stuck with C++14. Implementing the full P0602 ourselves would add even more complexity. We could do it, but this should be a helpful first step. ghstack-source-id: 115886743 Test Plan: Collect Callgrind instruction counts for `torch.empty(())`. Data: Make empty c10-ful (#46092): ``` <torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7ffaed1128e0> torch.empty(()) All Noisy symbols removed Instructions: 648005 632899 Baseline: 4144 3736 100 runs per measurement, 1 thread ``` This diff atop #46092: ``` <torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f943f1dc8e0> torch.empty(()) All Noisy symbols removed Instructions: 602347 591005 Baseline: 4106 3736 100 runs per measurement, 1 thread ``` (6.6% improvement vs #46092) Pass optionals by const reference (#46598) ``` <torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f1abb3988e0> torch.empty(()) All Noisy symbols removed Instructions: 601349 590005 Baseline: 4162 3736 100 runs per measurement, 1 thread ``` (6.8% improvement vs #46092) This diff atop #46598 (i.e., both together) ``` <torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f9577c22850> torch.empty(()) All Noisy symbols removed Instructions: 596095 582451 Baseline: 4162 3736 100 runs per measurement, 1 thread Warning: PyTorch was not built with debug symbols. Source information may be limited. Rebuild with REL_WITH_DEB_INFO=1 for more detailed results. ``` (another 1.3% savings!) #46598 outperformed this change slightly, and combining the two leads to further benefits. I guess we should do both! (Though I still don't understand why passing optionals that should fit in a register by const reference would help...) Reviewed By: smessmer Differential Revision: D24552280 fbshipit-source-id: 4d93bfcffafebd8c01559398513fa6b9db959d11

This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in #46092 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/ Benchmark: https://www.internalfb.com/intern/paste/P146297481/ Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/) [ghstack-poisoned]

Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression. This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481 Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) [ghstack-poisoned]

Pull Request resolved: #46598 This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in D24219944 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/ ghstack-source-id: 115948320 Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/)

This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in #46092 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/ Benchmark: https://www.internalfb.com/intern/paste/P146297481/ Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/) [ghstack-poisoned]

Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression. This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481 Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) [ghstack-poisoned]

Pull Request resolved: #46598 This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in D24219944 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/ ghstack-source-id: 116189891 Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/)

smessmer · 2020-11-09T17:28:20Z

We decided to not land this because the benefit is much smaller after @swolchok's fixes to c10::optional (only 1.2% improvement now) and since passing by value is better code style.

@swolchok

Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression. This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481 Update: with @swolchok's fixes to c10::optional, the regression went down to 2%. We decided to not land #46598. Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) [ghstack-poisoned]

@swolchok

Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression. This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481 Update: with @swolchok's fixes to c10::optional, the regression went down to 2%. We decided to not land #46598. Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) [ghstack-poisoned]

Pull Request resolved: #46598 This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in D24219944 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/ ghstack-source-id: 116189891 Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/)

@swolchok

Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression. This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481 Update: with @swolchok's fixes to c10::optional, the regression went down to 2%. We decided to not land #46598. Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) [ghstack-poisoned]

@swolchok

Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature. This PR also changes the signature of some helpers called by empty to the new style. Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression. This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481 Update: with @swolchok's fixes to c10::optional, the regression went down to 2%. We decided to not land #46598. Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/) [ghstack-poisoned]

Pass optional by reference

7531399

Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/) [ghstack-poisoned]

smessmer requested review from albanD and apaszke as code owners October 20, 2020 18:07

smessmer mentioned this pull request Oct 20, 2020

Make empty c10-full #46092

Closed

Update on "Pass optional by reference"

dfa571d

Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/) [ghstack-poisoned]

smessmer added a commit that referenced this pull request Oct 20, 2020

Pass optional by reference

c121ef4

Pull Request resolved: #46598 ghstack-source-id: 114736156 Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/)

Update on "Pass optional by reference"

9ab3ae6

Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/) [ghstack-poisoned]

smessmer added a commit that referenced this pull request Oct 21, 2020

Pass optional by reference

25e8521

Pull Request resolved: #46598 ghstack-source-id: 114801324 Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/)

smessmer requested review from bhosmer and ezyang October 21, 2020 12:53

bhosmer approved these changes Oct 21, 2020

View reviewed changes

ezyang added the module: bc-breaking Related to a BC-breaking change label Oct 21, 2020

swolchok mentioned this pull request Oct 28, 2020

[Pytorch] Specialize guts of c10::optional for 32-bit scalars #47015

Closed

facebook-github-bot added the cla signed label Oct 30, 2020

smessmer closed this Nov 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pass optional by reference#46598

Pass optional by reference#46598
smessmer wants to merge 22 commits intogh/smessmer/263/basefrom
gh/smessmer/263/head

smessmer commented Oct 20, 2020 •

edited

Loading

Uh oh!

dr-ci bot commented Oct 20, 2020 •

edited

Loading

Uh oh!

bhosmer commented Oct 21, 2020

Uh oh!

ezyang commented Oct 21, 2020

Uh oh!

swolchok commented Oct 28, 2020

Uh oh!

smessmer commented Nov 9, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

smessmer commented Oct 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dr-ci bot commented Oct 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

XLA failure

🚧 1 fixed upstream failure:

Extra GitHub checks: 1 failed

Uh oh!

bhosmer commented Oct 21, 2020

Uh oh!

ezyang commented Oct 21, 2020

Uh oh!

swolchok commented Oct 28, 2020

Uh oh!

smessmer commented Nov 9, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

smessmer commented Oct 20, 2020 •

edited

Loading

dr-ci bot commented Oct 20, 2020 •

edited

Loading