Skip to content

Make empty c10-full#46092

Closed
smessmer wants to merge 29 commits intogh/smessmer/261/basefrom
gh/smessmer/261/head
Closed

Make empty c10-full#46092
smessmer wants to merge 29 commits intogh/smessmer/261/basefrom
gh/smessmer/261/head

Conversation

@smessmer
Copy link
Copy Markdown
Contributor

@smessmer smessmer commented Oct 9, 2020

Stack from ghstack:

Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature.

This PR also changes the signature of some helpers called by empty to the new style.

Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression.
This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481

Update: with @swolchok's fixes to c10::optional, the regression went down to 2%. We decided to not land #46598.

Differential Revision: D24219944

Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature.

This PR also changes the signature of some helpers called by empty to the new style.

Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/)

[ghstack-poisoned]
@dr-ci
Copy link
Copy Markdown

dr-ci bot commented Oct 9, 2020

💊 CI failures summary and remediations

As of commit dfb7acb (more details on the Dr. CI page):


  • 2/2 failures possibly* introduced in this PR
    • 1/2 non-CircleCI failure(s)

XLA failure

Job pytorch_xla_linux_bionic_py3_6_clang9_build is failing. Please create an issue with title prefixed by [PT_BREAK] in pytorch/xla and link to to this PR. If you have questions, please reach out to @ailzhang / @dlibenzi / @JackCaoG.


ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 223 times.

Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature.

This PR also changes the signature of some helpers called by empty to the new style.

Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/)

[ghstack-poisoned]
smessmer added a commit that referenced this pull request Oct 9, 2020
Pull Request resolved: #46092

Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature.

This PR also changes the signature of some helpers called by empty to the new style.
ghstack-source-id: 113982141

Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/)
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature.

This PR also changes the signature of some helpers called by empty to the new style.

Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/)

[ghstack-poisoned]
smessmer added a commit that referenced this pull request Oct 12, 2020
Pull Request resolved: #46092

Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature.

This PR also changes the signature of some helpers called by empty to the new style.
ghstack-source-id: 114074272

Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/)
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature.

This PR also changes the signature of some helpers called by empty to the new style.

Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/)

[ghstack-poisoned]
@smessmer smessmer requested a review from apaszke as a code owner October 13, 2020 14:30
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature.

This PR also changes the signature of some helpers called by empty to the new style.

Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/)

[ghstack-poisoned]
smessmer added a commit that referenced this pull request Oct 13, 2020
Pull Request resolved: #46092

Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature.

This PR also changes the signature of some helpers called by empty to the new style.
ghstack-source-id: 114177162

Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/)
Copy link
Copy Markdown

@bhosmer bhosmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[edit: at least some of the test failures look real, happy to re-review if they need nontrivial further changes ofc]

Nice to have empty() back in the c10-full family 😁 one nontrivial suggestion inline (I'd pitch removing that new assert), otherwise just formatting notes.

Oh also I can't remember, were you able to confirm that this is ~instruction-count neutral?

Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature.

This PR also changes the signature of some helpers called by empty to the new style.

Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/)

[ghstack-poisoned]
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature.

This PR also changes the signature of some helpers called by empty to the new style.

Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression.
This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481

Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/)

[ghstack-poisoned]
smessmer added a commit that referenced this pull request Oct 28, 2020

This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in #46092 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). 

See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/
Benchmark: https://www.internalfb.com/intern/paste/P146297481/


Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/)

[ghstack-poisoned]
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature.

This PR also changes the signature of some helpers called by empty to the new style.

Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression.
This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481

Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/)

[ghstack-poisoned]
smessmer added a commit that referenced this pull request Oct 28, 2020

This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in #46092 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). 

See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/
Benchmark: https://www.internalfb.com/intern/paste/P146297481/


Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/)

[ghstack-poisoned]
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature.

This PR also changes the signature of some helpers called by empty to the new style.

Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression.
This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481

Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/)

[ghstack-poisoned]
smessmer added a commit that referenced this pull request Oct 28, 2020

This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in #46092 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). 

See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/
Benchmark: https://www.internalfb.com/intern/paste/P146297481/


Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/)

[ghstack-poisoned]
smessmer added a commit that referenced this pull request Oct 29, 2020

This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in #46092 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). 

See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/
Benchmark: https://www.internalfb.com/intern/paste/P146297481/


Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/)

[ghstack-poisoned]
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature.

This PR also changes the signature of some helpers called by empty to the new style.

Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression.
This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481

Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/)

[ghstack-poisoned]
smessmer added a commit that referenced this pull request Oct 29, 2020

This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in #46092 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). 

See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/
Benchmark: https://www.internalfb.com/intern/paste/P146297481/


Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/)

[ghstack-poisoned]
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature.

This PR also changes the signature of some helpers called by empty to the new style.

Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression.
This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481

Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/)

[ghstack-poisoned]
smessmer added a commit that referenced this pull request Nov 2, 2020

This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in #46092 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). 

See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/
Benchmark: https://www.internalfb.com/intern/paste/P146297481/


Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/)

[ghstack-poisoned]
facebook-github-bot pushed a commit that referenced this pull request Nov 5, 2020
Summary:
Pull Request resolved: #47015

c10::optional has non-trivial copy and move operations always. This change specializes it for 32-bit scalars so that it has trivial copy and move operations in that case. Ideally, we would instead rely on P0602 "variant and optional should propagate copy/move triviality" and use `std::optional` (or implement that functionality ourselves). We can't use `std::optional` because we are stuck with C++14. Implementing the full P0602 ourselves would add even more complexity. We could do it, but this should be a helpful first step.
ghstack-source-id: 115886743

Test Plan:
Collect Callgrind instruction counts for `torch.empty(())`. Data:

Make empty c10-ful (#46092):

```
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7ffaed1128e0>
torch.empty(())
                           All          Noisy symbols removed
    Instructions:       648005                     632899
    Baseline:             4144                       3736
100 runs per measurement, 1 thread
```

This diff atop #46092:

```
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f943f1dc8e0>
torch.empty(())
                           All          Noisy symbols removed
    Instructions:       602347                     591005
    Baseline:             4106                       3736
100 runs per measurement, 1 thread
```

(6.6% improvement vs #46092)

Pass optionals by const reference (#46598)

```
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f1abb3988e0>
torch.empty(())
                           All          Noisy symbols removed
    Instructions:       601349                     590005
    Baseline:             4162                       3736
100 runs per measurement, 1 thread
```
(6.8% improvement vs #46092)

This diff atop #46598 (i.e., both together)

```
<torch.utils.benchmark.utils.valgrind_wrapper.timer_interface.CallgrindStats object at 0x7f9577c22850>
torch.empty(())
                           All          Noisy symbols removed
    Instructions:       596095                     582451
    Baseline:             4162                       3736
100 runs per measurement, 1 thread
Warning: PyTorch was not built with debug symbols.
         Source information may be limited. Rebuild with
         REL_WITH_DEB_INFO=1 for more detailed results.
```

(another 1.3% savings!)

#46598 outperformed this change slightly, and combining the two leads to further benefits. I guess we should do both! (Though I still don't understand why passing optionals that should fit in a register by const reference would help...)

Reviewed By: smessmer

Differential Revision: D24552280

fbshipit-source-id: 4d93bfcffafebd8c01559398513fa6b9db959d11
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature.

This PR also changes the signature of some helpers called by empty to the new style.

Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression.
This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481

Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/)

[ghstack-poisoned]
smessmer added a commit that referenced this pull request Nov 5, 2020

This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in #46092 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). 

See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/
Benchmark: https://www.internalfb.com/intern/paste/P146297481/


Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/)

[ghstack-poisoned]
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature.

This PR also changes the signature of some helpers called by empty to the new style.

Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression.
This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481

Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/)

[ghstack-poisoned]
smessmer added a commit that referenced this pull request Nov 9, 2020

This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in #46092 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). 

See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/
Benchmark: https://www.internalfb.com/intern/paste/P146297481/


Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/)

[ghstack-poisoned]
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature.

This PR also changes the signature of some helpers called by empty to the new style.

Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression.
This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481

Update: with @swolchok's fixes to c10::optional, the regression went down to 2%. We decided to not land #46598.

Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/)

[ghstack-poisoned]
smessmer added a commit that referenced this pull request Nov 10, 2020

This fixes perf issues due to optional<T> not being trivially relocatable even if T is trivially relocatable. We had seen a 4% regression because of that in #46092 and this PR here is stacked on top and improves perf by 6%, so both together improve perf by 2% (measured by callgrind instruction counts). 

See for details: https://fb.workplace.com/groups/pytorch.dev/permalink/768545893723893/
Benchmark: https://www.internalfb.com/intern/paste/P146297481/


Differential Revision: [D24425836](https://our.internmc.facebook.com/intern/diff/D24425836/)

[ghstack-poisoned]
Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature.

This PR also changes the signature of some helpers called by empty to the new style.

Bench summary: With noisy symbols removed, instruction count goes from 661k to 692k, which is a 4-5% regression.
This regression is fixed in #46598 and a combination of both PRs actually improves perf by 2-3%, see P146297481

Update: with @swolchok's fixes to c10::optional, the regression went down to 2%. We decided to not land #46598.

Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/)

[ghstack-poisoned]
smessmer added a commit that referenced this pull request Nov 12, 2020
Pull Request resolved: #46092

Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature.

This PR also changes the signature of some helpers called by empty to the new style.
ghstack-source-id: 116544203


@override-unit-failures
(Note: this ignores all push blocking failures!)

Differential Revision: [D24219944](https://our.internmc.facebook.com/intern/diff/D24219944/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D24219944/)!
@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request has been merged in edf751c.

@facebook-github-bot facebook-github-bot deleted the gh/smessmer/261/head branch November 16, 2020 15:17
tugsbayasgalan pushed a commit to tugsbayasgalan/pytorch that referenced this pull request Nov 16, 2020
Summary:
Pull Request resolved: pytorch#46092

Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature.

This PR also changes the signature of some helpers called by empty to the new style.
ghstack-source-id: 116544203

(Note: this ignores all push blocking failures!)

Test Plan:
vs prev diff (outdated, before c10::optional fix): https://www.internalfb.com/intern/fblearner/details/224735103/

after c10::optional fix:
https://www.internalfb.com/intern/fblearner/details/231391773/

Also, after the c10::optional fix, the instruction counting benchmark shows a 2% regression for calling empty from Python. We decided this is acceptable and decided against landing D24425836 which would fix the regression.

Reviewed By: ezyang

Differential Revision: D24219944

fbshipit-source-id: e554096e90ce438c75b679131c3151ff8e5c5d50
jjsjann123 pushed a commit to jjsjann123/nvfuser that referenced this pull request Oct 29, 2022
Summary:
Pull Request resolved: pytorch/pytorch#46092

Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature.

This PR also changes the signature of some helpers called by empty to the new style.
ghstack-source-id: 116544203

(Note: this ignores all push blocking failures!)

Test Plan:
vs prev diff (outdated, before c10::optional fix): https://www.internalfb.com/intern/fblearner/details/224735103/

after c10::optional fix:
https://www.internalfb.com/intern/fblearner/details/231391773/

Also, after the c10::optional fix, the instruction counting benchmark shows a 2% regression for calling empty from Python. We decided this is acceptable and decided against landing D24425836 which would fix the regression.

Reviewed By: ezyang

Differential Revision: D24219944

fbshipit-source-id: e554096e90ce438c75b679131c3151ff8e5c5d50
jjsjann123 pushed a commit to jjsjann123/nvfuser that referenced this pull request Nov 10, 2022
Summary:
Pull Request resolved: pytorch/pytorch#46092

Make empty c10-full without using hacky-wrapper, i.e. port the kernel to the new style signature.

This PR also changes the signature of some helpers called by empty to the new style.
ghstack-source-id: 116544203

(Note: this ignores all push blocking failures!)

Test Plan:
vs prev diff (outdated, before c10::optional fix): https://www.internalfb.com/intern/fblearner/details/224735103/

after c10::optional fix:
https://www.internalfb.com/intern/fblearner/details/231391773/

Also, after the c10::optional fix, the instruction counting benchmark shows a 2% regression for calling empty from Python. We decided this is acceptable and decided against landing D24425836 which would fix the regression.

Reviewed By: ezyang

Differential Revision: D24219944

fbshipit-source-id: e554096e90ce438c75b679131c3151ff8e5c5d50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants