Skip to content

Add inline fast paths for SymInt operators#161586

Closed
swolchok wants to merge 5 commits intogh/swolchok/810/basefrom
gh/swolchok/810/head
Closed

Add inline fast paths for SymInt operators#161586
swolchok wants to merge 5 commits intogh/swolchok/810/basefrom
gh/swolchok/810/head

Conversation

@swolchok
Copy link
Copy Markdown
Contributor

@swolchok swolchok commented Aug 27, 2025

Stack from ghstack (oldest at bottom):

If SymInt::maybe_as_int() returns non-empty, then we get an inline
fast path. The philosophy here (as with the previous PR) is to
preserve performance in the "plain old ints" case.

Observed time spent in SymInt functions in computeStorageNBytes to
drop (and not cost shift elsewhere in the function) after this change,
profiling detach() using code similar to the benchmark from #160580
and Linux perf.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

Differential Revision: D81530107

If SymInt::maybe_as_int() returns non-empty, then we get an inline
fast path. The philosophy here (as with the previous PR) is to
preserve performance in the "plain old ints" case.

Observed time spent in SymInt functions in computeStorageNBytes to
drop (and not cost shift elsewhere in the function) after this change,
profiling detach() using code similar to the benchmark from #160580
and Linux perf.

[ghstack-poisoned]
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Aug 27, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161586

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ You can merge normally! (3 Unrelated Failures)

As of commit adecc7e with merge base dcf3853 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

BROKEN TRUNK - The following jobs failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

swolchok added a commit that referenced this pull request Aug 27, 2025
If SymInt::maybe_as_int() returns non-empty, then we get an inline
fast path. The philosophy here (as with the previous PR) is to
preserve performance in the "plain old ints" case.

Observed time spent in SymInt functions in computeStorageNBytes to
drop (and not cost shift elsewhere in the function) after this change,
profiling detach() using code similar to the benchmark from #160580
and Linux perf.

ghstack-source-id: 67895a5
Pull Request resolved: #161586
@swolchok swolchok added the topic: not user facing topic category label Aug 27, 2025
pytorchmergebot pushed a commit that referenced this pull request Sep 5, 2025
…161590)

This seems to be a (very very roughly) ~8% improvement on DTensor benchmark very similar to the benchmark from #160580 (120ish usec -> 110ish usec)

Differential Revision: [D81530105](https://our.internmc.facebook.com/intern/diff/D81530105)
Pull Request resolved: #161590
Approved by: https://github.com/albanD
ghstack dependencies: #161466, #161586
daisyden pushed a commit to daisyden/pytorch that referenced this pull request Sep 8, 2025
…ytorch#161590)

This seems to be a (very very roughly) ~8% improvement on DTensor benchmark very similar to the benchmark from pytorch#160580 (120ish usec -> 110ish usec)

Differential Revision: [D81530105](https://our.internmc.facebook.com/intern/diff/D81530105)
Pull Request resolved: pytorch#161590
Approved by: https://github.com/albanD
ghstack dependencies: pytorch#161466, pytorch#161586
pytorchmergebot pushed a commit that referenced this pull request Sep 8, 2025
…lready know (#161591)

We already know when we're called from make_wrapper_subclass or make_dtensor. The check isn't particularly cheap.

Differential Revision: [D81530099](https://our.internmc.facebook.com/intern/diff/D81530099)
Pull Request resolved: #161591
Approved by: https://github.com/ezyang
ghstack dependencies: #161466, #161586, #161590
pytorchmergebot pushed a commit that referenced this pull request Sep 8, 2025
…161595)

This seems to have been an especially slow one because of the repeated pybind access (schema is a pybind, as is arguments, and then we hit each argument). It's still ~~1% of total benchmark runtime because of the repeated single pybind function call, but that's a lot better.

Differential Revision: [D81530095](https://our.internmc.facebook.com/intern/diff/D81530095)

Pull Request resolved: #161595
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
ghstack dependencies: #161466, #161586, #161590, #161591
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
If SymInt::maybe_as_int() returns non-empty, then we get an inline
fast path. The philosophy here (as with the previous PR) is to
preserve performance in the "plain old ints" case.

Observed time spent in SymInt functions in computeStorageNBytes to
drop (and not cost shift elsewhere in the function) after this change,
profiling detach() using code similar to the benchmark from pytorch#160580
and Linux perf.

Differential Revision: [D81530107](https://our.internmc.facebook.com/intern/diff/D81530107)
Pull Request resolved: pytorch#161586
Approved by: https://github.com/ezyang
ghstack dependencies: pytorch#161466
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
…ytorch#161590)

This seems to be a (very very roughly) ~8% improvement on DTensor benchmark very similar to the benchmark from pytorch#160580 (120ish usec -> 110ish usec)

Differential Revision: [D81530105](https://our.internmc.facebook.com/intern/diff/D81530105)
Pull Request resolved: pytorch#161590
Approved by: https://github.com/albanD
ghstack dependencies: pytorch#161466, pytorch#161586
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
…lready know (pytorch#161591)

We already know when we're called from make_wrapper_subclass or make_dtensor. The check isn't particularly cheap.

Differential Revision: [D81530099](https://our.internmc.facebook.com/intern/diff/D81530099)
Pull Request resolved: pytorch#161591
Approved by: https://github.com/ezyang
ghstack dependencies: pytorch#161466, pytorch#161586, pytorch#161590
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
…ytorch#161595)

This seems to have been an especially slow one because of the repeated pybind access (schema is a pybind, as is arguments, and then we hit each argument). It's still ~~1% of total benchmark runtime because of the repeated single pybind function call, but that's a lot better.

Differential Revision: [D81530095](https://our.internmc.facebook.com/intern/diff/D81530095)

Pull Request resolved: pytorch#161595
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
ghstack dependencies: pytorch#161466, pytorch#161586, pytorch#161590, pytorch#161591
mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
If SymInt::maybe_as_int() returns non-empty, then we get an inline
fast path. The philosophy here (as with the previous PR) is to
preserve performance in the "plain old ints" case.

Observed time spent in SymInt functions in computeStorageNBytes to
drop (and not cost shift elsewhere in the function) after this change,
profiling detach() using code similar to the benchmark from pytorch#160580
and Linux perf.

Differential Revision: [D81530107](https://our.internmc.facebook.com/intern/diff/D81530107)
Pull Request resolved: pytorch#161586
Approved by: https://github.com/ezyang
ghstack dependencies: pytorch#161466
mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
…ytorch#161590)

This seems to be a (very very roughly) ~8% improvement on DTensor benchmark very similar to the benchmark from pytorch#160580 (120ish usec -> 110ish usec)

Differential Revision: [D81530105](https://our.internmc.facebook.com/intern/diff/D81530105)
Pull Request resolved: pytorch#161590
Approved by: https://github.com/albanD
ghstack dependencies: pytorch#161466, pytorch#161586
mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
…lready know (pytorch#161591)

We already know when we're called from make_wrapper_subclass or make_dtensor. The check isn't particularly cheap.

Differential Revision: [D81530099](https://our.internmc.facebook.com/intern/diff/D81530099)
Pull Request resolved: pytorch#161591
Approved by: https://github.com/ezyang
ghstack dependencies: pytorch#161466, pytorch#161586, pytorch#161590
mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
…ytorch#161595)

This seems to have been an especially slow one because of the repeated pybind access (schema is a pybind, as is arguments, and then we hit each argument). It's still ~~1% of total benchmark runtime because of the repeated single pybind function call, but that's a lot better.

Differential Revision: [D81530095](https://our.internmc.facebook.com/intern/diff/D81530095)

Pull Request resolved: pytorch#161595
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
ghstack dependencies: pytorch#161466, pytorch#161586, pytorch#161590, pytorch#161591
cleonard530 pushed a commit to cleonard530/pytorch that referenced this pull request Sep 22, 2025
…ytorch#161590)

This seems to be a (very very roughly) ~8% improvement on DTensor benchmark very similar to the benchmark from pytorch#160580 (120ish usec -> 110ish usec)

Differential Revision: [D81530105](https://our.internmc.facebook.com/intern/diff/D81530105)
Pull Request resolved: pytorch#161590
Approved by: https://github.com/albanD
ghstack dependencies: pytorch#161466, pytorch#161586
cleonard530 pushed a commit to cleonard530/pytorch that referenced this pull request Sep 22, 2025
…lready know (pytorch#161591)

We already know when we're called from make_wrapper_subclass or make_dtensor. The check isn't particularly cheap.

Differential Revision: [D81530099](https://our.internmc.facebook.com/intern/diff/D81530099)
Pull Request resolved: pytorch#161591
Approved by: https://github.com/ezyang
ghstack dependencies: pytorch#161466, pytorch#161586, pytorch#161590
cleonard530 pushed a commit to cleonard530/pytorch that referenced this pull request Sep 22, 2025
…ytorch#161595)

This seems to have been an especially slow one because of the repeated pybind access (schema is a pybind, as is arguments, and then we hit each argument). It's still ~~1% of total benchmark runtime because of the repeated single pybind function call, but that's a lot better.

Differential Revision: [D81530095](https://our.internmc.facebook.com/intern/diff/D81530095)

Pull Request resolved: pytorch#161595
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
ghstack dependencies: pytorch#161466, pytorch#161586, pytorch#161590, pytorch#161591
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
If SymInt::maybe_as_int() returns non-empty, then we get an inline
fast path. The philosophy here (as with the previous PR) is to
preserve performance in the "plain old ints" case.

Observed time spent in SymInt functions in computeStorageNBytes to
drop (and not cost shift elsewhere in the function) after this change,
profiling detach() using code similar to the benchmark from pytorch#160580
and Linux perf.

Differential Revision: [D81530107](https://our.internmc.facebook.com/intern/diff/D81530107)
Pull Request resolved: pytorch#161586
Approved by: https://github.com/ezyang
ghstack dependencies: pytorch#161466
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
…ytorch#161590)

This seems to be a (very very roughly) ~8% improvement on DTensor benchmark very similar to the benchmark from pytorch#160580 (120ish usec -> 110ish usec)

Differential Revision: [D81530105](https://our.internmc.facebook.com/intern/diff/D81530105)
Pull Request resolved: pytorch#161590
Approved by: https://github.com/albanD
ghstack dependencies: pytorch#161466, pytorch#161586
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
…lready know (pytorch#161591)

We already know when we're called from make_wrapper_subclass or make_dtensor. The check isn't particularly cheap.

Differential Revision: [D81530099](https://our.internmc.facebook.com/intern/diff/D81530099)
Pull Request resolved: pytorch#161591
Approved by: https://github.com/ezyang
ghstack dependencies: pytorch#161466, pytorch#161586, pytorch#161590
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
…ytorch#161595)

This seems to have been an especially slow one because of the repeated pybind access (schema is a pybind, as is arguments, and then we hit each argument). It's still ~~1% of total benchmark runtime because of the repeated single pybind function call, but that's a lot better.

Differential Revision: [D81530095](https://our.internmc.facebook.com/intern/diff/D81530095)

Pull Request resolved: pytorch#161595
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
ghstack dependencies: pytorch#161466, pytorch#161586, pytorch#161590, pytorch#161591
@github-actions github-actions Bot deleted the gh/swolchok/810/head branch October 4, 2025 02:05
swolchok added a commit that referenced this pull request Nov 13, 2025
…ue()?

is_symbolic() appears to be inconsistent with the rest of the interface currently. This is a behavior change, but I believe the old behavior was a bug. Please review carefully.

Motivation: #161586 (comment)

Differential Revision: [D86982216](https://our.internmc.facebook.com/intern/diff/D86982216/)

[ghstack-poisoned]
@swolchok
Copy link
Copy Markdown
Contributor Author

SymInt bugs

Attempted to fix this in #167759 (CC @ezyang), but the assertion I added to attempt to make sure we aren't setting SymInt nbytes on non-meta tensor storage (per Ed) is firing.

Khanaksahu pushed a commit to Khanaksahu/pytorch-fork that referenced this pull request Nov 17, 2025
…ue()?

is_symbolic() appears to be inconsistent with the rest of the interface currently. This is a behavior change, but I believe the old behavior was a bug. Please review carefully.

Motivation: pytorch/pytorch#161586 (comment)

Differential Revision: [D86982216](https://our.internmc.facebook.com/intern/diff/D86982216/)

ghstack-source-id: 323055106
Pull Request resolved: pytorch/pytorch#167758
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants