Add inline fast paths for SymInt operators by swolchok · Pull Request #161586 · pytorch/pytorch

swolchok · 2025-08-27T01:24:26Z

Stack from ghstack (oldest at bottom):

If SymInt::maybe_as_int() returns non-empty, then we get an inline
fast path. The philosophy here (as with the previous PR) is to
preserve performance in the "plain old ints" case.

Observed time spent in SymInt functions in computeStorageNBytes to
drop (and not cost shift elsewhere in the function) after this change,
profiling detach() using code similar to the benchmark from #160580
and Linux perf.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

Differential Revision: D81530107

If SymInt::maybe_as_int() returns non-empty, then we get an inline fast path. The philosophy here (as with the previous PR) is to preserve performance in the "plain old ints" case. Observed time spent in SymInt functions in computeStorageNBytes to drop (and not cost shift elsewhere in the function) after this change, profiling detach() using code similar to the benchmark from #160580 and Linux perf. [ghstack-poisoned]

pytorch-bot · 2025-08-27T01:24:30Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161586

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

ROCm MI2xx CI/CD workflows failing due to : download from https://api.github.com/repos/pytorch/pytorch timed out.

✅ You can merge normally! (3 Unrelated Failures)

As of commit adecc7e with merge base dcf3853 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / win-vs2022-cpu-py3 / test (default, 1, 3, windows.4xlarge.nonephemeral) (gh) (similar failure)
'Test'

BROKEN TRUNK - The following jobs failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

inductor / unit-test / inductor-halide-build / build (gh) (trunk failure)
undefined reference to NVPW_InitializeHost'`
inductor / unit-test / inductor-triton-cpu-build / build (gh) (trunk failure)
undefined reference to NVPW_InitializeHost'`

This comment was automatically generated by Dr. CI and updates every 15 minutes.

If SymInt::maybe_as_int() returns non-empty, then we get an inline fast path. The philosophy here (as with the previous PR) is to preserve performance in the "plain old ints" case. Observed time spent in SymInt functions in computeStorageNBytes to drop (and not cost shift elsewhere in the function) after this change, profiling detach() using code similar to the benchmark from #160580 and Linux perf. ghstack-source-id: 67895a5 Pull Request resolved: #161586

…161590) This seems to be a (very very roughly) ~8% improvement on DTensor benchmark very similar to the benchmark from #160580 (120ish usec -> 110ish usec) Differential Revision: [D81530105](https://our.internmc.facebook.com/intern/diff/D81530105) Pull Request resolved: #161590 Approved by: https://github.com/albanD ghstack dependencies: #161466, #161586

…ytorch#161590) This seems to be a (very very roughly) ~8% improvement on DTensor benchmark very similar to the benchmark from pytorch#160580 (120ish usec -> 110ish usec) Differential Revision: [D81530105](https://our.internmc.facebook.com/intern/diff/D81530105) Pull Request resolved: pytorch#161590 Approved by: https://github.com/albanD ghstack dependencies: pytorch#161466, pytorch#161586

…lready know (#161591) We already know when we're called from make_wrapper_subclass or make_dtensor. The check isn't particularly cheap. Differential Revision: [D81530099](https://our.internmc.facebook.com/intern/diff/D81530099) Pull Request resolved: #161591 Approved by: https://github.com/ezyang ghstack dependencies: #161466, #161586, #161590

…161595) This seems to have been an especially slow one because of the repeated pybind access (schema is a pybind, as is arguments, and then we hit each argument). It's still ~~1% of total benchmark runtime because of the repeated single pybind function call, but that's a lot better. Differential Revision: [D81530095](https://our.internmc.facebook.com/intern/diff/D81530095) Pull Request resolved: #161595 Approved by: https://github.com/ezyang, https://github.com/bdhirsh ghstack dependencies: #161466, #161586, #161590, #161591

If SymInt::maybe_as_int() returns non-empty, then we get an inline fast path. The philosophy here (as with the previous PR) is to preserve performance in the "plain old ints" case. Observed time spent in SymInt functions in computeStorageNBytes to drop (and not cost shift elsewhere in the function) after this change, profiling detach() using code similar to the benchmark from pytorch#160580 and Linux perf. Differential Revision: [D81530107](https://our.internmc.facebook.com/intern/diff/D81530107) Pull Request resolved: pytorch#161586 Approved by: https://github.com/ezyang ghstack dependencies: pytorch#161466

…ytorch#161590) This seems to be a (very very roughly) ~8% improvement on DTensor benchmark very similar to the benchmark from pytorch#160580 (120ish usec -> 110ish usec) Differential Revision: [D81530105](https://our.internmc.facebook.com/intern/diff/D81530105) Pull Request resolved: pytorch#161590 Approved by: https://github.com/albanD ghstack dependencies: pytorch#161466, pytorch#161586

…lready know (pytorch#161591) We already know when we're called from make_wrapper_subclass or make_dtensor. The check isn't particularly cheap. Differential Revision: [D81530099](https://our.internmc.facebook.com/intern/diff/D81530099) Pull Request resolved: pytorch#161591 Approved by: https://github.com/ezyang ghstack dependencies: pytorch#161466, pytorch#161586, pytorch#161590

…ytorch#161595) This seems to have been an especially slow one because of the repeated pybind access (schema is a pybind, as is arguments, and then we hit each argument). It's still ~~1% of total benchmark runtime because of the repeated single pybind function call, but that's a lot better. Differential Revision: [D81530095](https://our.internmc.facebook.com/intern/diff/D81530095) Pull Request resolved: pytorch#161595 Approved by: https://github.com/ezyang, https://github.com/bdhirsh ghstack dependencies: pytorch#161466, pytorch#161586, pytorch#161590, pytorch#161591

If SymInt::maybe_as_int() returns non-empty, then we get an inline fast path. The philosophy here (as with the previous PR) is to preserve performance in the "plain old ints" case. Observed time spent in SymInt functions in computeStorageNBytes to drop (and not cost shift elsewhere in the function) after this change, profiling detach() using code similar to the benchmark from pytorch#160580 and Linux perf. Differential Revision: [D81530107](https://our.internmc.facebook.com/intern/diff/D81530107) Pull Request resolved: pytorch#161586 Approved by: https://github.com/ezyang ghstack dependencies: pytorch#161466

…ytorch#161590) This seems to be a (very very roughly) ~8% improvement on DTensor benchmark very similar to the benchmark from pytorch#160580 (120ish usec -> 110ish usec) Differential Revision: [D81530105](https://our.internmc.facebook.com/intern/diff/D81530105) Pull Request resolved: pytorch#161590 Approved by: https://github.com/albanD ghstack dependencies: pytorch#161466, pytorch#161586

…lready know (pytorch#161591) We already know when we're called from make_wrapper_subclass or make_dtensor. The check isn't particularly cheap. Differential Revision: [D81530099](https://our.internmc.facebook.com/intern/diff/D81530099) Pull Request resolved: pytorch#161591 Approved by: https://github.com/ezyang ghstack dependencies: pytorch#161466, pytorch#161586, pytorch#161590

…ytorch#161595) This seems to have been an especially slow one because of the repeated pybind access (schema is a pybind, as is arguments, and then we hit each argument). It's still ~~1% of total benchmark runtime because of the repeated single pybind function call, but that's a lot better. Differential Revision: [D81530095](https://our.internmc.facebook.com/intern/diff/D81530095) Pull Request resolved: pytorch#161595 Approved by: https://github.com/ezyang, https://github.com/bdhirsh ghstack dependencies: pytorch#161466, pytorch#161586, pytorch#161590, pytorch#161591

…ytorch#161590) This seems to be a (very very roughly) ~8% improvement on DTensor benchmark very similar to the benchmark from pytorch#160580 (120ish usec -> 110ish usec) Differential Revision: [D81530105](https://our.internmc.facebook.com/intern/diff/D81530105) Pull Request resolved: pytorch#161590 Approved by: https://github.com/albanD ghstack dependencies: pytorch#161466, pytorch#161586

…lready know (pytorch#161591) We already know when we're called from make_wrapper_subclass or make_dtensor. The check isn't particularly cheap. Differential Revision: [D81530099](https://our.internmc.facebook.com/intern/diff/D81530099) Pull Request resolved: pytorch#161591 Approved by: https://github.com/ezyang ghstack dependencies: pytorch#161466, pytorch#161586, pytorch#161590

…ytorch#161595) This seems to have been an especially slow one because of the repeated pybind access (schema is a pybind, as is arguments, and then we hit each argument). It's still ~~1% of total benchmark runtime because of the repeated single pybind function call, but that's a lot better. Differential Revision: [D81530095](https://our.internmc.facebook.com/intern/diff/D81530095) Pull Request resolved: pytorch#161595 Approved by: https://github.com/ezyang, https://github.com/bdhirsh ghstack dependencies: pytorch#161466, pytorch#161586, pytorch#161590, pytorch#161591

If SymInt::maybe_as_int() returns non-empty, then we get an inline fast path. The philosophy here (as with the previous PR) is to preserve performance in the "plain old ints" case. Observed time spent in SymInt functions in computeStorageNBytes to drop (and not cost shift elsewhere in the function) after this change, profiling detach() using code similar to the benchmark from pytorch#160580 and Linux perf. Differential Revision: [D81530107](https://our.internmc.facebook.com/intern/diff/D81530107) Pull Request resolved: pytorch#161586 Approved by: https://github.com/ezyang ghstack dependencies: pytorch#161466

…ytorch#161590) This seems to be a (very very roughly) ~8% improvement on DTensor benchmark very similar to the benchmark from pytorch#160580 (120ish usec -> 110ish usec) Differential Revision: [D81530105](https://our.internmc.facebook.com/intern/diff/D81530105) Pull Request resolved: pytorch#161590 Approved by: https://github.com/albanD ghstack dependencies: pytorch#161466, pytorch#161586

…lready know (pytorch#161591) We already know when we're called from make_wrapper_subclass or make_dtensor. The check isn't particularly cheap. Differential Revision: [D81530099](https://our.internmc.facebook.com/intern/diff/D81530099) Pull Request resolved: pytorch#161591 Approved by: https://github.com/ezyang ghstack dependencies: pytorch#161466, pytorch#161586, pytorch#161590

…ytorch#161595) This seems to have been an especially slow one because of the repeated pybind access (schema is a pybind, as is arguments, and then we hit each argument). It's still ~~1% of total benchmark runtime because of the repeated single pybind function call, but that's a lot better. Differential Revision: [D81530095](https://our.internmc.facebook.com/intern/diff/D81530095) Pull Request resolved: pytorch#161595 Approved by: https://github.com/ezyang, https://github.com/bdhirsh ghstack dependencies: pytorch#161466, pytorch#161586, pytorch#161590, pytorch#161591

…ue()? is_symbolic() appears to be inconsistent with the rest of the interface currently. This is a behavior change, but I believe the old behavior was a bug. Please review carefully. Motivation: #161586 (comment) Differential Revision: [D86982216](https://our.internmc.facebook.com/intern/diff/D86982216/) [ghstack-poisoned]

swolchok · 2025-11-15T00:11:42Z

SymInt bugs

Attempted to fix this in #167759 (CC @ezyang), but the assertion I added to attempt to make sure we aren't setting SymInt nbytes on non-meta tensor storage (per Ed) is firing.

…ue()? is_symbolic() appears to be inconsistent with the rest of the interface currently. This is a behavior change, but I believe the old behavior was a bug. Please review carefully. Motivation: pytorch/pytorch#161586 (comment) Differential Revision: [D86982216](https://our.internmc.facebook.com/intern/diff/D86982216/) ghstack-source-id: 323055106 Pull Request resolved: pytorch/pytorch#167758

This was referenced Aug 27, 2025

Fix non-const reference arguments in torch/csrc/jit/python/init.cpp #161300

Closed

Fix forced copying def_property_readonly for FunctionSchema & friends #161301

Closed

Stop accessing func._schema in _python_dispatch.correct_storage_aliasing #161292

Closed

swolchok mentioned this pull request Aug 26, 2025

Use is, not ==, to check exact type matches in _python_dispatch #161304

Closed

pytorch-bot Bot added the ciflow/inductor label Aug 27, 2025

swolchok mentioned this pull request Aug 27, 2025

Add C++ function to accelerate DTensor.__new__ #161588

Closed

swolchok added the topic: not user facing topic category label Aug 27, 2025

swolchok requested review from Skylion007, ezyang and malfet August 27, 2025 03:43

This was referenced Sep 4, 2025

[easy] Don't force copy result of getAllOperatorsFor in init.cpp #162218

Closed

Overload _get_operation_for_overload_or_packet & friends to accept ArrayRef #162219

Closed

Dynamo: set_eval_frame microoptimization #162220

Closed

This was referenced Sep 5, 2025

Add DISABLE_JUSTKNOBS to torch/_utils_internal.py and use it for dynamo _maybe_set_eval_frame #162298

Closed

Fix TODO in make_tensor_for_subclass_helper #162336

Closed

Remove __torch_dispatch__ check in THPVariable_make_dtensor #162337

Closed

github-actions Bot deleted the gh/swolchok/810/head branch October 4, 2025 02:05

swolchok mentioned this pull request Nov 13, 2025

Shouldn't SymInt::is_symbolic() be defined as !maybe_as_int().has_value()? #167758

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add inline fast paths for SymInt operators#161586

Add inline fast paths for SymInt operators#161586
swolchok wants to merge 5 commits intogh/swolchok/810/basefrom
gh/swolchok/810/head

swolchok commented Aug 27, 2025 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Aug 27, 2025 •

edited

Loading

Uh oh!

swolchok commented Nov 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

swolchok commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161586

❗ 1 Active SEVs

✅ You can merge normally! (3 Unrelated Failures)

Uh oh!

swolchok commented Nov 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

swolchok commented Aug 27, 2025 •

edited

Loading

pytorch-bot Bot commented Aug 27, 2025 •

edited

Loading