Fix forced copying def_property_readonly for FunctionSchema & friends#161301
Fix forced copying def_property_readonly for FunctionSchema & friends#161301swolchok wants to merge 6 commits intogh/swolchok/797/basefrom
Conversation
This took me a bit to figure out and I'm pretty sure I've looked at this code before. Pybind uses `return_value_policy::reference_internal` for `def_property`, which [causes the owning object to be kept alive for the lifespan of the return value](https://pybind11.readthedocs.io/en/stable/advanced/functions.html), allowing the getter to safely avoid copying the property value. However, lambdas act like they return `auto`, not `decltype(auto)`, so our lambdas themselves were forcing copies! Testing: observed std::vector<Argument> copying disappear in Linux perf profile of someOpInfo._schema.arguments/returns (in _python_dispatch.correct_storage_aliasing). [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161301
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ You can merge normally! (2 Unrelated Failures)As of commit 6ecf852 with merge base 05eeb29 ( FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
…a & friends" This took me a bit to figure out and I'm pretty sure I've looked at this code before. Pybind uses `return_value_policy::reference_internal` for `def_property`, which [causes the owning object to be kept alive for the lifespan of the return value](https://pybind11.readthedocs.io/en/stable/advanced/functions.html), allowing the getter to safely avoid copying the property value. However, lambdas act like they return `auto`, not `decltype(auto)`, so our lambdas themselves were forcing copies! Testing: observed std::vector<Argument> copying disappear in Linux perf profile of someOpInfo._schema.arguments/returns (in _python_dispatch.correct_storage_aliasing). cc EikanWang jgong5 wenzhe-nrv sanchitintel [ghstack-poisoned]
…patch.correct_storage_aliasing" func._schema is a pybind, accessing the arguments/returns is expensive, we have no reason to do it anyway, and even though #161301 makes accessing the arguments/returns less expensive, this still seems to improve performance. [ghstack-poisoned]
…orage_aliasing" func._schema is a pybind, accessing the arguments/returns is expensive, we have no reason to do it anyway, and even though #161301 makes accessing the arguments/returns less expensive, this still seems to improve performance. [ghstack-poisoned]
torch/csrc/jit/python/init.cpp
Outdated
| .def_property_readonly( | ||
| "name", [](const FunctionSchema& self) { return self.name(); }) | ||
| "name", | ||
| [](const FunctionSchema& self) -> decltype(auto) { |
There was a problem hiding this comment.
May want to add a comment about why the explicit rvalue is important here even if it looks like it's implicitly inferred. An overzealous refactor could easily see these as redundant and remove them. I for sure wouldn't know the difference without an explanation in the comments.
There was a problem hiding this comment.
You can also just explicitly specify the return value type as another arg to def can't you?
There was a problem hiding this comment.
You can also just explicitly specify the return value type as another arg to def can't you?
I would not be surprised if an attempt to do that ended up trying to take a const reference to a temporary. Being explicit should be simple and effective.
(next ghstack update will add a comment)
…a & friends" This took me a bit to figure out and I'm pretty sure I've looked at this code before. Pybind uses `return_value_policy::reference_internal` for `def_property`, which [causes the owning object to be kept alive for the lifespan of the return value](https://pybind11.readthedocs.io/en/stable/advanced/functions.html), allowing the getter to safely avoid copying the property value. However, lambdas act like they return `auto`, not `decltype(auto)`, so our lambdas themselves were forcing copies! Testing: observed std::vector<Argument> copying disappear in Linux perf profile of someOpInfo._schema.arguments/returns (in _python_dispatch.correct_storage_aliasing). cc EikanWang jgong5 wenzhe-nrv sanchitintel [ghstack-poisoned]
…ing (#161292) func._schema is a pybind, accessing the arguments/returns is expensive, we have no reason to do it anyway, and even though #161301 makes accessing the arguments/returns less expensive, this still seems to improve performance. Pull Request resolved: #161292 Approved by: https://github.com/wconstab, https://github.com/malfet, https://github.com/bdhirsh ghstack dependencies: #161301
…61304) `is` checks object identity and is more efficient. Google seems to confirm it is the correct way to do an exact type check. Pull Request resolved: #161304 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/bdhirsh ghstack dependencies: #161301, #161292
…#161308) - Empty containers are Falsey - Hoist cheap checks first - Microbenchmarked single-element set access method Benchmark code: ``` import timeit to_test = [ ('list(x)', 'x = set([3])'), ('x[0]', 'x = [3]'), ('list(x)[0]', 'x = set([3])'), ('next(iter(x))', 'x = set([3])'), ] for (stmt, setup) in to_test: res = timeit.timeit(stmt=stmt, setup=setup) print(f"Time for `{stmt}`: {res}") ``` Result with Python 3.13 on Mac (with excess digits manually trimmed; directionally matches result on Linux) ``` Time for `list(x)`: 0.03418 Time for `x[0]`: 0.00852 Time for `list(x)[0]`: 0.03561 Time for `next(iter(x))`: 0.02278 ``` FWIW, I was surprised by this result, so I guess I'm glad I wrote the benchmark! Pull Request resolved: #161308 Approved by: https://github.com/Skylion007, https://github.com/bdhirsh ghstack dependencies: #161301, #161292, #161304
) Scanning a list of pybind enums with `in` is slow. See NOTE in code for full explanation. This is a significant optimization; will be updating the torchdispatch/return_and_correct_aliasing portion of this stack with benchmark and results soonish. Pull Request resolved: #161315 Approved by: https://github.com/Skylion007, https://github.com/bdhirsh ghstack dependencies: #161301, #161292, #161304, #161308
…61317) This assertion was expensive because of is_traceable_wrapper_subclass. Finding a cheap check to run first that's likely to let us skip the rest seems to improve things significantly. Pull Request resolved: #161317 Approved by: https://github.com/ezyang, https://github.com/XilunWu, https://github.com/bdhirsh ghstack dependencies: #161301, #161292, #161304, #161308, #161315
`auto` forces a copy. Confirmed this did something noticable with perf. Pull Request resolved: #161329 Approved by: https://github.com/zpcore, https://github.com/fduwjj, https://github.com/Skylion007, https://github.com/bdhirsh ghstack dependencies: #161301, #161292, #161304, #161308, #161315, #161317, #161328
If we want them interned, we should intern at callsites. (The numpy reference has bit rotted; see numpy/numpy@b222eb6#diff-6bdb6105198083838f51c57b55b3a49472ed23043bb40018f1ea41138e687163) Profiling a simple torchdispatch benchmark with perf before/after seems to show that time spent copying std::strings and interning Python strings is gone, though there is some noise and the improvement is very small. Pull Request resolved: #161432 Approved by: https://github.com/ezyang ghstack dependencies: #161301, #161292, #161304, #161308, #161315, #161317, #161328, #161329
) symbools are not identical with Py_True or PyFalse, so we can do those cheap checks first and at least get plain old bools to go fast. Pull Request resolved: #161455 Approved by: https://github.com/Skylion007 ghstack dependencies: #161301, #161292, #161304, #161308, #161315, #161317, #161328, #161329, #161432
…pytorch#161301) This took me a bit to figure out and I'm pretty sure I've looked at this code before. Pybind uses `return_value_policy::reference_internal` for `def_property`, which [causes the owning object to be kept alive for the lifespan of the return value](https://pybind11.readthedocs.io/en/stable/advanced/functions.html), allowing the getter to safely avoid copying the property value. However, lambdas act like they return `auto`, not `decltype(auto)`, so our lambdas themselves were forcing copies! Testing: observed std::vector<Argument> copying disappear in Linux perf profile of someOpInfo._schema.arguments/returns (in _python_dispatch.correct_storage_aliasing). Pull Request resolved: pytorch#161301 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/wconstab
…ing (pytorch#161292) func._schema is a pybind, accessing the arguments/returns is expensive, we have no reason to do it anyway, and even though pytorch#161301 makes accessing the arguments/returns less expensive, this still seems to improve performance. Pull Request resolved: pytorch#161292 Approved by: https://github.com/wconstab, https://github.com/malfet, https://github.com/bdhirsh ghstack dependencies: pytorch#161301
…torch#161304) `is` checks object identity and is more efficient. Google seems to confirm it is the correct way to do an exact type check. Pull Request resolved: pytorch#161304 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/bdhirsh ghstack dependencies: pytorch#161301, pytorch#161292
…pytorch#161308) - Empty containers are Falsey - Hoist cheap checks first - Microbenchmarked single-element set access method Benchmark code: ``` import timeit to_test = [ ('list(x)', 'x = set([3])'), ('x[0]', 'x = [3]'), ('list(x)[0]', 'x = set([3])'), ('next(iter(x))', 'x = set([3])'), ] for (stmt, setup) in to_test: res = timeit.timeit(stmt=stmt, setup=setup) print(f"Time for `{stmt}`: {res}") ``` Result with Python 3.13 on Mac (with excess digits manually trimmed; directionally matches result on Linux) ``` Time for `list(x)`: 0.03418 Time for `x[0]`: 0.00852 Time for `list(x)[0]`: 0.03561 Time for `next(iter(x))`: 0.02278 ``` FWIW, I was surprised by this result, so I guess I'm glad I wrote the benchmark! Pull Request resolved: pytorch#161308 Approved by: https://github.com/Skylion007, https://github.com/bdhirsh ghstack dependencies: pytorch#161301, pytorch#161292, pytorch#161304
…rch#161315) Scanning a list of pybind enums with `in` is slow. See NOTE in code for full explanation. This is a significant optimization; will be updating the torchdispatch/return_and_correct_aliasing portion of this stack with benchmark and results soonish. Pull Request resolved: pytorch#161315 Approved by: https://github.com/Skylion007, https://github.com/bdhirsh ghstack dependencies: pytorch#161301, pytorch#161292, pytorch#161304, pytorch#161308
…torch#161317) This assertion was expensive because of is_traceable_wrapper_subclass. Finding a cheap check to run first that's likely to let us skip the rest seems to improve things significantly. Pull Request resolved: pytorch#161317 Approved by: https://github.com/ezyang, https://github.com/XilunWu, https://github.com/bdhirsh ghstack dependencies: pytorch#161301, pytorch#161292, pytorch#161304, pytorch#161308, pytorch#161315
…61328) Not a huge cost, but free win is free. Pull Request resolved: pytorch#161328 Approved by: https://github.com/Skylion007 ghstack dependencies: pytorch#161301, pytorch#161292, pytorch#161304, pytorch#161308, pytorch#161315, pytorch#161317
`auto` forces a copy. Confirmed this did something noticable with perf. Pull Request resolved: pytorch#161329 Approved by: https://github.com/zpcore, https://github.com/fduwjj, https://github.com/Skylion007, https://github.com/bdhirsh ghstack dependencies: pytorch#161301, pytorch#161292, pytorch#161304, pytorch#161308, pytorch#161315, pytorch#161317, pytorch#161328
…h#161432) If we want them interned, we should intern at callsites. (The numpy reference has bit rotted; see numpy/numpy@b222eb6#diff-6bdb6105198083838f51c57b55b3a49472ed23043bb40018f1ea41138e687163) Profiling a simple torchdispatch benchmark with perf before/after seems to show that time spent copying std::strings and interning Python strings is gone, though there is some noise and the improvement is very small. Pull Request resolved: pytorch#161432 Approved by: https://github.com/ezyang ghstack dependencies: pytorch#161301, pytorch#161292, pytorch#161304, pytorch#161308, pytorch#161315, pytorch#161317, pytorch#161328, pytorch#161329
…rch#161455) symbools are not identical with Py_True or PyFalse, so we can do those cheap checks first and at least get plain old bools to go fast. Pull Request resolved: pytorch#161455 Approved by: https://github.com/Skylion007 ghstack dependencies: pytorch#161301, pytorch#161292, pytorch#161304, pytorch#161308, pytorch#161315, pytorch#161317, pytorch#161328, pytorch#161329, pytorch#161432
…pytorch#161301) This took me a bit to figure out and I'm pretty sure I've looked at this code before. Pybind uses `return_value_policy::reference_internal` for `def_property`, which [causes the owning object to be kept alive for the lifespan of the return value](https://pybind11.readthedocs.io/en/stable/advanced/functions.html), allowing the getter to safely avoid copying the property value. However, lambdas act like they return `auto`, not `decltype(auto)`, so our lambdas themselves were forcing copies! Testing: observed std::vector<Argument> copying disappear in Linux perf profile of someOpInfo._schema.arguments/returns (in _python_dispatch.correct_storage_aliasing). Pull Request resolved: pytorch#161301 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/wconstab
…ing (pytorch#161292) func._schema is a pybind, accessing the arguments/returns is expensive, we have no reason to do it anyway, and even though pytorch#161301 makes accessing the arguments/returns less expensive, this still seems to improve performance. Pull Request resolved: pytorch#161292 Approved by: https://github.com/wconstab, https://github.com/malfet, https://github.com/bdhirsh ghstack dependencies: pytorch#161301
…torch#161304) `is` checks object identity and is more efficient. Google seems to confirm it is the correct way to do an exact type check. Pull Request resolved: pytorch#161304 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/bdhirsh ghstack dependencies: pytorch#161301, pytorch#161292
…pytorch#161308) - Empty containers are Falsey - Hoist cheap checks first - Microbenchmarked single-element set access method Benchmark code: ``` import timeit to_test = [ ('list(x)', 'x = set([3])'), ('x[0]', 'x = [3]'), ('list(x)[0]', 'x = set([3])'), ('next(iter(x))', 'x = set([3])'), ] for (stmt, setup) in to_test: res = timeit.timeit(stmt=stmt, setup=setup) print(f"Time for `{stmt}`: {res}") ``` Result with Python 3.13 on Mac (with excess digits manually trimmed; directionally matches result on Linux) ``` Time for `list(x)`: 0.03418 Time for `x[0]`: 0.00852 Time for `list(x)[0]`: 0.03561 Time for `next(iter(x))`: 0.02278 ``` FWIW, I was surprised by this result, so I guess I'm glad I wrote the benchmark! Pull Request resolved: pytorch#161308 Approved by: https://github.com/Skylion007, https://github.com/bdhirsh ghstack dependencies: pytorch#161301, pytorch#161292, pytorch#161304
…rch#161315) Scanning a list of pybind enums with `in` is slow. See NOTE in code for full explanation. This is a significant optimization; will be updating the torchdispatch/return_and_correct_aliasing portion of this stack with benchmark and results soonish. Pull Request resolved: pytorch#161315 Approved by: https://github.com/Skylion007, https://github.com/bdhirsh ghstack dependencies: pytorch#161301, pytorch#161292, pytorch#161304, pytorch#161308
…torch#161317) This assertion was expensive because of is_traceable_wrapper_subclass. Finding a cheap check to run first that's likely to let us skip the rest seems to improve things significantly. Pull Request resolved: pytorch#161317 Approved by: https://github.com/ezyang, https://github.com/XilunWu, https://github.com/bdhirsh ghstack dependencies: pytorch#161301, pytorch#161292, pytorch#161304, pytorch#161308, pytorch#161315
…61328) Not a huge cost, but free win is free. Pull Request resolved: pytorch#161328 Approved by: https://github.com/Skylion007 ghstack dependencies: pytorch#161301, pytorch#161292, pytorch#161304, pytorch#161308, pytorch#161315, pytorch#161317
`auto` forces a copy. Confirmed this did something noticable with perf. Pull Request resolved: pytorch#161329 Approved by: https://github.com/zpcore, https://github.com/fduwjj, https://github.com/Skylion007, https://github.com/bdhirsh ghstack dependencies: pytorch#161301, pytorch#161292, pytorch#161304, pytorch#161308, pytorch#161315, pytorch#161317, pytorch#161328
…h#161432) If we want them interned, we should intern at callsites. (The numpy reference has bit rotted; see numpy/numpy@b222eb6#diff-6bdb6105198083838f51c57b55b3a49472ed23043bb40018f1ea41138e687163) Profiling a simple torchdispatch benchmark with perf before/after seems to show that time spent copying std::strings and interning Python strings is gone, though there is some noise and the improvement is very small. Pull Request resolved: pytorch#161432 Approved by: https://github.com/ezyang ghstack dependencies: pytorch#161301, pytorch#161292, pytorch#161304, pytorch#161308, pytorch#161315, pytorch#161317, pytorch#161328, pytorch#161329
…rch#161455) symbools are not identical with Py_True or PyFalse, so we can do those cheap checks first and at least get plain old bools to go fast. Pull Request resolved: pytorch#161455 Approved by: https://github.com/Skylion007 ghstack dependencies: pytorch#161301, pytorch#161292, pytorch#161304, pytorch#161308, pytorch#161315, pytorch#161317, pytorch#161328, pytorch#161329, pytorch#161432
Stack from ghstack (oldest at bottom):
is, not ==, to check exact type matches in _python_dispatch #161304This took me a bit to figure out and I'm pretty sure I've looked at
this code before. Pybind uses
return_value_policy::reference_internalfordef_property, whichcauses the owning object to be kept alive for the lifespan of the
return
value,
allowing the getter to safely avoid copying the property
value. However, lambdas act like they return
auto, notdecltype(auto), so our lambdas themselves were forcing copies!Testing: observed std::vector copying disappear in Linux
perf profile of someOpInfo._schema.arguments/returns (in
_python_dispatch.correct_storage_aliasing).
cc @EikanWang @jgong5 @wenzhe-nrv @sanchitintel