[inductor][cpp] complete vectorization for int32/int64#122961
[inductor][cpp] complete vectorization for int32/int64#122961jgong5 wants to merge 40 commits intogh/jgong5/33/basefrom
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122961
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 29501d9 with merge base bce0cab ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]
cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]
cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]
|
According to the fp32 inference benchmarking on the three suites, It improved perf on some of the models but caused regression on others. Need to figure out why.
|
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]
cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]
cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]
cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]
**Summary** Enable the vectorization of `load_seed` and `randn`. For now, `randn` is using the reference implementation. **Test Plan** ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_vec_randn ``` Pull Request resolved: #130317 Approved by: https://github.com/jgong5 ghstack dependencies: #122961
|
@pytorchmergebot revert -c nosignal -m "Breaks slow jobs: inductor/test_cpu_repro.py::CPUReproTests::test__adaptive_avg_pool2d GH job link HUD commit link" |
|
@pytorchbot successfully started a revert job. Check the current status here. |
|
@jgong5 your PR has been successfully reverted. |
…)" This reverts commit 99b3b58. Reverted #122961 on behalf of https://github.com/atalman due to Breaks slow jobs: inductor/test_cpu_repro.py::CPUReproTests::test__adaptive_avg_pool2d [GH job link](https://github.com/pytorch/pytorch/actions/runs/10432403692/job/28893704833) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/a0ef8888e60d934ae7e4ddaec1c1274b12d0d39d) ([comment](#122961 (comment)))
|
Adding |
**Summary** Implement the complete vectorization of `index_expr` functionally. We also add heuristic from performance perspective to resolve the regressions posted below: #122961 (comment) by disabling vectorization of specific (Fused) scheduler Node: - Heuristic 1: when the num of non-contiguous `index_expr/load/store` exceeds the threshold, we disable the vectorization. - Heuristic 2: when the total number of elements along the vec dim is less than `tiling_factor/2`, we disable the vectorization. cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]
… int32/int64" **Summary** Implement the complete vectorization of `index_expr` functionally. We also add heuristic from performance perspective to resolve the regressions posted below: #122961 (comment) by disabling vectorization of specific (Fused) scheduler Node: - Heuristic 1: when the num of non-contiguous `index_expr/load/store` exceeds the threshold, we disable the vectorization. - Heuristic 2: when the total number of elements along the vec dim is less than `tiling_factor/2`, we disable the vectorization. cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]
|
Hi @atalman, thanks for the informing. I think I have fixed the failure of |
**Summary** Implement the complete vectorization of `index_expr` functionally. We also add heuristic from performance perspective to resolve the regressions posted below: #122961 (comment) by disabling vectorization of specific (Fused) scheduler Node: - Heuristic 1: when the num of non-contiguous `index_expr/load/store` exceeds the threshold, we disable the vectorization. - Heuristic 2: when the total number of elements along the vec dim is less than `tiling_factor/2`, we disable the vectorization. cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]
… int32/int64" **Summary** Implement the complete vectorization of `index_expr` functionally. We also add heuristic from performance perspective to resolve the regressions posted below: #122961 (comment) by disabling vectorization of specific (Fused) scheduler Node: - Heuristic 1: when the num of non-contiguous `index_expr/load/store` exceeds the threshold, we disable the vectorization. - Heuristic 2: when the total number of elements along the vec dim is less than `tiling_factor/2`, we disable the vectorization. cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]
Hi @atalman, the failure posted previously passes now and the only failure marked as unrelated in Hud. I am going to re-land these 2 PRs. Thanks. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
**Summary** Enable the vectorization of `load_seed` and `randn`. For now, `randn` is using the reference implementation. **Test Plan** ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_vec_randn ``` Pull Request resolved: #130317 Approved by: https://github.com/jgong5 ghstack dependencies: #122961
|
@jgong5 @leslie-fang-intel I've got an internal report of a failure that I've been able to repro. The backtrace looks like: Is this stack trace enough to allow you to debug without a full repro (the internal test is rather large and complicated). More importantly, do you have any objection to reverting while we investigate? I can help vet any fixes on the internal workload. |
|
Hi @masnesral, could you share the commit ID you used? There are several PR landed in cpp.py recently. So, I am not sure |
This PR is important to us for PyTorch 2.5, may we try to open a hotfix with new PR at first? |
| )[ | ||
| _node.args[ | ||
| 1 if _node.target == "index_expr" else 2 | ||
| ].args[0] |
There was a problem hiding this comment.
@leslie-fang-intel this is the line that raises on the internal workload
There was a problem hiding this comment.
Thanks, will try to submit a hotfix soon. Still not sure yet, how could the _node.args[1 if _node.target == "index_expr" else 2] be a str, but I guess we can bypass this special case.
There was a problem hiding this comment.
I'll keep debugging on my side; unfortunately the failing test takes a while to run :/
There was a problem hiding this comment.
Thanks, here we just want to get the name of index expression. So for all of the cases we see in the 3 test suits (TorchBench, Timms, Huggingface) we expect, take index_expr as example:
_nodeis a FX Node with targetindex_expr_node.args[1 if _node.target == "index_expr" else 2]is another FX node with targetget_index_node.args[1 if _node.target == "index_expr" else 2].args[0]is a str for the name of this index expression
For your case, it seems _node.args[1 if _node.target == "index_expr" else 2] is a str instead of the FX node.
There was a problem hiding this comment.
@leslie-fang-intel Sorry for the delay. I was trying to debug last night and I'm seeing some extremely strange behavior I can't explain. I thought I might be hallucinating and just needed sleep. When I poke around in the debugger, the element in question is not a string. Amazingly, when I change the code to the following, there is no error and the test runs successfully:
iii = 1 if _node.target == "index_expr" else 2
...
_node.args[
iii
].args[0]
What?!
Yeah, it's a little tricky because we mirror OSS commits to an internal repo asynchronously. But I was able to validate that backing out this PR removes the problem. I'll comment on the line that's raising the exception. |
I honestly don't know how difficult that is or how close we are to the deadline. Maybe @atalman can help out here (who's helped me recently with some other, unrelated hotfixes) |
Stack from ghstack (oldest at bottom):
Summary
Implement the complete vectorization of
index_exprfunctionally. We also add heuristic from performance perspective to resolve the regressions posted below: #122961 (comment) by disabling vectorization of specific (Fused) scheduler Node:index_expr/load/storeexceeds the threshold, we disable the vectorization.tiling_factor/2, we disable the vectorization.cc @voznesenskym @penguinwu @EikanWang @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @desertfire @chauhang