[cpu] Modify inductor opt flag --- ftree-loop-vectorize#121782
[cpu] Modify inductor opt flag --- ftree-loop-vectorize#121782Valentine233 wants to merge 6 commits intomainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/121782
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 1 Unrelated FailureAs of commit ae50a74 with merge base e7141d1 ( NEW FAILURE - The following job has failed:
BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
jgong5
left a comment
There was a problem hiding this comment.
Please share performance numbers to make sure there is no regression.
| if not config.cpp.enable_tree_loop_vec_opt_flag: | ||
| base_flags += " -fno-tree-loop-vectorize" |
There was a problem hiding this comment.
Please add the issue links as comment to explain why we have to disable this by default.
There was a problem hiding this comment.
Thanks and added!
| resnet50,float32,dynamic,default,2.28794694 | ||
| timm_efficientnet,float32,static,cpp,2.72195686 | ||
| mobilenet_v3_large,float32,static,cpp,3.02274304 | ||
| mobilenet_v3_large,float32,static,cpp,2.9000000 |
There was a problem hiding this comment.
To pass CI, modify the speedup. According to the validation, this model doesn't have a perf regression.
3612edd to
ae50a74
Compare
|
@Valentine233 please help to check if whether we can enable the vectorization for the regression models. |
Updated in PR description. |
|
We may wait for all the regressions fixed, if they could be solved in short term. |
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
Reopen #121782, as more optimizations have landed. Fixes #115261, #113017. For CPU inductor path, remove -ftree-loop-vectorize from optimization flags to fix functional issues. ### Validation on 3 benchmark suites #### FP32  Outlier models (speedup<0.8, single socket): None. #### BF16  Outlier models (speedup<0.8, single socket multi threads): - functorch_dp_cifar10 0.58 - opacus_cifar10 0.57 Pull Request resolved: #136827 Approved by: https://github.com/jansel, https://github.com/jgong5
Reopen #121782, as more optimizations have landed. Fixes #115261, #113017. For CPU inductor path, remove -ftree-loop-vectorize from optimization flags to fix functional issues. ### Validation on 3 benchmark suites #### FP32  Outlier models (speedup<0.8, single socket): None. #### BF16  Outlier models (speedup<0.8, single socket multi threads): - functorch_dp_cifar10 0.58 - opacus_cifar10 0.57 Pull Request resolved: #136827 Approved by: https://github.com/jansel, https://github.com/jgong5
Reopen pytorch#121782, as more optimizations have landed. Fixes pytorch#115261, pytorch#113017. For CPU inductor path, remove -ftree-loop-vectorize from optimization flags to fix functional issues. ### Validation on 3 benchmark suites #### FP32  Outlier models (speedup<0.8, single socket): None. #### BF16  Outlier models (speedup<0.8, single socket multi threads): - functorch_dp_cifar10 0.58 - opacus_cifar10 0.57 Pull Request resolved: pytorch#136827 Approved by: https://github.com/jansel, https://github.com/jgong5
Reopen pytorch#121782, as more optimizations have landed. Fixes pytorch#115261, pytorch#113017. For CPU inductor path, remove -ftree-loop-vectorize from optimization flags to fix functional issues. ### Validation on 3 benchmark suites #### FP32  Outlier models (speedup<0.8, single socket): None. #### BF16  Outlier models (speedup<0.8, single socket multi threads): - functorch_dp_cifar10 0.58 - opacus_cifar10 0.57 Pull Request resolved: pytorch#136827 Approved by: https://github.com/jansel, https://github.com/jgong5
Fixes #115261, #113017.
For CPU inductor path, remove
-ftree-loop-vectorizefrom optimization flags to fix functional issues.Validation on 3 benchmark suites
FP32
Outlier models (speedup<0.8, single socket):
atomic_add(scatter_add) @CaoEindex_expr(batch_norm)Expected to be fixed by [inductor][cpp] complete vectorization for int32/int64 #122961
BF16
Outlier models (speedup<0.8, single socket):
atomic_add(scatter_add) @CaoEindex_expr(batch_norm)Expected to be fixed by [inductor][cpp] complete vectorization for int32/int64 #122961
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @desertfire @chauhang