[inductor] enable mkldnn op weight pre-packing on aarch64 by snadampal · Pull Request #115037 · pytorch/pytorch

snadampal · 2023-12-03T17:57:25Z

This PR enables the fx passes and mkldnn optimizations for aarch64 It improved the bert inference performance up to 5.8x on AWS c7g instance when compared torch.compile() vs no compile path. This is enabled when pytorch is built with USE_MKLDNN_ACL option for aarch64.

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler

pytorch-bot · 2023-12-03T17:57:28Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/115037

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit a97e61b with merge base 71bf4f3 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / linux-focal-rocm5.7-py3.8 / test (default, 1, 1, linux.rocm.gpu) (gh)

BROKEN TRUNK - The following job failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (dynamic_aot_eager_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

snadampal · 2023-12-03T17:59:31Z

Hi @jgong5 , @XiaobingSuper , @CaoE , Can you please review this PR? It will be great if it gets merged for PyTorch 2.2 release. Thank you!

snadampal · 2023-12-04T17:58:03Z

@pytorchbot merge

pytorchmergebot · 2023-12-04T18:00:50Z

Merge failed

Reason: Approval needed from one of the following:
lc0, lyoka, nimin98, minjungkim85, yns88, ...

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

snadampal · 2023-12-04T18:47:25Z

Hi lc0, @lyoka , @nimin98 , minjungkim85, @yns88, appreciate if any of you can review this PR. thank you!

torch/_inductor/fx_passes/mkldnn_fusion.py

snadampal · 2023-12-05T19:51:14Z

btw, I had already tested bert sentiment analysis with torch.compile() and the results look correct on aarch64. my above questions were just to better understand the fx_pass and mkldnn rewrite behavior for non-fusion cases, i will dig into the code.

and this PR is ready for merge once approved by the module owners.

snadampal · 2023-12-05T23:01:18Z

I have updated the PR to allow the dynamic shapes on aarch64 even for fp32 inputs.

snadampal · 2023-12-06T19:11:56Z

Hi lc0, @lyoka , @nimin98 , minjungkim85, @yns88, appreciate if any of you can review this PR. Given the performance gains from torch.compile() it will be really great if it can be merged to PyTorch 2.2. Thank you!

malfet

LGTM, but I wonder if is_mkldnn_acl_supported() is in any way fundamentally different than _is_mkldnn_bf16_supported()? If not, why not merge two functions together?

aten/src/ATen/native/mkldnn/RegisterMkldnnOpContextClass.cpp

snadampal · 2023-12-06T20:01:33Z

LGTM, but I wonder if is_mkldnn_acl_supported() is in any way fundamentally different than _is_mkldnn_bf16_supported()? If not, why not merge two functions together?

these two are different. for example on Neoverse N1 we have acl_supported but not bf16, we can still take advantage of weight pre-packing for fp32.

This PR enables the fx passes and mkldnn optimizations for aarch64. It improved the bert inference performance up to 5.8x on AWS c7g instance when compared torch.compile() vs no compile path. This is enabled when pytorch is built with USE_MKLDNN_ACL option on aarch64.

malfet · 2023-12-07T19:56:14Z

@pytorchbot merge

pytorchmergebot · 2023-12-07T19:58:14Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

This PR enables the fx passes and mkldnn optimizations for aarch64 It improved the bert inference performance up to 5.8x on AWS c7g instance when compared torch.compile() vs no compile path. This is enabled when pytorch is built with USE_MKLDNN_ACL option for aarch64. Pull Request resolved: pytorch#115037 Approved by: https://github.com/jgong5, https://github.com/malfet

…5037) This PR enables the fx passes and mkldnn optimizations for aarch64 It improved the bert inference performance up to 5.8x on AWS c7g instance when compared torch.compile() vs no compile path. This is enabled when pytorch is built with USE_MKLDNN_ACL option for aarch64. Pull Request resolved: pytorch#115037 Approved by: https://github.com/jgong5, https://github.com/malfet

github-actions bot added module: cpu CPU specific problem (e.g., perf, algorithm) module: inductor ciflow/inductor labels Dec 3, 2023

pytorchbot added the open source label Dec 3, 2023

snadampal added release notes: fx release notes category release notes: cpp release notes category labels Dec 3, 2023

snadampal force-pushed the aarch64_torch_inductor branch from db59207 to cb44ee2 Compare December 3, 2023 20:59

jgong5 approved these changes Dec 4, 2023

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 4, 2023

pytorchmergebot added the merging label Dec 4, 2023

pytorchmergebot removed the merging label Dec 4, 2023

snadampal commented Dec 5, 2023

View reviewed changes

torch/_inductor/fx_passes/mkldnn_fusion.py Show resolved Hide resolved

snadampal force-pushed the aarch64_torch_inductor branch from cb44ee2 to fade410 Compare December 5, 2023 22:59

malfet approved these changes Dec 6, 2023

View reviewed changes

aten/src/ATen/native/mkldnn/RegisterMkldnnOpContextClass.cpp Outdated Show resolved Hide resolved

snadampal force-pushed the aarch64_torch_inductor branch from fade410 to a97e61b Compare December 6, 2023 21:43

pytorchmergebot added the merging label Dec 7, 2023

pytorchmergebot added the Merged label Dec 7, 2023

pytorchmergebot closed this in 7faa67f Dec 7, 2023

pytorchmergebot removed the merging label Dec 7, 2023

snadampal mentioned this pull request Dec 7, 2023

enable mkldnn op weight pre-packing on aarch64 (cherrypick #115037) #115382

Closed

snadampal mentioned this pull request Dec 19, 2023

linear+gelu fused operator is not supported in ACL ARM-software/ComputeLibrary#1083

Closed

snadampal mentioned this pull request Mar 13, 2024

Upgrade submodule onednn to v3.3.5 #120767

Closed

Conversation

snadampal commented Dec 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/115037

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

snadampal commented Dec 3, 2023

Uh oh!

snadampal commented Dec 4, 2023

Uh oh!

pytorchmergebot commented Dec 4, 2023

Merge failed

Uh oh!

snadampal commented Dec 4, 2023

Uh oh!

Uh oh!

snadampal commented Dec 5, 2023

Uh oh!

snadampal commented Dec 5, 2023

Uh oh!

snadampal commented Dec 6, 2023

Uh oh!

malfet left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

snadampal commented Dec 6, 2023

Uh oh!

malfet commented Dec 7, 2023

Uh oh!

pytorchmergebot commented Dec 7, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

snadampal commented Dec 3, 2023 •

edited

Loading

pytorch-bot bot commented Dec 3, 2023 •

edited

Loading