Skip to content

[inductor] enable mkldnn op weight pre-packing on aarch64#115037

Closed
snadampal wants to merge 1 commit intopytorch:mainfrom
snadampal:aarch64_torch_inductor
Closed

[inductor] enable mkldnn op weight pre-packing on aarch64#115037
snadampal wants to merge 1 commit intopytorch:mainfrom
snadampal:aarch64_torch_inductor

Conversation

@snadampal
Copy link
Copy Markdown
Collaborator

@snadampal snadampal commented Dec 3, 2023

This PR enables the fx passes and mkldnn optimizations for aarch64 It improved the bert inference performance up to 5.8x on AWS c7g instance when compared torch.compile() vs no compile path. This is enabled when pytorch is built with USE_MKLDNN_ACL option for aarch64.

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Dec 3, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/115037

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit a97e61b with merge base 71bf4f3 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

BROKEN TRUNK - The following job failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@github-actions github-actions bot added module: cpu CPU specific problem (e.g., perf, algorithm) module: inductor ciflow/inductor labels Dec 3, 2023
@snadampal
Copy link
Copy Markdown
Collaborator Author

Hi @jgong5 , @XiaobingSuper , @CaoE , Can you please review this PR? It will be great if it gets merged for PyTorch 2.2 release. Thank you!

@snadampal snadampal added release notes: fx release notes category release notes: cpp release notes category labels Dec 3, 2023
@snadampal snadampal force-pushed the aarch64_torch_inductor branch from db59207 to cb44ee2 Compare December 3, 2023 20:59
@snadampal
Copy link
Copy Markdown
Collaborator Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 4, 2023
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge failed

Reason: Approval needed from one of the following:
lc0, lyoka, nimin98, minjungkim85, yns88, ...

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@snadampal
Copy link
Copy Markdown
Collaborator Author

Hi lc0, @lyoka , @nimin98 , minjungkim85, @yns88, appreciate if any of you can review this PR. thank you!

@snadampal
Copy link
Copy Markdown
Collaborator Author

btw, I had already tested bert sentiment analysis with torch.compile() and the results look correct on aarch64. my above questions were just to better understand the fx_pass and mkldnn rewrite behavior for non-fusion cases, i will dig into the code.

and this PR is ready for merge once approved by the module owners.

@snadampal snadampal force-pushed the aarch64_torch_inductor branch from cb44ee2 to fade410 Compare December 5, 2023 22:59
@snadampal
Copy link
Copy Markdown
Collaborator Author

I have updated the PR to allow the dynamic shapes on aarch64 even for fp32 inputs.

@snadampal
Copy link
Copy Markdown
Collaborator Author

Hi lc0, @lyoka , @nimin98 , minjungkim85, @yns88, appreciate if any of you can review this PR. Given the performance gains from torch.compile() it will be really great if it can be merged to PyTorch 2.2. Thank you!

Copy link
Copy Markdown
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I wonder if is_mkldnn_acl_supported() is in any way fundamentally different than _is_mkldnn_bf16_supported()? If not, why not merge two functions together?

@snadampal
Copy link
Copy Markdown
Collaborator Author

LGTM, but I wonder if is_mkldnn_acl_supported() is in any way fundamentally different than _is_mkldnn_bf16_supported()? If not, why not merge two functions together?

these two are different. for example on Neoverse N1 we have acl_supported but not bf16, we can still take advantage of weight pre-packing for fp32.

This PR enables the fx passes and mkldnn optimizations for aarch64.
It improved the bert inference performance up to 5.8x on AWS c7g instance
when compared torch.compile() vs no compile path. This is enabled
when pytorch is built with USE_MKLDNN_ACL option on aarch64.
@snadampal snadampal force-pushed the aarch64_torch_inductor branch from fade410 to a97e61b Compare December 6, 2023 21:43
@malfet
Copy link
Copy Markdown
Contributor

malfet commented Dec 7, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

snadampal added a commit to snadampal/pytorch that referenced this pull request Dec 7, 2023
This PR enables the fx passes and mkldnn optimizations for aarch64 It improved the bert inference performance up to 5.8x on AWS c7g instance when compared torch.compile() vs no compile path. This is enabled when pytorch is built with USE_MKLDNN_ACL option for aarch64.

Pull Request resolved: pytorch#115037
Approved by: https://github.com/jgong5, https://github.com/malfet
dmenig pushed a commit to dmenig/pytorch that referenced this pull request Dec 21, 2023
…5037)

This PR enables the fx passes and mkldnn optimizations for aarch64 It improved the bert inference performance up to 5.8x on AWS c7g instance when compared torch.compile() vs no compile path. This is enabled when pytorch is built with USE_MKLDNN_ACL option for aarch64.

Pull Request resolved: pytorch#115037
Approved by: https://github.com/jgong5, https://github.com/malfet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged module: cpu CPU specific problem (e.g., perf, algorithm) module: inductor open source release notes: cpp release notes category release notes: fx release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants