[PyTorch] Move FP16 dot and GEMV kernels to new file in ATen/native/cpu/ by swolchok · Pull Request #137914 · pytorch/pytorch

swolchok · 2024-10-14T17:44:43Z

Stack from ghstack (oldest at bottom):

This is in preparation for supporting x86 as well; we need to
be in this directory so that we can get rebuilt with different
CPU_CAPABILITY settings (AVX2/AVX-512). Also incidentally starts
fulfilling request from @malfet to split the ARM64 fast path stuff
into its own file. BFloat16 will be in a later diff.

Differential Revision: D64265755

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

@malfet

This is in preparation for supporting x86 as well; we need to be in this directory so that we can get rebuilt with different CPU_CAPABILITY settings (AVX2/AVX-512). Also incidentally starts fulfilling request from @malfet to split the ARM64 fast path stuff into its own file. BFloat16 will be in a later diff. Differential Revision: [D64265755](https://our.internmc.facebook.com/intern/diff/D64265755/) [ghstack-poisoned]

pytorch-bot · 2024-10-14T17:44:46Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137914

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit a49029a with merge base b9618c9 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-10-14T17:45:29Z

This pull request was exported from Phabricator. Differential Revision: D64265755

…en/native/cpu/" This is in preparation for supporting x86 as well; we need to be in this directory so that we can get rebuilt with different CPU_CAPABILITY settings (AVX2/AVX-512). Also incidentally starts fulfilling request from malfet to split the ARM64 fast path stuff into its own file. BFloat16 will be in a later diff. Differential Revision: [D64265755](https://our.internmc.facebook.com/intern/diff/D64265755/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

facebook-github-bot · 2024-10-15T18:30:55Z

This pull request was exported from Phabricator. Differential Revision: D64265755

malfet

LGTM, but would be good to have a test plan of sorts, i.e. run a benchmark and make sure results are the same

…en/native/cpu/" This is in preparation for supporting x86 as well; we need to be in this directory so that we can get rebuilt with different CPU_CAPABILITY settings (AVX2/AVX-512). Also incidentally starts fulfilling request from malfet to split the ARM64 fast path stuff into its own file. BFloat16 will be in a later diff. Differential Revision: [D64265755](https://our.internmc.facebook.com/intern/diff/D64265755/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

facebook-github-bot · 2024-10-17T22:34:10Z

This pull request was exported from Phabricator. Differential Revision: D64265755

…en/native/cpu/" This is in preparation for supporting x86 as well; we need to be in this directory so that we can get rebuilt with different CPU_CAPABILITY settings (AVX2/AVX-512). Also incidentally starts fulfilling request from malfet to split the ARM64 fast path stuff into its own file. BFloat16 will be in a later diff. Differential Revision: [D64265755](https://our.internmc.facebook.com/intern/diff/D64265755/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

facebook-github-bot · 2024-10-22T16:39:07Z

This pull request was exported from Phabricator. Differential Revision: D64265755

…en/native/cpu/" This is in preparation for supporting x86 as well; we need to be in this directory so that we can get rebuilt with different CPU_CAPABILITY settings (AVX2/AVX-512). Also incidentally starts fulfilling request from malfet to split the ARM64 fast path stuff into its own file. BFloat16 will be in a later diff. Differential Revision: [D64265755](https://our.internmc.facebook.com/intern/diff/D64265755/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

facebook-github-bot · 2024-10-24T17:56:04Z

This pull request was exported from Phabricator. Differential Revision: D64265755

swolchok · 2024-10-25T17:49:00Z

@pytorchbot merge

pytorchmergebot · 2024-10-25T17:50:46Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-10-25T19:10:54Z

Merge failed

Reason: Approvers from one of the following sets are needed:

superuser (pytorch/metamates)
Core Reviewers (mruberry, lezcano, Skylion007, ngimel, peterbell10, ...)
Core Maintainers (soumith, gchanan, ezyang, dzhulgakov, malfet, ...)

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

malfet

ReducedPrecisionFloatGemvFastPathKernel.cpp feels a bit long, how about Float16Gemv.cpp

…en/native/cpu/" This is in preparation for supporting x86 as well; we need to be in this directory so that we can get rebuilt with different CPU_CAPABILITY settings (AVX2/AVX-512). Also incidentally starts fulfilling request from malfet to split the ARM64 fast path stuff into its own file. BFloat16 will be in a later diff. Differential Revision: [D64265755](https://our.internmc.facebook.com/intern/diff/D64265755/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

facebook-github-bot · 2024-10-28T17:21:20Z

This pull request was exported from Phabricator. Differential Revision: D64265755

…en/native/cpu/" This is in preparation for supporting x86 as well; we need to be in this directory so that we can get rebuilt with different CPU_CAPABILITY settings (AVX2/AVX-512). Also incidentally starts fulfilling request from malfet to split the ARM64 fast path stuff into its own file. BFloat16 will be in a later diff. Differential Revision: [D64265755](https://our.internmc.facebook.com/intern/diff/D64265755/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

facebook-github-bot · 2024-10-29T17:47:23Z

This pull request was exported from Phabricator. Differential Revision: D64265755

In preparation for other vector instruction sets. (NEON and AVX512 have 32 registers, but AVX and AVX2 have only 16.) Differential Revision: [D64265759](https://our.internmc.facebook.com/intern/diff/D64265759/) Pull Request resolved: #137915 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: #137661, #137911, #137912, #137913, #137914

…whole vector register instead of half (#137916) The fixup loop doesn't really need to vectorize the last 7 elements, and not doing so will make migrating to x86 simpler. Differential Revision: [D64280689](https://our.internmc.facebook.com/intern/diff/D64280689/) Pull Request resolved: #137916 Approved by: https://github.com/malfet ghstack dependencies: #137661, #137911, #137912, #137913, #137914, #137915

…s for non-ARM architectures too (#137917) Remove reasons to gate it on ARM. Differential Revision: [D64280687](https://our.internmc.facebook.com/intern/diff/D64280687/) Pull Request resolved: #137917 Approved by: https://github.com/malfet ghstack dependencies: #137661, #137911, #137912, #137913, #137914, #137915, #137916

@malfet

…pu/ (pytorch#137914) This is in preparation for supporting x86 as well; we need to be in this directory so that we can get rebuilt with different CPU_CAPABILITY settings (AVX2/AVX-512). Also incidentally starts fulfilling request from @malfet to split the ARM64 fast path stuff into its own file. BFloat16 will be in a later diff. Differential Revision: [D64265755](https://our.internmc.facebook.com/intern/diff/D64265755/) Pull Request resolved: pytorch#137914 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912, pytorch#137913

…137915) In preparation for other vector instruction sets. (NEON and AVX512 have 32 registers, but AVX and AVX2 have only 16.) Differential Revision: [D64265759](https://our.internmc.facebook.com/intern/diff/D64265759/) Pull Request resolved: pytorch#137915 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912, pytorch#137913, pytorch#137914

…whole vector register instead of half (pytorch#137916) The fixup loop doesn't really need to vectorize the last 7 elements, and not doing so will make migrating to x86 simpler. Differential Revision: [D64280689](https://our.internmc.facebook.com/intern/diff/D64280689/) Pull Request resolved: pytorch#137916 Approved by: https://github.com/malfet ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912, pytorch#137913, pytorch#137914, pytorch#137915

…s for non-ARM architectures too (pytorch#137917) Remove reasons to gate it on ARM. Differential Revision: [D64280687](https://our.internmc.facebook.com/intern/diff/D64280687/) Pull Request resolved: pytorch#137917 Approved by: https://github.com/malfet ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912, pytorch#137913, pytorch#137914, pytorch#137915, pytorch#137916

pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Oct 14, 2024

facebook-github-bot added the fb-exported label Oct 14, 2024

Skylion007 approved these changes Oct 14, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 14, 2024

swolchok mentioned this pull request Oct 15, 2024

[PyTorch] Hook up fp16_gemv_trans to gemv fast path for non-aarch64 architectures #138005

Closed

malfet approved these changes Oct 15, 2024

View reviewed changes

swolchok mentioned this pull request Oct 17, 2024

[PyTorch] Support non-zero beta in fp16_gemv_trans #138275

Closed

This was referenced Oct 22, 2024

[PyTorch] Fix inductor CPU masked() body codegen when result dtype is bool and operator is where #138486

Closed

[PyTorch] Fix inductor bug with unrolled vectorized prod #138542

Closed

swolchok added the topic: not user facing topic category label Oct 22, 2024

pytorchmergebot added the merging label Oct 25, 2024

pytorchmergebot removed the merging label Oct 25, 2024

malfet reviewed Oct 26, 2024

View reviewed changes

This was referenced Oct 29, 2024

Build bf16 gemv fast path & entry points for non-ARM architectures too #139208

Closed

Hook up bf16_gemv_trans to x86 bf16 GEMM #139220

Closed

pytorchmergebot added the Merged label Oct 29, 2024

pytorchmergebot closed this in aafbea4 Oct 29, 2024

github-actions bot deleted the gh/swolchok/662/head branch November 29, 2024 02:13

Conversation

swolchok commented Oct 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137914

✅ No Failures

Uh oh!

facebook-github-bot commented Oct 14, 2024

Uh oh!

facebook-github-bot commented Oct 15, 2024

Uh oh!

malfet left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Oct 17, 2024

Uh oh!

facebook-github-bot commented Oct 22, 2024

Uh oh!

facebook-github-bot commented Oct 24, 2024

Uh oh!

swolchok commented Oct 25, 2024

Uh oh!

pytorchmergebot commented Oct 25, 2024

Merge started

Uh oh!

pytorchmergebot commented Oct 25, 2024

Merge failed

Uh oh!

malfet left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Oct 28, 2024

Uh oh!

facebook-github-bot commented Oct 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

swolchok commented Oct 14, 2024 •

edited

Loading

pytorch-bot bot commented Oct 14, 2024 •

edited

Loading