[PyTorch] Migrate fp16 gemv fast path kernel from intrinsics to vec::Vectorized#137912

Closed

swolchok wants to merge 9 commits intogh/swolchok/660/basefrom

gh/swolchok/660/head

Contributor

swolchok commented Oct 14, 2024 •

edited

Loading

Stack from ghstack (oldest at bottom):

Migrated as much as possible and convenient; focusing on fp16
for now. (This is building toward enabling these fast paths on x86 for
machines without AVX-512fp16/bf16 to fix
pytorch/torchchat#1253 .)

Differential Revision: D64218206

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10


          [PyTorch] Migrate fp16 gemv fast path kernel from intrinsics to vec::…

de62d71

…Vectorized

Migrated as much as possible and convenient; focusing on fp16
for now. (This is building toward enabling these fast paths on x86 for
machines without AVX-512fp16/bf16 to fix
pytorch/torchchat#1253 .)

Differential Revision: [D64218206](https://our.internmc.facebook.com/intern/diff/D64218206/)

[ghstack-poisoned]

pytorch-bot Bot commented Oct 14, 2024 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137912

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c8f47af with merge base b9618c9 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Contributor

facebook-github-bot commented Oct 14, 2024

This pull request was exported from Phabricator. Differential Revision: D64218206

facebook-github-bot added the fb-exported label

This was referenced Oct 14, 2024

[PyTorch] Check defined(__aarch64__) && !defined(CPU_CAPABILITY_SVE256) instead of defined(CPU_CAPABILITY_NEON) #137722

Closed

[PyTorch] Use 128-bit vectors for ARM64 #137426

Closed

[PyTorch] Move NEON VecConvert specialization from vec256_convert to vec128_convert #137661

Closed

[PyTorch] Specialize Vectorized<Half> for NEON even if FP16 arithmetic isn't available #137911

Closed

[PyTorch] Use Half, not float16_t, in fp16 gemv fast path signatures #137913

Closed

[PyTorch] Move FP16 dot and GEMV kernels to new file in ATen/native/cpu/ #137914

Closed

[PyTorch] Clean up Registers/ElementsPerIteration constants #137915

Closed

[PyTorch] Convert reduced precision gemv vectorized tail loop to use whole vector register instead of half #137916

Closed

[PyTorch] Build ReducedPrecisionFloatGemvFastPathKernel & entry points for non-ARM architectures too #137917

Closed

[PyTorch] Hook up fp16_gemv_trans to x86 fp16 GEMM #137918

Closed


          Update on "[PyTorch] Migrate fp16 gemv fast path kernel from intrinsi…

…cs to vec::Vectorized"

Migrated as much as possible and convenient; focusing on fp16
for now. (This is building toward enabling these fast paths on x86 for
machines without AVX-512fp16/bf16 to fix
pytorch/torchchat#1253 .)

Differential Revision: [D64218206](https://our.internmc.facebook.com/intern/diff/D64218206/)

[ghstack-poisoned]

Contributor

facebook-github-bot commented Oct 15, 2024

This pull request was exported from Phabricator. Differential Revision: D64218206

swolchok mentioned this pull request

[PyTorch] Hook up fp16_gemv_trans to gemv fast path for non-aarch64 architectures #138005

Closed


          Update on "[PyTorch] Migrate fp16 gemv fast path kernel from intrinsi…

b88dec9

…cs to vec::Vectorized"

Migrated as much as possible and convenient; focusing on fp16
for now. (This is building toward enabling these fast paths on x86 for
machines without AVX-512fp16/bf16 to fix
pytorch/torchchat#1253 .)

Differential Revision: [D64218206](https://our.internmc.facebook.com/intern/diff/D64218206/)

[ghstack-poisoned]

Contributor

facebook-github-bot commented Oct 17, 2024

This pull request was exported from Phabricator. Differential Revision: D64218206

swolchok mentioned this pull request

[PyTorch] Support non-zero beta in fp16_gemv_trans #138275

Closed

swolchok requested review from albanD and malfet

October 17, 2024 22:52


          Update on "[PyTorch] Migrate fp16 gemv fast path kernel from intrinsi…

de5a344

…cs to vec::Vectorized"

Migrated as much as possible and convenient; focusing on fp16
for now. (This is building toward enabling these fast paths on x86 for
machines without AVX-512fp16/bf16 to fix
pytorch/torchchat#1253 .)

Differential Revision: [D64218206](https://our.internmc.facebook.com/intern/diff/D64218206/)

[ghstack-poisoned]

Contributor

facebook-github-bot commented Oct 22, 2024

This pull request was exported from Phabricator. Differential Revision: D64218206

This was referenced Oct 22, 2024

[PyTorch] Fix inductor CPU masked() body codegen when result dtype is bool and operator is where #138486

Closed

[PyTorch] Fix inductor bug with unrolled vectorized prod #138542

Closed

swolchok added the topic: not user facing label


          Update on "[PyTorch] Migrate fp16 gemv fast path kernel from intrinsi…

238ad87

…cs to vec::Vectorized"

Migrated as much as possible and convenient; focusing on fp16
for now. (This is building toward enabling these fast paths on x86 for
machines without AVX-512fp16/bf16 to fix
pytorch/torchchat#1253 .)

Differential Revision: [D64218206](https://our.internmc.facebook.com/intern/diff/D64218206/)

[ghstack-poisoned]

Contributor

facebook-github-bot commented Oct 22, 2024

This pull request was exported from Phabricator. Differential Revision: D64218206


          Update on "[PyTorch] Migrate fp16 gemv fast path kernel from intrinsi…

04f6eb4

…cs to vec::Vectorized"

Migrated as much as possible and convenient; focusing on fp16
for now. (This is building toward enabling these fast paths on x86 for
machines without AVX-512fp16/bf16 to fix
pytorch/torchchat#1253 .)

Differential Revision: [D64218206](https://our.internmc.facebook.com/intern/diff/D64218206/)

[ghstack-poisoned]

swolchok added a commit that referenced this pull request


          Update base for Update on "[PyTorch] Migrate bf16 gemv fast path kern…

38cadeb

…el from intrinsics to vec::Vectorized"

Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).

Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/)

[ghstack-poisoned]

swolchok added a commit that referenced this pull request


          Update on "[PyTorch] Migrate bf16 gemv fast path kernel from intrinsi…

587e402

…cs to vec::Vectorized"

Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).

Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/)

[ghstack-poisoned]

swolchok mentioned this pull request

Hook up bf16_gemv_trans to x86 bf16 GEMM #139220

Closed

swolchok added a commit that referenced this pull request


          Update base for Update on "[PyTorch] Migrate bf16 gemv fast path kern…

855d486

…el from intrinsics to vec::Vectorized"

Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).

Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/)

[ghstack-poisoned]

swolchok added a commit that referenced this pull request


          Update on "[PyTorch] Migrate bf16 gemv fast path kernel from intrinsi…

f755824

…cs to vec::Vectorized"

Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).

Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/)

[ghstack-poisoned]

pytorchmergebot added the Merged label

pytorchmergebot closed this in

9ede4b2

pytorchmergebot pushed a commit that referenced this pull request


          [PyTorch] Use Half, not float16_t, in fp16 gemv fast path signatures (#…

6502d6c

…137913)

float16_t is ARM-specific. Half is not.

Differential Revision: [D64218427](https://our.internmc.facebook.com/intern/diff/D64218427/)

Pull Request resolved: #137913
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: #137661, #137911, #137912

pytorchmergebot pushed a commit that referenced this pull request


          [PyTorch] Move FP16 dot and GEMV kernels to new file in ATen/native/c…

aafbea4

…pu/ (#137914)

This is in preparation for supporting x86 as well; we need to
be in this directory so that we can get rebuilt with different
CPU_CAPABILITY settings (AVX2/AVX-512). Also incidentally starts
fulfilling request from @malfet to split the ARM64 fast path stuff
into its own file. BFloat16 will be in a later diff.

Differential Revision: [D64265755](https://our.internmc.facebook.com/intern/diff/D64265755/)

Pull Request resolved: #137914
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: #137661, #137911, #137912, #137913

pytorchmergebot pushed a commit that referenced this pull request


          [PyTorch] Clean up Registers/ElementsPerIteration constants (#137915)

5be1556

In preparation for other vector instruction sets. (NEON and AVX512 have 32 registers, but AVX and AVX2 have only 16.)

Differential Revision: [D64265759](https://our.internmc.facebook.com/intern/diff/D64265759/)

Pull Request resolved: #137915
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: #137661, #137911, #137912, #137913, #137914

pytorchmergebot pushed a commit that referenced this pull request


          [PyTorch] Convert reduced precision gemv vectorized tail loop to use …

fc2d0da

…whole vector register instead of half (#137916)

The fixup loop doesn't really need to vectorize the last 7 elements, and not doing so will make migrating to x86 simpler.

Differential Revision: [D64280689](https://our.internmc.facebook.com/intern/diff/D64280689/)

Pull Request resolved: #137916
Approved by: https://github.com/malfet
ghstack dependencies: #137661, #137911, #137912, #137913, #137914, #137915

pytorchmergebot pushed a commit that referenced this pull request


          [PyTorch] Build ReducedPrecisionFloatGemvFastPathKernel & entry point…

b29c170

…s for non-ARM architectures too (#137917)

Remove reasons to gate it on ARM.

Differential Revision: [D64280687](https://our.internmc.facebook.com/intern/diff/D64280687/)

Pull Request resolved: #137917
Approved by: https://github.com/malfet
ghstack dependencies: #137661, #137911, #137912, #137913, #137914, #137915, #137916

malfet added ciflow/linux-aarch64 and removed Merged labels

swolchok added a commit that referenced this pull request


          Update base for Update on "[PyTorch] Migrate bf16 gemv fast path kern…

adc42b1

…el from intrinsics to vec::Vectorized"

Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).

Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

swolchok added a commit that referenced this pull request


          Update on "[PyTorch] Migrate bf16 gemv fast path kernel from intrinsi…

37fc80d

…cs to vec::Vectorized"

Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).

Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

swolchok added a commit that referenced this pull request


          Update base for Update on "[PyTorch] Migrate bf16 gemv fast path kern…

9ab310d

…el from intrinsics to vec::Vectorized"


Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).

Testing: checked for regression with llm_experiments' benchmarks/benchmark_torch_mm.py llm on M1 Mac and it appeared to be neutral. Supported this assessment by inspecting assembly for the bf16 dot kernel (`objdump -d --no-leading-addr --no-show-raw-insn build/caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.DEFAULT.cpp.o | c++filt` from pytorch root directory after `python setup.py develop`); observed minor instruction scheduling changes but nothing more.

Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

swolchok added a commit that referenced this pull request


          Update on "[PyTorch] Migrate bf16 gemv fast path kernel from intrinsi…

5b4c33b

…cs to vec::Vectorized"


Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).

Testing: checked for regression with llm_experiments' benchmarks/benchmark_torch_mm.py llm on M1 Mac and it appeared to be neutral. Supported this assessment by inspecting assembly for the bf16 dot kernel (`objdump -d --no-leading-addr --no-show-raw-insn build/caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.DEFAULT.cpp.o | c++filt` from pytorch root directory after `python setup.py develop`); observed minor instruction scheduling changes but nothing more.

Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

swolchok added a commit that referenced this pull request


          Update base for Update on "[PyTorch] Migrate bf16 gemv fast path kern…

51132fc

…el from intrinsics to vec::Vectorized"


Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).

Testing: checked for regression with llm_experiments' benchmarks/benchmark_torch_mm.py llm on M1 Mac and it appeared to be neutral. Supported this assessment by inspecting assembly for the bf16 dot kernel (`objdump -d --no-leading-addr --no-show-raw-insn build/caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.DEFAULT.cpp.o | c++filt` from pytorch root directory after `python setup.py develop`); observed minor instruction scheduling changes but nothing more.

Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

swolchok added a commit that referenced this pull request


          Update on "[PyTorch] Migrate bf16 gemv fast path kernel from intrinsi…

4df3c59

…cs to vec::Vectorized"


Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).

Testing: checked for regression with llm_experiments' benchmarks/benchmark_torch_mm.py llm on M1 Mac and it appeared to be neutral. Supported this assessment by inspecting assembly for the bf16 dot kernel (`objdump -d --no-leading-addr --no-show-raw-insn build/caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.DEFAULT.cpp.o | c++filt` from pytorch root directory after `python setup.py develop`); observed minor instruction scheduling changes but nothing more.

Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

swolchok added a commit that referenced this pull request


          Update base for reorder PRs; still working on SVE build issues on "[P…

a0c30b1

…yTorch] Migrate bf16 gemv fast path kernel from intrinsics to vec::Vectorized"


Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).

Testing: checked for regression with llm_experiments' benchmarks/benchmark_torch_mm.py llm on M1 Mac and it appeared to be neutral. Supported this assessment by inspecting assembly for the bf16 dot kernel (`objdump -d --no-leading-addr --no-show-raw-insn build/caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.DEFAULT.cpp.o | c++filt` from pytorch root directory after `python setup.py develop`); observed minor instruction scheduling changes but nothing more.

Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

swolchok added a commit that referenced this pull request


          reorder PRs; still working on SVE build issues on "[PyTorch] Migrate …

a78691a

…bf16 gemv fast path kernel from intrinsics to vec::Vectorized"


Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).

Testing: checked for regression with llm_experiments' benchmarks/benchmark_torch_mm.py llm on M1 Mac and it appeared to be neutral. Supported this assessment by inspecting assembly for the bf16 dot kernel (`objdump -d --no-leading-addr --no-show-raw-insn build/caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.DEFAULT.cpp.o | c++filt` from pytorch root directory after `python setup.py develop`); observed minor instruction scheduling changes but nothing more.

Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request


          [PyTorch] Migrate fp16 gemv fast path kernel from intrinsics to vec::…

607a5ab

…Vectorized (pytorch#137912)

Migrated as much as possible and convenient; focusing on fp16
for now. (This is building toward enabling these fast paths on x86 for
machines without AVX-512fp16/bf16 to fix
pytorch/torchchat#1253 .)

Differential Revision: [D64218206](https://our.internmc.facebook.com/intern/diff/D64218206/)

Pull Request resolved: pytorch#137912
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#137661, pytorch#137911

rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request


          [PyTorch] Use Half, not float16_t, in fp16 gemv fast path signatures (p…

bfb12c3

…ytorch#137913)

float16_t is ARM-specific. Half is not.

Differential Revision: [D64218427](https://our.internmc.facebook.com/intern/diff/D64218427/)

Pull Request resolved: pytorch#137913
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912

rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request


          [PyTorch] Move FP16 dot and GEMV kernels to new file in ATen/native/c…

fd55e52

…pu/ (pytorch#137914)

This is in preparation for supporting x86 as well; we need to
be in this directory so that we can get rebuilt with different
CPU_CAPABILITY settings (AVX2/AVX-512). Also incidentally starts
fulfilling request from @malfet to split the ARM64 fast path stuff
into its own file. BFloat16 will be in a later diff.

Differential Revision: [D64265755](https://our.internmc.facebook.com/intern/diff/D64265755/)

Pull Request resolved: pytorch#137914
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912, pytorch#137913

rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request


          [PyTorch] Clean up Registers/ElementsPerIteration constants (pytorch#…

e16cfce

…137915)

In preparation for other vector instruction sets. (NEON and AVX512 have 32 registers, but AVX and AVX2 have only 16.)

Differential Revision: [D64265759](https://our.internmc.facebook.com/intern/diff/D64265759/)

Pull Request resolved: pytorch#137915
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912, pytorch#137913, pytorch#137914

rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request


          [PyTorch] Convert reduced precision gemv vectorized tail loop to use …

…whole vector register instead of half (pytorch#137916)

The fixup loop doesn't really need to vectorize the last 7 elements, and not doing so will make migrating to x86 simpler.

Differential Revision: [D64280689](https://our.internmc.facebook.com/intern/diff/D64280689/)

Pull Request resolved: pytorch#137916
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912, pytorch#137913, pytorch#137914, pytorch#137915

rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request


          [PyTorch] Build ReducedPrecisionFloatGemvFastPathKernel & entry point…

…s for non-ARM architectures too (pytorch#137917)

Remove reasons to gate it on ARM.

Differential Revision: [D64280687](https://our.internmc.facebook.com/intern/diff/D64280687/)

Pull Request resolved: pytorch#137917
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912, pytorch#137913, pytorch#137914, pytorch#137915, pytorch#137916

github-actions Bot deleted the gh/swolchok/660/head branch

November 29, 2024 02:13

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/linux-aarch64 ciflow/trunk fb-exported module: cpu topic: not user facing