[PyTorch] Migrate fp16 gemv fast path kernel from intrinsics to vec::Vectorized#137912
Closed
swolchok wants to merge 9 commits intogh/swolchok/660/basefrom
Closed
[PyTorch] Migrate fp16 gemv fast path kernel from intrinsics to vec::Vectorized#137912swolchok wants to merge 9 commits intogh/swolchok/660/basefrom
swolchok wants to merge 9 commits intogh/swolchok/660/basefrom
Conversation
…Vectorized Migrated as much as possible and convenient; focusing on fp16 for now. (This is building toward enabling these fast paths on x86 for machines without AVX-512fp16/bf16 to fix pytorch/torchchat#1253 .) Differential Revision: [D64218206](https://our.internmc.facebook.com/intern/diff/D64218206/) [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137912
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit c8f47af with merge base b9618c9 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Contributor
|
This pull request was exported from Phabricator. Differential Revision: D64218206 |
This was referenced Oct 14, 2024
Closed
…cs to vec::Vectorized" Migrated as much as possible and convenient; focusing on fp16 for now. (This is building toward enabling these fast paths on x86 for machines without AVX-512fp16/bf16 to fix pytorch/torchchat#1253 .) Differential Revision: [D64218206](https://our.internmc.facebook.com/intern/diff/D64218206/) [ghstack-poisoned]
Contributor
|
This pull request was exported from Phabricator. Differential Revision: D64218206 |
…cs to vec::Vectorized" Migrated as much as possible and convenient; focusing on fp16 for now. (This is building toward enabling these fast paths on x86 for machines without AVX-512fp16/bf16 to fix pytorch/torchchat#1253 .) Differential Revision: [D64218206](https://our.internmc.facebook.com/intern/diff/D64218206/) [ghstack-poisoned]
Contributor
|
This pull request was exported from Phabricator. Differential Revision: D64218206 |
…cs to vec::Vectorized" Migrated as much as possible and convenient; focusing on fp16 for now. (This is building toward enabling these fast paths on x86 for machines without AVX-512fp16/bf16 to fix pytorch/torchchat#1253 .) Differential Revision: [D64218206](https://our.internmc.facebook.com/intern/diff/D64218206/) [ghstack-poisoned]
Contributor
|
This pull request was exported from Phabricator. Differential Revision: D64218206 |
This was referenced Oct 22, 2024
…cs to vec::Vectorized" Migrated as much as possible and convenient; focusing on fp16 for now. (This is building toward enabling these fast paths on x86 for machines without AVX-512fp16/bf16 to fix pytorch/torchchat#1253 .) Differential Revision: [D64218206](https://our.internmc.facebook.com/intern/diff/D64218206/) [ghstack-poisoned]
Contributor
|
This pull request was exported from Phabricator. Differential Revision: D64218206 |
…cs to vec::Vectorized" Migrated as much as possible and convenient; focusing on fp16 for now. (This is building toward enabling these fast paths on x86 for machines without AVX-512fp16/bf16 to fix pytorch/torchchat#1253 .) Differential Revision: [D64218206](https://our.internmc.facebook.com/intern/diff/D64218206/) [ghstack-poisoned]
swolchok
added a commit
that referenced
this pull request
Oct 29, 2024
…el from intrinsics to vec::Vectorized" Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16). Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/) [ghstack-poisoned]
swolchok
added a commit
that referenced
this pull request
Oct 29, 2024
…cs to vec::Vectorized" Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16). Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/) [ghstack-poisoned]
swolchok
added a commit
that referenced
this pull request
Oct 29, 2024
…el from intrinsics to vec::Vectorized" Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16). Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/) [ghstack-poisoned]
swolchok
added a commit
that referenced
this pull request
Oct 29, 2024
…cs to vec::Vectorized" Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16). Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/) [ghstack-poisoned]
pytorchmergebot
pushed a commit
that referenced
this pull request
Oct 29, 2024
…137913) float16_t is ARM-specific. Half is not. Differential Revision: [D64218427](https://our.internmc.facebook.com/intern/diff/D64218427/) Pull Request resolved: #137913 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: #137661, #137911, #137912
pytorchmergebot
pushed a commit
that referenced
this pull request
Oct 29, 2024
…pu/ (#137914) This is in preparation for supporting x86 as well; we need to be in this directory so that we can get rebuilt with different CPU_CAPABILITY settings (AVX2/AVX-512). Also incidentally starts fulfilling request from @malfet to split the ARM64 fast path stuff into its own file. BFloat16 will be in a later diff. Differential Revision: [D64265755](https://our.internmc.facebook.com/intern/diff/D64265755/) Pull Request resolved: #137914 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: #137661, #137911, #137912, #137913
pytorchmergebot
pushed a commit
that referenced
this pull request
Oct 29, 2024
In preparation for other vector instruction sets. (NEON and AVX512 have 32 registers, but AVX and AVX2 have only 16.) Differential Revision: [D64265759](https://our.internmc.facebook.com/intern/diff/D64265759/) Pull Request resolved: #137915 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: #137661, #137911, #137912, #137913, #137914
pytorchmergebot
pushed a commit
that referenced
this pull request
Oct 29, 2024
…whole vector register instead of half (#137916) The fixup loop doesn't really need to vectorize the last 7 elements, and not doing so will make migrating to x86 simpler. Differential Revision: [D64280689](https://our.internmc.facebook.com/intern/diff/D64280689/) Pull Request resolved: #137916 Approved by: https://github.com/malfet ghstack dependencies: #137661, #137911, #137912, #137913, #137914, #137915
pytorchmergebot
pushed a commit
that referenced
this pull request
Oct 29, 2024
…s for non-ARM architectures too (#137917) Remove reasons to gate it on ARM. Differential Revision: [D64280687](https://our.internmc.facebook.com/intern/diff/D64280687/) Pull Request resolved: #137917 Approved by: https://github.com/malfet ghstack dependencies: #137661, #137911, #137912, #137913, #137914, #137915, #137916
swolchok
added a commit
that referenced
this pull request
Oct 31, 2024
…el from intrinsics to vec::Vectorized" Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16). Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
swolchok
added a commit
that referenced
this pull request
Oct 31, 2024
…cs to vec::Vectorized" Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16). Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
swolchok
added a commit
that referenced
this pull request
Oct 31, 2024
…el from intrinsics to vec::Vectorized" Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16). Testing: checked for regression with llm_experiments' benchmarks/benchmark_torch_mm.py llm on M1 Mac and it appeared to be neutral. Supported this assessment by inspecting assembly for the bf16 dot kernel (`objdump -d --no-leading-addr --no-show-raw-insn build/caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.DEFAULT.cpp.o | c++filt` from pytorch root directory after `python setup.py develop`); observed minor instruction scheduling changes but nothing more. Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
swolchok
added a commit
that referenced
this pull request
Oct 31, 2024
…cs to vec::Vectorized" Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16). Testing: checked for regression with llm_experiments' benchmarks/benchmark_torch_mm.py llm on M1 Mac and it appeared to be neutral. Supported this assessment by inspecting assembly for the bf16 dot kernel (`objdump -d --no-leading-addr --no-show-raw-insn build/caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.DEFAULT.cpp.o | c++filt` from pytorch root directory after `python setup.py develop`); observed minor instruction scheduling changes but nothing more. Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
swolchok
added a commit
that referenced
this pull request
Nov 1, 2024
…el from intrinsics to vec::Vectorized" Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16). Testing: checked for regression with llm_experiments' benchmarks/benchmark_torch_mm.py llm on M1 Mac and it appeared to be neutral. Supported this assessment by inspecting assembly for the bf16 dot kernel (`objdump -d --no-leading-addr --no-show-raw-insn build/caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.DEFAULT.cpp.o | c++filt` from pytorch root directory after `python setup.py develop`); observed minor instruction scheduling changes but nothing more. Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
swolchok
added a commit
that referenced
this pull request
Nov 1, 2024
…cs to vec::Vectorized" Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16). Testing: checked for regression with llm_experiments' benchmarks/benchmark_torch_mm.py llm on M1 Mac and it appeared to be neutral. Supported this assessment by inspecting assembly for the bf16 dot kernel (`objdump -d --no-leading-addr --no-show-raw-insn build/caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.DEFAULT.cpp.o | c++filt` from pytorch root directory after `python setup.py develop`); observed minor instruction scheduling changes but nothing more. Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
swolchok
added a commit
that referenced
this pull request
Nov 1, 2024
…yTorch] Migrate bf16 gemv fast path kernel from intrinsics to vec::Vectorized" Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16). Testing: checked for regression with llm_experiments' benchmarks/benchmark_torch_mm.py llm on M1 Mac and it appeared to be neutral. Supported this assessment by inspecting assembly for the bf16 dot kernel (`objdump -d --no-leading-addr --no-show-raw-insn build/caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.DEFAULT.cpp.o | c++filt` from pytorch root directory after `python setup.py develop`); observed minor instruction scheduling changes but nothing more. Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
swolchok
added a commit
that referenced
this pull request
Nov 1, 2024
…bf16 gemv fast path kernel from intrinsics to vec::Vectorized" Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16). Testing: checked for regression with llm_experiments' benchmarks/benchmark_torch_mm.py llm on M1 Mac and it appeared to be neutral. Supported this assessment by inspecting assembly for the bf16 dot kernel (`objdump -d --no-leading-addr --no-show-raw-insn build/caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.DEFAULT.cpp.o | c++filt` from pytorch root directory after `python setup.py develop`); observed minor instruction scheduling changes but nothing more. Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
rahulsingh-intel
pushed a commit
to rahulsingh-intel/pytorch
that referenced
this pull request
Nov 5, 2024
…Vectorized (pytorch#137912) Migrated as much as possible and convenient; focusing on fp16 for now. (This is building toward enabling these fast paths on x86 for machines without AVX-512fp16/bf16 to fix pytorch/torchchat#1253 .) Differential Revision: [D64218206](https://our.internmc.facebook.com/intern/diff/D64218206/) Pull Request resolved: pytorch#137912 Approved by: https://github.com/malfet ghstack dependencies: pytorch#137661, pytorch#137911
rahulsingh-intel
pushed a commit
to rahulsingh-intel/pytorch
that referenced
this pull request
Nov 5, 2024
…ytorch#137913) float16_t is ARM-specific. Half is not. Differential Revision: [D64218427](https://our.internmc.facebook.com/intern/diff/D64218427/) Pull Request resolved: pytorch#137913 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912
rahulsingh-intel
pushed a commit
to rahulsingh-intel/pytorch
that referenced
this pull request
Nov 5, 2024
…pu/ (pytorch#137914) This is in preparation for supporting x86 as well; we need to be in this directory so that we can get rebuilt with different CPU_CAPABILITY settings (AVX2/AVX-512). Also incidentally starts fulfilling request from @malfet to split the ARM64 fast path stuff into its own file. BFloat16 will be in a later diff. Differential Revision: [D64265755](https://our.internmc.facebook.com/intern/diff/D64265755/) Pull Request resolved: pytorch#137914 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912, pytorch#137913
rahulsingh-intel
pushed a commit
to rahulsingh-intel/pytorch
that referenced
this pull request
Nov 5, 2024
…137915) In preparation for other vector instruction sets. (NEON and AVX512 have 32 registers, but AVX and AVX2 have only 16.) Differential Revision: [D64265759](https://our.internmc.facebook.com/intern/diff/D64265759/) Pull Request resolved: pytorch#137915 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912, pytorch#137913, pytorch#137914
rahulsingh-intel
pushed a commit
to rahulsingh-intel/pytorch
that referenced
this pull request
Nov 5, 2024
…whole vector register instead of half (pytorch#137916) The fixup loop doesn't really need to vectorize the last 7 elements, and not doing so will make migrating to x86 simpler. Differential Revision: [D64280689](https://our.internmc.facebook.com/intern/diff/D64280689/) Pull Request resolved: pytorch#137916 Approved by: https://github.com/malfet ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912, pytorch#137913, pytorch#137914, pytorch#137915
rahulsingh-intel
pushed a commit
to rahulsingh-intel/pytorch
that referenced
this pull request
Nov 5, 2024
…s for non-ARM architectures too (pytorch#137917) Remove reasons to gate it on ARM. Differential Revision: [D64280687](https://our.internmc.facebook.com/intern/diff/D64280687/) Pull Request resolved: pytorch#137917 Approved by: https://github.com/malfet ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912, pytorch#137913, pytorch#137914, pytorch#137915, pytorch#137916
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stack from ghstack (oldest at bottom):
Migrated as much as possible and convenient; focusing on fp16
for now. (This is building toward enabling these fast paths on x86 for
machines without AVX-512fp16/bf16 to fix
pytorch/torchchat#1253 .)
Differential Revision: D64218206
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10