[PyTorch] Specialize Vectorized<Half> for NEON even if FP16 arithmetic isn't available#137911
[PyTorch] Specialize Vectorized<Half> for NEON even if FP16 arithmetic isn't available#137911swolchok wants to merge 9 commits intogh/swolchok/659/basefrom
Conversation
…c isn't available We can do most of what this header does (by line count) anyway by converting to and from float. Differential Revision: [D64265757](https://our.internmc.facebook.com/intern/diff/D64265757/) [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137911
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 6c24f9c with merge base b9618c9 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
This pull request was exported from Phabricator. Differential Revision: D64265757 |
…6 arithmetic isn't available" We can do most of what this header does (by line count) anyway by converting to and from float. Differential Revision: [D64265757](https://our.internmc.facebook.com/intern/diff/D64265757/) [ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D64265757 |
…6 arithmetic isn't available" We can do most of what this header does (by line count) anyway by converting to and from float. Differential Revision: [D64265757](https://our.internmc.facebook.com/intern/diff/D64265757/) [ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D64265757 |
|
(it's hard to succinctly explain what the benefits of this one are, so I gave it a not-user-facing for release notes) |
…6 arithmetic isn't available" We can do most of what this header does (by line count) anyway by converting to and from float. Differential Revision: [D64265757](https://our.internmc.facebook.com/intern/diff/D64265757/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D64265757 |
this failure doesn't seem to repro locally on a linux machine, nor does it pass the sniff test because this diff only affects ARM and it's on Windows... |
…6 arithmetic isn't available" We can do most of what this header does (by line count) anyway by converting to and from float. Differential Revision: [D64265757](https://our.internmc.facebook.com/intern/diff/D64265757/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D64265757 |
…Vectorized (#137912) Migrated as much as possible and convenient; focusing on fp16 for now. (This is building toward enabling these fast paths on x86 for machines without AVX-512fp16/bf16 to fix pytorch/torchchat#1253 .) Differential Revision: [D64218206](https://our.internmc.facebook.com/intern/diff/D64218206/) Pull Request resolved: #137912 Approved by: https://github.com/malfet ghstack dependencies: #137661, #137911
…137913) float16_t is ARM-specific. Half is not. Differential Revision: [D64218427](https://our.internmc.facebook.com/intern/diff/D64218427/) Pull Request resolved: #137913 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: #137661, #137911, #137912
…pu/ (#137914) This is in preparation for supporting x86 as well; we need to be in this directory so that we can get rebuilt with different CPU_CAPABILITY settings (AVX2/AVX-512). Also incidentally starts fulfilling request from @malfet to split the ARM64 fast path stuff into its own file. BFloat16 will be in a later diff. Differential Revision: [D64265755](https://our.internmc.facebook.com/intern/diff/D64265755/) Pull Request resolved: #137914 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: #137661, #137911, #137912, #137913
In preparation for other vector instruction sets. (NEON and AVX512 have 32 registers, but AVX and AVX2 have only 16.) Differential Revision: [D64265759](https://our.internmc.facebook.com/intern/diff/D64265759/) Pull Request resolved: #137915 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: #137661, #137911, #137912, #137913, #137914
…whole vector register instead of half (#137916) The fixup loop doesn't really need to vectorize the last 7 elements, and not doing so will make migrating to x86 simpler. Differential Revision: [D64280689](https://our.internmc.facebook.com/intern/diff/D64280689/) Pull Request resolved: #137916 Approved by: https://github.com/malfet ghstack dependencies: #137661, #137911, #137912, #137913, #137914, #137915
…s for non-ARM architectures too (#137917) Remove reasons to gate it on ARM. Differential Revision: [D64280687](https://our.internmc.facebook.com/intern/diff/D64280687/) Pull Request resolved: #137917 Approved by: https://github.com/malfet ghstack dependencies: #137661, #137911, #137912, #137913, #137914, #137915, #137916
`mask` is already defined as `uint16x8_t` no need to reinterpret it https://github.com/pytorch/pytorch/blob/bd369bb18258fc3be5ee91f8fcaf06a4b6fc41a7/aten/src/ATen/cpu/vec/vec128/vec128_half_neon.h#L220 Fixes ``` var/lib/jenkins/workspace/aten/src/ATen/cpu/vec/vec128/vec128_half_neon.h: In static member function 'static at::vec::DEFAULT::Vectorized<c10::Half> at::vec::DEFAULT::Vectorized<c10::Half>::set(const at::vec::DEFAULT::Vectorized<c10::Half>&, const at::vec::DEFAULT::Vectorized<c10::Half>&, int64_t)': /var/lib/jenkins/workspace/aten/src/ATen/cpu/vec/vec128/vec128_half_neon.h:227:39: error: cannot convert 'uint16x8_t' to 'float16x8_t' 227 | vreinterpretq_u16_f16(mask), | ^~~~ | | | uint16x8_t In file included from /var/lib/jenkins/workspace/aten/src/ATen/cpu/vec/intrinsics.h:23, from /var/lib/jenkins/workspace/aten/src/ATen/cpu/vec/vec128/vec128.h:4, from /var/lib/jenkins/workspace/aten/src/ATen/cpu/vec/vec.h:6, from /var/lib/jenkins/workspace/aten/src/ATen/test/vec_test_all_types.h:2, from /var/lib/jenkins/workspace/aten/src/ATen/test/vec_test_all_types.cpp:1: /usr/lib/gcc/aarch64-linux-gnu/11/include/arm_neon.h:5841:36: note: initializing argument 1 of 'uint16x8_t vreinterpretq_u16_f16(float16x8_t)' 5841 | vreinterpretq_u16_f16 (float16x8_t __a) | ~~~~~~~~~~~~^~~ ``` introduced by #137911 Also, guard any use of NEON intrinsics in `ReducedPrecisionFloatGemvFastPathKernel.cpp` with `!defined(CPU_CAPABILITY_SVE)` otherwise compilation fails with ``` /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp: In function 'float at::native::SVE256::reduce(at::vec::SVE256::VectorizedN<c10::Half, 16>&)': /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp:77:24: error: cannot convert 'at::vec::SVE256::Vectorized<float>' to 'float32x4_t' 77 | return vaddvq_f32(t0 + t1); | ~~~^~~~ | | | at::vec::SVE256::Vectorized<float> In file included from /var/lib/jenkins/workspace/c10/util/Half.h:51, from /var/lib/jenkins/workspace/c10/util/Float8_e5m2.h:17, from /var/lib/jenkins/workspace/c10/core/ScalarType.h:8, from /var/lib/jenkins/workspace/c10/core/TensorImpl.h:11, from /var/lib/jenkins/workspace/c10/core/GeneratorImpl.h:8, from /var/lib/jenkins/workspace/aten/src/ATen/core/Generator.h:18, from /var/lib/jenkins/workspace/aten/src/ATen/CPUGeneratorImpl.h:3, from /var/lib/jenkins/workspace/aten/src/ATen/Context.h:4, from /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp:2, from /var/lib/jenkins/workspace/build/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.SVE256.cpp:1: /usr/lib/gcc/aarch64-linux-gnu/11/include/arm_neon.h:10423:25: note: initializing argument 1 of 'float32_t vaddvq_f32(float32x4_t)' 10423 | vaddvq_f32 (float32x4_t __a) | ~~~~~~~~~~~~^~~ In file included from /var/lib/jenkins/workspace/build/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.SVE256.cpp:1: /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp: In function 'float at::native::SVE256::reduce(at::vec::SVE256::Vectorized<float>)': /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp:119:21: error: cannot convert 'at::vec::SVE256::Vectorized<float>' to 'float32x4_t' 119 | return vaddvq_f32(x); | ^ | | | at::vec::SVE256::Vectorized<float> ``` Pull Request resolved: #139235 Approved by: https://github.com/huydhn
…c isn't available (pytorch#137911) We can do most of what this header does (by line count) anyway by converting to and from float. Differential Revision: [D64265757](https://our.internmc.facebook.com/intern/diff/D64265757/) Pull Request resolved: pytorch#137911 Approved by: https://github.com/jgong5, https://github.com/malfet ghstack dependencies: pytorch#137661
…Vectorized (pytorch#137912) Migrated as much as possible and convenient; focusing on fp16 for now. (This is building toward enabling these fast paths on x86 for machines without AVX-512fp16/bf16 to fix pytorch/torchchat#1253 .) Differential Revision: [D64218206](https://our.internmc.facebook.com/intern/diff/D64218206/) Pull Request resolved: pytorch#137912 Approved by: https://github.com/malfet ghstack dependencies: pytorch#137661, pytorch#137911
…ytorch#137913) float16_t is ARM-specific. Half is not. Differential Revision: [D64218427](https://our.internmc.facebook.com/intern/diff/D64218427/) Pull Request resolved: pytorch#137913 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912
…pu/ (pytorch#137914) This is in preparation for supporting x86 as well; we need to be in this directory so that we can get rebuilt with different CPU_CAPABILITY settings (AVX2/AVX-512). Also incidentally starts fulfilling request from @malfet to split the ARM64 fast path stuff into its own file. BFloat16 will be in a later diff. Differential Revision: [D64265755](https://our.internmc.facebook.com/intern/diff/D64265755/) Pull Request resolved: pytorch#137914 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912, pytorch#137913
…137915) In preparation for other vector instruction sets. (NEON and AVX512 have 32 registers, but AVX and AVX2 have only 16.) Differential Revision: [D64265759](https://our.internmc.facebook.com/intern/diff/D64265759/) Pull Request resolved: pytorch#137915 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912, pytorch#137913, pytorch#137914
…whole vector register instead of half (pytorch#137916) The fixup loop doesn't really need to vectorize the last 7 elements, and not doing so will make migrating to x86 simpler. Differential Revision: [D64280689](https://our.internmc.facebook.com/intern/diff/D64280689/) Pull Request resolved: pytorch#137916 Approved by: https://github.com/malfet ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912, pytorch#137913, pytorch#137914, pytorch#137915
…s for non-ARM architectures too (pytorch#137917) Remove reasons to gate it on ARM. Differential Revision: [D64280687](https://our.internmc.facebook.com/intern/diff/D64280687/) Pull Request resolved: pytorch#137917 Approved by: https://github.com/malfet ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912, pytorch#137913, pytorch#137914, pytorch#137915, pytorch#137916
`mask` is already defined as `uint16x8_t` no need to reinterpret it https://github.com/pytorch/pytorch/blob/bd369bb18258fc3be5ee91f8fcaf06a4b6fc41a7/aten/src/ATen/cpu/vec/vec128/vec128_half_neon.h#L220 Fixes ``` var/lib/jenkins/workspace/aten/src/ATen/cpu/vec/vec128/vec128_half_neon.h: In static member function 'static at::vec::DEFAULT::Vectorized<c10::Half> at::vec::DEFAULT::Vectorized<c10::Half>::set(const at::vec::DEFAULT::Vectorized<c10::Half>&, const at::vec::DEFAULT::Vectorized<c10::Half>&, int64_t)': /var/lib/jenkins/workspace/aten/src/ATen/cpu/vec/vec128/vec128_half_neon.h:227:39: error: cannot convert 'uint16x8_t' to 'float16x8_t' 227 | vreinterpretq_u16_f16(mask), | ^~~~ | | | uint16x8_t In file included from /var/lib/jenkins/workspace/aten/src/ATen/cpu/vec/intrinsics.h:23, from /var/lib/jenkins/workspace/aten/src/ATen/cpu/vec/vec128/vec128.h:4, from /var/lib/jenkins/workspace/aten/src/ATen/cpu/vec/vec.h:6, from /var/lib/jenkins/workspace/aten/src/ATen/test/vec_test_all_types.h:2, from /var/lib/jenkins/workspace/aten/src/ATen/test/vec_test_all_types.cpp:1: /usr/lib/gcc/aarch64-linux-gnu/11/include/arm_neon.h:5841:36: note: initializing argument 1 of 'uint16x8_t vreinterpretq_u16_f16(float16x8_t)' 5841 | vreinterpretq_u16_f16 (float16x8_t __a) | ~~~~~~~~~~~~^~~ ``` introduced by pytorch#137911 Also, guard any use of NEON intrinsics in `ReducedPrecisionFloatGemvFastPathKernel.cpp` with `!defined(CPU_CAPABILITY_SVE)` otherwise compilation fails with ``` /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp: In function 'float at::native::SVE256::reduce(at::vec::SVE256::VectorizedN<c10::Half, 16>&)': /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp:77:24: error: cannot convert 'at::vec::SVE256::Vectorized<float>' to 'float32x4_t' 77 | return vaddvq_f32(t0 + t1); | ~~~^~~~ | | | at::vec::SVE256::Vectorized<float> In file included from /var/lib/jenkins/workspace/c10/util/Half.h:51, from /var/lib/jenkins/workspace/c10/util/Float8_e5m2.h:17, from /var/lib/jenkins/workspace/c10/core/ScalarType.h:8, from /var/lib/jenkins/workspace/c10/core/TensorImpl.h:11, from /var/lib/jenkins/workspace/c10/core/GeneratorImpl.h:8, from /var/lib/jenkins/workspace/aten/src/ATen/core/Generator.h:18, from /var/lib/jenkins/workspace/aten/src/ATen/CPUGeneratorImpl.h:3, from /var/lib/jenkins/workspace/aten/src/ATen/Context.h:4, from /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp:2, from /var/lib/jenkins/workspace/build/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.SVE256.cpp:1: /usr/lib/gcc/aarch64-linux-gnu/11/include/arm_neon.h:10423:25: note: initializing argument 1 of 'float32_t vaddvq_f32(float32x4_t)' 10423 | vaddvq_f32 (float32x4_t __a) | ~~~~~~~~~~~~^~~ In file included from /var/lib/jenkins/workspace/build/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.SVE256.cpp:1: /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp: In function 'float at::native::SVE256::reduce(at::vec::SVE256::Vectorized<float>)': /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp:119:21: error: cannot convert 'at::vec::SVE256::Vectorized<float>' to 'float32x4_t' 119 | return vaddvq_f32(x); | ^ | | | at::vec::SVE256::Vectorized<float> ``` Pull Request resolved: pytorch#139235 Approved by: https://github.com/huydhn
Stack from ghstack (oldest at bottom):
We can do most of what this header does (by line count) anyway by converting to and from float.
Differential Revision: D64265757
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10