[PyTorch] Move NEON VecConvert specialization from vec256_convert to vec128_convert#137661
[PyTorch] Move NEON VecConvert specialization from vec256_convert to vec128_convert#137661swolchok wants to merge 15 commits intogh/swolchok/651/basefrom
Conversation
…vec128_convert NEON vectors are 128-bit and don't belong with 256 stuff. Differential Revision: [D64143615](https://our.internmc.facebook.com/intern/diff/D64143615/) [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137661
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 59a2c71 with merge base b9618c9 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
This pull request was exported from Phabricator. Differential Revision: D64143615 |
…vec128_convert NEON vectors are 128-bit and don't belong with 256 stuff. Differential Revision: [D64143615](https://our.internmc.facebook.com/intern/diff/D64143615/) ghstack-source-id: 247178186 Pull Request resolved: #137661
…convert to vec128_convert" NEON vectors are 128-bit and don't belong with 256 stuff. Differential Revision: [D64143615](https://our.internmc.facebook.com/intern/diff/D64143615/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D64143615 |
…convert to vec128_convert" NEON vectors are 128-bit and don't belong with 256 stuff. Differential Revision: [D64143615](https://our.internmc.facebook.com/intern/diff/D64143615/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D64143615 |
…convert to vec128_convert" NEON vectors are 128-bit and don't belong with 256 stuff. Differential Revision: [D64143615](https://our.internmc.facebook.com/intern/diff/D64143615/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D64143615 |
…vec256_convert to vec128_convert" NEON vectors are 128-bit and don't belong with 256 stuff. Differential Revision: [D64143615](https://our.internmc.facebook.com/intern/diff/D64143615/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D64143615 |
…vec128_convert Pull Request resolved: #137661 NEON vectors are 128-bit and don't belong with 256 stuff. ghstack-source-id: 247393002 @exported-using-ghexport Differential Revision: [D64143615](https://our.internmc.facebook.com/intern/diff/D64143615/)
…convert to vec128_convert" NEON vectors are 128-bit and don't belong with 256 stuff. Differential Revision: [D64143615](https://our.internmc.facebook.com/intern/diff/D64143615/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D64143615 |
|
This pull request was exported from Phabricator. Differential Revision: D64143615 |
…convert to vec128_convert" NEON vectors are 128-bit and don't belong with 256 stuff. Differential Revision: [D64143615](https://our.internmc.facebook.com/intern/diff/D64143615/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D64143615 |
…c isn't available (#137911) We can do most of what this header does (by line count) anyway by converting to and from float. Differential Revision: [D64265757](https://our.internmc.facebook.com/intern/diff/D64265757/) Pull Request resolved: #137911 Approved by: https://github.com/jgong5, https://github.com/malfet ghstack dependencies: #137661
…Vectorized (#137912) Migrated as much as possible and convenient; focusing on fp16 for now. (This is building toward enabling these fast paths on x86 for machines without AVX-512fp16/bf16 to fix pytorch/torchchat#1253 .) Differential Revision: [D64218206](https://our.internmc.facebook.com/intern/diff/D64218206/) Pull Request resolved: #137912 Approved by: https://github.com/malfet ghstack dependencies: #137661, #137911
…137913) float16_t is ARM-specific. Half is not. Differential Revision: [D64218427](https://our.internmc.facebook.com/intern/diff/D64218427/) Pull Request resolved: #137913 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: #137661, #137911, #137912
…pu/ (#137914) This is in preparation for supporting x86 as well; we need to be in this directory so that we can get rebuilt with different CPU_CAPABILITY settings (AVX2/AVX-512). Also incidentally starts fulfilling request from @malfet to split the ARM64 fast path stuff into its own file. BFloat16 will be in a later diff. Differential Revision: [D64265755](https://our.internmc.facebook.com/intern/diff/D64265755/) Pull Request resolved: #137914 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: #137661, #137911, #137912, #137913
In preparation for other vector instruction sets. (NEON and AVX512 have 32 registers, but AVX and AVX2 have only 16.) Differential Revision: [D64265759](https://our.internmc.facebook.com/intern/diff/D64265759/) Pull Request resolved: #137915 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: #137661, #137911, #137912, #137913, #137914
…whole vector register instead of half (#137916) The fixup loop doesn't really need to vectorize the last 7 elements, and not doing so will make migrating to x86 simpler. Differential Revision: [D64280689](https://our.internmc.facebook.com/intern/diff/D64280689/) Pull Request resolved: #137916 Approved by: https://github.com/malfet ghstack dependencies: #137661, #137911, #137912, #137913, #137914, #137915
…s for non-ARM architectures too (#137917) Remove reasons to gate it on ARM. Differential Revision: [D64280687](https://our.internmc.facebook.com/intern/diff/D64280687/) Pull Request resolved: #137917 Approved by: https://github.com/malfet ghstack dependencies: #137661, #137911, #137912, #137913, #137914, #137915, #137916
…vec128_convert (pytorch#137661) NEON vectors are 128-bit and don't belong with 256 stuff. Differential Revision: [D64143615](https://our.internmc.facebook.com/intern/diff/D64143615/) Pull Request resolved: pytorch#137661 Approved by: https://github.com/jgong5, https://github.com/malfet
…c isn't available (pytorch#137911) We can do most of what this header does (by line count) anyway by converting to and from float. Differential Revision: [D64265757](https://our.internmc.facebook.com/intern/diff/D64265757/) Pull Request resolved: pytorch#137911 Approved by: https://github.com/jgong5, https://github.com/malfet ghstack dependencies: pytorch#137661
…Vectorized (pytorch#137912) Migrated as much as possible and convenient; focusing on fp16 for now. (This is building toward enabling these fast paths on x86 for machines without AVX-512fp16/bf16 to fix pytorch/torchchat#1253 .) Differential Revision: [D64218206](https://our.internmc.facebook.com/intern/diff/D64218206/) Pull Request resolved: pytorch#137912 Approved by: https://github.com/malfet ghstack dependencies: pytorch#137661, pytorch#137911
…ytorch#137913) float16_t is ARM-specific. Half is not. Differential Revision: [D64218427](https://our.internmc.facebook.com/intern/diff/D64218427/) Pull Request resolved: pytorch#137913 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912
…pu/ (pytorch#137914) This is in preparation for supporting x86 as well; we need to be in this directory so that we can get rebuilt with different CPU_CAPABILITY settings (AVX2/AVX-512). Also incidentally starts fulfilling request from @malfet to split the ARM64 fast path stuff into its own file. BFloat16 will be in a later diff. Differential Revision: [D64265755](https://our.internmc.facebook.com/intern/diff/D64265755/) Pull Request resolved: pytorch#137914 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912, pytorch#137913
…137915) In preparation for other vector instruction sets. (NEON and AVX512 have 32 registers, but AVX and AVX2 have only 16.) Differential Revision: [D64265759](https://our.internmc.facebook.com/intern/diff/D64265759/) Pull Request resolved: pytorch#137915 Approved by: https://github.com/Skylion007, https://github.com/malfet ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912, pytorch#137913, pytorch#137914
…whole vector register instead of half (pytorch#137916) The fixup loop doesn't really need to vectorize the last 7 elements, and not doing so will make migrating to x86 simpler. Differential Revision: [D64280689](https://our.internmc.facebook.com/intern/diff/D64280689/) Pull Request resolved: pytorch#137916 Approved by: https://github.com/malfet ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912, pytorch#137913, pytorch#137914, pytorch#137915
…s for non-ARM architectures too (pytorch#137917) Remove reasons to gate it on ARM. Differential Revision: [D64280687](https://our.internmc.facebook.com/intern/diff/D64280687/) Pull Request resolved: pytorch#137917 Approved by: https://github.com/malfet ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912, pytorch#137913, pytorch#137914, pytorch#137915, pytorch#137916
Stack from ghstack (oldest at bottom):
NEON vectors are 128-bit and don't belong with 256 stuff.
Differential Revision: D64143615
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10