[PyTorch] Add efficient isnan for NEON half#139083
[PyTorch] Add efficient isnan for NEON half#139083swolchok wants to merge 8 commits intogh/swolchok/679/basefrom
Conversation
Same as the efficient one for float when f16 hardware support is available. Differential Revision: [D65003321](https://our.internmc.facebook.com/intern/diff/D65003321/) [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139083
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 2d1d999 with merge base 86602a6 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
This pull request was exported from Phabricator. Differential Revision: D65003321 |
Same as the efficient one for float when f16 hardware support is available. Differential Revision: [D65003321](https://our.internmc.facebook.com/intern/diff/D65003321/) [ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D65003321 |
Same as the efficient one for float when f16 hardware support is available. Differential Revision: [D65003321](https://our.internmc.facebook.com/intern/diff/D65003321/) [ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D65003321 |
Same as the efficient one for float when f16 hardware support is available. Differential Revision: [D65003321](https://our.internmc.facebook.com/intern/diff/D65003321/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D65003321 |
Same as the efficient one for float when f16 hardware support is available. Differential Revision: [D65003321](https://our.internmc.facebook.com/intern/diff/D65003321/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D65003321 |
|
I have fixes for this but don't want to re-kick CI on ready-to-go diffs below it in the stack... |
Same as the efficient one for float when f16 hardware support is available. Differential Revision: [D65003321](https://our.internmc.facebook.com/intern/diff/D65003321/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D65003321 |
Same as the efficient one for float when f16 hardware support is available. Testing: Added exhaustive isnan test coverage Differential Revision: [D65003321](https://our.internmc.facebook.com/intern/diff/D65003321/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D65003321 |
Same as the efficient one for float when f16 hardware support is available. Testing: Added exhaustive isnan test coverage Differential Revision: [D65003321](https://our.internmc.facebook.com/intern/diff/D65003321/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D65003321 |
|
@pytorchbot merge -f "Lint + builds + relevant tests are green" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
This is the first big milestone we've been building towards! (Following rev also hooks this up to actual gemv.) Testing: To check perf, I ran python torchchat.py generate stories110M --dtype fp16 --device cpu on an x86 machine without AVX512FP16. Observed roughly 5x tokens/sec increase. Differential Revision: [D64280688](https://our.internmc.facebook.com/intern/diff/D64280688/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D64280688/)! Pull Request resolved: #137918 Approved by: https://github.com/malfet ghstack dependencies: #139082, #139083
…rchitectures (#138005) Following up on previous rev to use fp16_gemv_trans in gemv, not just gemm-used-for-gemv. Differential Revision: [D64351092](https://our.internmc.facebook.com/intern/diff/D64351092/) Pull Request resolved: #138005 Approved by: https://github.com/malfet ghstack dependencies: #139082, #139083, #137918
No real reason to have the zero-beta restriction, so let's lift it. Testing: intentionally broke new paths locally to verify test coverage existed Differential Revision: [D64407752](https://our.internmc.facebook.com/intern/diff/D64407752/) Pull Request resolved: #138275 Approved by: https://github.com/malfet ghstack dependencies: #139082, #139083, #137918, #138005
Same as the efficient one for float when f16 hardware support is available. Testing: Added exhaustive isnan test coverage Differential Revision: [D65003321](https://our.internmc.facebook.com/intern/diff/D65003321/) Pull Request resolved: pytorch#139083 Approved by: https://github.com/malfet ghstack dependencies: pytorch#139082
This is the first big milestone we've been building towards! (Following rev also hooks this up to actual gemv.) Testing: To check perf, I ran python torchchat.py generate stories110M --dtype fp16 --device cpu on an x86 machine without AVX512FP16. Observed roughly 5x tokens/sec increase. Differential Revision: [D64280688](https://our.internmc.facebook.com/intern/diff/D64280688/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D64280688/)! Pull Request resolved: pytorch#137918 Approved by: https://github.com/malfet ghstack dependencies: pytorch#139082, pytorch#139083
…rchitectures (pytorch#138005) Following up on previous rev to use fp16_gemv_trans in gemv, not just gemm-used-for-gemv. Differential Revision: [D64351092](https://our.internmc.facebook.com/intern/diff/D64351092/) Pull Request resolved: pytorch#138005 Approved by: https://github.com/malfet ghstack dependencies: pytorch#139082, pytorch#139083, pytorch#137918
No real reason to have the zero-beta restriction, so let's lift it. Testing: intentionally broke new paths locally to verify test coverage existed Differential Revision: [D64407752](https://our.internmc.facebook.com/intern/diff/D64407752/) Pull Request resolved: pytorch#138275 Approved by: https://github.com/malfet ghstack dependencies: pytorch#139082, pytorch#139083, pytorch#137918, pytorch#138005
Stack from ghstack (oldest at bottom):
Same as the efficient one for float when f16 hardware support is available.
Testing: Added exhaustive isnan test coverage
Differential Revision: D65003321
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10