Build bf16 gemv fast path & entry points for non-ARM architectures too#139208
Build bf16 gemv fast path & entry points for non-ARM architectures too#139208swolchok wants to merge 15 commits intogh/swolchok/683/basefrom
Conversation
…ectures too Very similar to #137917, but for bf16. Differential Revision: [D65155971](https://our.internmc.facebook.com/intern/diff/D65155971/) [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139208
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit a56927d with merge base 419a7e1 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
This pull request was exported from Phabricator. Differential Revision: D65155971 |
…-ARM architectures too" Very similar to #137917, but for bf16. Differential Revision: [D65155971](https://our.internmc.facebook.com/intern/diff/D65155971/) [ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D65155971 |
…tectures too" Very similar to #137917, but for bf16. Differential Revision: [D65155971](https://our.internmc.facebook.com/intern/diff/D65155971/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D65155971 |
…tectures too" Very similar to #137917, but for bf16. Differential Revision: [D65155971](https://our.internmc.facebook.com/intern/diff/D65155971/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D65155971 |
…tectures too" Very similar to #137917, but for bf16. Differential Revision: [D65155971](https://our.internmc.facebook.com/intern/diff/D65155971/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D65155971 |
…tectures too" Very similar to #137917, but for bf16. Differential Revision: [D65155971](https://our.internmc.facebook.com/intern/diff/D65155971/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D65155971 |
…tectures too" Very similar to #137917, but for bf16. Differential Revision: [D65155971](https://our.internmc.facebook.com/intern/diff/D65155971/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D65155971 |
…tectures too" Very similar to #137917, but for bf16. Differential Revision: [D65155971](https://our.internmc.facebook.com/intern/diff/D65155971/) cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D65155971 |
| // https://godbolt.org/z/z8P4Yncra | ||
| #define COMPILER_SUPPORTS_BF16_TARGET 1 | ||
| #elif !defined(__clang__) && defined(__GNUC__) && __GNUC__ >= 10 | ||
| #elif defined(__aarch64__) && !defined(CPU_CAPABILITY_SVE) && !defined(__clang__) && defined(__GNUC__) && __GNUC__ >= 10 |
There was a problem hiding this comment.
Can this be moved to say compiler_capabilites header which is included form here, that has a table on top that explains which compiler versions supports what
This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 . Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2. Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/) Pull Request resolved: #139220 Approved by: https://github.com/malfet ghstack dependencies: #139084, #139090, #139558, #139081, #139208
pytorch#139208) Very similar to pytorch#137917, but for bf16. Differential Revision: [D65155971](https://our.internmc.facebook.com/intern/diff/D65155971/) Pull Request resolved: pytorch#139208 Approved by: https://github.com/malfet ghstack dependencies: pytorch#139084, pytorch#139090, pytorch#139558, pytorch#139081
This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 . Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2. Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/) Pull Request resolved: pytorch#139220 Approved by: https://github.com/malfet ghstack dependencies: pytorch#139084, pytorch#139090, pytorch#139558, pytorch#139081, pytorch#139208
Stack from ghstack (oldest at bottom):
Very similar to #137917, but for bf16.
Differential Revision: D65155971
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10