Hook up bf16_gemv_trans to x86 bf16 GEMM#139220
Hook up bf16_gemv_trans to x86 bf16 GEMM#139220swolchok wants to merge 14 commits intogh/swolchok/684/basefrom
Conversation
This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 . Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/) [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139220
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 81ba770 with merge base 419a7e1 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
This pull request was exported from Phabricator. Differential Revision: D65170967 |
This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 . Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/) ghstack-source-id: 250775158 Pull Request resolved: #139220
This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 . Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/) [ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D65170967 |
This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 . Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2. Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/) [ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D65170967 |
This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 . Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2. Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/) [ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D65170967 |
This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 . Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2. Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/) [ghstack-poisoned]
Pull Request resolved: #139220 This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 . ghstack-source-id: 251589338 Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2. @exported-using-ghexport Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/)
|
This pull request was exported from Phabricator. Differential Revision: D65170967 |
This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 . Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2. Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/) [ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D65170967 |
This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 . Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2. Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/) [ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D65170967 |
Pull Request resolved: #139220 This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 . ghstack-source-id: 251716245 Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2. @exported-using-ghexport Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/)
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 . Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2. Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/) Pull Request resolved: pytorch#139220 Approved by: https://github.com/malfet ghstack dependencies: pytorch#139084, pytorch#139090, pytorch#139558, pytorch#139081, pytorch#139208
Stack from ghstack (oldest at bottom):
This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 .
Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2.
Differential Revision: D65170967