Hook up bf16_gemv_trans to x86 bf16 GEMM#139220

Closed

swolchok wants to merge 14 commits intogh/swolchok/684/basefrom

gh/swolchok/684/head

Contributor

swolchok commented Oct 29, 2024 •

edited

Loading

Stack from ghstack (oldest at bottom):

This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 .

Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2.

Differential Revision: D65170967


          Hook up bf16_gemv_trans to x86 bf16 GEMM

1d6f50f

This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 .

Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/)

[ghstack-poisoned]

pytorch-bot bot commented Oct 29, 2024 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139220

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 81ba770 with merge base 419a7e1 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

This was referenced Oct 29, 2024

[PyTorch] Move NEON VecConvert specialization from vec256_convert to vec128_convert #137661

Closed

[PyTorch] Specialize Vectorized<Half> for NEON even if FP16 arithmetic isn't available #137911

Closed

[PyTorch] Migrate fp16 gemv fast path kernel from intrinsics to vec::Vectorized #137912

Closed

[PyTorch] Use Half, not float16_t, in fp16 gemv fast path signatures #137913

Closed

Contributor

facebook-github-bot commented Oct 29, 2024

This pull request was exported from Phabricator. Differential Revision: D65170967

swolchok mentioned this pull request

[PyTorch] Move FP16 dot and GEMV kernels to new file in ATen/native/cpu/ #137914

Closed

facebook-github-bot added the fb-exported label

This was referenced Oct 29, 2024

[PyTorch] Clean up Registers/ElementsPerIteration constants #137915

Closed

[PyTorch] Convert reduced precision gemv vectorized tail loop to use whole vector register instead of half #137916

Closed

[PyTorch] Build ReducedPrecisionFloatGemvFastPathKernel & entry points for non-ARM architectures too #137917

Closed

[PyTorch] Hook up fp16_gemv_trans to x86 fp16 GEMM #137918

Closed

[PyTorch] Hook up fp16_gemv_trans to gemv fast path for non-aarch64 architectures #138005

Closed

[PyTorch] Support non-zero beta in fp16_gemv_trans #138275

Closed

Move bf16_gemv_trans to ReducedPrecisionFloatGemvFastPathKernel #139081

Closed

[PyTorch] Add efficient isnan for NEON float #139082

Closed

[PyTorch] Add efficient isnan for NEON half #139083

Closed

Extract value_type-generic NEON Vectorized<Half> functions to CRTP base class #139084

Closed

Add Vectorized<c10::BFloat16> specialization for ARM #139090

Closed

[PyTorch] Migrate bf16 gemv fast path kernel from intrinsics to vec::Vectorized #139159

Closed

Build bf16 gemv fast path & entry points for non-ARM architectures too #139208

Closed

swolchok added a commit that referenced this pull request


          Hook up bf16_gemv_trans to x86 bf16 GEMM

6983d78

This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 .

Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/)

ghstack-source-id: 250775158
Pull Request resolved: #139220

swolchok added release notes: performance_as_product topic: performance labels


          Update on "Hook up bf16_gemv_trans to x86 bf16 GEMM"

6e1b58a

This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 .

Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/)

[ghstack-poisoned]

Contributor

facebook-github-bot commented Oct 31, 2024

This pull request was exported from Phabricator. Differential Revision: D65170967

swolchok requested a review from malfet

October 31, 2024 17:50


          Update on "Hook up bf16_gemv_trans to x86 bf16 GEMM"

4b75607

This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 .

Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2.

Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/)

[ghstack-poisoned]

Contributor

facebook-github-bot commented Nov 3, 2024

This pull request was exported from Phabricator. Differential Revision: D65170967


          Update on "Hook up bf16_gemv_trans to x86 bf16 GEMM"

This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 .

Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2.

Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/)

[ghstack-poisoned]

Contributor

facebook-github-bot commented Nov 3, 2024

This pull request was exported from Phabricator. Differential Revision: D65170967


          Update on "Hook up bf16_gemv_trans to x86 bf16 GEMM"

0dcad49

This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 .

Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2.

Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/)

[ghstack-poisoned]

swolchok added a commit that referenced this pull request


          Hook up bf16_gemv_trans to x86 bf16 GEMM

bbdd4a3

Pull Request resolved: #139220

This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 .
ghstack-source-id: 251589338

Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2.

@exported-using-ghexport

Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/)

Contributor

facebook-github-bot commented Nov 3, 2024

This pull request was exported from Phabricator. Differential Revision: D65170967


          Update on "Hook up bf16_gemv_trans to x86 bf16 GEMM"

c5644a4

This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 .

Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2.

Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/)

[ghstack-poisoned]

Contributor

facebook-github-bot commented Nov 4, 2024

This pull request was exported from Phabricator. Differential Revision: D65170967


          Update on "Hook up bf16_gemv_trans to x86 bf16 GEMM"

81ba770

This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 .

Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2.

Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/)

[ghstack-poisoned]

Contributor

facebook-github-bot commented Nov 4, 2024

This pull request was exported from Phabricator. Differential Revision: D65170967

swolchok added a commit that referenced this pull request


          Hook up bf16_gemv_trans to x86 bf16 GEMM

38fd257

Pull Request resolved: #139220

This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 .
ghstack-source-id: 251716245

Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2.

@exported-using-ghexport

Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/)

malfet approved these changes

View reviewed changes

pytorch-bot bot added the ciflow/trunk label

Contributor

malfet commented Nov 8, 2024

@pytorchbot merge

pytorchmergebot added the merging label

Collaborator

pytorchmergebot commented Nov 8, 2024

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot added the Merged label

pytorchmergebot closed this in

cc44b55

pytorchmergebot removed the merging label

swolchok mentioned this pull request

x86 CPU: BF16 should improve decoding performance relative to FP32 on x86, even without hardware BF16 pytorch/torchchat#1253

Closed

pobin6 pushed a commit to pobin6/pytorch that referenced this pull request


          Hook up bf16_gemv_trans to x86 bf16 GEMM (pytorch#139220)

1a8f885

This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 .

Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2.

Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/)
Pull Request resolved: pytorch#139220
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#139084, pytorch#139090, pytorch#139558, pytorch#139081, pytorch#139208

github-actions bot deleted the gh/swolchok/684/head branch

December 9, 2024 02:14

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk fb-exported Merged topic: performance