Build bf16 gemv fast path & entry points for non-ARM architectures too#139208

Closed

swolchok wants to merge 15 commits intogh/swolchok/683/basefrom

gh/swolchok/683/head

Contributor

swolchok commented Oct 29, 2024 •

edited

Loading

Stack from ghstack (oldest at bottom):

Very similar to #137917, but for bf16.

Differential Revision: D65155971

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10


          [PyTorch] Build bf16 gemv fast path & entry points for non-ARM archit…

f4dd4e4

…ectures too

Very similar to #137917, but for bf16.

Differential Revision: [D65155971](https://our.internmc.facebook.com/intern/diff/D65155971/)

[ghstack-poisoned]

pytorch-bot bot commented Oct 29, 2024 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139208

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit a56927d with merge base 419a7e1 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot bot added the module: cpu label

Contributor

facebook-github-bot commented Oct 29, 2024

This pull request was exported from Phabricator. Differential Revision: D65155971

This was referenced Oct 29, 2024

[PyTorch] Move NEON VecConvert specialization from vec256_convert to vec128_convert #137661

Closed

[PyTorch] Specialize Vectorized<Half> for NEON even if FP16 arithmetic isn't available #137911

Closed

facebook-github-bot added the fb-exported label

This was referenced Oct 29, 2024

[PyTorch] Migrate fp16 gemv fast path kernel from intrinsics to vec::Vectorized #137912

Closed

[PyTorch] Use Half, not float16_t, in fp16 gemv fast path signatures #137913

Closed

[PyTorch] Move FP16 dot and GEMV kernels to new file in ATen/native/cpu/ #137914

Closed

[PyTorch] Clean up Registers/ElementsPerIteration constants #137915

Closed

[PyTorch] Convert reduced precision gemv vectorized tail loop to use whole vector register instead of half #137916

Closed

[PyTorch] Build ReducedPrecisionFloatGemvFastPathKernel & entry points for non-ARM architectures too #137917

Closed

[PyTorch] Hook up fp16_gemv_trans to x86 fp16 GEMM #137918

Closed

[PyTorch] Hook up fp16_gemv_trans to gemv fast path for non-aarch64 architectures #138005

Closed

[PyTorch] Support non-zero beta in fp16_gemv_trans #138275

Closed

Move bf16_gemv_trans to ReducedPrecisionFloatGemvFastPathKernel #139081

Closed

[PyTorch] Add efficient isnan for NEON float #139082

Closed

[PyTorch] Add efficient isnan for NEON half #139083

Closed

Extract value_type-generic NEON Vectorized<Half> functions to CRTP base class #139084

Closed

Add Vectorized<c10::BFloat16> specialization for ARM #139090

Closed

[PyTorch] Migrate bf16 gemv fast path kernel from intrinsics to vec::Vectorized #139159

Closed


          Update on "[PyTorch] Build bf16 gemv fast path & entry points for non…

ab5f884

…-ARM architectures too"

Very similar to #137917, but for bf16.

Differential Revision: [D65155971](https://our.internmc.facebook.com/intern/diff/D65155971/)

[ghstack-poisoned]

Contributor

facebook-github-bot commented Oct 29, 2024

This pull request was exported from Phabricator. Differential Revision: D65155971

swolchok mentioned this pull request

Hook up bf16_gemv_trans to x86 bf16 GEMM #139220

Closed

swolchok requested review from albanD and malfet and removed request for albanD

October 29, 2024 21:28

swolchok added the release notes: performance_as_product label

swolchok changed the title ~~[PyTorch] Build bf16 gemv fast path & entry points for non-ARM architectures too~~ Build bf16 gemv fast path & entry points for non-ARM architectures too

pytorch-bot bot added the ciflow/trunk label


          Update on "Build bf16 gemv fast path & entry points for non-ARM archi…

6e341b8

…tectures too"


Very similar to #137917, but for bf16.

Differential Revision: [D65155971](https://our.internmc.facebook.com/intern/diff/D65155971/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Contributor

facebook-github-bot commented Nov 2, 2024

This pull request was exported from Phabricator. Differential Revision: D65155971


          Update on "Build bf16 gemv fast path & entry points for non-ARM archi…

e2ad8ec

…tectures too"


Very similar to #137917, but for bf16.

Differential Revision: [D65155971](https://our.internmc.facebook.com/intern/diff/D65155971/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Contributor

facebook-github-bot commented Nov 3, 2024

This pull request was exported from Phabricator. Differential Revision: D65155971


          Update on "Build bf16 gemv fast path & entry points for non-ARM archi…

a1d615f

…tectures too"


Very similar to #137917, but for bf16.

Differential Revision: [D65155971](https://our.internmc.facebook.com/intern/diff/D65155971/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Contributor

facebook-github-bot commented Nov 3, 2024

This pull request was exported from Phabricator. Differential Revision: D65155971


          Update on "Build bf16 gemv fast path & entry points for non-ARM archi…

e3f550d

…tectures too"


Very similar to #137917, but for bf16.

Differential Revision: [D65155971](https://our.internmc.facebook.com/intern/diff/D65155971/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Contributor

facebook-github-bot commented Nov 3, 2024

This pull request was exported from Phabricator. Differential Revision: D65155971


          Update on "Build bf16 gemv fast path & entry points for non-ARM archi…

15a79d2

…tectures too"


Very similar to #137917, but for bf16.

Differential Revision: [D65155971](https://our.internmc.facebook.com/intern/diff/D65155971/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Contributor

facebook-github-bot commented Nov 4, 2024

This pull request was exported from Phabricator. Differential Revision: D65155971


          Update on "Build bf16 gemv fast path & entry points for non-ARM archi…

a56927d

…tectures too"


Very similar to #137917, but for bf16.

Differential Revision: [D65155971](https://our.internmc.facebook.com/intern/diff/D65155971/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Contributor

facebook-github-bot commented Nov 4, 2024

This pull request was exported from Phabricator. Differential Revision: D65155971

malfet approved these changes

View reviewed changes

aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp

               // https://godbolt.org/z/z8P4Yncra
               #define COMPILER_SUPPORTS_BF16_TARGET 1
-              #elif !defined(__clang__) && defined(__GNUC__) && __GNUC__ >= 10
+              #elif defined(__aarch64__) && !defined(CPU_CAPABILITY_SVE) && !defined(__clang__) && defined(__GNUC__) && __GNUC__ >= 10

Contributor

malfet Nov 8, 2024

Can this be moved to say compiler_capabilites header which is included form here, that has a table on top that explains which compiler versions supports what

pytorchmergebot closed this in

25c469b

pytorchmergebot added the Merged label

pytorchmergebot pushed a commit that referenced this pull request


          Hook up bf16_gemv_trans to x86 bf16 GEMM (#139220)

cc44b55

This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 .

Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2.

Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/)
Pull Request resolved: #139220
Approved by: https://github.com/malfet
ghstack dependencies: #139084, #139090, #139558, #139081, #139208

pobin6 pushed a commit to pobin6/pytorch that referenced this pull request


          Build bf16 gemv fast path & entry points for non-ARM architectures too (

1d946ad

pytorch#139208)

Very similar to pytorch#137917, but for bf16.

Differential Revision: [D65155971](https://our.internmc.facebook.com/intern/diff/D65155971/)

Pull Request resolved: pytorch#139208
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#139084, pytorch#139090, pytorch#139558, pytorch#139081

pobin6 pushed a commit to pobin6/pytorch that referenced this pull request


          Hook up bf16_gemv_trans to x86 bf16 GEMM (pytorch#139220)

1a8f885

This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 .

Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2.

Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/)
Pull Request resolved: pytorch#139220
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#139084, pytorch#139090, pytorch#139558, pytorch#139081, pytorch#139208

github-actions bot deleted the gh/swolchok/683/head branch

December 9, 2024 02:14

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/linux-aarch64 ciflow/mps ciflow/trunk fb-exported Merged module: cpu topic: performance