Add Vectorized<c10::BFloat16> specialization for ARM by swolchok · Pull Request #139090 · pytorch/pytorch

swolchok · 2024-10-28T17:54:21Z

Stack from ghstack (oldest at bottom):

When we have hardware support, we can use it. When we don't have hardware support, we can still do better than vec_base.h. I'm not sure to what extent we're set up to properly test both defined(__ARM_FEATURE_BF16) and !defined(__ARM_FEATURE_BF16) builds, feedback especially welcome there.

Testing: vec_test_all_types should cover correctness. For perf, seems clear that using vectorized intrinsics should be better than vec_base?

Differential Revision: D64997747

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

When we have hardware support, we can use it. When we don't have hardware support, we can still do better than vec_base.h. I'm not sure to what extent we're set up to properly test both `defined(__ARM_FEATURE_BF16)` and `!defined(__ARM_FEATURE_BF16)` builds, feedback especially welcome there. Differential Revision: [D64997747](https://our.internmc.facebook.com/intern/diff/D64997747/) [ghstack-poisoned]

pytorch-bot · 2024-10-28T17:54:24Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139090

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 47201a1 with merge base 419a7e1 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-10-28T17:54:30Z

This pull request was exported from Phabricator. Differential Revision: D64997747

… ARM" When we have hardware support, we can use it. When we don't have hardware support, we can still do better than vec_base.h. I'm not sure to what extent we're set up to properly test both `defined(__ARM_FEATURE_BF16)` and `!defined(__ARM_FEATURE_BF16)` builds, feedback especially welcome there. Differential Revision: [D64997747](https://our.internmc.facebook.com/intern/diff/D64997747/) [ghstack-poisoned]

facebook-github-bot · 2024-10-28T22:45:34Z

This pull request was exported from Phabricator. Differential Revision: D64997747

… ARM" When we have hardware support, we can use it. When we don't have hardware support, we can still do better than vec_base.h. I'm not sure to what extent we're set up to properly test both `defined(__ARM_FEATURE_BF16)` and `!defined(__ARM_FEATURE_BF16)` builds, feedback especially welcome there. Differential Revision: [D64997747](https://our.internmc.facebook.com/intern/diff/D64997747/) [ghstack-poisoned]

facebook-github-bot · 2024-10-29T05:41:24Z

This pull request was exported from Phabricator. Differential Revision: D64997747

… ARM" When we have hardware support, we can use it. When we don't have hardware support, we can still do better than vec_base.h. I'm not sure to what extent we're set up to properly test both `defined(__ARM_FEATURE_BF16)` and `!defined(__ARM_FEATURE_BF16)` builds, feedback especially welcome there. Differential Revision: [D64997747](https://our.internmc.facebook.com/intern/diff/D64997747/) [ghstack-poisoned]

facebook-github-bot · 2024-10-29T17:47:50Z

This pull request was exported from Phabricator. Differential Revision: D64997747

When we have hardware support, we can use it. When we don't have hardware support, we can still do better than vec_base.h. I'm not sure to what extent we're set up to properly test both `defined(__ARM_FEATURE_BF16)` and `!defined(__ARM_FEATURE_BF16)` builds, feedback especially welcome there. Testing: vec_test_all_types should cover correctness. For perf, seems clear that using vectorized intrinsics should be better than vec_base? Differential Revision: [D64997747](https://our.internmc.facebook.com/intern/diff/D64997747/) Pull Request resolved: pytorch#139090 Approved by: https://github.com/jgong5, https://github.com/malfet ghstack dependencies: pytorch#139084

…torch#139558) Discovered this bug when working on Vectorized<BFloat16>; apparently we have no automated testing for aarch64 without FP16. Testing: Manually disable FP16 feature for local vec_test_all_types run on Mac; see pass. Differential Revision: [D65385267](https://our.internmc.facebook.com/intern/diff/D65385267/) Pull Request resolved: pytorch#139558 Approved by: https://github.com/malfet ghstack dependencies: pytorch#139084, pytorch#139090

…rch#139081) Following the previous move of fp16_gemv_trans. Testing: Checked for performance regression with llm_benchmarks' `python benchmarks/benchmark_torch_mm.py llm`, didn't find one Differential Revision: [D64930872](https://our.internmc.facebook.com/intern/diff/D64930872/) Pull Request resolved: pytorch#139081 Approved by: https://github.com/malfet ghstack dependencies: pytorch#139084, pytorch#139090, pytorch#139558

pytorch#139208) Very similar to pytorch#137917, but for bf16. Differential Revision: [D65155971](https://our.internmc.facebook.com/intern/diff/D65155971/) Pull Request resolved: pytorch#139208 Approved by: https://github.com/malfet ghstack dependencies: pytorch#139084, pytorch#139090, pytorch#139558, pytorch#139081

This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 . Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2. Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/) Pull Request resolved: pytorch#139220 Approved by: https://github.com/malfet ghstack dependencies: pytorch#139084, pytorch#139090, pytorch#139558, pytorch#139081, pytorch#139208

@swolchok

Fix typo causing compilation error on aarch64 architecture with BF16 support. (#139090) tag: @swolchok Pull Request resolved: #142370 Approved by: https://github.com/Skylion007, https://github.com/malfet

pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Oct 28, 2024

swolchok mentioned this pull request Oct 28, 2024

[PyTorch] Move NEON VecConvert specialization from vec256_convert to vec128_convert #137661

Closed

facebook-github-bot added the fb-exported label Oct 28, 2024

swolchok mentioned this pull request Oct 29, 2024

[PyTorch] Migrate bf16 gemv fast path kernel from intrinsics to vec::Vectorized #139159

Closed

swolchok mentioned this pull request Oct 29, 2024

Build bf16 gemv fast path & entry points for non-ARM architectures too #139208

Closed

github-actions bot deleted the gh/swolchok/681/head branch December 9, 2024 02:13

aditew01 mentioned this pull request Dec 9, 2024

[cpu/aarch64] fix compilation for Vec:bf16 (128bit) #142370

Closed

tinglvv mentioned this pull request Jan 15, 2025

[aarch64] multiple inductor test failures related to vec128_bfloat16 #144818

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Vectorized<c10::BFloat16> specialization for ARM#139090

Add Vectorized<c10::BFloat16> specialization for ARM#139090
swolchok wants to merge 17 commits intogh/swolchok/681/basefrom
gh/swolchok/681/head

swolchok commented Oct 28, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Oct 28, 2024 •

edited

Loading

Uh oh!

facebook-github-bot commented Oct 28, 2024

Uh oh!

facebook-github-bot commented Oct 28, 2024

Uh oh!

facebook-github-bot commented Oct 29, 2024

Uh oh!

facebook-github-bot commented Oct 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

swolchok commented Oct 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139090

✅ No Failures

Uh oh!

facebook-github-bot commented Oct 28, 2024

Uh oh!

facebook-github-bot commented Oct 28, 2024

Uh oh!

facebook-github-bot commented Oct 29, 2024

Uh oh!

facebook-github-bot commented Oct 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

swolchok commented Oct 28, 2024 •

edited

Loading

pytorch-bot bot commented Oct 28, 2024 •

edited

Loading