Unbreak vec128_half_neon comparison without FP16 hardware support#139558

Closed

swolchok wants to merge 7 commits intogh/swolchok/688/basefrom

gh/swolchok/688/head

Contributor

swolchok commented Nov 2, 2024 •

edited by pytorch-bot bot

Loading

Stack from ghstack (oldest at bottom):

Discovered this bug when working on Vectorized; apparently we have no automated testing for aarch64 without FP16.

Testing: Manually disable FP16 feature for local vec_test_all_types run on Mac; see pass.

Differential Revision: D65385267

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10


          [PyTorch] Unbreak vec128_half_neon comparison ops without FP16 hardwa…

d3f8125

…re suppo

Discovered this bug when working on Vectorized<BFloat16>; apparently we have no automated testing for aarch64 without FP16.

Testing: Manually disable FP16 feature for local vec_test_all_types run on Mac; see pass.

Differential Revision: [D65385267](https://our.internmc.facebook.com/intern/diff/D65385267/)

[ghstack-poisoned]

pytorch-bot bot added the module: cpu label

pytorch-bot bot commented Nov 2, 2024 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139558

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit bcb32c3 with merge base 419a7e1 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Contributor

facebook-github-bot commented Nov 2, 2024

This pull request was exported from Phabricator. Differential Revision: D65385267

facebook-github-bot added the fb-exported label

This was referenced Nov 2, 2024

Extract value_type-generic NEON Vectorized<Half> functions to CRTP base class #139084

Closed

Add Vectorized<c10::BFloat16> specialization for ARM #139090

Closed

Move bf16_gemv_trans to ReducedPrecisionFloatGemvFastPathKernel #139081

Closed

Build bf16 gemv fast path & entry points for non-ARM architectures too #139208

Closed

Hook up bf16_gemv_trans to x86 bf16 GEMM #139220

Closed

Contributor Author

swolchok commented Nov 2, 2024

Testing: Manually disable FP16 feature for local vec_test_all_types run on Mac; see pass.

this is only a problem if FCVTN canonicalizes NaNs. It's implementation-defined whether it will do that by default; the behavior is controlled by the DN bit of the FPCR, which is unknown after a reset. (https://developer.arm.com/documentation/ddi0595/2021-03/AArch64-Registers/FPCR--Floating-point-Control-Register?lang=en#fieldset_0-25_25) Therefore I cannot repro a bug locally with the existing behavior, but I can check to make sure things still work with this diff.

swolchok requested a review from malfet

November 2, 2024 19:13

swolchok added release notes: jit topic: bug fixes labels

swolchok changed the title ~~[PyTorch] Unbreak vec128_half_neon comparison ops without FP16 hardware suppo~~ Unbreak vec128_half_neon comparison without FP16 hardware support

pytorch-bot bot added the ciflow/trunk label


          Update on "Unbreak vec128_half_neon comparison without FP16 hardware …

a7004f2

…support"

Discovered this bug when working on Vectorized<BFloat16>; apparently we have no automated testing for aarch64 without FP16.

Testing: Manually disable FP16 feature for local vec_test_all_types run on Mac; see pass.

Differential Revision: [D65385267](https://our.internmc.facebook.com/intern/diff/D65385267/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Contributor

facebook-github-bot commented Nov 2, 2024

This pull request was exported from Phabricator. Differential Revision: D65385267


          Update on "Unbreak vec128_half_neon comparison without FP16 hardware …

f2b4649

…support"

Discovered this bug when working on Vectorized<BFloat16>; apparently we have no automated testing for aarch64 without FP16.

Testing: Manually disable FP16 feature for local vec_test_all_types run on Mac; see pass.

Differential Revision: [D65385267](https://our.internmc.facebook.com/intern/diff/D65385267/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Contributor

facebook-github-bot commented Nov 3, 2024

This pull request was exported from Phabricator. Differential Revision: D65385267


          Update on "Unbreak vec128_half_neon comparison without FP16 hardware …

5309fdf

…support"

Discovered this bug when working on Vectorized<BFloat16>; apparently we have no automated testing for aarch64 without FP16.

Testing: Manually disable FP16 feature for local vec_test_all_types run on Mac; see pass.

Differential Revision: [D65385267](https://our.internmc.facebook.com/intern/diff/D65385267/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Contributor

facebook-github-bot commented Nov 3, 2024

This pull request was exported from Phabricator. Differential Revision: D65385267


          Update on "Unbreak vec128_half_neon comparison without FP16 hardware …

86e32a3

…support"

Discovered this bug when working on Vectorized<BFloat16>; apparently we have no automated testing for aarch64 without FP16.

Testing: Manually disable FP16 feature for local vec_test_all_types run on Mac; see pass.

Differential Revision: [D65385267](https://our.internmc.facebook.com/intern/diff/D65385267/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Contributor

facebook-github-bot commented Nov 3, 2024

This pull request was exported from Phabricator. Differential Revision: D65385267


          Update on "Unbreak vec128_half_neon comparison without FP16 hardware …

1cd6313

…support"

Discovered this bug when working on Vectorized<BFloat16>; apparently we have no automated testing for aarch64 without FP16.

Testing: Manually disable FP16 feature for local vec_test_all_types run on Mac; see pass.

Differential Revision: [D65385267](https://our.internmc.facebook.com/intern/diff/D65385267/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Contributor

facebook-github-bot commented Nov 4, 2024

This pull request was exported from Phabricator. Differential Revision: D65385267


          Update on "Unbreak vec128_half_neon comparison without FP16 hardware …

bcb32c3

…support"

Discovered this bug when working on Vectorized<BFloat16>; apparently we have no automated testing for aarch64 without FP16.

Testing: Manually disable FP16 feature for local vec_test_all_types run on Mac; see pass.

Differential Revision: [D65385267](https://our.internmc.facebook.com/intern/diff/D65385267/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Contributor

facebook-github-bot commented Nov 4, 2024

This pull request was exported from Phabricator. Differential Revision: D65385267

malfet approved these changes

View reviewed changes

aten/src/ATen/cpu/vec/vec128/vec128_half_neon.h

                   return Vectorized<c10::Half>(vcombine_f16(r00, r01));
                 }
+                Vectorized<c10::Half> map2_bitmask_with_vec_float_method(

Contributor

malfet Nov 8, 2024

Yuck

pytorchmergebot closed this in

44f6d14

pytorchmergebot pushed a commit that referenced this pull request


          Move bf16_gemv_trans to ReducedPrecisionFloatGemvFastPathKernel (#139081

7f0bf9f

)

Following the previous move of fp16_gemv_trans.

Testing: Checked for performance regression with llm_benchmarks' `python benchmarks/benchmark_torch_mm.py llm`, didn't find one
Differential Revision: [D64930872](https://our.internmc.facebook.com/intern/diff/D64930872/)

Pull Request resolved: #139081
Approved by: https://github.com/malfet
ghstack dependencies: #139084, #139090, #139558

pytorchmergebot added the Merged label

pytorchmergebot pushed a commit that referenced this pull request


          Build bf16 gemv fast path & entry points for non-ARM architectures too (

25c469b

#139208)

Very similar to #137917, but for bf16.

Differential Revision: [D65155971](https://our.internmc.facebook.com/intern/diff/D65155971/)

Pull Request resolved: #139208
Approved by: https://github.com/malfet
ghstack dependencies: #139084, #139090, #139558, #139081

pytorchmergebot pushed a commit that referenced this pull request


          Hook up bf16_gemv_trans to x86 bf16 GEMM (#139220)

cc44b55

This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 .

Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2.

Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/)
Pull Request resolved: #139220
Approved by: https://github.com/malfet
ghstack dependencies: #139084, #139090, #139558, #139081, #139208

pobin6 pushed a commit to pobin6/pytorch that referenced this pull request


          Unbreak vec128_half_neon comparison without FP16 hardware support (py…

…torch#139558)

Discovered this bug when working on Vectorized<BFloat16>; apparently we have no automated testing for aarch64 without FP16.

Testing: Manually disable FP16 feature for local vec_test_all_types run on Mac; see pass.

Differential Revision: [D65385267](https://our.internmc.facebook.com/intern/diff/D65385267/)

Pull Request resolved: pytorch#139558
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#139084, pytorch#139090

pobin6 pushed a commit to pobin6/pytorch that referenced this pull request


          Move bf16_gemv_trans to ReducedPrecisionFloatGemvFastPathKernel (pyto…

c300e15

…rch#139081)

Following the previous move of fp16_gemv_trans.

Testing: Checked for performance regression with llm_benchmarks' `python benchmarks/benchmark_torch_mm.py llm`, didn't find one
Differential Revision: [D64930872](https://our.internmc.facebook.com/intern/diff/D64930872/)

Pull Request resolved: pytorch#139081
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#139084, pytorch#139090, pytorch#139558

pobin6 pushed a commit to pobin6/pytorch that referenced this pull request


          Build bf16 gemv fast path & entry points for non-ARM architectures too (

1d946ad

pytorch#139208)

Very similar to pytorch#137917, but for bf16.

Differential Revision: [D65155971](https://our.internmc.facebook.com/intern/diff/D65155971/)

Pull Request resolved: pytorch#139208
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#139084, pytorch#139090, pytorch#139558, pytorch#139081

pobin6 pushed a commit to pobin6/pytorch that referenced this pull request


          Hook up bf16_gemv_trans to x86 bf16 GEMM (pytorch#139220)

1a8f885

This is the big milestone for bf16 and should enable us to close pytorch/torchchat#1253 .

Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2.

Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/)
Pull Request resolved: pytorch#139220
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#139084, pytorch#139090, pytorch#139558, pytorch#139081, pytorch#139208

github-actions bot deleted the gh/swolchok/688/head branch

December 9, 2024 02:14

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk fb-exported Merged module: cpu release notes: jit topic: bug fixes