Add scalar conversion using avx instructions for half#102140
Add scalar conversion using avx instructions for half#102140CaoE wants to merge 2 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/102140
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit b7079ee with merge base 3a79621 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
Move #101378 from stack to here. |
|
@ngimel Could please review this PR ? Thank you. |
|
@cpuhrsch Could you please review this PR ? Thanks. |
aten/src/ATen/cpu/vec/vec_half.h
Outdated
| inline namespace CPU_CAPABILITY { | ||
|
|
||
| #if defined(CPU_CAPABILITY_AVX2) || defined(CPU_CAPABILITY_AVX512) | ||
| inline uint16_t float2half_scalar(float val) { |
There was a problem hiding this comment.
Why introduce these new functions that are only defined under both of these options?
It looks like they're only used once.
There was a problem hiding this comment.
There is no direct data type conversion instruction for single Half value on CPU, so I add scalar conversion with AVX instructions for Half to speed up when AVX2 or AVX512 is supported on the platform.
If AVX2 and AVX512 are not supported, Half <-> float conversion will fallback to the original implementations.
| return float(c10::bit_cast<sycl::half>(x)); | ||
| #else | ||
| #if defined(CPU_CAPABILITY_AVX2) || defined(CPU_CAPABILITY_AVX512) | ||
| return at::vec::half2float_scalar(x); |
There was a problem hiding this comment.
Should we just inline the code here?
There was a problem hiding this comment.
Alternatively we can unify code that works for all CPU types in the half2float, float2half conversion functions.
There was a problem hiding this comment.
Should we just inline the code here?
I can not inline at::vec::half2float_scalar or at::vec::float2half_scalar in c10/util/Half-inl.h since PyTorch build systems does not add compilation flags related to AVX or AVX512 for c10/util/Half-inl.h during compilation. It is unable to compile AVX related instructions in c10/util/Half-inl.h. That's why I chose to add these two functions under aten/src/Aten/cpu.
cpuhrsch
left a comment
There was a problem hiding this comment.
Are we sure that CI covers these changes? Can you raise an exception within the conversion and see which tests and which environment fails? I want to make sure we have a CPU with the required capability and a test that exercises this.
|
@cpuhrsch I added a c++ test case for half conversion.
Half conversion would not raise an exception since if the cpu does not support avx2 or avx512 it will fallback to |
|
@pytorchbot revert -m "Sorry, this is still breaking internal builds. Specifically, the dynamo test test_repros.py::DynamicShapesReproTests::test_odict_get_item_index_name" -c ghfirst @cpuhrsch can help debug this |
|
@pytorchbot successfully started a revert job. Check the current status here. |
|
@CaoE your PR has been successfully reverted. |
This reverts commit 1d6a446. Reverted #102140 on behalf of https://github.com/ZainRizvi due to Sorry, this is still breaking internal builds. Specifically, the dynamo test test_repros.py::DynamicShapesReproTests::test_odict_get_item_index_name ([comment](#102140 (comment)))
|
This looks like an unrelated failure. @CaoE could you please rebase and we'll try again? |
eee3577 to
f9530b4
Compare
|
@cpuhrsch Thanks for your help !!! I rebased and also removed apple check by using |
6c5f279 to
31c2cc4
Compare
|
Warning: Unknown label
Please add the new label to .github/pytorch-probot.yml |
|
|
ghstack-source-id: f151192 Pull Request resolved: pytorch#101378
31c2cc4 to
b7079ee
Compare
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Motivation
Scalar conversion between Half and Float on CPU is more time consuming compared to BFloat16 <-> Float. There is no direct data type conversion instruction for single Half value on CPU, so we add scalar conversion with avx instructions for Half to speed up.
Testing
Test maxpool, and compared with the results of #98819.
Single socket (28 cores):
Single core:
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov