add scalar conversion using avx instructions for Half on CPU#101378
add scalar conversion using avx instructions for Half on CPU#101378CaoE wants to merge 8 commits intogh/CaoE/24/basefrom
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/101378
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 945e0c0: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
ghstack-source-id: 5ee69c2 Pull Request resolved: pytorch#101378
| __m256 v = _mm256_set_ps(val, val, val, val, val, val, val, val); | ||
| __m128i o = _mm256_cvtps_ph( | ||
| v, (_MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC)); | ||
| std::memcpy(&d, &o, sizeof(d)); |
There was a problem hiding this comment.
How does it compare to a vector store to an array and get the first item of it? Have you checked the assembly code?
There was a problem hiding this comment.
std::memcpy and a vector store have almost the same performance. Used vector store instead.
I checked the results with this conversion on maxpool #101379 (include avx2 and avx512).
There was a problem hiding this comment.
good job, you get rid of the memcpy, that's what i was talking about ~
|
|
||
| template<> inline Half down_scale(float val) { | ||
| unsigned short d; | ||
| __m256 v = _mm256_set_ps(val, val, val, val, val, val, val, val); |
There was a problem hiding this comment.
_mm256_set1_ps is simpler?
There was a problem hiding this comment.
Yes. Used _mm256_set1_ps instead.
### Motivation Scalar conversion between Half and Float on CPU is more time consuming compared to BFloat16 <-> Float. There is no direct data type conversion instruction for single Half value on CPU, so we add scalar conversion with avx instructions for Half to speed up. ### Testing cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
### Motivation Scalar conversion between Half and Float on CPU is more time consuming compared to BFloat16 <-> Float. There is no direct data type conversion instruction for single Half value on CPU, so we add scalar conversion with avx instructions for Half to speed up. ### Testing cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
ghstack-source-id: 849a2e9 Pull Request resolved: pytorch#101378
mingfeima
left a comment
There was a problem hiding this comment.
generally LGTM except a few minor changes!
aten/src/ATen/cpu/vec/vec_half.h
Outdated
| @@ -0,0 +1,59 @@ | |||
| #include <ATen/cpu/vec/intrinsics.h> | |||
|
|
|||
| #ifdef CPU_CAPABILITY_AVX512 | |||
There was a problem hiding this comment.
why can't we just use __at_align__ from ./aten/src/ATen/cpu/vec/vec_base.h
There was a problem hiding this comment.
Removed __at_align__.
aten/src/ATen/cpu/vec/vec_half.h
Outdated
| inline namespace CPU_CAPABILITY { | ||
|
|
||
| #if defined(CPU_CAPABILITY_AVX2) || defined(CPU_CAPABILITY_AVX512) | ||
| inline uint16_t float_to_half(float val) { |
There was a problem hiding this comment.
can we use Half as the output type, instead of uint16_t ?
There was a problem hiding this comment.
Returning uint16_t is aligned with the behavior of fp16_ieee_to_fp32_value called in Half-inl.h. It is easy for Half::Half(float value) to use float_to_half using uint16_t as output.
aten/src/ATen/cpu/vec/vec_half.h
Outdated
| #endif | ||
| } | ||
|
|
||
| inline float half_to_float(uint16_t val) { |
There was a problem hiding this comment.
I would suggest to make an overload for each pragma if-else for half2float_scalar and float2half_scalar:
#if defined(CPU_CAPABILITY_AVX512)
half2float_scalar { ... }
float2half_scalar { ... }
#elif defined(CPU_CAPABILITY_AVX2)
half2float_scalar { ... }
float2half_scalar { ... }
#else
// calls into the slow path
half2float_scalar { ... }
float2half_scalar { ... }
#endif
There was a problem hiding this comment.
In Half-inl.h :
#if defined(CPU_CAPABILITY_AVX2) || defined(CPU_CAPABILITY_AVX512)
x(at::vec::float_to_half(value))
#else
x(detail::fp16_ieee_from_fp32_value(value))
If CPU_CAPABILITY_AVX2 and CPU_CAPABILITY_AVX512 is not defined, it will use x(detail::fp16_ieee_from_fp32_value(value)).
If we want to define the slow path in vec_half.h we need to include detail::fp16_ieee_from_fp32_value(value) in vec_half.h. I think maybe we can leave the slow path in Half-inl.h.
| __m256 v = _mm256_set1_ps(val); | ||
| __m128i o = | ||
| _mm256_cvtps_ph(v, (_MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC)); | ||
| return static_cast<std::uint16_t>(_mm_cvtsi128_si32(o)); |
There was a problem hiding this comment.
Yes, the lower 16 bits. Copied from https://github.com/pytorch/FBGEMM/blob/main/src/QuantUtilsAvx2.cc#L1556.
|
what's the operation shown in the benchmark above ? this patch has no dependency on the stack, suggest move it out as a single pr. |
### Motivation Scalar conversion between Half and Float on CPU is more time consuming compared to BFloat16 <-> Float. There is no direct data type conversion instruction for single Half value on CPU, so we add scalar conversion with avx instructions for Half to speed up. ### Testing Single socket (28 cores): shape | fp16 forward / ms | bf16 forward / ms | fp16 backward / ms | bf16 backward / ms | speedup ratio (fp16 forward) | speedup ratio (fp16 backward) -- | -- | -- | -- | -- | -- | -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig | 5.07165 | 5.418 | 0.5798 | 0.5123 | 1.373694951 | 3.430786 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL | 1.37455 | 1.2505 | 8.8336 | 9.7684 | 1.373635008 | 4.132924 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig | 28.72 | 30.7069 | 3.813 | 3.75 | 1.31977124 | 2.783006 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL | 4.5783 | 4.703 | 4.703 | 5.1 | 1.028980189 | 3.1293 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig | 13.896 | 14.8138 | 1.6635 | 1.6274 | 1.298704663 | 2.982699 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL | 2.11291 | 2.1158 | 2.26778 | 2.272 | 0.951105348 | 3.179012 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig | 0.4204 | 0.3843 | 0.0649 | 0.0633 | 2.102711703 | 1.779492 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d | 0.1134 | 0.11 | 0.1476 | 0.143 | 2.23042328 | 3.612398 Single core: shape | fp16 forward / ms | bf16 forward / ms | fp16 backward / ms | bf16 backward / ms | speedup ratio (fp16 forward) | speedup ratio (fp16 backward) -- | -- | -- | -- | -- | -- | -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig | 124.413 | 114.44 | 10.553 | 11.2486 | 1.31395433 | 3.923844 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL | 28.99 | 28.0781 | 9.5092 | 10.9258 | 1.324296999 | 3.888377 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig | 640.8276 | 591.964 | 59.18776 | 60.854 | 1.334956391 | 3.704458 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL | 88.57 | 90.214 | 54.358 | 59.205 | 1.031258214 | 3.75285 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig | 318.6197 | 285.155 | 28.4999 | 29.4387 | 1.315298144 | 3.759747 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL | 31.3981 | 34.0544 | 25.6557 | 28.7811 | 1.068505738 | 3.841587 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig | 8.87882 | 8.207 | 0.386056 | 0.3939 | 1.567866 | 3.50387 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d | 2.4167 | 2.38295 | 0.3769 | 0.4066 | 1.39402491 | 3.30061 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
### Motivation Scalar conversion between Half and Float on CPU is more time consuming compared to BFloat16 <-> Float. There is no direct data type conversion instruction for single Half value on CPU, so we add scalar conversion with avx instructions for Half to speed up. ### Testing Single socket (28 cores): shape | fp16 forward / ms | bf16 forward / ms | fp16 backward / ms | bf16 backward / ms | speedup ratio (fp16 forward) | speedup ratio (fp16 backward) -- | -- | -- | -- | -- | -- | -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig | 5.07165 | 5.418 | 0.5798 | 0.5123 | 1.373694951 | 3.430786 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL | 1.37455 | 1.2505 | 8.8336 | 9.7684 | 1.373635008 | 4.132924 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig | 28.72 | 30.7069 | 3.813 | 3.75 | 1.31977124 | 2.783006 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL | 4.5783 | 4.703 | 4.703 | 5.1 | 1.028980189 | 3.1293 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig | 13.896 | 14.8138 | 1.6635 | 1.6274 | 1.298704663 | 2.982699 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL | 2.11291 | 2.1158 | 2.26778 | 2.272 | 0.951105348 | 3.179012 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig | 0.4204 | 0.3843 | 0.0649 | 0.0633 | 2.102711703 | 1.779492 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d | 0.1134 | 0.11 | 0.1476 | 0.143 | 2.23042328 | 3.612398 Single core: shape | fp16 forward / ms | bf16 forward / ms | fp16 backward / ms | bf16 backward / ms | speedup ratio (fp16 forward) | speedup ratio (fp16 backward) -- | -- | -- | -- | -- | -- | -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig | 124.413 | 114.44 | 10.553 | 11.2486 | 1.31395433 | 3.923844 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL | 28.99 | 28.0781 | 9.5092 | 10.9258 | 1.324296999 | 3.888377 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig | 640.8276 | 591.964 | 59.18776 | 60.854 | 1.334956391 | 3.704458 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL | 88.57 | 90.214 | 54.358 | 59.205 | 1.031258214 | 3.75285 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig | 318.6197 | 285.155 | 28.4999 | 29.4387 | 1.315298144 | 3.759747 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL | 31.3981 | 34.0544 | 25.6557 | 28.7811 | 1.068505738 | 3.841587 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig | 8.87882 | 8.207 | 0.386056 | 0.3939 | 1.567866 | 3.50387 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d | 2.4167 | 2.38295 | 0.3769 | 0.4066 | 1.39402491 | 3.30061 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
The operation is maxpool, and compared with the results of #98819. |
### Motivation Scalar conversion between Half and Float on CPU is more time consuming compared to BFloat16 <-> Float. There is no direct data type conversion instruction for single Half value on CPU, so we add scalar conversion with avx instructions for Half to speed up. ### Testing Test maxpool, and compared with the results of #98819. Single socket (28 cores): shape | fp16 forward / ms | bf16 forward / ms | fp16 backward / ms | bf16 backward / ms | speedup ratio (fp16 forward) | speedup ratio (fp16 backward) -- | -- | -- | -- | -- | -- | -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig | 5.07165 | 5.418 | 0.5798 | 0.5123 | 1.373694951 | 3.430786 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL | 1.37455 | 1.2505 | 8.8336 | 9.7684 | 1.373635008 | 4.132924 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig | 28.72 | 30.7069 | 3.813 | 3.75 | 1.31977124 | 2.783006 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL | 4.5783 | 4.703 | 4.703 | 5.1 | 1.028980189 | 3.1293 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig | 13.896 | 14.8138 | 1.6635 | 1.6274 | 1.298704663 | 2.982699 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL | 2.11291 | 2.1158 | 2.26778 | 2.272 | 0.951105348 | 3.179012 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig | 0.4204 | 0.3843 | 0.0649 | 0.0633 | 2.102711703 | 1.779492 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d | 0.1134 | 0.11 | 0.1476 | 0.143 | 2.23042328 | 3.612398 Single core: shape | fp16 forward / ms | bf16 forward / ms | fp16 backward / ms | bf16 backward / ms | speedup ratio (fp16 forward) | speedup ratio (fp16 backward) -- | -- | -- | -- | -- | -- | -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig | 124.413 | 114.44 | 10.553 | 11.2486 | 1.31395433 | 3.923844 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL | 28.99 | 28.0781 | 9.5092 | 10.9258 | 1.324296999 | 3.888377 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig | 640.8276 | 591.964 | 59.18776 | 60.854 | 1.334956391 | 3.704458 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL | 88.57 | 90.214 | 54.358 | 59.205 | 1.031258214 | 3.75285 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig | 318.6197 | 285.155 | 28.4999 | 29.4387 | 1.315298144 | 3.759747 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL | 31.3981 | 34.0544 | 25.6557 | 28.7811 | 1.068505738 | 3.841587 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig | 8.87882 | 8.207 | 0.386056 | 0.3939 | 1.567866 | 3.50387 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d | 2.4167 | 2.38295 | 0.3769 | 0.4066 | 1.39402491 | 3.30061 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
ghstack-source-id: f151192 Pull Request resolved: pytorch#101378
Move this PR to #102140 |
ghstack-source-id: f151192 Pull Request resolved: pytorch#101378
ghstack-source-id: f151192 Pull Request resolved: pytorch#101378
ghstack-source-id: f151192 Pull Request resolved: pytorch#101378
ghstack-source-id: f151192 Pull Request resolved: pytorch#101378
ghstack-source-id: f151192 Pull Request resolved: pytorch#101378
ghstack-source-id: f151192 Pull Request resolved: pytorch#101378
ghstack-source-id: f151192 Pull Request resolved: pytorch#101378
ghstack-source-id: f151192 Pull Request resolved: pytorch#101378
ghstack-source-id: f151192 Pull Request resolved: pytorch#101378
ghstack-source-id: f151192 Pull Request resolved: pytorch#101378
Stack from ghstack (oldest at bottom):
Motivation
Scalar conversion between Half and Float on CPU is more time consuming compared to BFloat16 <-> Float. There is no direct data type conversion instruction for single Half value on CPU, so we add scalar conversion with avx instructions for Half to speed up.
Testing
Test maxpool, and compared with the results of #98819.
Single socket (28 cores):
Single core:
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10