Skip to content

add scalar conversion using avx instructions for Half on CPU#101378

Closed
CaoE wants to merge 8 commits intogh/CaoE/24/basefrom
gh/CaoE/24/head
Closed

add scalar conversion using avx instructions for Half on CPU#101378
CaoE wants to merge 8 commits intogh/CaoE/24/basefrom
gh/CaoE/24/head

Conversation

@CaoE
Copy link
Collaborator

@CaoE CaoE commented May 15, 2023

Stack from ghstack (oldest at bottom):

Motivation

Scalar conversion between Half and Float on CPU is more time consuming compared to BFloat16 <-> Float. There is no direct data type conversion instruction for single Half value on CPU, so we add scalar conversion with avx instructions for Half to speed up.

Testing

Test maxpool, and compared with the results of #98819.
Single socket (28 cores):

shape fp16 forward / ms bf16 forward / ms fp16 backward / ms bf16 backward / ms speedup ratio (fp16 forward) speedup ratio (fp16 backward)
size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig 5.07165 5.418 0.5798 0.5123 1.373694951 3.430786
size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL 1.37455 1.2505 8.8336 9.7684 1.373635008 4.132924
size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig 28.72 30.7069 3.813 3.75 1.31977124 2.783006
size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL 4.5783 4.703 4.703 5.1 1.028980189 3.1293
size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig 13.896 14.8138 1.6635 1.6274 1.298704663 2.982699
size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL 2.11291 2.1158 2.26778 2.272 0.951105348 3.179012
size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig 0.4204 0.3843 0.0649 0.0633 2.102711703 1.779492
size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d 0.1134 0.11 0.1476 0.143 2.23042328 3.612398

Single core:

shape fp16 forward / ms bf16 forward / ms fp16 backward / ms bf16 backward / ms speedup ratio (fp16 forward) speedup ratio (fp16 backward)
size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig 124.413 114.44 10.553 11.2486 1.31395433 3.923844
size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL 28.99 28.0781 9.5092 10.9258 1.324296999 3.888377
size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig 640.8276 591.964 59.18776 60.854 1.334956391 3.704458
size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL 88.57 90.214 54.358 59.205 1.031258214 3.75285
size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig 318.6197 285.155 28.4999 29.4387 1.315298144 3.759747
size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL 31.3981 34.0544 25.6557 28.7811 1.068505738 3.841587
size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig 8.87882 8.207 0.386056 0.3939 1.567866 3.50387
size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d 2.4167 2.38295 0.3769 0.4066 1.39402491 3.30061

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

@pytorch-bot
Copy link

pytorch-bot bot commented May 15, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/101378

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 945e0c0:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@github-actions github-actions bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label May 15, 2023
@CaoE CaoE marked this pull request as draft May 15, 2023 02:59
@CaoE CaoE requested a review from mingfeima May 15, 2023 02:59
@CaoE CaoE added the topic: not user facing topic category label May 15, 2023
@CaoE CaoE changed the title add scalar conversion using avx instructions for half add scalar conversion using avx instructions for Half on CPU May 15, 2023
CaoE added a commit to CaoE/pytorch that referenced this pull request May 15, 2023
@CaoE CaoE added ciflow/trunk Trigger trunk jobs on your pull request ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR labels May 15, 2023
@CaoE CaoE requested a review from jgong5 May 16, 2023 02:44
__m256 v = _mm256_set_ps(val, val, val, val, val, val, val, val);
__m128i o = _mm256_cvtps_ph(
v, (_MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC));
std::memcpy(&d, &o, sizeof(d));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it compare to a vector store to an array and get the first item of it? Have you checked the assembly code?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::memcpy and a vector store have almost the same performance. Used vector store instead.
I checked the results with this conversion on maxpool #101379 (include avx2 and avx512).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good job, you get rid of the memcpy, that's what i was talking about ~


template<> inline Half down_scale(float val) {
unsigned short d;
__m256 v = _mm256_set_ps(val, val, val, val, val, val, val, val);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_mm256_set1_ps is simpler?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Used _mm256_set1_ps instead.

### Motivation
Scalar conversion between Half and Float on CPU is more time consuming compared to BFloat16 <-> Float. There is no direct data type conversion instruction for single Half value on CPU,  so we add scalar conversion with avx instructions for Half to speed up.

### Testing


cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
@CaoE CaoE added the module: half Related to float16 half-precision floats label May 17, 2023
@CaoE CaoE marked this pull request as ready for review May 18, 2023 00:52
@CaoE CaoE marked this pull request as draft May 18, 2023 07:06
### Motivation
Scalar conversion between Half and Float on CPU is more time consuming compared to BFloat16 <-> Float. There is no direct data type conversion instruction for single Half value on CPU,  so we add scalar conversion with avx instructions for Half to speed up.

### Testing


cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
CaoE added a commit that referenced this pull request May 19, 2023
ghstack-source-id: 849a2e9
Pull Request resolved: #101378
CaoE added a commit to CaoE/pytorch that referenced this pull request May 22, 2023
Copy link
Collaborator

@mingfeima mingfeima left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally LGTM except a few minor changes!

@@ -0,0 +1,59 @@
#include <ATen/cpu/vec/intrinsics.h>

#ifdef CPU_CAPABILITY_AVX512
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't we just use __at_align__ from ./aten/src/ATen/cpu/vec/vec_base.h

Copy link
Collaborator Author

@CaoE CaoE May 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed __at_align__.

inline namespace CPU_CAPABILITY {

#if defined(CPU_CAPABILITY_AVX2) || defined(CPU_CAPABILITY_AVX512)
inline uint16_t float_to_half(float val) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use Half as the output type, instead of uint16_t ?

Copy link
Collaborator Author

@CaoE CaoE May 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning uint16_t is aligned with the behavior of fp16_ieee_to_fp32_value called in Half-inl.h. It is easy for Half::Half(float value) to use float_to_half using uint16_t as output.

#endif
}

inline float half_to_float(uint16_t val) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to make an overload for each pragma if-else for half2float_scalar and float2half_scalar:

#if defined(CPU_CAPABILITY_AVX512)
half2float_scalar { ... }
float2half_scalar { ... }
#elif defined(CPU_CAPABILITY_AVX2)
half2float_scalar { ... }
float2half_scalar { ... }
#else
// calls into the slow path
half2float_scalar { ... }
float2half_scalar { ... }
#endif


Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Half-inl.h :

#if defined(CPU_CAPABILITY_AVX2) || defined(CPU_CAPABILITY_AVX512)
      x(at::vec::float_to_half(value))
#else
      x(detail::fp16_ieee_from_fp32_value(value))

If CPU_CAPABILITY_AVX2 and CPU_CAPABILITY_AVX512 is not defined, it will use x(detail::fp16_ieee_from_fp32_value(value)).
If we want to define the slow path in vec_half.h we need to include detail::fp16_ieee_from_fp32_value(value) in vec_half.h. I think maybe we can leave the slow path in Half-inl.h.

__m256 v = _mm256_set1_ps(val);
__m128i o =
_mm256_cvtps_ph(v, (_MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC));
return static_cast<std::uint16_t>(_mm_cvtsi128_si32(o));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the lower 16 bits?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mingfeima
Copy link
Collaborator

what's the operation shown in the benchmark above ?

this patch has no dependency on the stack, suggest move it out as a single pr.

### Motivation
Scalar conversion between Half and Float on CPU is more time consuming compared to BFloat16 <-> Float. There is no direct data type conversion instruction for single Half value on CPU,  so we add scalar conversion with avx instructions for Half to speed up.

### Testing
Single socket (28 cores):

shape | fp16 forward / ms | bf16 forward / ms | fp16 backward / ms | bf16 backward / ms | speedup ratio (fp16 forward) | speedup ratio (fp16 backward)
-- | -- | -- | -- | -- | -- | --
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: contig | 5.07165 | 5.418 | 0.5798 | 0.5123 | 1.373694951 | 3.430786
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: CL | 1.37455 | 1.2505 | 8.8336 | 9.7684 | 1.373635008 | 4.132924
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: contig | 28.72 | 30.7069 | 3.813 | 3.75 | 1.31977124 | 2.783006
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: CL | 4.5783 | 4.703 | 4.703 | 5.1 | 1.028980189 | 3.1293
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: contig | 13.896 | 14.8138 | 1.6635 | 1.6274 | 1.298704663 | 2.982699
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: CL | 2.11291 | 2.1158 | 2.26778 | 2.272 | 0.951105348 | 3.179012
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: contig | 0.4204 | 0.3843 | 0.0649 | 0.0633 | 2.102711703 | 1.779492
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: CL3d | 0.1134 | 0.11 | 0.1476 | 0.143 | 2.23042328 | 3.612398



Single core:

shape | fp16 forward / ms | bf16 forward / ms | fp16 backward / ms | bf16 backward / ms | speedup ratio (fp16 forward) | speedup ratio (fp16 backward)
-- | -- | -- | -- | -- | -- | --
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: contig | 124.413 | 114.44 | 10.553 | 11.2486 | 1.31395433 | 3.923844
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: CL | 28.99 | 28.0781 | 9.5092 | 10.9258 | 1.324296999 | 3.888377
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: contig | 640.8276 | 591.964 | 59.18776 | 60.854 | 1.334956391 | 3.704458
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: CL | 88.57 | 90.214 | 54.358 | 59.205 | 1.031258214 | 3.75285
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: contig | 318.6197 | 285.155 | 28.4999 | 29.4387 | 1.315298144 | 3.759747
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: CL | 31.3981 | 34.0544 | 25.6557 | 28.7811 | 1.068505738 | 3.841587
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: contig | 8.87882 | 8.207 | 0.386056 | 0.3939 | 1.567866 | 3.50387
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: CL3d | 2.4167 | 2.38295 | 0.3769 | 0.4066 | 1.39402491 | 3.30061


cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
### Motivation
Scalar conversion between Half and Float on CPU is more time consuming compared to BFloat16 <-> Float. There is no direct data type conversion instruction for single Half value on CPU,  so we add scalar conversion with avx instructions for Half to speed up.

### Testing
Single socket (28 cores):

shape | fp16 forward / ms | bf16 forward / ms | fp16 backward / ms | bf16 backward / ms | speedup ratio (fp16 forward) | speedup ratio (fp16 backward)
-- | -- | -- | -- | -- | -- | --
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: contig | 5.07165 | 5.418 | 0.5798 | 0.5123 | 1.373694951 | 3.430786
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: CL | 1.37455 | 1.2505 | 8.8336 | 9.7684 | 1.373635008 | 4.132924
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: contig | 28.72 | 30.7069 | 3.813 | 3.75 | 1.31977124 | 2.783006
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: CL | 4.5783 | 4.703 | 4.703 | 5.1 | 1.028980189 | 3.1293
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: contig | 13.896 | 14.8138 | 1.6635 | 1.6274 | 1.298704663 | 2.982699
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: CL | 2.11291 | 2.1158 | 2.26778 | 2.272 | 0.951105348 | 3.179012
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: contig | 0.4204 | 0.3843 | 0.0649 | 0.0633 | 2.102711703 | 1.779492
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: CL3d | 0.1134 | 0.11 | 0.1476 | 0.143 | 2.23042328 | 3.612398



Single core:

shape | fp16 forward / ms | bf16 forward / ms | fp16 backward / ms | bf16 backward / ms | speedup ratio (fp16 forward) | speedup ratio (fp16 backward)
-- | -- | -- | -- | -- | -- | --
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: contig | 124.413 | 114.44 | 10.553 | 11.2486 | 1.31395433 | 3.923844
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: CL | 28.99 | 28.0781 | 9.5092 | 10.9258 | 1.324296999 | 3.888377
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: contig | 640.8276 | 591.964 | 59.18776 | 60.854 | 1.334956391 | 3.704458
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: CL | 88.57 | 90.214 | 54.358 | 59.205 | 1.031258214 | 3.75285
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: contig | 318.6197 | 285.155 | 28.4999 | 29.4387 | 1.315298144 | 3.759747
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: CL | 31.3981 | 34.0544 | 25.6557 | 28.7811 | 1.068505738 | 3.841587
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: contig | 8.87882 | 8.207 | 0.386056 | 0.3939 | 1.567866 | 3.50387
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: CL3d | 2.4167 | 2.38295 | 0.3769 | 0.4066 | 1.39402491 | 3.30061


cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
@CaoE
Copy link
Collaborator Author

CaoE commented May 23, 2023

what's the operation shown in the benchmark above ?

The operation is maxpool, and compared with the results of #98819.

@CaoE CaoE requested a review from mingfeima May 23, 2023 06:38
### Motivation
Scalar conversion between Half and Float on CPU is more time consuming compared to BFloat16 <-> Float. There is no direct data type conversion instruction for single Half value on CPU,  so we add scalar conversion with avx instructions for Half to speed up.

### Testing
Test maxpool, and compared with the results of #98819.
Single socket (28 cores):

shape | fp16 forward / ms | bf16 forward / ms | fp16 backward / ms | bf16 backward / ms | speedup ratio (fp16 forward) | speedup ratio (fp16 backward)
-- | -- | -- | -- | -- | -- | --
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: contig | 5.07165 | 5.418 | 0.5798 | 0.5123 | 1.373694951 | 3.430786
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: CL | 1.37455 | 1.2505 | 8.8336 | 9.7684 | 1.373635008 | 4.132924
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: contig | 28.72 | 30.7069 | 3.813 | 3.75 | 1.31977124 | 2.783006
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: CL | 4.5783 | 4.703 | 4.703 | 5.1 | 1.028980189 | 3.1293
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: contig | 13.896 | 14.8138 | 1.6635 | 1.6274 | 1.298704663 | 2.982699
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: CL | 2.11291 | 2.1158 | 2.26778 | 2.272 | 0.951105348 | 3.179012
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: contig | 0.4204 | 0.3843 | 0.0649 | 0.0633 | 2.102711703 | 1.779492
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: CL3d | 0.1134 | 0.11 | 0.1476 | 0.143 | 2.23042328 | 3.612398



Single core:

shape | fp16 forward / ms | bf16 forward / ms | fp16 backward / ms | bf16 backward / ms | speedup ratio (fp16 forward) | speedup ratio (fp16 backward)
-- | -- | -- | -- | -- | -- | --
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: contig | 124.413 | 114.44 | 10.553 | 11.2486 | 1.31395433 | 3.923844
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: CL | 28.99 | 28.0781 | 9.5092 | 10.9258 | 1.324296999 | 3.888377
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: contig | 640.8276 | 591.964 | 59.18776 | 60.854 | 1.334956391 | 3.704458
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: CL | 88.57 | 90.214 | 54.358 | 59.205 | 1.031258214 | 3.75285
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: contig | 318.6197 | 285.155 | 28.4999 | 29.4387 | 1.315298144 | 3.759747
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: CL | 31.3981 | 34.0544 | 25.6557 | 28.7811 | 1.068505738 | 3.841587
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: contig | 8.87882 | 8.207 | 0.386056 | 0.3939 | 1.567866 | 3.50387
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: CL3d | 2.4167 | 2.38295 | 0.3769 | 0.4066 | 1.39402491 | 3.30061


cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
CaoE added a commit that referenced this pull request May 23, 2023
ghstack-source-id: f151192
Pull Request resolved: #101378
CaoE added a commit to CaoE/pytorch that referenced this pull request May 24, 2023
@CaoE
Copy link
Collaborator Author

CaoE commented May 31, 2023

this patch has no dependency on the stack, suggest move it out as a single pr.

Move this PR to #102140

@CaoE CaoE closed this May 31, 2023
@facebook-github-bot facebook-github-bot deleted the gh/CaoE/24/head branch June 30, 2023 14:16
CaoE added a commit to CaoE/pytorch that referenced this pull request Jul 12, 2023
pytorchmergebot pushed a commit to jiayisunx/pytorch that referenced this pull request Jul 25, 2023
CaoE added a commit to CaoE/pytorch that referenced this pull request Jul 28, 2023
CaoE added a commit to CaoE/pytorch that referenced this pull request Aug 15, 2023
CaoE added a commit to CaoE/pytorch that referenced this pull request Aug 16, 2023
CaoE added a commit to CaoE/pytorch that referenced this pull request Aug 22, 2023
CaoE added a commit to CaoE/pytorch that referenced this pull request Aug 25, 2023
CaoE added a commit to CaoE/pytorch that referenced this pull request Aug 28, 2023
CaoE added a commit to CaoE/pytorch that referenced this pull request Aug 29, 2023
CaoE added a commit to CaoE/pytorch that referenced this pull request Aug 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request module: cpu CPU specific problem (e.g., perf, algorithm) module: half Related to float16 half-precision floats open source topic: not user facing topic category

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

4 participants