add scalar conversion using avx instructions for Half on CPU by CaoE · Pull Request #101378 · pytorch/pytorch

CaoE · 2023-05-15T02:58:24Z

Stack from ghstack (oldest at bottom):

Motivation

Scalar conversion between Half and Float on CPU is more time consuming compared to BFloat16 <-> Float. There is no direct data type conversion instruction for single Half value on CPU, so we add scalar conversion with avx instructions for Half to speed up.

Testing

Test maxpool, and compared with the results of #98819.
Single socket (28 cores):

shape	fp16 forward / ms	bf16 forward / ms	fp16 backward / ms	bf16 backward / ms	speedup ratio (fp16 forward)	speedup ratio (fp16 backward)
size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig	5.07165	5.418	0.5798	0.5123	1.373694951	3.430786
size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL	1.37455	1.2505	8.8336	9.7684	1.373635008	4.132924
size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig	28.72	30.7069	3.813	3.75	1.31977124	2.783006
size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL	4.5783	4.703	4.703	5.1	1.028980189	3.1293
size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig	13.896	14.8138	1.6635	1.6274	1.298704663	2.982699
size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL	2.11291	2.1158	2.26778	2.272	0.951105348	3.179012
size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig	0.4204	0.3843	0.0649	0.0633	2.102711703	1.779492
size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d	0.1134	0.11	0.1476	0.143	2.23042328	3.612398

Single core:

shape	fp16 forward / ms	bf16 forward / ms	fp16 backward / ms	bf16 backward / ms	speedup ratio (fp16 forward)	speedup ratio (fp16 backward)
size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig	124.413	114.44	10.553	11.2486	1.31395433	3.923844
size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL	28.99	28.0781	9.5092	10.9258	1.324296999	3.888377
size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig	640.8276	591.964	59.18776	60.854	1.334956391	3.704458
size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL	88.57	90.214	54.358	59.205	1.031258214	3.75285
size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig	318.6197	285.155	28.4999	29.4387	1.315298144	3.759747
size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL	31.3981	34.0544	25.6557	28.7811	1.068505738	3.841587
size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig	8.87882	8.207	0.386056	0.3939	1.567866	3.50387
size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d	2.4167	2.38295	0.3769	0.4066	1.39402491	3.30061

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

[ghstack-poisoned]

pytorch-bot · 2023-05-15T02:58:27Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/101378

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 945e0c0:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 5ee69c2 Pull Request resolved: pytorch#101378

jgong5 · 2023-05-16T04:31:57Z

aten/src/ATen/cpu/vec/vec256/vec256_bfloat16.h

+  __m256 v = _mm256_set_ps(val, val, val, val, val, val, val, val);
+  __m128i o = _mm256_cvtps_ph(
+      v, (_MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC));
+  std::memcpy(&d, &o, sizeof(d));


How does it compare to a vector store to an array and get the first item of it? Have you checked the assembly code?

std::memcpy and a vector store have almost the same performance. Used vector store instead.
I checked the results with this conversion on maxpool #101379 (include avx2 and avx512).

good job, you get rid of the memcpy, that's what i was talking about ~

jgong5 · 2023-05-16T04:32:26Z

aten/src/ATen/cpu/vec/vec256/vec256_bfloat16.h

+
+template<> inline Half down_scale(float val) {
+  unsigned short d;
+  __m256 v = _mm256_set_ps(val, val, val, val, val, val, val, val);


_mm256_set1_ps is simpler?

Yes. Used _mm256_set1_ps instead.

### Motivation Scalar conversion between Half and Float on CPU is more time consuming compared to BFloat16 <-> Float. There is no direct data type conversion instruction for single Half value on CPU, so we add scalar conversion with avx instructions for Half to speed up. ### Testing cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

ghstack-source-id: 849a2e9 Pull Request resolved: #101378

ghstack-source-id: 849a2e9 Pull Request resolved: pytorch#101378

mingfeima

generally LGTM except a few minor changes!

mingfeima · 2023-05-23T05:06:16Z

aten/src/ATen/cpu/vec/vec_half.h

@@ -0,0 +1,59 @@
+#include <ATen/cpu/vec/intrinsics.h>
+
+#ifdef CPU_CAPABILITY_AVX512


why can't we just use __at_align__ from ./aten/src/ATen/cpu/vec/vec_base.h

Removed __at_align__.

mingfeima · 2023-05-23T05:07:42Z

aten/src/ATen/cpu/vec/vec_half.h

+inline namespace CPU_CAPABILITY {
+
+#if defined(CPU_CAPABILITY_AVX2) || defined(CPU_CAPABILITY_AVX512)
+inline uint16_t float_to_half(float val) {


can we use Half as the output type, instead of uint16_t ?

Returning uint16_t is aligned with the behavior of fp16_ieee_to_fp32_value called in Half-inl.h. It is easy for Half::Half(float value) to use float_to_half using uint16_t as output.

mingfeima · 2023-05-23T05:12:33Z

aten/src/ATen/cpu/vec/vec_half.h

+#endif
+}
+
+inline float half_to_float(uint16_t val) {


I would suggest to make an overload for each pragma if-else for half2float_scalar and float2half_scalar:

#if defined(CPU_CAPABILITY_AVX512) half2float_scalar { ... } float2half_scalar { ... } #elif defined(CPU_CAPABILITY_AVX2) half2float_scalar { ... } float2half_scalar { ... } #else // calls into the slow path half2float_scalar { ... } float2half_scalar { ... } #endif

In Half-inl.h :

#if defined(CPU_CAPABILITY_AVX2) || defined(CPU_CAPABILITY_AVX512) x(at::vec::float_to_half(value)) #else x(detail::fp16_ieee_from_fp32_value(value))

If CPU_CAPABILITY_AVX2 and CPU_CAPABILITY_AVX512 is not defined, it will use x(detail::fp16_ieee_from_fp32_value(value)).
If we want to define the slow path in vec_half.h we need to include detail::fp16_ieee_from_fp32_value(value) in vec_half.h. I think maybe we can leave the slow path in Half-inl.h.

mingfeima · 2023-05-23T05:13:58Z

aten/src/ATen/cpu/vec/vec_half.h

+  __m256 v = _mm256_set1_ps(val);
+  __m128i o =
+      _mm256_cvtps_ph(v, (_MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC));
+  return static_cast<std::uint16_t>(_mm_cvtsi128_si32(o));


the lower 16 bits?

Yes, the lower 16 bits. Copied from https://github.com/pytorch/FBGEMM/blob/main/src/QuantUtilsAvx2.cc#L1556.

mingfeima · 2023-05-23T05:20:30Z

what's the operation shown in the benchmark above ?

this patch has no dependency on the stack, suggest move it out as a single pr.

### Motivation Scalar conversion between Half and Float on CPU is more time consuming compared to BFloat16 <-> Float. There is no direct data type conversion instruction for single Half value on CPU, so we add scalar conversion with avx instructions for Half to speed up. ### Testing Single socket (28 cores): shape | fp16 forward / ms | bf16 forward / ms | fp16 backward / ms | bf16 backward / ms | speedup ratio (fp16 forward) | speedup ratio (fp16 backward) -- | -- | -- | -- | -- | -- | -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig | 5.07165 | 5.418 | 0.5798 | 0.5123 | 1.373694951 | 3.430786 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL | 1.37455 | 1.2505 | 8.8336 | 9.7684 | 1.373635008 | 4.132924 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig | 28.72 | 30.7069 | 3.813 | 3.75 | 1.31977124 | 2.783006 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL | 4.5783 | 4.703 | 4.703 | 5.1 | 1.028980189 | 3.1293 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig | 13.896 | 14.8138 | 1.6635 | 1.6274 | 1.298704663 | 2.982699 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL | 2.11291 | 2.1158 | 2.26778 | 2.272 | 0.951105348 | 3.179012 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig | 0.4204 | 0.3843 | 0.0649 | 0.0633 | 2.102711703 | 1.779492 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d | 0.1134 | 0.11 | 0.1476 | 0.143 | 2.23042328 | 3.612398 Single core: shape | fp16 forward / ms | bf16 forward / ms | fp16 backward / ms | bf16 backward / ms | speedup ratio (fp16 forward) | speedup ratio (fp16 backward) -- | -- | -- | -- | -- | -- | -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig | 124.413 | 114.44 | 10.553 | 11.2486 | 1.31395433 | 3.923844 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL | 28.99 | 28.0781 | 9.5092 | 10.9258 | 1.324296999 | 3.888377 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig | 640.8276 | 591.964 | 59.18776 | 60.854 | 1.334956391 | 3.704458 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL | 88.57 | 90.214 | 54.358 | 59.205 | 1.031258214 | 3.75285 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig | 318.6197 | 285.155 | 28.4999 | 29.4387 | 1.315298144 | 3.759747 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL | 31.3981 | 34.0544 | 25.6557 | 28.7811 | 1.068505738 | 3.841587 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig | 8.87882 | 8.207 | 0.386056 | 0.3939 | 1.567866 | 3.50387 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d | 2.4167 | 2.38295 | 0.3769 | 0.4066 | 1.39402491 | 3.30061 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

CaoE · 2023-05-23T06:35:58Z

what's the operation shown in the benchmark above ?

The operation is maxpool, and compared with the results of #98819.

### Motivation Scalar conversion between Half and Float on CPU is more time consuming compared to BFloat16 <-> Float. There is no direct data type conversion instruction for single Half value on CPU, so we add scalar conversion with avx instructions for Half to speed up. ### Testing Test maxpool, and compared with the results of #98819. Single socket (28 cores): shape | fp16 forward / ms | bf16 forward / ms | fp16 backward / ms | bf16 backward / ms | speedup ratio (fp16 forward) | speedup ratio (fp16 backward) -- | -- | -- | -- | -- | -- | -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig | 5.07165 | 5.418 | 0.5798 | 0.5123 | 1.373694951 | 3.430786 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL | 1.37455 | 1.2505 | 8.8336 | 9.7684 | 1.373635008 | 4.132924 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig | 28.72 | 30.7069 | 3.813 | 3.75 | 1.31977124 | 2.783006 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL | 4.5783 | 4.703 | 4.703 | 5.1 | 1.028980189 | 3.1293 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig | 13.896 | 14.8138 | 1.6635 | 1.6274 | 1.298704663 | 2.982699 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL | 2.11291 | 2.1158 | 2.26778 | 2.272 | 0.951105348 | 3.179012 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig | 0.4204 | 0.3843 | 0.0649 | 0.0633 | 2.102711703 | 1.779492 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d | 0.1134 | 0.11 | 0.1476 | 0.143 | 2.23042328 | 3.612398 Single core: shape | fp16 forward / ms | bf16 forward / ms | fp16 backward / ms | bf16 backward / ms | speedup ratio (fp16 forward) | speedup ratio (fp16 backward) -- | -- | -- | -- | -- | -- | -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig | 124.413 | 114.44 | 10.553 | 11.2486 | 1.31395433 | 3.923844 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL | 28.99 | 28.0781 | 9.5092 | 10.9258 | 1.324296999 | 3.888377 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig | 640.8276 | 591.964 | 59.18776 | 60.854 | 1.334956391 | 3.704458 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL | 88.57 | 90.214 | 54.358 | 59.205 | 1.031258214 | 3.75285 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig | 318.6197 | 285.155 | 28.4999 | 29.4387 | 1.315298144 | 3.759747 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL | 31.3981 | 34.0544 | 25.6557 | 28.7811 | 1.068505738 | 3.841587 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig | 8.87882 | 8.207 | 0.386056 | 0.3939 | 1.567866 | 3.50387 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d | 2.4167 | 2.38295 | 0.3769 | 0.4066 | 1.39402491 | 3.30061 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

ghstack-source-id: f151192 Pull Request resolved: #101378

ghstack-source-id: f151192 Pull Request resolved: pytorch#101378

CaoE · 2023-05-31T02:04:06Z

this patch has no dependency on the stack, suggest move it out as a single pr.

Move this PR to #102140

ghstack-source-id: f151192 Pull Request resolved: pytorch#101378

add scalar conversion using avx instructions for half

95652b2

[ghstack-poisoned]

github-actions bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label May 15, 2023

This was referenced May 15, 2023

add channel last 3d support for batch_norm on CPU #97774

Closed

add channel last 3d support for maxpool3d on CPU #97775

Closed

add Half support for maxpool on CPU #98819

Closed

optimize Half performance for maxpool on CPU #101379

Closed

CaoE marked this pull request as draft May 15, 2023 02:59

CaoE requested a review from mingfeima May 15, 2023 02:59

pytorchbot added the open source label May 15, 2023

CaoE added the topic: not user facing topic category label May 15, 2023

CaoE changed the title ~~add scalar conversion using avx instructions for half~~ add scalar conversion using avx instructions for Half on CPU May 15, 2023

CaoE added a commit to CaoE/pytorch that referenced this pull request May 15, 2023

add scalar conversion using avx instructions for half

dc22574

ghstack-source-id: 5ee69c2 Pull Request resolved: pytorch#101378

CaoE added ciflow/trunk Trigger trunk jobs on your pull request ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR labels May 15, 2023

CaoE requested a review from jgong5 May 16, 2023 02:44

jgong5 reviewed May 16, 2023

View reviewed changes

CaoE added the module: half Related to float16 half-precision floats label May 17, 2023

jgong5 approved these changes May 17, 2023

View reviewed changes

CaoE marked this pull request as ready for review May 18, 2023 00:52

CaoE marked this pull request as draft May 18, 2023 07:06

CaoE added a commit that referenced this pull request May 19, 2023

add scalar conversion using avx instructions for half

e33f84c

ghstack-source-id: 849a2e9 Pull Request resolved: #101378

CaoE added a commit to CaoE/pytorch that referenced this pull request May 22, 2023

add scalar conversion using avx instructions for half

4ea0583

ghstack-source-id: 849a2e9 Pull Request resolved: pytorch#101378

mingfeima requested changes May 23, 2023

View reviewed changes

CaoE requested a review from mingfeima May 23, 2023 06:38

CaoE added a commit that referenced this pull request May 23, 2023

add scalar conversion using avx instructions for half

7ab8b5c

ghstack-source-id: f151192 Pull Request resolved: #101378

CaoE added a commit to CaoE/pytorch that referenced this pull request May 24, 2023

add scalar conversion using avx instructions for half

995760a

ghstack-source-id: f151192 Pull Request resolved: pytorch#101378

CaoE mentioned this pull request May 24, 2023

Add scalar conversion using avx instructions for half #102140

Closed

CaoE closed this May 31, 2023

facebook-github-bot deleted the gh/CaoE/24/head branch June 30, 2023 14:16

CaoE added a commit to CaoE/pytorch that referenced this pull request Jul 12, 2023

add scalar conversion using avx instructions for half

1381809

ghstack-source-id: f151192 Pull Request resolved: pytorch#101378

pytorchmergebot pushed a commit to jiayisunx/pytorch that referenced this pull request Jul 25, 2023

add scalar conversion using avx instructions for half

0f7c281

ghstack-source-id: f151192 Pull Request resolved: pytorch#101378

CaoE added a commit to CaoE/pytorch that referenced this pull request Jul 28, 2023

add scalar conversion using avx instructions for half

93083e9

ghstack-source-id: f151192 Pull Request resolved: pytorch#101378

CaoE added a commit to CaoE/pytorch that referenced this pull request Aug 15, 2023

add scalar conversion using avx instructions for half

e62945f

ghstack-source-id: f151192 Pull Request resolved: pytorch#101378

CaoE added a commit to CaoE/pytorch that referenced this pull request Aug 16, 2023

add scalar conversion using avx instructions for half

0df7499

ghstack-source-id: f151192 Pull Request resolved: pytorch#101378

CaoE added a commit to CaoE/pytorch that referenced this pull request Aug 22, 2023

add scalar conversion using avx instructions for half

ab8000d

ghstack-source-id: f151192 Pull Request resolved: pytorch#101378

CaoE added a commit to CaoE/pytorch that referenced this pull request Aug 25, 2023

add scalar conversion using avx instructions for half

9423a84

ghstack-source-id: f151192 Pull Request resolved: pytorch#101378

CaoE added a commit to CaoE/pytorch that referenced this pull request Aug 28, 2023

add scalar conversion using avx instructions for half

a9e4c5f

ghstack-source-id: f151192 Pull Request resolved: pytorch#101378

CaoE added a commit to CaoE/pytorch that referenced this pull request Aug 29, 2023

add scalar conversion using avx instructions for half

1c89d44

ghstack-source-id: f151192 Pull Request resolved: pytorch#101378

CaoE added a commit to CaoE/pytorch that referenced this pull request Aug 30, 2023

add scalar conversion using avx instructions for half

90256fa

ghstack-source-id: f151192 Pull Request resolved: pytorch#101378

		@@ -0,0 +1,59 @@
		#include <ATen/cpu/vec/intrinsics.h>

		#ifdef CPU_CAPABILITY_AVX512

Conversation

CaoE commented May 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Testing

Uh oh!

pytorch-bot bot commented May 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/101378

✅ No Failures

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mingfeima left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CaoE May 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CaoE May 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mingfeima commented May 23, 2023

Uh oh!

CaoE commented May 23, 2023

Uh oh!

CaoE commented May 31, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CaoE commented May 15, 2023 •

edited

Loading

pytorch-bot bot commented May 15, 2023 •

edited

Loading

CaoE May 23, 2023 •

edited

Loading

CaoE May 23, 2023 •

edited

Loading