optimize Half performance for maxpool on CPU by CaoE · Pull Request #101379 · pytorch/pytorch

CaoE · 2023-05-15T02:58:37Z

Stack from ghstack (oldest at bottom):

Motivation

Scalar conversion between Half and Float on CPU is more time consuming compared to BFloat16 <-> Float. Use scalar conversion with avx instructions for Half to speed up the Half performance of maxpool on CPU.

Testing

Single socket (28 cores):

shape	fp16 forward / ms	bf16 forward / ms	fp16 backward / ms	bf16 backward / ms	speedup ratio (fp16 forward)	speedup ratio (fp16 backward)
size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig	4.774427	5.493402	0.59731	0.568485	1.459212	3.330214
size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL	1.413417	1.351261	9.366035	10.57506	1.335862	3.897978
size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig	25.7854	29.55482	3.817368	3.869653	1.469973	2.779821
size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL	4.850936	4.793715	4.509115	5.024385	0.971149	3.263856
size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig	13.64016	15.28637	1.67952	1.71365	1.323064	2.954249
size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL	2.05289	2.05532	2.11025	2.33923	0.978913	3.416325
size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig	0.421462	0.479748	0.065937	0.065155	2.097413	1.751505
size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d	0.117073	0.112326	0.141022	0.145364	2.160446	3.780899

Single core:

shape	fp16 forward / ms	bf16 forward / ms	fp16 backward / ms	bf16 backward / ms	speedup ratio (fp16 forward)	speedup ratio (fp16 backward)
size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig	111.3258	128.6513	10.9745	12.21541	1.468419	3.77314
size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL	28.91686	29.80297	9.718766	11.25323	1.327646	3.804531
size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig	586.6346	674.8961	61.85198	68.8631	1.458279	3.544892
size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL	86.70579	89.43031	53.36975	62.60659	1.053431	3.822342
size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig	290.023	331.8997	29.75552	32.78853	1.444988	3.601094
size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL	31.60022	32.75628	25.69056	30.17796	1.061671	3.836374
size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig	8.705227	9.937217	0.410647	0.434229	1.599131	3.294046
size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d	2.456079	2.469444	0.353301	0.401361	1.371674	3.521077

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

[ghstack-poisoned]

pytorch-bot · 2023-05-15T02:58:40Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/101379

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 263f398:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

### Testing Single socket (28 cores): shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms -- | -- | -- | -- | -- | -- | -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig | 3.959584 | 4.774427414 | 5.493402 | 0.557232 | 0.59731 | 0.568485 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL| 0.815511 | 1.413416862 | 1.351261 | 5.710506 | 9.366035 | 10.57506 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig | 21.28074 | 25.78539848 | 29.55482 | 6.921601 | 3.817368 | 3.869653 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL | 5.755067 | 4.850935936 | 4.793715 | 6.737041 | 4.509115 | 5.024385 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig | 10.63426 | 13.64016 | 15.28637 | 2.67656 | 1.67952 | 1.71365 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL | 2.63570 | 2.05289 | 2.05532 | 2.55452 | 2.11025 | 2.33923 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig | 0.375469 | 0.421462059 | 0.479748 | 0.066364 | 0.065937 | 0.065155 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d | 0.112197 | 0.117073059 | 0.112326 | 0.111697 | 0.141022 | 0.145364 Single core: shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms -- | -- | -- | -- | -- | -- | -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig | 92.16582 | 111.3258338 | 128.6513 | 6.684325 | 10.9745 | 12.21541 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL | 10.14318 | 28.9168644 | 29.80297 | 7.350142 | 9.718766 | 11.25323 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig | 485.4677 | 586.6346097 | 674.8961 | 44.93304 | 61.85198 | 68.8631 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL | 81.45644 | 86.70579195 | 89.43031 | 46.44922 | 53.36975 | 62.60659 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig | 238.55453 | 290.02304 | 331.89967 | 19.694657 | 29.75552 | 32.78853 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL | 30.17079 | 31.60022 | 32.75628 | 22.44543 | 25.69056 |30.17796 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig | 7.474389 | 8.705227375 | 9.937217 | 0.236015 | 0.410647 | 0.434229 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d | 2.318954 | 2.456078529 | 2.469444 | 0.262125 | 0.353301 | 0.401361 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

ghstack-source-id: b2519cc Pull Request resolved: #101379

jgong5 · 2023-05-15T06:40:11Z

Please elaborate how the changes contribute to the performance in the PR description.

### Motivation Scalar conversion between Half and Float on CPU is more time consuming compared to BFloat16 <-> Float. Use scalar conversion with avx instructions for Half to speed up the Half performance of maxpool on CPU. ### Testing Single socket (28 cores): shape | fp16 forward / ms | bf16 forward / ms | fp16 backward / ms | bf16 backward / ms | speedup ratio (fp16 forward) | speedup ratio (fp16 backward) -- | -- | -- | -- | -- | -- | -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig | 4.774427 | 5.493402 | 0.59731 | 0.568485 | 1.459212 | 3.330214 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL | 1.413417 | 1.351261 | 9.366035 | 10.57506 | 1.335862 | 3.897978 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig | 25.7854 | 29.55482 | 3.817368 | 3.869653 | 1.469973 | 2.779821 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL | 4.850936 | 4.793715 | 4.509115 | 5.024385 | 0.971149 | 3.263856 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig | 13.64016 | 15.28637 | 1.67952 | 1.71365 | 1.323064 | 2.954249 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL | 2.05289 | 2.05532 | 2.11025 | 2.33923 | 0.978913 | 3.416325 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig | 0.421462 | 0.479748 | 0.065937 | 0.065155 | 2.097413 | 1.751505 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d | 0.117073 | 0.112326 | 0.141022 | 0.145364 | 2.160446 | 3.780899 Single core: shape | fp16 forward / ms | bf16 forward / ms | fp16 backward / ms | bf16 backward / ms | speedup ratio (fp16 forward) | speedup ratio (fp16 backward) -- | -- | -- | -- | -- | -- | -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig | 111.3258 | 128.6513 | 10.9745 | 12.21541 | 1.468419 | 3.77314 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL | 28.91686 | 29.80297 | 9.718766 | 11.25323 | 1.327646 | 3.804531 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig | 586.6346 | 674.8961 | 61.85198 | 68.8631 | 1.458279 | 3.544892 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL | 86.70579 | 89.43031 | 53.36975 | 62.60659 | 1.053431 | 3.822342 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig | 290.023 | 331.8997 | 29.75552 | 32.78853 | 1.444988 | 3.601094 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL | 31.60022 | 32.75628 | 25.69056 | 30.17796 | 1.061671 | 3.836374 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig | 8.705227 | 9.937217 | 0.410647 | 0.434229 | 1.599131 | 3.294046 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d | 2.456079 | 2.469444 | 0.353301 | 0.401361 | 1.371674 | 3.521077 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

ghstack-source-id: c027674 Pull Request resolved: #101379

optimize Half performance for maxpool

cfc5e56

[ghstack-poisoned]

CaoE mentioned this pull request May 15, 2023

add channel last 3d support for batch_norm on CPU #97774

Closed

This was referenced May 15, 2023

add channel last 3d support for maxpool3d on CPU #97775

Closed

add Half support for maxpool on CPU #98819

Closed

add scalar conversion using avx instructions for Half on CPU #101378

Closed

github-actions bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label May 15, 2023

CaoE marked this pull request as draft May 15, 2023 02:59

pytorchbot added the open source label May 15, 2023

CaoE added the topic: not user facing topic category label May 15, 2023

CaoE changed the title ~~optimize Half performance for maxpool~~ optimize Half performance for maxpool on CPU May 15, 2023

CaoE requested a review from mingfeima May 15, 2023 06:17

CaoE added a commit that referenced this pull request May 15, 2023

optimize Half performance for maxpool

c5e3904

ghstack-source-id: b2519cc Pull Request resolved: #101379

CaoE added ciflow/trunk Trigger trunk jobs on your pull request ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR labels May 15, 2023

CaoE requested a review from jgong5 May 16, 2023 02:43

jgong5 approved these changes May 16, 2023

View reviewed changes

CaoE added a commit that referenced this pull request May 17, 2023

optimize Half performance for maxpool

1452bf6

ghstack-source-id: c027674 Pull Request resolved: #101379

CaoE marked this pull request as ready for review May 18, 2023 00:52

CaoE marked this pull request as draft May 18, 2023 07:06

CaoE closed this May 19, 2023

facebook-github-bot deleted the gh/CaoE/25/head branch June 18, 2023 14:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize Half performance for maxpool on CPU#101379

optimize Half performance for maxpool on CPU#101379
CaoE wants to merge 3 commits intogh/CaoE/25/basefrom
gh/CaoE/25/head

CaoE commented May 15, 2023 •

edited

Loading

Uh oh!

pytorch-bot bot commented May 15, 2023 •

edited

Loading

Uh oh!

jgong5 commented May 15, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

CaoE commented May 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Testing

Uh oh!

pytorch-bot bot commented May 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/101379

✅ No Failures

Uh oh!

jgong5 commented May 15, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CaoE commented May 15, 2023 •

edited

Loading

pytorch-bot bot commented May 15, 2023 •

edited

Loading