Skip to content

optimize Half performance for maxpool on CPU#101379

Closed
CaoE wants to merge 3 commits intogh/CaoE/25/basefrom
gh/CaoE/25/head
Closed

optimize Half performance for maxpool on CPU#101379
CaoE wants to merge 3 commits intogh/CaoE/25/basefrom
gh/CaoE/25/head

Conversation

@CaoE
Copy link
Collaborator

@CaoE CaoE commented May 15, 2023

Stack from ghstack (oldest at bottom):

Motivation

Scalar conversion between Half and Float on CPU is more time consuming compared to BFloat16 <-> Float. Use scalar conversion with avx instructions for Half to speed up the Half performance of maxpool on CPU.

Testing

Single socket (28 cores):

shape fp16 forward / ms bf16 forward / ms fp16 backward / ms bf16 backward / ms speedup ratio (fp16 forward) speedup ratio (fp16 backward)
size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig 4.774427 5.493402 0.59731 0.568485 1.459212 3.330214
size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL 1.413417 1.351261 9.366035 10.57506 1.335862 3.897978
size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig 25.7854 29.55482 3.817368 3.869653 1.469973 2.779821
size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL 4.850936 4.793715 4.509115 5.024385 0.971149 3.263856
size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig 13.64016 15.28637 1.67952 1.71365 1.323064 2.954249
size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL 2.05289 2.05532 2.11025 2.33923 0.978913 3.416325
size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig 0.421462 0.479748 0.065937 0.065155 2.097413 1.751505
size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d 0.117073 0.112326 0.141022 0.145364 2.160446 3.780899

Single core:

shape fp16 forward / ms bf16 forward / ms fp16 backward / ms bf16 backward / ms speedup ratio (fp16 forward) speedup ratio (fp16 backward)
size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig 111.3258 128.6513 10.9745 12.21541 1.468419 3.77314
size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL 28.91686 29.80297 9.718766 11.25323 1.327646 3.804531
size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig 586.6346 674.8961 61.85198 68.8631 1.458279 3.544892
size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL 86.70579 89.43031 53.36975 62.60659 1.053431 3.822342
size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig 290.023 331.8997 29.75552 32.78853 1.444988 3.601094
size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL 31.60022 32.75628 25.69056 30.17796 1.061671 3.836374
size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig 8.705227 9.937217 0.410647 0.434229 1.599131 3.294046
size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d 2.456079 2.469444 0.353301 0.401361 1.371674 3.521077

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

@pytorch-bot
Copy link

pytorch-bot bot commented May 15, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/101379

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 263f398:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@github-actions github-actions bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label May 15, 2023
@CaoE CaoE marked this pull request as draft May 15, 2023 02:59
@CaoE CaoE added the topic: not user facing topic category label May 15, 2023
@CaoE CaoE changed the title optimize Half performance for maxpool optimize Half performance for maxpool on CPU May 15, 2023
@CaoE CaoE requested a review from mingfeima May 15, 2023 06:17
### Testing
Single socket (28 cores):

shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms
-- | -- | -- | -- | -- | -- | --
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: contig | 3.959584 | 4.774427414 | 5.493402 | 0.557232 | 0.59731 | 0.568485
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: CL| 0.815511 | 1.413416862 | 1.351261 | 5.710506 | 9.366035 | 10.57506
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: contig  | 21.28074 | 25.78539848 | 29.55482 | 6.921601 | 3.817368 | 3.869653
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: CL | 5.755067 | 4.850935936 | 4.793715 | 6.737041 | 4.509115 | 5.024385
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: contig  | 10.63426 | 13.64016 | 15.28637 | 2.67656 | 1.67952 | 1.71365
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: CL | 2.63570 | 2.05289 | 2.05532 | 2.55452 | 2.11025 | 2.33923
size: (4, 19, 10, 16, 16), kernel:   3, stride: 1, mem_format: contig | 0.375469 | 0.421462059 | 0.479748 | 0.066364 | 0.065937 | 0.065155
size: (4, 19, 10, 16, 16), kernel:   3, stride: 1, mem_format: CL3d | 0.112197 | 0.117073059 | 0.112326 | 0.111697 | 0.141022 | 0.145364



Single core:


shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms
-- | -- | -- | -- | -- | -- | --
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: contig | 92.16582 | 111.3258338 | 128.6513 | 6.684325 | 10.9745 | 12.21541
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: CL | 10.14318 | 28.9168644 | 29.80297 | 7.350142 | 9.718766 | 11.25323
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: contig | 485.4677 | 586.6346097 | 674.8961 | 44.93304 | 61.85198 | 68.8631
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: CL | 81.45644 | 86.70579195 | 89.43031 | 46.44922 | 53.36975 | 62.60659
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: contig | 238.55453 | 290.02304 | 331.89967 | 19.694657 | 29.75552 | 32.78853
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: CL | 30.17079 | 31.60022 | 32.75628 | 22.44543 | 25.69056 |30.17796
size: (4, 19, 10, 16, 16), kernel:   3, stride: 1, mem_format: contig | 7.474389 | 8.705227375 | 9.937217 | 0.236015 | 0.410647 | 0.434229
size: (4, 19, 10, 16, 16), kernel:   3, stride: 1, mem_format: CL3d | 2.318954 | 2.456078529 | 2.469444 | 0.262125 | 0.353301 | 0.401361


cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
CaoE added a commit that referenced this pull request May 15, 2023
ghstack-source-id: b2519cc
Pull Request resolved: #101379
@CaoE CaoE added ciflow/trunk Trigger trunk jobs on your pull request ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR labels May 15, 2023
@jgong5
Copy link
Collaborator

jgong5 commented May 15, 2023

Please elaborate how the changes contribute to the performance in the PR description.

@CaoE CaoE requested a review from jgong5 May 16, 2023 02:43
### Motivation
Scalar conversion between Half and Float on CPU is more time consuming compared to BFloat16 <-> Float. Use scalar conversion with avx instructions for Half to speed up the Half performance of maxpool on CPU.

### Testing
Single socket (28 cores):

shape | fp16 forward / ms | bf16 forward / ms | fp16 backward / ms | bf16 backward / ms | speedup ratio (fp16 forward) | speedup ratio (fp16 backward)
-- | -- | -- | -- | -- | -- | --
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: contig | 4.774427 | 5.493402 | 0.59731 | 0.568485 | 1.459212 | 3.330214
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: CL | 1.413417 | 1.351261 | 9.366035 | 10.57506 | 1.335862 | 3.897978
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: contig | 25.7854 | 29.55482 | 3.817368 | 3.869653 | 1.469973 | 2.779821
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: CL | 4.850936 | 4.793715 | 4.509115 | 5.024385 | 0.971149 | 3.263856
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: contig | 13.64016 | 15.28637 | 1.67952 | 1.71365 | 1.323064 | 2.954249
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: CL | 2.05289 | 2.05532 | 2.11025 | 2.33923 | 0.978913 | 3.416325
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: contig | 0.421462 | 0.479748 | 0.065937 | 0.065155 | 2.097413 | 1.751505
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: CL3d | 0.117073 | 0.112326 | 0.141022 | 0.145364 | 2.160446 | 3.780899



Single core:

shape | fp16 forward / ms | bf16 forward / ms | fp16 backward / ms | bf16 backward / ms | speedup ratio (fp16 forward) | speedup ratio (fp16 backward)
-- | -- | -- | -- | -- | -- | --
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: contig | 111.3258 | 128.6513 | 10.9745 | 12.21541 | 1.468419 | 3.77314
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: CL | 28.91686 | 29.80297 | 9.718766 | 11.25323 | 1.327646 | 3.804531
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: contig | 586.6346 | 674.8961 | 61.85198 | 68.8631 | 1.458279 | 3.544892
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: CL | 86.70579 | 89.43031 | 53.36975 | 62.60659 | 1.053431 | 3.822342
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: contig | 290.023 | 331.8997 | 29.75552 | 32.78853 | 1.444988 | 3.601094
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: CL | 31.60022 | 32.75628 | 25.69056 | 30.17796 | 1.061671 | 3.836374
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: contig | 8.705227 | 9.937217 | 0.410647 | 0.434229 | 1.599131 | 3.294046
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: CL3d | 2.456079 | 2.469444 | 0.353301 | 0.401361 | 1.371674 | 3.521077



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
CaoE added a commit that referenced this pull request May 17, 2023
ghstack-source-id: c027674
Pull Request resolved: #101379
@CaoE CaoE marked this pull request as ready for review May 18, 2023 00:52
@CaoE CaoE marked this pull request as draft May 18, 2023 07:06
@CaoE CaoE closed this May 19, 2023
@facebook-github-bot facebook-github-bot deleted the gh/CaoE/25/head branch June 18, 2023 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request module: cpu CPU specific problem (e.g., perf, algorithm) open source topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants