add Half support for maxpool on CPU#98819
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/98819
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit dd17e4e with merge base 50fa588 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
mingfeima
left a comment
There was a problem hiding this comment.
LGTM, just remember to add benchmark numbers.
Thank you, and I will add benchmark numbers. |
cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
### Motivation Scalar conversion between Half and Float on CPU is more time consuming compared to BFloat16 <-> Float. There is no direct data type conversion instruction for single Half value on CPU, so we add scalar conversion with avx instructions for Half to speed up. ### Testing Test maxpool, and compared with the results of #98819. Single socket (28 cores): shape | fp16 forward / ms | bf16 forward / ms | fp16 backward / ms | bf16 backward / ms | speedup ratio (fp16 forward) | speedup ratio (fp16 backward) -- | -- | -- | -- | -- | -- | -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig | 5.07165 | 5.418 | 0.5798 | 0.5123 | 1.373694951 | 3.430786 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL | 1.37455 | 1.2505 | 8.8336 | 9.7684 | 1.373635008 | 4.132924 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig | 28.72 | 30.7069 | 3.813 | 3.75 | 1.31977124 | 2.783006 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL | 4.5783 | 4.703 | 4.703 | 5.1 | 1.028980189 | 3.1293 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig | 13.896 | 14.8138 | 1.6635 | 1.6274 | 1.298704663 | 2.982699 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL | 2.11291 | 2.1158 | 2.26778 | 2.272 | 0.951105348 | 3.179012 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig | 0.4204 | 0.3843 | 0.0649 | 0.0633 | 2.102711703 | 1.779492 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d | 0.1134 | 0.11 | 0.1476 | 0.143 | 2.23042328 | 3.612398 Single core: shape | fp16 forward / ms | bf16 forward / ms | fp16 backward / ms | bf16 backward / ms | speedup ratio (fp16 forward) | speedup ratio (fp16 backward) -- | -- | -- | -- | -- | -- | -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig | 124.413 | 114.44 | 10.553 | 11.2486 | 1.31395433 | 3.923844 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL | 28.99 | 28.0781 | 9.5092 | 10.9258 | 1.324296999 | 3.888377 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig | 640.8276 | 591.964 | 59.18776 | 60.854 | 1.334956391 | 3.704458 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL | 88.57 | 90.214 | 54.358 | 59.205 | 1.031258214 | 3.75285 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig | 318.6197 | 285.155 | 28.4999 | 29.4387 | 1.315298144 | 3.759747 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL | 31.3981 | 34.0544 | 25.6557 | 28.7811 | 1.068505738 | 3.841587 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig | 8.87882 | 8.207 | 0.386056 | 0.3939 | 1.567866 | 3.50387 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d | 2.4167 | 2.38295 | 0.3769 | 0.4066 | 1.39402491 | 3.30061 Pull Request resolved: #102140 Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/cpuhrsch
### Testing Single socket (28 cores): shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms -- | -- | -- | -- | -- | -- | -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig | 4.12895 | 6.9669 | 5.30297 | 0.55775 | 1.98917 | 0.72233 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL | 0.85093 | 1.88813 | 1.38063 | 5.5742 | 36.5086 | 10.58552 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig | 22.37212 | 37.90383 | 30.94482 | 6.85868 | 10.6116 | 3.9993 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL | 5.41658 | 4.71098 | 4.66578 | 6.69875 | 14.7171 | 5.1167 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig | 10.69831 | 18.0468 | 13.71657 | 2.61192 | 4.96172 | 1.68635 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL | 2.52637 | 2.0096 | 2.0055 | 2.60314 | 7.2093 | 2.49843 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig | 0.47605 | 0.88398 | 0.65326 | 0.06525 | 0.115489 | 0.0674 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d | 0.10902 | 0.25293 | 0.157475 | 0.11386 | 0.53319 | 0.17836 Single core: shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms -- | -- | -- | -- | -- | -- | -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig | 90.9809 | 163.473 | 126.1276 | 6.57721 | 41.40833 | 11.82505 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL | 9.88405 | 38.39137 | 29.62069 | 7.10636 | 36.97535 | 11.0525 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig | 476.782 | 855.4769 | 648.2248 | 46.6488 | 219.2586 | 67.10599 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL | 80.29271 | 91.33854 | 87.80345 | 48.81692 | 203.9974 | 63.39004 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig | 235.2113 | 419.0799 | 315.4284 | 20.6049 | 107.1524 | 32.39169 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL | 29.47653 | 33.54905 | 32.82823 | 22.59674 | 98.5586 | 30.05763 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig | 7.90684 | 13.9208 | 10.03272 | 0.23725 | 1.35269 | 0.41728 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d | 2.33638 | 3.36894 | 2.64635 | 0.26535 | 1.244 | 0.38895 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
### Testing Single socket (28 cores): shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms -- | -- | -- | -- | -- | -- | -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig | 4.12895 | 6.9669 | 5.30297 | 0.55775 | 1.98917 | 0.72233 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL | 0.85093 | 1.88813 | 1.38063 | 5.5742 | 36.5086 | 10.58552 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig | 22.37212 | 37.90383 | 30.94482 | 6.85868 | 10.6116 | 3.9993 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL | 5.41658 | 4.71098 | 4.66578 | 6.69875 | 14.7171 | 5.1167 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig | 10.69831 | 18.0468 | 13.71657 | 2.61192 | 4.96172 | 1.68635 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL | 2.52637 | 2.0096 | 2.0055 | 2.60314 | 7.2093 | 2.49843 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig | 0.47605 | 0.88398 | 0.65326 | 0.06525 | 0.115489 | 0.0674 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d | 0.10902 | 0.25293 | 0.157475 | 0.11386 | 0.53319 | 0.17836 Single core: shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms -- | -- | -- | -- | -- | -- | -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig | 90.9809 | 163.473 | 126.1276 | 6.57721 | 41.40833 | 11.82505 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL | 9.88405 | 38.39137 | 29.62069 | 7.10636 | 36.97535 | 11.0525 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig | 476.782 | 855.4769 | 648.2248 | 46.6488 | 219.2586 | 67.10599 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL | 80.29271 | 91.33854 | 87.80345 | 48.81692 | 203.9974 | 63.39004 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig | 235.2113 | 419.0799 | 315.4284 | 20.6049 | 107.1524 | 32.39169 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL | 29.47653 | 33.54905 | 32.82823 | 22.59674 | 98.5586 | 30.05763 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig | 7.90684 | 13.9208 | 10.03272 | 0.23725 | 1.35269 | 0.41728 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d | 2.33638 | 3.36894 | 2.64635 | 0.26535 | 1.244 | 0.38895 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
### Testing Single socket (28 cores): shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms -- | -- | -- | -- | -- | -- | -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig | 4.12895 | 6.9669 | 5.30297 | 0.55775 | 1.98917 | 0.72233 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL | 0.85093 | 1.88813 | 1.38063 | 5.5742 | 36.5086 | 10.58552 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig | 22.37212 | 37.90383 | 30.94482 | 6.85868 | 10.6116 | 3.9993 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL | 5.41658 | 4.71098 | 4.66578 | 6.69875 | 14.7171 | 5.1167 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig | 10.69831 | 18.0468 | 13.71657 | 2.61192 | 4.96172 | 1.68635 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL | 2.52637 | 2.0096 | 2.0055 | 2.60314 | 7.2093 | 2.49843 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig | 0.47605 | 0.88398 | 0.65326 | 0.06525 | 0.115489 | 0.0674 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d | 0.10902 | 0.25293 | 0.157475 | 0.11386 | 0.53319 | 0.17836 Single core: shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms -- | -- | -- | -- | -- | -- | -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig | 90.9809 | 163.473 | 126.1276 | 6.57721 | 41.40833 | 11.82505 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL | 9.88405 | 38.39137 | 29.62069 | 7.10636 | 36.97535 | 11.0525 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig | 476.782 | 855.4769 | 648.2248 | 46.6488 | 219.2586 | 67.10599 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL | 80.29271 | 91.33854 | 87.80345 | 48.81692 | 203.9974 | 63.39004 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig | 235.2113 | 419.0799 | 315.4284 | 20.6049 | 107.1524 | 32.39169 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL | 29.47653 | 33.54905 | 32.82823 | 22.59674 | 98.5586 | 30.05763 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig | 7.90684 | 13.9208 | 10.03272 | 0.23725 | 1.35269 | 0.41728 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d | 2.33638 | 3.36894 | 2.64635 | 0.26535 | 1.244 | 0.38895 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
### Testing Single socket (28 cores): shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms -- | -- | -- | -- | -- | -- | -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig | 4.12895 | 6.9669 | 5.30297 | 0.55775 | 1.98917 | 0.72233 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL | 0.85093 | 1.88813 | 1.38063 | 5.5742 | 36.5086 | 10.58552 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig | 22.37212 | 37.90383 | 30.94482 | 6.85868 | 10.6116 | 3.9993 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL | 5.41658 | 4.71098 | 4.66578 | 6.69875 | 14.7171 | 5.1167 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig | 10.69831 | 18.0468 | 13.71657 | 2.61192 | 4.96172 | 1.68635 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL | 2.52637 | 2.0096 | 2.0055 | 2.60314 | 7.2093 | 2.49843 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig | 0.47605 | 0.88398 | 0.65326 | 0.06525 | 0.115489 | 0.0674 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d | 0.10902 | 0.25293 | 0.157475 | 0.11386 | 0.53319 | 0.17836 Single core: shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms -- | -- | -- | -- | -- | -- | -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig | 90.9809 | 163.473 | 126.1276 | 6.57721 | 41.40833 | 11.82505 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL | 9.88405 | 38.39137 | 29.62069 | 7.10636 | 36.97535 | 11.0525 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig | 476.782 | 855.4769 | 648.2248 | 46.6488 | 219.2586 | 67.10599 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL | 80.29271 | 91.33854 | 87.80345 | 48.81692 | 203.9974 | 63.39004 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig | 235.2113 | 419.0799 | 315.4284 | 20.6049 | 107.1524 | 32.39169 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL | 29.47653 | 33.54905 | 32.82823 | 22.59674 | 98.5586 | 30.05763 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig | 7.90684 | 13.9208 | 10.03272 | 0.23725 | 1.35269 | 0.41728 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d | 2.33638 | 3.36894 | 2.64635 | 0.26535 | 1.244 | 0.38895 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
### Motivation Scalar conversion between Half and Float on CPU is more time consuming compared to BFloat16 <-> Float. There is no direct data type conversion instruction for single Half value on CPU, so we add scalar conversion with avx instructions for Half to speed up. ### Testing Test maxpool, and compared with the results of #98819. Single socket (28 cores): shape | fp16 forward / ms | bf16 forward / ms | fp16 backward / ms | bf16 backward / ms | speedup ratio (fp16 forward) | speedup ratio (fp16 backward) -- | -- | -- | -- | -- | -- | -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig | 5.07165 | 5.418 | 0.5798 | 0.5123 | 1.373694951 | 3.430786 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL | 1.37455 | 1.2505 | 8.8336 | 9.7684 | 1.373635008 | 4.132924 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig | 28.72 | 30.7069 | 3.813 | 3.75 | 1.31977124 | 2.783006 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL | 4.5783 | 4.703 | 4.703 | 5.1 | 1.028980189 | 3.1293 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig | 13.896 | 14.8138 | 1.6635 | 1.6274 | 1.298704663 | 2.982699 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL | 2.11291 | 2.1158 | 2.26778 | 2.272 | 0.951105348 | 3.179012 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig | 0.4204 | 0.3843 | 0.0649 | 0.0633 | 2.102711703 | 1.779492 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d | 0.1134 | 0.11 | 0.1476 | 0.143 | 2.23042328 | 3.612398 Single core: shape | fp16 forward / ms | bf16 forward / ms | fp16 backward / ms | bf16 backward / ms | speedup ratio (fp16 forward) | speedup ratio (fp16 backward) -- | -- | -- | -- | -- | -- | -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig | 124.413 | 114.44 | 10.553 | 11.2486 | 1.31395433 | 3.923844 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL | 28.99 | 28.0781 | 9.5092 | 10.9258 | 1.324296999 | 3.888377 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig | 640.8276 | 591.964 | 59.18776 | 60.854 | 1.334956391 | 3.704458 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL | 88.57 | 90.214 | 54.358 | 59.205 | 1.031258214 | 3.75285 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig | 318.6197 | 285.155 | 28.4999 | 29.4387 | 1.315298144 | 3.759747 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL | 31.3981 | 34.0544 | 25.6557 | 28.7811 | 1.068505738 | 3.841587 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig | 8.87882 | 8.207 | 0.386056 | 0.3939 | 1.567866 | 3.50387 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d | 2.4167 | 2.38295 | 0.3769 | 0.4066 | 1.39402491 | 3.30061 Pull Request resolved: #102140 Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/cpuhrsch
### Testing Single socket (28 cores): shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms -- | -- | -- | -- | -- | -- | -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig | 4.12895 | 6.9669 | 5.30297 | 0.55775 | 1.98917 | 0.72233 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL | 0.85093 | 1.88813 | 1.38063 | 5.5742 | 36.5086 | 10.58552 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig | 22.37212 | 37.90383 | 30.94482 | 6.85868 | 10.6116 | 3.9993 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL | 5.41658 | 4.71098 | 4.66578 | 6.69875 | 14.7171 | 5.1167 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig | 10.69831 | 18.0468 | 13.71657 | 2.61192 | 4.96172 | 1.68635 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL | 2.52637 | 2.0096 | 2.0055 | 2.60314 | 7.2093 | 2.49843 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig | 0.47605 | 0.88398 | 0.65326 | 0.06525 | 0.115489 | 0.0674 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d | 0.10902 | 0.25293 | 0.157475 | 0.11386 | 0.53319 | 0.17836 Single core: shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms -- | -- | -- | -- | -- | -- | -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig | 90.9809 | 163.473 | 126.1276 | 6.57721 | 41.40833 | 11.82505 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL | 9.88405 | 38.39137 | 29.62069 | 7.10636 | 36.97535 | 11.0525 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig | 476.782 | 855.4769 | 648.2248 | 46.6488 | 219.2586 | 67.10599 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL | 80.29271 | 91.33854 | 87.80345 | 48.81692 | 203.9974 | 63.39004 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig | 235.2113 | 419.0799 | 315.4284 | 20.6049 | 107.1524 | 32.39169 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL | 29.47653 | 33.54905 | 32.82823 | 22.59674 | 98.5586 | 30.05763 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig | 7.90684 | 13.9208 | 10.03272 | 0.23725 | 1.35269 | 0.41728 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d | 2.33638 | 3.36894 | 2.64635 | 0.26535 | 1.244 | 0.38895 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
### Testing Single socket (28 cores): shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms -- | -- | -- | -- | -- | -- | -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig | 4.12895 | 6.9669 | 5.30297 | 0.55775 | 1.98917 | 0.72233 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL | 0.85093 | 1.88813 | 1.38063 | 5.5742 | 36.5086 | 10.58552 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig | 22.37212 | 37.90383 | 30.94482 | 6.85868 | 10.6116 | 3.9993 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL | 5.41658 | 4.71098 | 4.66578 | 6.69875 | 14.7171 | 5.1167 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig | 10.69831 | 18.0468 | 13.71657 | 2.61192 | 4.96172 | 1.68635 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL | 2.52637 | 2.0096 | 2.0055 | 2.60314 | 7.2093 | 2.49843 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig | 0.47605 | 0.88398 | 0.65326 | 0.06525 | 0.115489 | 0.0674 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d | 0.10902 | 0.25293 | 0.157475 | 0.11386 | 0.53319 | 0.17836 Single core: shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms -- | -- | -- | -- | -- | -- | -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig | 90.9809 | 163.473 | 126.1276 | 6.57721 | 41.40833 | 11.82505 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL | 9.88405 | 38.39137 | 29.62069 | 7.10636 | 36.97535 | 11.0525 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig | 476.782 | 855.4769 | 648.2248 | 46.6488 | 219.2586 | 67.10599 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL | 80.29271 | 91.33854 | 87.80345 | 48.81692 | 203.9974 | 63.39004 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig | 235.2113 | 419.0799 | 315.4284 | 20.6049 | 107.1524 | 32.39169 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL | 29.47653 | 33.54905 | 32.82823 | 22.59674 | 98.5586 | 30.05763 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig | 7.90684 | 13.9208 | 10.03272 | 0.23725 | 1.35269 | 0.41728 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d | 2.33638 | 3.36894 | 2.64635 | 0.26535 | 1.244 | 0.38895 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
|
@mikaylagawarecki For the memory format tests of maxpool defined in https://github.com/pytorch/pytorch/blob/main/test/nn/test_pooling.py#L878, it's hard to be replace by |
|
@ngimel Could you please review this PR ? Thanks. |
### Testing Single socket (28 cores): shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms -- | -- | -- | -- | -- | -- | -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig | 4.12895 | 6.9669 | 5.30297 | 0.55775 | 1.98917 | 0.72233 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL | 0.85093 | 1.88813 | 1.38063 | 5.5742 | 36.5086 | 10.58552 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig | 22.37212 | 37.90383 | 30.94482 | 6.85868 | 10.6116 | 3.9993 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL | 5.41658 | 4.71098 | 4.66578 | 6.69875 | 14.7171 | 5.1167 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig | 10.69831 | 18.0468 | 13.71657 | 2.61192 | 4.96172 | 1.68635 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL | 2.52637 | 2.0096 | 2.0055 | 2.60314 | 7.2093 | 2.49843 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig | 0.47605 | 0.88398 | 0.65326 | 0.06525 | 0.115489 | 0.0674 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d | 0.10902 | 0.25293 | 0.157475 | 0.11386 | 0.53319 | 0.17836 Single core: shape | fp32 forward / ms | fp16 forward / ms | bf16 forward / ms | fp32 backward / ms | fp16 backward / ms | bf16 backward / ms -- | -- | -- | -- | -- | -- | -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig | 90.9809 | 163.473 | 126.1276 | 6.57721 | 41.40833 | 11.82505 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL | 9.88405 | 38.39137 | 29.62069 | 7.10636 | 36.97535 | 11.0525 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig | 476.782 | 855.4769 | 648.2248 | 46.6488 | 219.2586 | 67.10599 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL | 80.29271 | 91.33854 | 87.80345 | 48.81692 | 203.9974 | 63.39004 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig | 235.2113 | 419.0799 | 315.4284 | 20.6049 | 107.1524 | 32.39169 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL | 29.47653 | 33.54905 | 32.82823 | 22.59674 | 98.5586 | 30.05763 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig | 7.90684 | 13.9208 | 10.03272 | 0.23725 | 1.35269 | 0.41728 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d | 2.33638 | 3.36894 | 2.64635 | 0.26535 | 1.244 | 0.38895 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
| cpu_grad_inputs = torch.autograd.grad(diff_cpu_out, diff_cpu_arg, grad_outputs=cpu_grad_outputs, allow_unused=True) | ||
| mps_grad_inputs = torch.autograd.grad(diff_mps_out, diff_mps_arg, grad_outputs=mps_grad_outputs, allow_unused=True) | ||
|
|
||
| if op.name in ["nn.functional.gelu", "nn.functional.glu"] and dtype == torch.float16: |
There was a problem hiding this comment.
out of curiosity: why this change, do gelu/glu use max_pool?
There was a problem hiding this comment.
Just move gelu and glu to FP16_LOW_PRECISION_LIST to handle such cases uniformly.
There was a problem hiding this comment.
oh, I see that you had added this path in a previous PR, got it, thanks!
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Stack from ghstack (oldest at bottom):
Testing
Single socket (28 cores):
Single core:
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10