Optimize GroupNorm on CUDA#28204
Conversation
Optimize GroupNorm on CUDA Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/) [ghstack-poisoned]
Optimize GroupNorm on CUDA Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/) ghstack-source-id: 92059612 Pull Request resolved: #28204
|
link to #28201 |
|
cc @ppwwyyxx |
Optimize GroupNorm on CUDA Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine. Before this PR ``` GroupNorm forward: 11.333400868010358ms -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 23.06% 56.891ms 100.00% 246.741ms 246.741us 31.20% 11.369s 11.369ms 1000 batch_norm 8.59% 21.194ms 35.82% 88.382ms 88.382us 18.89% 6.884s 6.884ms 1000 _batch_norm_impl_index 6.50% 16.045ms 27.23% 67.189ms 67.189us 18.86% 6.872s 6.872ms 1000 native_batch_norm 20.73% 51.143ms 20.73% 51.143ms 51.143us 18.83% 6.861s 6.861ms 1000 addcmul 15.17% 37.434ms 15.17% 37.434ms 37.434us 12.13% 4.419s 4.419ms 1000 view 21.76% 53.695ms 21.76% 53.695ms 13.424us 0.06% 21.595ms 5.399us 4000 contiguous 4.19% 10.339ms 4.19% 10.339ms 10.339us 0.03% 9.650ms 9.650us 1000 -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 246.741ms CUDA time total: 36.436s GroupNorm backward: 42.1425356430118ms ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- AddcmulBackward 10.08% 69.875ms 36.22% 251.046ms 251.046us 24.64% 19.492s 19.492ms 1000 mul 26.14% 181.171ms 26.14% 181.171ms 45.293us 24.60% 19.460s 4.865ms 4000 NativeBatchNormBackward 3.44% 23.832ms 9.82% 68.072ms 68.072us 14.23% 11.261s 11.261ms 1000 native_batch_norm_backward 6.38% 44.240ms 6.38% 44.240ms 44.240us 14.23% 11.255s 11.255ms 1000 torch::autograd::AccumulateGrad 7.14% 49.495ms 16.52% 114.485ms 38.162us 8.02% 6.343s 2.114ms 3000 add_ 9.38% 64.990ms 9.38% 64.990ms 21.663us 8.00% 6.326s 2.109ms 3000 sum 11.71% 81.163ms 11.71% 81.163ms 40.581us 6.15% 4.863s 2.431ms 2000 ViewBackward 9.72% 67.398ms 23.92% 165.801ms 41.450us 0.07% 57.930ms 14.482us 4000 reshape 8.35% 57.903ms 14.20% 98.403ms 24.601us 0.04% 35.134ms 8.783us 4000 as_strided 5.84% 40.500ms 5.84% 40.500ms 10.125us 0.01% 11.111ms 2.778us 4000 torch::autograd::GraphRoot 1.81% 12.526ms 1.81% 12.526ms 12.526us 0.01% 8.698ms 8.698us 1000 ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 693.093ms CUDA time total: 79.113s ``` After this PR ``` GroupNorm forward: 6.693606367000029ms --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 21.65% 17.851ms 100.00% 82.451ms 82.451us 50.04% 6.627s 6.627ms 1000 native_group_norm 78.35% 64.601ms 78.35% 64.601ms 64.601us 49.96% 6.616s 6.616ms 1000 --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 82.451ms CUDA time total: 13.244s GroupNorm backward: 17.170999962007045ms ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- NativeGroupNormBackward 12.91% 34.087ms 50.89% 134.411ms 134.411us 31.42% 10.733s 10.733ms 1000 non_differentiable_native_group_norm_backward 37.98% 100.324ms 37.98% 100.324ms 100.324us 31.36% 10.713s 10.713ms 1000 torch::autograd::AccumulateGrad 19.36% 51.140ms 44.70% 118.057ms 39.352us 18.62% 6.361s 2.120ms 3000 add_ 25.34% 66.917ms 25.34% 66.917ms 22.306us 18.57% 6.345s 2.115ms 3000 torch::autograd::GraphRoot 4.41% 11.656ms 4.41% 11.656ms 11.656us 0.02% 7.626ms 7.626us 1000 ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 264.124ms CUDA time total: 34.159s ``` Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine. Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/) [ghstack-poisoned]
Pull Request resolved: #28204 Optimize GroupNorm on CUDA ghstack-source-id: 92091549 Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)
|
Hi. Why this PR hasn't been merged yet? |
|
FWIW this PR still builds on current master (after resolving the conflicts) and provides around 6-8% speedup for resnet50 training with torch.amp on 2080Ti compared to master(using the same batch size), and max batch size is about 40% larger which is a great win 👍 . In terms of speed batchnorm is still much faster. |
Optimize GroupNorm on CUDA Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine. Before this PR ``` GroupNorm forward: 11.333400868010358ms -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 23.06% 56.891ms 100.00% 246.741ms 246.741us 31.20% 11.369s 11.369ms 1000 batch_norm 8.59% 21.194ms 35.82% 88.382ms 88.382us 18.89% 6.884s 6.884ms 1000 _batch_norm_impl_index 6.50% 16.045ms 27.23% 67.189ms 67.189us 18.86% 6.872s 6.872ms 1000 native_batch_norm 20.73% 51.143ms 20.73% 51.143ms 51.143us 18.83% 6.861s 6.861ms 1000 addcmul 15.17% 37.434ms 15.17% 37.434ms 37.434us 12.13% 4.419s 4.419ms 1000 view 21.76% 53.695ms 21.76% 53.695ms 13.424us 0.06% 21.595ms 5.399us 4000 contiguous 4.19% 10.339ms 4.19% 10.339ms 10.339us 0.03% 9.650ms 9.650us 1000 -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 246.741ms CUDA time total: 36.436s GroupNorm backward: 42.1425356430118ms ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- AddcmulBackward 10.08% 69.875ms 36.22% 251.046ms 251.046us 24.64% 19.492s 19.492ms 1000 mul 26.14% 181.171ms 26.14% 181.171ms 45.293us 24.60% 19.460s 4.865ms 4000 NativeBatchNormBackward 3.44% 23.832ms 9.82% 68.072ms 68.072us 14.23% 11.261s 11.261ms 1000 native_batch_norm_backward 6.38% 44.240ms 6.38% 44.240ms 44.240us 14.23% 11.255s 11.255ms 1000 torch::autograd::AccumulateGrad 7.14% 49.495ms 16.52% 114.485ms 38.162us 8.02% 6.343s 2.114ms 3000 add_ 9.38% 64.990ms 9.38% 64.990ms 21.663us 8.00% 6.326s 2.109ms 3000 sum 11.71% 81.163ms 11.71% 81.163ms 40.581us 6.15% 4.863s 2.431ms 2000 ViewBackward 9.72% 67.398ms 23.92% 165.801ms 41.450us 0.07% 57.930ms 14.482us 4000 reshape 8.35% 57.903ms 14.20% 98.403ms 24.601us 0.04% 35.134ms 8.783us 4000 as_strided 5.84% 40.500ms 5.84% 40.500ms 10.125us 0.01% 11.111ms 2.778us 4000 torch::autograd::GraphRoot 1.81% 12.526ms 1.81% 12.526ms 12.526us 0.01% 8.698ms 8.698us 1000 ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 693.093ms CUDA time total: 79.113s ``` After this PR ``` GroupNorm forward: 6.693606367000029ms --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 21.65% 17.851ms 100.00% 82.451ms 82.451us 50.04% 6.627s 6.627ms 1000 native_group_norm 78.35% 64.601ms 78.35% 64.601ms 64.601us 49.96% 6.616s 6.616ms 1000 --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 82.451ms CUDA time total: 13.244s GroupNorm backward: 17.170999962007045ms ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- NativeGroupNormBackward 12.91% 34.087ms 50.89% 134.411ms 134.411us 31.42% 10.733s 10.733ms 1000 non_differentiable_native_group_norm_backward 37.98% 100.324ms 37.98% 100.324ms 100.324us 31.36% 10.713s 10.713ms 1000 torch::autograd::AccumulateGrad 19.36% 51.140ms 44.70% 118.057ms 39.352us 18.62% 6.361s 2.120ms 3000 add_ 25.34% 66.917ms 25.34% 66.917ms 22.306us 18.57% 6.345s 2.115ms 3000 torch::autograd::GraphRoot 4.41% 11.656ms 4.41% 11.656ms 11.656us 0.02% 7.626ms 7.626us 1000 ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 264.124ms CUDA time total: 34.159s ``` Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine. Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/) [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit 2c78ef2 (more details on the Dr. CI page):
XLA failureJob pytorch_xla_linux_bionic_py3_6_clang9_test is failing. Please create an issue with title prefixed by This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 94 times. |
Optimize GroupNorm on CUDA Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine. Before this PR ``` GroupNorm forward: 11.333400868010358ms -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 23.06% 56.891ms 100.00% 246.741ms 246.741us 31.20% 11.369s 11.369ms 1000 batch_norm 8.59% 21.194ms 35.82% 88.382ms 88.382us 18.89% 6.884s 6.884ms 1000 _batch_norm_impl_index 6.50% 16.045ms 27.23% 67.189ms 67.189us 18.86% 6.872s 6.872ms 1000 native_batch_norm 20.73% 51.143ms 20.73% 51.143ms 51.143us 18.83% 6.861s 6.861ms 1000 addcmul 15.17% 37.434ms 15.17% 37.434ms 37.434us 12.13% 4.419s 4.419ms 1000 view 21.76% 53.695ms 21.76% 53.695ms 13.424us 0.06% 21.595ms 5.399us 4000 contiguous 4.19% 10.339ms 4.19% 10.339ms 10.339us 0.03% 9.650ms 9.650us 1000 -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 246.741ms CUDA time total: 36.436s GroupNorm backward: 42.1425356430118ms ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- AddcmulBackward 10.08% 69.875ms 36.22% 251.046ms 251.046us 24.64% 19.492s 19.492ms 1000 mul 26.14% 181.171ms 26.14% 181.171ms 45.293us 24.60% 19.460s 4.865ms 4000 NativeBatchNormBackward 3.44% 23.832ms 9.82% 68.072ms 68.072us 14.23% 11.261s 11.261ms 1000 native_batch_norm_backward 6.38% 44.240ms 6.38% 44.240ms 44.240us 14.23% 11.255s 11.255ms 1000 torch::autograd::AccumulateGrad 7.14% 49.495ms 16.52% 114.485ms 38.162us 8.02% 6.343s 2.114ms 3000 add_ 9.38% 64.990ms 9.38% 64.990ms 21.663us 8.00% 6.326s 2.109ms 3000 sum 11.71% 81.163ms 11.71% 81.163ms 40.581us 6.15% 4.863s 2.431ms 2000 ViewBackward 9.72% 67.398ms 23.92% 165.801ms 41.450us 0.07% 57.930ms 14.482us 4000 reshape 8.35% 57.903ms 14.20% 98.403ms 24.601us 0.04% 35.134ms 8.783us 4000 as_strided 5.84% 40.500ms 5.84% 40.500ms 10.125us 0.01% 11.111ms 2.778us 4000 torch::autograd::GraphRoot 1.81% 12.526ms 1.81% 12.526ms 12.526us 0.01% 8.698ms 8.698us 1000 ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 693.093ms CUDA time total: 79.113s ``` After this PR ``` GroupNorm forward: 6.693606367000029ms --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 21.65% 17.851ms 100.00% 82.451ms 82.451us 50.04% 6.627s 6.627ms 1000 native_group_norm 78.35% 64.601ms 78.35% 64.601ms 64.601us 49.96% 6.616s 6.616ms 1000 --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 82.451ms CUDA time total: 13.244s GroupNorm backward: 17.170999962007045ms ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- NativeGroupNormBackward 12.91% 34.087ms 50.89% 134.411ms 134.411us 31.42% 10.733s 10.733ms 1000 non_differentiable_native_group_norm_backward 37.98% 100.324ms 37.98% 100.324ms 100.324us 31.36% 10.713s 10.713ms 1000 torch::autograd::AccumulateGrad 19.36% 51.140ms 44.70% 118.057ms 39.352us 18.62% 6.361s 2.120ms 3000 add_ 25.34% 66.917ms 25.34% 66.917ms 22.306us 18.57% 6.345s 2.115ms 3000 torch::autograd::GraphRoot 4.41% 11.656ms 4.41% 11.656ms 11.656us 0.02% 7.626ms 7.626us 1000 ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 264.124ms CUDA time total: 34.159s ``` Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine. Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/) [ghstack-poisoned]
Optimize GroupNorm on CUDA Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine. Before this PR ``` GroupNorm forward: 11.333400868010358ms -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 23.06% 56.891ms 100.00% 246.741ms 246.741us 31.20% 11.369s 11.369ms 1000 batch_norm 8.59% 21.194ms 35.82% 88.382ms 88.382us 18.89% 6.884s 6.884ms 1000 _batch_norm_impl_index 6.50% 16.045ms 27.23% 67.189ms 67.189us 18.86% 6.872s 6.872ms 1000 native_batch_norm 20.73% 51.143ms 20.73% 51.143ms 51.143us 18.83% 6.861s 6.861ms 1000 addcmul 15.17% 37.434ms 15.17% 37.434ms 37.434us 12.13% 4.419s 4.419ms 1000 view 21.76% 53.695ms 21.76% 53.695ms 13.424us 0.06% 21.595ms 5.399us 4000 contiguous 4.19% 10.339ms 4.19% 10.339ms 10.339us 0.03% 9.650ms 9.650us 1000 -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 246.741ms CUDA time total: 36.436s GroupNorm backward: 42.1425356430118ms ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- AddcmulBackward 10.08% 69.875ms 36.22% 251.046ms 251.046us 24.64% 19.492s 19.492ms 1000 mul 26.14% 181.171ms 26.14% 181.171ms 45.293us 24.60% 19.460s 4.865ms 4000 NativeBatchNormBackward 3.44% 23.832ms 9.82% 68.072ms 68.072us 14.23% 11.261s 11.261ms 1000 native_batch_norm_backward 6.38% 44.240ms 6.38% 44.240ms 44.240us 14.23% 11.255s 11.255ms 1000 torch::autograd::AccumulateGrad 7.14% 49.495ms 16.52% 114.485ms 38.162us 8.02% 6.343s 2.114ms 3000 add_ 9.38% 64.990ms 9.38% 64.990ms 21.663us 8.00% 6.326s 2.109ms 3000 sum 11.71% 81.163ms 11.71% 81.163ms 40.581us 6.15% 4.863s 2.431ms 2000 ViewBackward 9.72% 67.398ms 23.92% 165.801ms 41.450us 0.07% 57.930ms 14.482us 4000 reshape 8.35% 57.903ms 14.20% 98.403ms 24.601us 0.04% 35.134ms 8.783us 4000 as_strided 5.84% 40.500ms 5.84% 40.500ms 10.125us 0.01% 11.111ms 2.778us 4000 torch::autograd::GraphRoot 1.81% 12.526ms 1.81% 12.526ms 12.526us 0.01% 8.698ms 8.698us 1000 ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 693.093ms CUDA time total: 79.113s ``` After this PR ``` GroupNorm forward: 6.693606367000029ms --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 21.65% 17.851ms 100.00% 82.451ms 82.451us 50.04% 6.627s 6.627ms 1000 native_group_norm 78.35% 64.601ms 78.35% 64.601ms 64.601us 49.96% 6.616s 6.616ms 1000 --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 82.451ms CUDA time total: 13.244s GroupNorm backward: 17.170999962007045ms ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- NativeGroupNormBackward 12.91% 34.087ms 50.89% 134.411ms 134.411us 31.42% 10.733s 10.733ms 1000 non_differentiable_native_group_norm_backward 37.98% 100.324ms 37.98% 100.324ms 100.324us 31.36% 10.713s 10.713ms 1000 torch::autograd::AccumulateGrad 19.36% 51.140ms 44.70% 118.057ms 39.352us 18.62% 6.361s 2.120ms 3000 add_ 25.34% 66.917ms 25.34% 66.917ms 22.306us 18.57% 6.345s 2.115ms 3000 torch::autograd::GraphRoot 4.41% 11.656ms 4.41% 11.656ms 11.656us 0.02% 7.626ms 7.626us 1000 ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 264.124ms CUDA time total: 34.159s ``` Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine. Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/) [ghstack-poisoned]
Optimize GroupNorm on CUDA Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine. Before this PR ``` GroupNorm forward: 11.333400868010358ms -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 23.06% 56.891ms 100.00% 246.741ms 246.741us 31.20% 11.369s 11.369ms 1000 batch_norm 8.59% 21.194ms 35.82% 88.382ms 88.382us 18.89% 6.884s 6.884ms 1000 _batch_norm_impl_index 6.50% 16.045ms 27.23% 67.189ms 67.189us 18.86% 6.872s 6.872ms 1000 native_batch_norm 20.73% 51.143ms 20.73% 51.143ms 51.143us 18.83% 6.861s 6.861ms 1000 addcmul 15.17% 37.434ms 15.17% 37.434ms 37.434us 12.13% 4.419s 4.419ms 1000 view 21.76% 53.695ms 21.76% 53.695ms 13.424us 0.06% 21.595ms 5.399us 4000 contiguous 4.19% 10.339ms 4.19% 10.339ms 10.339us 0.03% 9.650ms 9.650us 1000 -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 246.741ms CUDA time total: 36.436s GroupNorm backward: 42.1425356430118ms ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- AddcmulBackward 10.08% 69.875ms 36.22% 251.046ms 251.046us 24.64% 19.492s 19.492ms 1000 mul 26.14% 181.171ms 26.14% 181.171ms 45.293us 24.60% 19.460s 4.865ms 4000 NativeBatchNormBackward 3.44% 23.832ms 9.82% 68.072ms 68.072us 14.23% 11.261s 11.261ms 1000 native_batch_norm_backward 6.38% 44.240ms 6.38% 44.240ms 44.240us 14.23% 11.255s 11.255ms 1000 torch::autograd::AccumulateGrad 7.14% 49.495ms 16.52% 114.485ms 38.162us 8.02% 6.343s 2.114ms 3000 add_ 9.38% 64.990ms 9.38% 64.990ms 21.663us 8.00% 6.326s 2.109ms 3000 sum 11.71% 81.163ms 11.71% 81.163ms 40.581us 6.15% 4.863s 2.431ms 2000 ViewBackward 9.72% 67.398ms 23.92% 165.801ms 41.450us 0.07% 57.930ms 14.482us 4000 reshape 8.35% 57.903ms 14.20% 98.403ms 24.601us 0.04% 35.134ms 8.783us 4000 as_strided 5.84% 40.500ms 5.84% 40.500ms 10.125us 0.01% 11.111ms 2.778us 4000 torch::autograd::GraphRoot 1.81% 12.526ms 1.81% 12.526ms 12.526us 0.01% 8.698ms 8.698us 1000 ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 693.093ms CUDA time total: 79.113s ``` After this PR ``` GroupNorm forward: 6.693606367000029ms --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 21.65% 17.851ms 100.00% 82.451ms 82.451us 50.04% 6.627s 6.627ms 1000 native_group_norm 78.35% 64.601ms 78.35% 64.601ms 64.601us 49.96% 6.616s 6.616ms 1000 --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 82.451ms CUDA time total: 13.244s GroupNorm backward: 17.170999962007045ms ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- NativeGroupNormBackward 12.91% 34.087ms 50.89% 134.411ms 134.411us 31.42% 10.733s 10.733ms 1000 non_differentiable_native_group_norm_backward 37.98% 100.324ms 37.98% 100.324ms 100.324us 31.36% 10.713s 10.713ms 1000 torch::autograd::AccumulateGrad 19.36% 51.140ms 44.70% 118.057ms 39.352us 18.62% 6.361s 2.120ms 3000 add_ 25.34% 66.917ms 25.34% 66.917ms 22.306us 18.57% 6.345s 2.115ms 3000 torch::autograd::GraphRoot 4.41% 11.656ms 4.41% 11.656ms 11.656us 0.02% 7.626ms 7.626us 1000 ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 264.124ms CUDA time total: 34.159s ``` Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine. Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/) [ghstack-poisoned]
Optimize GroupNorm on CUDA Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine. Before this PR ``` GroupNorm forward: 11.333400868010358ms -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 23.06% 56.891ms 100.00% 246.741ms 246.741us 31.20% 11.369s 11.369ms 1000 batch_norm 8.59% 21.194ms 35.82% 88.382ms 88.382us 18.89% 6.884s 6.884ms 1000 _batch_norm_impl_index 6.50% 16.045ms 27.23% 67.189ms 67.189us 18.86% 6.872s 6.872ms 1000 native_batch_norm 20.73% 51.143ms 20.73% 51.143ms 51.143us 18.83% 6.861s 6.861ms 1000 addcmul 15.17% 37.434ms 15.17% 37.434ms 37.434us 12.13% 4.419s 4.419ms 1000 view 21.76% 53.695ms 21.76% 53.695ms 13.424us 0.06% 21.595ms 5.399us 4000 contiguous 4.19% 10.339ms 4.19% 10.339ms 10.339us 0.03% 9.650ms 9.650us 1000 -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 246.741ms CUDA time total: 36.436s GroupNorm backward: 42.1425356430118ms ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- AddcmulBackward 10.08% 69.875ms 36.22% 251.046ms 251.046us 24.64% 19.492s 19.492ms 1000 mul 26.14% 181.171ms 26.14% 181.171ms 45.293us 24.60% 19.460s 4.865ms 4000 NativeBatchNormBackward 3.44% 23.832ms 9.82% 68.072ms 68.072us 14.23% 11.261s 11.261ms 1000 native_batch_norm_backward 6.38% 44.240ms 6.38% 44.240ms 44.240us 14.23% 11.255s 11.255ms 1000 torch::autograd::AccumulateGrad 7.14% 49.495ms 16.52% 114.485ms 38.162us 8.02% 6.343s 2.114ms 3000 add_ 9.38% 64.990ms 9.38% 64.990ms 21.663us 8.00% 6.326s 2.109ms 3000 sum 11.71% 81.163ms 11.71% 81.163ms 40.581us 6.15% 4.863s 2.431ms 2000 ViewBackward 9.72% 67.398ms 23.92% 165.801ms 41.450us 0.07% 57.930ms 14.482us 4000 reshape 8.35% 57.903ms 14.20% 98.403ms 24.601us 0.04% 35.134ms 8.783us 4000 as_strided 5.84% 40.500ms 5.84% 40.500ms 10.125us 0.01% 11.111ms 2.778us 4000 torch::autograd::GraphRoot 1.81% 12.526ms 1.81% 12.526ms 12.526us 0.01% 8.698ms 8.698us 1000 ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 693.093ms CUDA time total: 79.113s ``` After this PR ``` GroupNorm forward: 6.693606367000029ms --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 21.65% 17.851ms 100.00% 82.451ms 82.451us 50.04% 6.627s 6.627ms 1000 native_group_norm 78.35% 64.601ms 78.35% 64.601ms 64.601us 49.96% 6.616s 6.616ms 1000 --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 82.451ms CUDA time total: 13.244s GroupNorm backward: 17.170999962007045ms ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- NativeGroupNormBackward 12.91% 34.087ms 50.89% 134.411ms 134.411us 31.42% 10.733s 10.733ms 1000 non_differentiable_native_group_norm_backward 37.98% 100.324ms 37.98% 100.324ms 100.324us 31.36% 10.713s 10.713ms 1000 torch::autograd::AccumulateGrad 19.36% 51.140ms 44.70% 118.057ms 39.352us 18.62% 6.361s 2.120ms 3000 add_ 25.34% 66.917ms 25.34% 66.917ms 22.306us 18.57% 6.345s 2.115ms 3000 torch::autograd::GraphRoot 4.41% 11.656ms 4.41% 11.656ms 11.656us 0.02% 7.626ms 7.626us 1000 ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 264.124ms CUDA time total: 34.159s ``` Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine. Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/) [ghstack-poisoned]
Optimize GroupNorm on CUDA Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine. Before this PR ``` GroupNorm forward: 11.333400868010358ms -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 23.06% 56.891ms 100.00% 246.741ms 246.741us 31.20% 11.369s 11.369ms 1000 batch_norm 8.59% 21.194ms 35.82% 88.382ms 88.382us 18.89% 6.884s 6.884ms 1000 _batch_norm_impl_index 6.50% 16.045ms 27.23% 67.189ms 67.189us 18.86% 6.872s 6.872ms 1000 native_batch_norm 20.73% 51.143ms 20.73% 51.143ms 51.143us 18.83% 6.861s 6.861ms 1000 addcmul 15.17% 37.434ms 15.17% 37.434ms 37.434us 12.13% 4.419s 4.419ms 1000 view 21.76% 53.695ms 21.76% 53.695ms 13.424us 0.06% 21.595ms 5.399us 4000 contiguous 4.19% 10.339ms 4.19% 10.339ms 10.339us 0.03% 9.650ms 9.650us 1000 -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 246.741ms CUDA time total: 36.436s GroupNorm backward: 42.1425356430118ms ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- AddcmulBackward 10.08% 69.875ms 36.22% 251.046ms 251.046us 24.64% 19.492s 19.492ms 1000 mul 26.14% 181.171ms 26.14% 181.171ms 45.293us 24.60% 19.460s 4.865ms 4000 NativeBatchNormBackward 3.44% 23.832ms 9.82% 68.072ms 68.072us 14.23% 11.261s 11.261ms 1000 native_batch_norm_backward 6.38% 44.240ms 6.38% 44.240ms 44.240us 14.23% 11.255s 11.255ms 1000 torch::autograd::AccumulateGrad 7.14% 49.495ms 16.52% 114.485ms 38.162us 8.02% 6.343s 2.114ms 3000 add_ 9.38% 64.990ms 9.38% 64.990ms 21.663us 8.00% 6.326s 2.109ms 3000 sum 11.71% 81.163ms 11.71% 81.163ms 40.581us 6.15% 4.863s 2.431ms 2000 ViewBackward 9.72% 67.398ms 23.92% 165.801ms 41.450us 0.07% 57.930ms 14.482us 4000 reshape 8.35% 57.903ms 14.20% 98.403ms 24.601us 0.04% 35.134ms 8.783us 4000 as_strided 5.84% 40.500ms 5.84% 40.500ms 10.125us 0.01% 11.111ms 2.778us 4000 torch::autograd::GraphRoot 1.81% 12.526ms 1.81% 12.526ms 12.526us 0.01% 8.698ms 8.698us 1000 ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 693.093ms CUDA time total: 79.113s ``` After this PR ``` GroupNorm forward: 6.693606367000029ms --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 21.65% 17.851ms 100.00% 82.451ms 82.451us 50.04% 6.627s 6.627ms 1000 native_group_norm 78.35% 64.601ms 78.35% 64.601ms 64.601us 49.96% 6.616s 6.616ms 1000 --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 82.451ms CUDA time total: 13.244s GroupNorm backward: 17.170999962007045ms ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- NativeGroupNormBackward 12.91% 34.087ms 50.89% 134.411ms 134.411us 31.42% 10.733s 10.733ms 1000 non_differentiable_native_group_norm_backward 37.98% 100.324ms 37.98% 100.324ms 100.324us 31.36% 10.713s 10.713ms 1000 torch::autograd::AccumulateGrad 19.36% 51.140ms 44.70% 118.057ms 39.352us 18.62% 6.361s 2.120ms 3000 add_ 25.34% 66.917ms 25.34% 66.917ms 22.306us 18.57% 6.345s 2.115ms 3000 torch::autograd::GraphRoot 4.41% 11.656ms 4.41% 11.656ms 11.656us 0.02% 7.626ms 7.626us 1000 ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 264.124ms CUDA time total: 34.159s ``` Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine. Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/) [ghstack-poisoned]
Optimize GroupNorm on CUDA Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine. Before this PR ``` GroupNorm forward: 11.333400868010358ms -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 23.06% 56.891ms 100.00% 246.741ms 246.741us 31.20% 11.369s 11.369ms 1000 batch_norm 8.59% 21.194ms 35.82% 88.382ms 88.382us 18.89% 6.884s 6.884ms 1000 _batch_norm_impl_index 6.50% 16.045ms 27.23% 67.189ms 67.189us 18.86% 6.872s 6.872ms 1000 native_batch_norm 20.73% 51.143ms 20.73% 51.143ms 51.143us 18.83% 6.861s 6.861ms 1000 addcmul 15.17% 37.434ms 15.17% 37.434ms 37.434us 12.13% 4.419s 4.419ms 1000 view 21.76% 53.695ms 21.76% 53.695ms 13.424us 0.06% 21.595ms 5.399us 4000 contiguous 4.19% 10.339ms 4.19% 10.339ms 10.339us 0.03% 9.650ms 9.650us 1000 -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 246.741ms CUDA time total: 36.436s GroupNorm backward: 42.1425356430118ms ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- AddcmulBackward 10.08% 69.875ms 36.22% 251.046ms 251.046us 24.64% 19.492s 19.492ms 1000 mul 26.14% 181.171ms 26.14% 181.171ms 45.293us 24.60% 19.460s 4.865ms 4000 NativeBatchNormBackward 3.44% 23.832ms 9.82% 68.072ms 68.072us 14.23% 11.261s 11.261ms 1000 native_batch_norm_backward 6.38% 44.240ms 6.38% 44.240ms 44.240us 14.23% 11.255s 11.255ms 1000 torch::autograd::AccumulateGrad 7.14% 49.495ms 16.52% 114.485ms 38.162us 8.02% 6.343s 2.114ms 3000 add_ 9.38% 64.990ms 9.38% 64.990ms 21.663us 8.00% 6.326s 2.109ms 3000 sum 11.71% 81.163ms 11.71% 81.163ms 40.581us 6.15% 4.863s 2.431ms 2000 ViewBackward 9.72% 67.398ms 23.92% 165.801ms 41.450us 0.07% 57.930ms 14.482us 4000 reshape 8.35% 57.903ms 14.20% 98.403ms 24.601us 0.04% 35.134ms 8.783us 4000 as_strided 5.84% 40.500ms 5.84% 40.500ms 10.125us 0.01% 11.111ms 2.778us 4000 torch::autograd::GraphRoot 1.81% 12.526ms 1.81% 12.526ms 12.526us 0.01% 8.698ms 8.698us 1000 ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 693.093ms CUDA time total: 79.113s ``` After this PR ``` GroupNorm forward: 6.693606367000029ms --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 21.65% 17.851ms 100.00% 82.451ms 82.451us 50.04% 6.627s 6.627ms 1000 native_group_norm 78.35% 64.601ms 78.35% 64.601ms 64.601us 49.96% 6.616s 6.616ms 1000 --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 82.451ms CUDA time total: 13.244s GroupNorm backward: 17.170999962007045ms ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- NativeGroupNormBackward 12.91% 34.087ms 50.89% 134.411ms 134.411us 31.42% 10.733s 10.733ms 1000 non_differentiable_native_group_norm_backward 37.98% 100.324ms 37.98% 100.324ms 100.324us 31.36% 10.713s 10.713ms 1000 torch::autograd::AccumulateGrad 19.36% 51.140ms 44.70% 118.057ms 39.352us 18.62% 6.361s 2.120ms 3000 add_ 25.34% 66.917ms 25.34% 66.917ms 22.306us 18.57% 6.345s 2.115ms 3000 torch::autograd::GraphRoot 4.41% 11.656ms 4.41% 11.656ms 11.656us 0.02% 7.626ms 7.626us 1000 ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 264.124ms CUDA time total: 34.159s ``` Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine. Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/) [ghstack-poisoned]
Pull Request resolved: #28204 Optimize GroupNorm on CUDA ghstack-source-id: 104603817 Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)
Optimize GroupNorm on CUDA Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine. Before this PR ``` GroupNorm forward: 11.333400868010358ms -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 23.06% 56.891ms 100.00% 246.741ms 246.741us 31.20% 11.369s 11.369ms 1000 batch_norm 8.59% 21.194ms 35.82% 88.382ms 88.382us 18.89% 6.884s 6.884ms 1000 _batch_norm_impl_index 6.50% 16.045ms 27.23% 67.189ms 67.189us 18.86% 6.872s 6.872ms 1000 native_batch_norm 20.73% 51.143ms 20.73% 51.143ms 51.143us 18.83% 6.861s 6.861ms 1000 addcmul 15.17% 37.434ms 15.17% 37.434ms 37.434us 12.13% 4.419s 4.419ms 1000 view 21.76% 53.695ms 21.76% 53.695ms 13.424us 0.06% 21.595ms 5.399us 4000 contiguous 4.19% 10.339ms 4.19% 10.339ms 10.339us 0.03% 9.650ms 9.650us 1000 -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 246.741ms CUDA time total: 36.436s GroupNorm backward: 42.1425356430118ms ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- AddcmulBackward 10.08% 69.875ms 36.22% 251.046ms 251.046us 24.64% 19.492s 19.492ms 1000 mul 26.14% 181.171ms 26.14% 181.171ms 45.293us 24.60% 19.460s 4.865ms 4000 NativeBatchNormBackward 3.44% 23.832ms 9.82% 68.072ms 68.072us 14.23% 11.261s 11.261ms 1000 native_batch_norm_backward 6.38% 44.240ms 6.38% 44.240ms 44.240us 14.23% 11.255s 11.255ms 1000 torch::autograd::AccumulateGrad 7.14% 49.495ms 16.52% 114.485ms 38.162us 8.02% 6.343s 2.114ms 3000 add_ 9.38% 64.990ms 9.38% 64.990ms 21.663us 8.00% 6.326s 2.109ms 3000 sum 11.71% 81.163ms 11.71% 81.163ms 40.581us 6.15% 4.863s 2.431ms 2000 ViewBackward 9.72% 67.398ms 23.92% 165.801ms 41.450us 0.07% 57.930ms 14.482us 4000 reshape 8.35% 57.903ms 14.20% 98.403ms 24.601us 0.04% 35.134ms 8.783us 4000 as_strided 5.84% 40.500ms 5.84% 40.500ms 10.125us 0.01% 11.111ms 2.778us 4000 torch::autograd::GraphRoot 1.81% 12.526ms 1.81% 12.526ms 12.526us 0.01% 8.698ms 8.698us 1000 ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 693.093ms CUDA time total: 79.113s ``` After this PR ``` GroupNorm forward: 6.693606367000029ms --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 21.65% 17.851ms 100.00% 82.451ms 82.451us 50.04% 6.627s 6.627ms 1000 native_group_norm 78.35% 64.601ms 78.35% 64.601ms 64.601us 49.96% 6.616s 6.616ms 1000 --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 82.451ms CUDA time total: 13.244s GroupNorm backward: 17.170999962007045ms ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- NativeGroupNormBackward 12.91% 34.087ms 50.89% 134.411ms 134.411us 31.42% 10.733s 10.733ms 1000 non_differentiable_native_group_norm_backward 37.98% 100.324ms 37.98% 100.324ms 100.324us 31.36% 10.713s 10.713ms 1000 torch::autograd::AccumulateGrad 19.36% 51.140ms 44.70% 118.057ms 39.352us 18.62% 6.361s 2.120ms 3000 add_ 25.34% 66.917ms 25.34% 66.917ms 22.306us 18.57% 6.345s 2.115ms 3000 torch::autograd::GraphRoot 4.41% 11.656ms 4.41% 11.656ms 11.656us 0.02% 7.626ms 7.626us 1000 ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 264.124ms CUDA time total: 34.159s ``` Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine. Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/) [ghstack-poisoned]
Optimize GroupNorm on CUDA Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine. Before this PR ``` GroupNorm forward: 11.333400868010358ms -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 23.06% 56.891ms 100.00% 246.741ms 246.741us 31.20% 11.369s 11.369ms 1000 batch_norm 8.59% 21.194ms 35.82% 88.382ms 88.382us 18.89% 6.884s 6.884ms 1000 _batch_norm_impl_index 6.50% 16.045ms 27.23% 67.189ms 67.189us 18.86% 6.872s 6.872ms 1000 native_batch_norm 20.73% 51.143ms 20.73% 51.143ms 51.143us 18.83% 6.861s 6.861ms 1000 addcmul 15.17% 37.434ms 15.17% 37.434ms 37.434us 12.13% 4.419s 4.419ms 1000 view 21.76% 53.695ms 21.76% 53.695ms 13.424us 0.06% 21.595ms 5.399us 4000 contiguous 4.19% 10.339ms 4.19% 10.339ms 10.339us 0.03% 9.650ms 9.650us 1000 -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 246.741ms CUDA time total: 36.436s GroupNorm backward: 42.1425356430118ms ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- AddcmulBackward 10.08% 69.875ms 36.22% 251.046ms 251.046us 24.64% 19.492s 19.492ms 1000 mul 26.14% 181.171ms 26.14% 181.171ms 45.293us 24.60% 19.460s 4.865ms 4000 NativeBatchNormBackward 3.44% 23.832ms 9.82% 68.072ms 68.072us 14.23% 11.261s 11.261ms 1000 native_batch_norm_backward 6.38% 44.240ms 6.38% 44.240ms 44.240us 14.23% 11.255s 11.255ms 1000 torch::autograd::AccumulateGrad 7.14% 49.495ms 16.52% 114.485ms 38.162us 8.02% 6.343s 2.114ms 3000 add_ 9.38% 64.990ms 9.38% 64.990ms 21.663us 8.00% 6.326s 2.109ms 3000 sum 11.71% 81.163ms 11.71% 81.163ms 40.581us 6.15% 4.863s 2.431ms 2000 ViewBackward 9.72% 67.398ms 23.92% 165.801ms 41.450us 0.07% 57.930ms 14.482us 4000 reshape 8.35% 57.903ms 14.20% 98.403ms 24.601us 0.04% 35.134ms 8.783us 4000 as_strided 5.84% 40.500ms 5.84% 40.500ms 10.125us 0.01% 11.111ms 2.778us 4000 torch::autograd::GraphRoot 1.81% 12.526ms 1.81% 12.526ms 12.526us 0.01% 8.698ms 8.698us 1000 ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 693.093ms CUDA time total: 79.113s ``` After this PR ``` GroupNorm forward: 6.693606367000029ms --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 21.65% 17.851ms 100.00% 82.451ms 82.451us 50.04% 6.627s 6.627ms 1000 native_group_norm 78.35% 64.601ms 78.35% 64.601ms 64.601us 49.96% 6.616s 6.616ms 1000 --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 82.451ms CUDA time total: 13.244s GroupNorm backward: 17.170999962007045ms ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- NativeGroupNormBackward 12.91% 34.087ms 50.89% 134.411ms 134.411us 31.42% 10.733s 10.733ms 1000 non_differentiable_native_group_norm_backward 37.98% 100.324ms 37.98% 100.324ms 100.324us 31.36% 10.713s 10.713ms 1000 torch::autograd::AccumulateGrad 19.36% 51.140ms 44.70% 118.057ms 39.352us 18.62% 6.361s 2.120ms 3000 add_ 25.34% 66.917ms 25.34% 66.917ms 22.306us 18.57% 6.345s 2.115ms 3000 torch::autograd::GraphRoot 4.41% 11.656ms 4.41% 11.656ms 11.656us 0.02% 7.626ms 7.626us 1000 ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 264.124ms CUDA time total: 34.159s ``` Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine. Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/) [ghstack-poisoned]
Optimize GroupNorm on CUDA Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine. Before this PR ``` GroupNorm forward: 11.333400868010358ms -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 23.06% 56.891ms 100.00% 246.741ms 246.741us 31.20% 11.369s 11.369ms 1000 batch_norm 8.59% 21.194ms 35.82% 88.382ms 88.382us 18.89% 6.884s 6.884ms 1000 _batch_norm_impl_index 6.50% 16.045ms 27.23% 67.189ms 67.189us 18.86% 6.872s 6.872ms 1000 native_batch_norm 20.73% 51.143ms 20.73% 51.143ms 51.143us 18.83% 6.861s 6.861ms 1000 addcmul 15.17% 37.434ms 15.17% 37.434ms 37.434us 12.13% 4.419s 4.419ms 1000 view 21.76% 53.695ms 21.76% 53.695ms 13.424us 0.06% 21.595ms 5.399us 4000 contiguous 4.19% 10.339ms 4.19% 10.339ms 10.339us 0.03% 9.650ms 9.650us 1000 -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 246.741ms CUDA time total: 36.436s GroupNorm backward: 42.1425356430118ms ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- AddcmulBackward 10.08% 69.875ms 36.22% 251.046ms 251.046us 24.64% 19.492s 19.492ms 1000 mul 26.14% 181.171ms 26.14% 181.171ms 45.293us 24.60% 19.460s 4.865ms 4000 NativeBatchNormBackward 3.44% 23.832ms 9.82% 68.072ms 68.072us 14.23% 11.261s 11.261ms 1000 native_batch_norm_backward 6.38% 44.240ms 6.38% 44.240ms 44.240us 14.23% 11.255s 11.255ms 1000 torch::autograd::AccumulateGrad 7.14% 49.495ms 16.52% 114.485ms 38.162us 8.02% 6.343s 2.114ms 3000 add_ 9.38% 64.990ms 9.38% 64.990ms 21.663us 8.00% 6.326s 2.109ms 3000 sum 11.71% 81.163ms 11.71% 81.163ms 40.581us 6.15% 4.863s 2.431ms 2000 ViewBackward 9.72% 67.398ms 23.92% 165.801ms 41.450us 0.07% 57.930ms 14.482us 4000 reshape 8.35% 57.903ms 14.20% 98.403ms 24.601us 0.04% 35.134ms 8.783us 4000 as_strided 5.84% 40.500ms 5.84% 40.500ms 10.125us 0.01% 11.111ms 2.778us 4000 torch::autograd::GraphRoot 1.81% 12.526ms 1.81% 12.526ms 12.526us 0.01% 8.698ms 8.698us 1000 ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 693.093ms CUDA time total: 79.113s ``` After this PR ``` GroupNorm forward: 6.693606367000029ms --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 21.65% 17.851ms 100.00% 82.451ms 82.451us 50.04% 6.627s 6.627ms 1000 native_group_norm 78.35% 64.601ms 78.35% 64.601ms 64.601us 49.96% 6.616s 6.616ms 1000 --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 82.451ms CUDA time total: 13.244s GroupNorm backward: 17.170999962007045ms ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- NativeGroupNormBackward 12.91% 34.087ms 50.89% 134.411ms 134.411us 31.42% 10.733s 10.733ms 1000 non_differentiable_native_group_norm_backward 37.98% 100.324ms 37.98% 100.324ms 100.324us 31.36% 10.713s 10.713ms 1000 torch::autograd::AccumulateGrad 19.36% 51.140ms 44.70% 118.057ms 39.352us 18.62% 6.361s 2.120ms 3000 add_ 25.34% 66.917ms 25.34% 66.917ms 22.306us 18.57% 6.345s 2.115ms 3000 torch::autograd::GraphRoot 4.41% 11.656ms 4.41% 11.656ms 11.656us 0.02% 7.626ms 7.626us 1000 ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 264.124ms CUDA time total: 34.159s ``` Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine. Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/) [ghstack-poisoned]
Pull Request resolved: #28204 Optimize GroupNorm on CUDA ghstack-source-id: 104616173 Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)
Optimize GroupNorm on CUDA Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine. Before this PR ``` GroupNorm forward: 11.333400868010358ms -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 23.06% 56.891ms 100.00% 246.741ms 246.741us 31.20% 11.369s 11.369ms 1000 batch_norm 8.59% 21.194ms 35.82% 88.382ms 88.382us 18.89% 6.884s 6.884ms 1000 _batch_norm_impl_index 6.50% 16.045ms 27.23% 67.189ms 67.189us 18.86% 6.872s 6.872ms 1000 native_batch_norm 20.73% 51.143ms 20.73% 51.143ms 51.143us 18.83% 6.861s 6.861ms 1000 addcmul 15.17% 37.434ms 15.17% 37.434ms 37.434us 12.13% 4.419s 4.419ms 1000 view 21.76% 53.695ms 21.76% 53.695ms 13.424us 0.06% 21.595ms 5.399us 4000 contiguous 4.19% 10.339ms 4.19% 10.339ms 10.339us 0.03% 9.650ms 9.650us 1000 -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 246.741ms CUDA time total: 36.436s GroupNorm backward: 42.1425356430118ms ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- AddcmulBackward 10.08% 69.875ms 36.22% 251.046ms 251.046us 24.64% 19.492s 19.492ms 1000 mul 26.14% 181.171ms 26.14% 181.171ms 45.293us 24.60% 19.460s 4.865ms 4000 NativeBatchNormBackward 3.44% 23.832ms 9.82% 68.072ms 68.072us 14.23% 11.261s 11.261ms 1000 native_batch_norm_backward 6.38% 44.240ms 6.38% 44.240ms 44.240us 14.23% 11.255s 11.255ms 1000 torch::autograd::AccumulateGrad 7.14% 49.495ms 16.52% 114.485ms 38.162us 8.02% 6.343s 2.114ms 3000 add_ 9.38% 64.990ms 9.38% 64.990ms 21.663us 8.00% 6.326s 2.109ms 3000 sum 11.71% 81.163ms 11.71% 81.163ms 40.581us 6.15% 4.863s 2.431ms 2000 ViewBackward 9.72% 67.398ms 23.92% 165.801ms 41.450us 0.07% 57.930ms 14.482us 4000 reshape 8.35% 57.903ms 14.20% 98.403ms 24.601us 0.04% 35.134ms 8.783us 4000 as_strided 5.84% 40.500ms 5.84% 40.500ms 10.125us 0.01% 11.111ms 2.778us 4000 torch::autograd::GraphRoot 1.81% 12.526ms 1.81% 12.526ms 12.526us 0.01% 8.698ms 8.698us 1000 ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 693.093ms CUDA time total: 79.113s ``` After this PR ``` GroupNorm forward: 6.693606367000029ms --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 21.65% 17.851ms 100.00% 82.451ms 82.451us 50.04% 6.627s 6.627ms 1000 native_group_norm 78.35% 64.601ms 78.35% 64.601ms 64.601us 49.96% 6.616s 6.616ms 1000 --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 82.451ms CUDA time total: 13.244s GroupNorm backward: 17.170999962007045ms ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- NativeGroupNormBackward 12.91% 34.087ms 50.89% 134.411ms 134.411us 31.42% 10.733s 10.733ms 1000 non_differentiable_native_group_norm_backward 37.98% 100.324ms 37.98% 100.324ms 100.324us 31.36% 10.713s 10.713ms 1000 torch::autograd::AccumulateGrad 19.36% 51.140ms 44.70% 118.057ms 39.352us 18.62% 6.361s 2.120ms 3000 add_ 25.34% 66.917ms 25.34% 66.917ms 22.306us 18.57% 6.345s 2.115ms 3000 torch::autograd::GraphRoot 4.41% 11.656ms 4.41% 11.656ms 11.656us 0.02% 7.626ms 7.626us 1000 ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 264.124ms CUDA time total: 34.159s ``` Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine. Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/) [ghstack-poisoned]
Optimize GroupNorm on CUDA Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine. Before this PR ``` GroupNorm forward: 11.333400868010358ms -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 23.06% 56.891ms 100.00% 246.741ms 246.741us 31.20% 11.369s 11.369ms 1000 batch_norm 8.59% 21.194ms 35.82% 88.382ms 88.382us 18.89% 6.884s 6.884ms 1000 _batch_norm_impl_index 6.50% 16.045ms 27.23% 67.189ms 67.189us 18.86% 6.872s 6.872ms 1000 native_batch_norm 20.73% 51.143ms 20.73% 51.143ms 51.143us 18.83% 6.861s 6.861ms 1000 addcmul 15.17% 37.434ms 15.17% 37.434ms 37.434us 12.13% 4.419s 4.419ms 1000 view 21.76% 53.695ms 21.76% 53.695ms 13.424us 0.06% 21.595ms 5.399us 4000 contiguous 4.19% 10.339ms 4.19% 10.339ms 10.339us 0.03% 9.650ms 9.650us 1000 -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 246.741ms CUDA time total: 36.436s GroupNorm backward: 42.1425356430118ms ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- AddcmulBackward 10.08% 69.875ms 36.22% 251.046ms 251.046us 24.64% 19.492s 19.492ms 1000 mul 26.14% 181.171ms 26.14% 181.171ms 45.293us 24.60% 19.460s 4.865ms 4000 NativeBatchNormBackward 3.44% 23.832ms 9.82% 68.072ms 68.072us 14.23% 11.261s 11.261ms 1000 native_batch_norm_backward 6.38% 44.240ms 6.38% 44.240ms 44.240us 14.23% 11.255s 11.255ms 1000 torch::autograd::AccumulateGrad 7.14% 49.495ms 16.52% 114.485ms 38.162us 8.02% 6.343s 2.114ms 3000 add_ 9.38% 64.990ms 9.38% 64.990ms 21.663us 8.00% 6.326s 2.109ms 3000 sum 11.71% 81.163ms 11.71% 81.163ms 40.581us 6.15% 4.863s 2.431ms 2000 ViewBackward 9.72% 67.398ms 23.92% 165.801ms 41.450us 0.07% 57.930ms 14.482us 4000 reshape 8.35% 57.903ms 14.20% 98.403ms 24.601us 0.04% 35.134ms 8.783us 4000 as_strided 5.84% 40.500ms 5.84% 40.500ms 10.125us 0.01% 11.111ms 2.778us 4000 torch::autograd::GraphRoot 1.81% 12.526ms 1.81% 12.526ms 12.526us 0.01% 8.698ms 8.698us 1000 ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 693.093ms CUDA time total: 79.113s ``` After this PR ``` GroupNorm forward: 6.693606367000029ms --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 21.65% 17.851ms 100.00% 82.451ms 82.451us 50.04% 6.627s 6.627ms 1000 native_group_norm 78.35% 64.601ms 78.35% 64.601ms 64.601us 49.96% 6.616s 6.616ms 1000 --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 82.451ms CUDA time total: 13.244s GroupNorm backward: 17.170999962007045ms ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- NativeGroupNormBackward 12.91% 34.087ms 50.89% 134.411ms 134.411us 31.42% 10.733s 10.733ms 1000 non_differentiable_native_group_norm_backward 37.98% 100.324ms 37.98% 100.324ms 100.324us 31.36% 10.713s 10.713ms 1000 torch::autograd::AccumulateGrad 19.36% 51.140ms 44.70% 118.057ms 39.352us 18.62% 6.361s 2.120ms 3000 add_ 25.34% 66.917ms 25.34% 66.917ms 22.306us 18.57% 6.345s 2.115ms 3000 torch::autograd::GraphRoot 4.41% 11.656ms 4.41% 11.656ms 11.656us 0.02% 7.626ms 7.626us 1000 ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 264.124ms CUDA time total: 34.159s ``` Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine. Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/) [ghstack-poisoned]
Optimize GroupNorm on CUDA Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine. Before this PR ``` GroupNorm forward: 11.333400868010358ms -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 23.06% 56.891ms 100.00% 246.741ms 246.741us 31.20% 11.369s 11.369ms 1000 batch_norm 8.59% 21.194ms 35.82% 88.382ms 88.382us 18.89% 6.884s 6.884ms 1000 _batch_norm_impl_index 6.50% 16.045ms 27.23% 67.189ms 67.189us 18.86% 6.872s 6.872ms 1000 native_batch_norm 20.73% 51.143ms 20.73% 51.143ms 51.143us 18.83% 6.861s 6.861ms 1000 addcmul 15.17% 37.434ms 15.17% 37.434ms 37.434us 12.13% 4.419s 4.419ms 1000 view 21.76% 53.695ms 21.76% 53.695ms 13.424us 0.06% 21.595ms 5.399us 4000 contiguous 4.19% 10.339ms 4.19% 10.339ms 10.339us 0.03% 9.650ms 9.650us 1000 -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 246.741ms CUDA time total: 36.436s GroupNorm backward: 42.1425356430118ms ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- AddcmulBackward 10.08% 69.875ms 36.22% 251.046ms 251.046us 24.64% 19.492s 19.492ms 1000 mul 26.14% 181.171ms 26.14% 181.171ms 45.293us 24.60% 19.460s 4.865ms 4000 NativeBatchNormBackward 3.44% 23.832ms 9.82% 68.072ms 68.072us 14.23% 11.261s 11.261ms 1000 native_batch_norm_backward 6.38% 44.240ms 6.38% 44.240ms 44.240us 14.23% 11.255s 11.255ms 1000 torch::autograd::AccumulateGrad 7.14% 49.495ms 16.52% 114.485ms 38.162us 8.02% 6.343s 2.114ms 3000 add_ 9.38% 64.990ms 9.38% 64.990ms 21.663us 8.00% 6.326s 2.109ms 3000 sum 11.71% 81.163ms 11.71% 81.163ms 40.581us 6.15% 4.863s 2.431ms 2000 ViewBackward 9.72% 67.398ms 23.92% 165.801ms 41.450us 0.07% 57.930ms 14.482us 4000 reshape 8.35% 57.903ms 14.20% 98.403ms 24.601us 0.04% 35.134ms 8.783us 4000 as_strided 5.84% 40.500ms 5.84% 40.500ms 10.125us 0.01% 11.111ms 2.778us 4000 torch::autograd::GraphRoot 1.81% 12.526ms 1.81% 12.526ms 12.526us 0.01% 8.698ms 8.698us 1000 ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 693.093ms CUDA time total: 79.113s ``` After this PR ``` GroupNorm forward: 6.693606367000029ms --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 21.65% 17.851ms 100.00% 82.451ms 82.451us 50.04% 6.627s 6.627ms 1000 native_group_norm 78.35% 64.601ms 78.35% 64.601ms 64.601us 49.96% 6.616s 6.616ms 1000 --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 82.451ms CUDA time total: 13.244s GroupNorm backward: 17.170999962007045ms ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- NativeGroupNormBackward 12.91% 34.087ms 50.89% 134.411ms 134.411us 31.42% 10.733s 10.733ms 1000 non_differentiable_native_group_norm_backward 37.98% 100.324ms 37.98% 100.324ms 100.324us 31.36% 10.713s 10.713ms 1000 torch::autograd::AccumulateGrad 19.36% 51.140ms 44.70% 118.057ms 39.352us 18.62% 6.361s 2.120ms 3000 add_ 25.34% 66.917ms 25.34% 66.917ms 22.306us 18.57% 6.345s 2.115ms 3000 torch::autograd::GraphRoot 4.41% 11.656ms 4.41% 11.656ms 11.656us 0.02% 7.626ms 7.626us 1000 ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 264.124ms CUDA time total: 34.159s ``` Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine. Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/) [ghstack-poisoned]
Pull Request resolved: #28204 Optimize GroupNorm on CUDA ghstack-source-id: 105051827 Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)
Optimize GroupNorm on CUDA Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine. Before this PR ``` GroupNorm forward: 11.333400868010358ms -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 23.06% 56.891ms 100.00% 246.741ms 246.741us 31.20% 11.369s 11.369ms 1000 batch_norm 8.59% 21.194ms 35.82% 88.382ms 88.382us 18.89% 6.884s 6.884ms 1000 _batch_norm_impl_index 6.50% 16.045ms 27.23% 67.189ms 67.189us 18.86% 6.872s 6.872ms 1000 native_batch_norm 20.73% 51.143ms 20.73% 51.143ms 51.143us 18.83% 6.861s 6.861ms 1000 addcmul 15.17% 37.434ms 15.17% 37.434ms 37.434us 12.13% 4.419s 4.419ms 1000 view 21.76% 53.695ms 21.76% 53.695ms 13.424us 0.06% 21.595ms 5.399us 4000 contiguous 4.19% 10.339ms 4.19% 10.339ms 10.339us 0.03% 9.650ms 9.650us 1000 -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 246.741ms CUDA time total: 36.436s GroupNorm backward: 42.1425356430118ms ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- AddcmulBackward 10.08% 69.875ms 36.22% 251.046ms 251.046us 24.64% 19.492s 19.492ms 1000 mul 26.14% 181.171ms 26.14% 181.171ms 45.293us 24.60% 19.460s 4.865ms 4000 NativeBatchNormBackward 3.44% 23.832ms 9.82% 68.072ms 68.072us 14.23% 11.261s 11.261ms 1000 native_batch_norm_backward 6.38% 44.240ms 6.38% 44.240ms 44.240us 14.23% 11.255s 11.255ms 1000 torch::autograd::AccumulateGrad 7.14% 49.495ms 16.52% 114.485ms 38.162us 8.02% 6.343s 2.114ms 3000 add_ 9.38% 64.990ms 9.38% 64.990ms 21.663us 8.00% 6.326s 2.109ms 3000 sum 11.71% 81.163ms 11.71% 81.163ms 40.581us 6.15% 4.863s 2.431ms 2000 ViewBackward 9.72% 67.398ms 23.92% 165.801ms 41.450us 0.07% 57.930ms 14.482us 4000 reshape 8.35% 57.903ms 14.20% 98.403ms 24.601us 0.04% 35.134ms 8.783us 4000 as_strided 5.84% 40.500ms 5.84% 40.500ms 10.125us 0.01% 11.111ms 2.778us 4000 torch::autograd::GraphRoot 1.81% 12.526ms 1.81% 12.526ms 12.526us 0.01% 8.698ms 8.698us 1000 ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 693.093ms CUDA time total: 79.113s ``` After this PR ``` GroupNorm forward: 6.693606367000029ms --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 21.65% 17.851ms 100.00% 82.451ms 82.451us 50.04% 6.627s 6.627ms 1000 native_group_norm 78.35% 64.601ms 78.35% 64.601ms 64.601us 49.96% 6.616s 6.616ms 1000 --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 82.451ms CUDA time total: 13.244s GroupNorm backward: 17.170999962007045ms ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- NativeGroupNormBackward 12.91% 34.087ms 50.89% 134.411ms 134.411us 31.42% 10.733s 10.733ms 1000 non_differentiable_native_group_norm_backward 37.98% 100.324ms 37.98% 100.324ms 100.324us 31.36% 10.713s 10.713ms 1000 torch::autograd::AccumulateGrad 19.36% 51.140ms 44.70% 118.057ms 39.352us 18.62% 6.361s 2.120ms 3000 add_ 25.34% 66.917ms 25.34% 66.917ms 22.306us 18.57% 6.345s 2.115ms 3000 torch::autograd::GraphRoot 4.41% 11.656ms 4.41% 11.656ms 11.656us 0.02% 7.626ms 7.626us 1000 ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 264.124ms CUDA time total: 34.159s ``` Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine. Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/) [ghstack-poisoned]
Pull Request resolved: #28204 Optimize GroupNorm on CUDA ghstack-source-id: 105117221 Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)
Optimize GroupNorm on CUDA Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine. Before this PR ``` GroupNorm forward: 11.333400868010358ms -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 23.06% 56.891ms 100.00% 246.741ms 246.741us 31.20% 11.369s 11.369ms 1000 batch_norm 8.59% 21.194ms 35.82% 88.382ms 88.382us 18.89% 6.884s 6.884ms 1000 _batch_norm_impl_index 6.50% 16.045ms 27.23% 67.189ms 67.189us 18.86% 6.872s 6.872ms 1000 native_batch_norm 20.73% 51.143ms 20.73% 51.143ms 51.143us 18.83% 6.861s 6.861ms 1000 addcmul 15.17% 37.434ms 15.17% 37.434ms 37.434us 12.13% 4.419s 4.419ms 1000 view 21.76% 53.695ms 21.76% 53.695ms 13.424us 0.06% 21.595ms 5.399us 4000 contiguous 4.19% 10.339ms 4.19% 10.339ms 10.339us 0.03% 9.650ms 9.650us 1000 -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 246.741ms CUDA time total: 36.436s GroupNorm backward: 42.1425356430118ms ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- AddcmulBackward 10.08% 69.875ms 36.22% 251.046ms 251.046us 24.64% 19.492s 19.492ms 1000 mul 26.14% 181.171ms 26.14% 181.171ms 45.293us 24.60% 19.460s 4.865ms 4000 NativeBatchNormBackward 3.44% 23.832ms 9.82% 68.072ms 68.072us 14.23% 11.261s 11.261ms 1000 native_batch_norm_backward 6.38% 44.240ms 6.38% 44.240ms 44.240us 14.23% 11.255s 11.255ms 1000 torch::autograd::AccumulateGrad 7.14% 49.495ms 16.52% 114.485ms 38.162us 8.02% 6.343s 2.114ms 3000 add_ 9.38% 64.990ms 9.38% 64.990ms 21.663us 8.00% 6.326s 2.109ms 3000 sum 11.71% 81.163ms 11.71% 81.163ms 40.581us 6.15% 4.863s 2.431ms 2000 ViewBackward 9.72% 67.398ms 23.92% 165.801ms 41.450us 0.07% 57.930ms 14.482us 4000 reshape 8.35% 57.903ms 14.20% 98.403ms 24.601us 0.04% 35.134ms 8.783us 4000 as_strided 5.84% 40.500ms 5.84% 40.500ms 10.125us 0.01% 11.111ms 2.778us 4000 torch::autograd::GraphRoot 1.81% 12.526ms 1.81% 12.526ms 12.526us 0.01% 8.698ms 8.698us 1000 ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 693.093ms CUDA time total: 79.113s ``` After this PR ``` GroupNorm forward: 6.693606367000029ms --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 21.65% 17.851ms 100.00% 82.451ms 82.451us 50.04% 6.627s 6.627ms 1000 native_group_norm 78.35% 64.601ms 78.35% 64.601ms 64.601us 49.96% 6.616s 6.616ms 1000 --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 82.451ms CUDA time total: 13.244s GroupNorm backward: 17.170999962007045ms ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- NativeGroupNormBackward 12.91% 34.087ms 50.89% 134.411ms 134.411us 31.42% 10.733s 10.733ms 1000 non_differentiable_native_group_norm_backward 37.98% 100.324ms 37.98% 100.324ms 100.324us 31.36% 10.713s 10.713ms 1000 torch::autograd::AccumulateGrad 19.36% 51.140ms 44.70% 118.057ms 39.352us 18.62% 6.361s 2.120ms 3000 add_ 25.34% 66.917ms 25.34% 66.917ms 22.306us 18.57% 6.345s 2.115ms 3000 torch::autograd::GraphRoot 4.41% 11.656ms 4.41% 11.656ms 11.656us 0.02% 7.626ms 7.626us 1000 ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 264.124ms CUDA time total: 34.159s ``` Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine. Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/) [ghstack-poisoned]
Pull Request resolved: #28204 Optimize GroupNorm on CUDA ghstack-source-id: 105150481 Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)
Optimize GroupNorm on CUDA Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine. Before this PR ``` GroupNorm forward: 11.333400868010358ms -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 23.06% 56.891ms 100.00% 246.741ms 246.741us 31.20% 11.369s 11.369ms 1000 batch_norm 8.59% 21.194ms 35.82% 88.382ms 88.382us 18.89% 6.884s 6.884ms 1000 _batch_norm_impl_index 6.50% 16.045ms 27.23% 67.189ms 67.189us 18.86% 6.872s 6.872ms 1000 native_batch_norm 20.73% 51.143ms 20.73% 51.143ms 51.143us 18.83% 6.861s 6.861ms 1000 addcmul 15.17% 37.434ms 15.17% 37.434ms 37.434us 12.13% 4.419s 4.419ms 1000 view 21.76% 53.695ms 21.76% 53.695ms 13.424us 0.06% 21.595ms 5.399us 4000 contiguous 4.19% 10.339ms 4.19% 10.339ms 10.339us 0.03% 9.650ms 9.650us 1000 -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 246.741ms CUDA time total: 36.436s GroupNorm backward: 42.1425356430118ms ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- AddcmulBackward 10.08% 69.875ms 36.22% 251.046ms 251.046us 24.64% 19.492s 19.492ms 1000 mul 26.14% 181.171ms 26.14% 181.171ms 45.293us 24.60% 19.460s 4.865ms 4000 NativeBatchNormBackward 3.44% 23.832ms 9.82% 68.072ms 68.072us 14.23% 11.261s 11.261ms 1000 native_batch_norm_backward 6.38% 44.240ms 6.38% 44.240ms 44.240us 14.23% 11.255s 11.255ms 1000 torch::autograd::AccumulateGrad 7.14% 49.495ms 16.52% 114.485ms 38.162us 8.02% 6.343s 2.114ms 3000 add_ 9.38% 64.990ms 9.38% 64.990ms 21.663us 8.00% 6.326s 2.109ms 3000 sum 11.71% 81.163ms 11.71% 81.163ms 40.581us 6.15% 4.863s 2.431ms 2000 ViewBackward 9.72% 67.398ms 23.92% 165.801ms 41.450us 0.07% 57.930ms 14.482us 4000 reshape 8.35% 57.903ms 14.20% 98.403ms 24.601us 0.04% 35.134ms 8.783us 4000 as_strided 5.84% 40.500ms 5.84% 40.500ms 10.125us 0.01% 11.111ms 2.778us 4000 torch::autograd::GraphRoot 1.81% 12.526ms 1.81% 12.526ms 12.526us 0.01% 8.698ms 8.698us 1000 ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 693.093ms CUDA time total: 79.113s ``` After this PR ``` GroupNorm forward: 6.693606367000029ms --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 21.65% 17.851ms 100.00% 82.451ms 82.451us 50.04% 6.627s 6.627ms 1000 native_group_norm 78.35% 64.601ms 78.35% 64.601ms 64.601us 49.96% 6.616s 6.616ms 1000 --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 82.451ms CUDA time total: 13.244s GroupNorm backward: 17.170999962007045ms ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- NativeGroupNormBackward 12.91% 34.087ms 50.89% 134.411ms 134.411us 31.42% 10.733s 10.733ms 1000 non_differentiable_native_group_norm_backward 37.98% 100.324ms 37.98% 100.324ms 100.324us 31.36% 10.713s 10.713ms 1000 torch::autograd::AccumulateGrad 19.36% 51.140ms 44.70% 118.057ms 39.352us 18.62% 6.361s 2.120ms 3000 add_ 25.34% 66.917ms 25.34% 66.917ms 22.306us 18.57% 6.345s 2.115ms 3000 torch::autograd::GraphRoot 4.41% 11.656ms 4.41% 11.656ms 11.656us 0.02% 7.626ms 7.626us 1000 ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 264.124ms CUDA time total: 34.159s ``` Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine. Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/) [ghstack-poisoned]
Pull Request resolved: #28204 Optimize GroupNorm on CUDA ghstack-source-id: 105365117 Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)
ngimel
left a comment
There was a problem hiding this comment.
Looks good, I have a few small comments. Also, to be safe, I'd assert in native_group_norm_backward that tensors you expect to be contiguous are contiguous - you make sure that is the case in derivatives.yaml, but if someone calls group_norm_backward directly the contiguity of inputs is not guaranteed.
| const int wid = threadIdx.x / C10_WARP_SIZE; | ||
| val = WarpReduceSum(val); | ||
| if (lid == 0) { | ||
| shared[wid] = val; |
There was a problem hiding this comment.
This is unsafe if you are calling BlockReduceSum in a row with the same shared parameter. You don't do it in your code, but if someone later reuses this it can lead to errors. Can you please synchronize before writing to shared?
There was a problem hiding this comment.
Not sure if I correctly understand this. I added one __syncthreads() before here. Is that correct?
There was a problem hiding this comment.
Oh, you are right, I'm blind. It's all good.
| N, C, G, mean_data, rstd_data, gamma_data, beta_data, a_data, b_data); | ||
| if (HxW < kCUDANumThreads) { | ||
| B = (N * C * HxW + kCUDANumThreads - 1) / kCUDANumThreads; | ||
| GroupNormForwardSimpleCUDAKernel<T><<<B, kCUDANumThreads, 0, cuda_stream>>>( |
There was a problem hiding this comment.
Yes, since when the feature map becomes smaller by layers, some layers only has feature map whose size is 7 * 7 or 14 * 14, in that case, some of the threads are not working at all. In these cases, this will help.
| const auto& gamma = weight.is_contiguous() ? weight : weight.contiguous(); | ||
| const auto& beta = bias.is_contiguous() ? bias : bias.contiguous(); | ||
| return std::get<0>( | ||
| at::native_group_norm(X, gamma, beta, N, C, HxW, num_groups, eps)); |
There was a problem hiding this comment.
You probably need to move .contiguous() calls on X, gamma and beta inside native_group_norm, because native_group_norm cannot handle non-contiguous inputs, yet, since it is exposed in native_functions.yaml, someone can call it directly, and it does not check contiguity of inputs.
| int64_t C, | ||
| int64_t HxW, | ||
| int64_t group, | ||
| Tensor* dX, |
There was a problem hiding this comment.
Why are you passing pointers here instead of non-const references? I think in the rest on the codebase we use references.
There was a problem hiding this comment.
Just to make it clear that pointers are for output arguments, and references are for inputs. This is something we used to do in Caffe2. We can consider to unify the styles in later PRs.
Optimize GroupNorm on CUDA Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine. Before this PR ``` GroupNorm forward: 11.333400868010358ms -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 23.06% 56.891ms 100.00% 246.741ms 246.741us 31.20% 11.369s 11.369ms 1000 batch_norm 8.59% 21.194ms 35.82% 88.382ms 88.382us 18.89% 6.884s 6.884ms 1000 _batch_norm_impl_index 6.50% 16.045ms 27.23% 67.189ms 67.189us 18.86% 6.872s 6.872ms 1000 native_batch_norm 20.73% 51.143ms 20.73% 51.143ms 51.143us 18.83% 6.861s 6.861ms 1000 addcmul 15.17% 37.434ms 15.17% 37.434ms 37.434us 12.13% 4.419s 4.419ms 1000 view 21.76% 53.695ms 21.76% 53.695ms 13.424us 0.06% 21.595ms 5.399us 4000 contiguous 4.19% 10.339ms 4.19% 10.339ms 10.339us 0.03% 9.650ms 9.650us 1000 -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 246.741ms CUDA time total: 36.436s GroupNorm backward: 42.1425356430118ms ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- AddcmulBackward 10.08% 69.875ms 36.22% 251.046ms 251.046us 24.64% 19.492s 19.492ms 1000 mul 26.14% 181.171ms 26.14% 181.171ms 45.293us 24.60% 19.460s 4.865ms 4000 NativeBatchNormBackward 3.44% 23.832ms 9.82% 68.072ms 68.072us 14.23% 11.261s 11.261ms 1000 native_batch_norm_backward 6.38% 44.240ms 6.38% 44.240ms 44.240us 14.23% 11.255s 11.255ms 1000 torch::autograd::AccumulateGrad 7.14% 49.495ms 16.52% 114.485ms 38.162us 8.02% 6.343s 2.114ms 3000 add_ 9.38% 64.990ms 9.38% 64.990ms 21.663us 8.00% 6.326s 2.109ms 3000 sum 11.71% 81.163ms 11.71% 81.163ms 40.581us 6.15% 4.863s 2.431ms 2000 ViewBackward 9.72% 67.398ms 23.92% 165.801ms 41.450us 0.07% 57.930ms 14.482us 4000 reshape 8.35% 57.903ms 14.20% 98.403ms 24.601us 0.04% 35.134ms 8.783us 4000 as_strided 5.84% 40.500ms 5.84% 40.500ms 10.125us 0.01% 11.111ms 2.778us 4000 torch::autograd::GraphRoot 1.81% 12.526ms 1.81% 12.526ms 12.526us 0.01% 8.698ms 8.698us 1000 ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 693.093ms CUDA time total: 79.113s ``` After this PR ``` GroupNorm forward: 6.693606367000029ms --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 21.65% 17.851ms 100.00% 82.451ms 82.451us 50.04% 6.627s 6.627ms 1000 native_group_norm 78.35% 64.601ms 78.35% 64.601ms 64.601us 49.96% 6.616s 6.616ms 1000 --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 82.451ms CUDA time total: 13.244s GroupNorm backward: 17.170999962007045ms ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- NativeGroupNormBackward 12.91% 34.087ms 50.89% 134.411ms 134.411us 31.42% 10.733s 10.733ms 1000 non_differentiable_native_group_norm_backward 37.98% 100.324ms 37.98% 100.324ms 100.324us 31.36% 10.713s 10.713ms 1000 torch::autograd::AccumulateGrad 19.36% 51.140ms 44.70% 118.057ms 39.352us 18.62% 6.361s 2.120ms 3000 add_ 25.34% 66.917ms 25.34% 66.917ms 22.306us 18.57% 6.345s 2.115ms 3000 torch::autograd::GraphRoot 4.41% 11.656ms 4.41% 11.656ms 11.656us 0.02% 7.626ms 7.626us 1000 ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 264.124ms CUDA time total: 34.159s ``` Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine. Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/) [ghstack-poisoned]
Pull Request resolved: #28204 Optimize GroupNorm on CUDA ghstack-source-id: 105388365 Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)
|
This pull request has been merged in 614dd03. |
|
XLA failure on PR was real. |
|
@ezyang We have just merged the fix on xla side. Should be back green in1-2 hrs. :D |
|
OK, I will ninja unland the revert |
Summary: Pull Request resolved: pytorch#28204 Optimize GroupNorm on CUDA ghstack-source-id: 105388365 Test Plan: buck test mode/dev-nosan caffe2/test:nn -- "GroupNorm" Reviewed By: houseroad Differential Revision: D17923732 fbshipit-source-id: 9afaf01288bd9d273eed89909bff77243df89e9f
Stack from ghstack:
Optimize GroupNorm on CUDA
Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine.
Before this PR
After this PR
Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine.
Differential Revision: D17923732