Optimize GroupNorm on CUDA by xiaomengy · Pull Request #28204 · pytorch/pytorch

xiaomengy · 2019-10-17T04:19:51Z

Stack from ghstack:

Optimize GroupNorm on CUDA #28204 Optimize GroupNorm on CUDA

Optimize GroupNorm on CUDA

Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine.

Before this PR

GroupNorm forward: 11.333400868010358ms
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                        Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
group_norm                  23.06%           56.891ms         100.00%          246.741ms        246.741us        31.20%           11.369s          11.369ms         1000             
batch_norm                  8.59%            21.194ms         35.82%           88.382ms         88.382us         18.89%           6.884s           6.884ms          1000             
_batch_norm_impl_index      6.50%            16.045ms         27.23%           67.189ms         67.189us         18.86%           6.872s           6.872ms          1000             
native_batch_norm           20.73%           51.143ms         20.73%           51.143ms         51.143us         18.83%           6.861s           6.861ms          1000             
addcmul                     15.17%           37.434ms         15.17%           37.434ms         37.434us         12.13%           4.419s           4.419ms          1000             
view                        21.76%           53.695ms         21.76%           53.695ms         13.424us         0.06%            21.595ms         5.399us          4000             
contiguous                  4.19%            10.339ms         4.19%            10.339ms         10.339us         0.03%            9.650ms          9.650us          1000             
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 246.741ms
CUDA time total: 36.436s

GroupNorm backward: 42.1425356430118ms
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                 Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
AddcmulBackward                      10.08%           69.875ms         36.22%           251.046ms        251.046us        24.64%           19.492s          19.492ms         1000             
mul                                  26.14%           181.171ms        26.14%           181.171ms        45.293us         24.60%           19.460s          4.865ms          4000             
NativeBatchNormBackward              3.44%            23.832ms         9.82%            68.072ms         68.072us         14.23%           11.261s          11.261ms         1000             
native_batch_norm_backward           6.38%            44.240ms         6.38%            44.240ms         44.240us         14.23%           11.255s          11.255ms         1000             
torch::autograd::AccumulateGrad      7.14%            49.495ms         16.52%           114.485ms        38.162us         8.02%            6.343s           2.114ms          3000             
add_                                 9.38%            64.990ms         9.38%            64.990ms         21.663us         8.00%            6.326s           2.109ms          3000             
sum                                  11.71%           81.163ms         11.71%           81.163ms         40.581us         6.15%            4.863s           2.431ms          2000             
ViewBackward                         9.72%            67.398ms         23.92%           165.801ms        41.450us         0.07%            57.930ms         14.482us         4000             
reshape                              8.35%            57.903ms         14.20%           98.403ms         24.601us         0.04%            35.134ms         8.783us          4000             
as_strided                           5.84%            40.500ms         5.84%            40.500ms         10.125us         0.01%            11.111ms         2.778us          4000             
torch::autograd::GraphRoot           1.81%            12.526ms         1.81%            12.526ms         12.526us         0.01%            8.698ms          8.698us          1000             
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 693.093ms
CUDA time total: 79.113s

After this PR

GroupNorm forward: 6.693606367000029ms
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                   Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
group_norm             21.65%           17.851ms         100.00%          82.451ms         82.451us         50.04%           6.627s           6.627ms          1000             
native_group_norm      78.35%           64.601ms         78.35%           64.601ms         64.601us         49.96%           6.616s           6.616ms          1000             
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 82.451ms
CUDA time total: 13.244s

GroupNorm backward: 17.170999962007045ms
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                               Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
NativeGroupNormBackward                            12.91%           34.087ms         50.89%           134.411ms        134.411us        31.42%           10.733s          10.733ms         1000             
non_differentiable_native_group_norm_backward      37.98%           100.324ms        37.98%           100.324ms        100.324us        31.36%           10.713s          10.713ms         1000             
torch::autograd::AccumulateGrad                    19.36%           51.140ms         44.70%           118.057ms        39.352us         18.62%           6.361s           2.120ms          3000             
add_                                               25.34%           66.917ms         25.34%           66.917ms         22.306us         18.57%           6.345s           2.115ms          3000             
torch::autograd::GraphRoot                         4.41%            11.656ms         4.41%            11.656ms         11.656us         0.02%            7.626ms          7.626us          1000             
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 264.124ms
CUDA time total: 34.159s

Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine.

Differential Revision: D17923732

Optimize GroupNorm on CUDA Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/) [ghstack-poisoned]

Optimize GroupNorm on CUDA Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/) ghstack-source-id: 92059612 Pull Request resolved: #28204

xiaomengy · 2019-10-17T04:23:48Z

link to #28201

xiaomengy · 2019-10-17T05:12:10Z

cc @ppwwyyxx

Optimize GroupNorm on CUDA Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine. Before this PR ``` GroupNorm forward: 11.333400868010358ms -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 23.06% 56.891ms 100.00% 246.741ms 246.741us 31.20% 11.369s 11.369ms 1000 batch_norm 8.59% 21.194ms 35.82% 88.382ms 88.382us 18.89% 6.884s 6.884ms 1000 _batch_norm_impl_index 6.50% 16.045ms 27.23% 67.189ms 67.189us 18.86% 6.872s 6.872ms 1000 native_batch_norm 20.73% 51.143ms 20.73% 51.143ms 51.143us 18.83% 6.861s 6.861ms 1000 addcmul 15.17% 37.434ms 15.17% 37.434ms 37.434us 12.13% 4.419s 4.419ms 1000 view 21.76% 53.695ms 21.76% 53.695ms 13.424us 0.06% 21.595ms 5.399us 4000 contiguous 4.19% 10.339ms 4.19% 10.339ms 10.339us 0.03% 9.650ms 9.650us 1000 -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 246.741ms CUDA time total: 36.436s GroupNorm backward: 42.1425356430118ms ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- AddcmulBackward 10.08% 69.875ms 36.22% 251.046ms 251.046us 24.64% 19.492s 19.492ms 1000 mul 26.14% 181.171ms 26.14% 181.171ms 45.293us 24.60% 19.460s 4.865ms 4000 NativeBatchNormBackward 3.44% 23.832ms 9.82% 68.072ms 68.072us 14.23% 11.261s 11.261ms 1000 native_batch_norm_backward 6.38% 44.240ms 6.38% 44.240ms 44.240us 14.23% 11.255s 11.255ms 1000 torch::autograd::AccumulateGrad 7.14% 49.495ms 16.52% 114.485ms 38.162us 8.02% 6.343s 2.114ms 3000 add_ 9.38% 64.990ms 9.38% 64.990ms 21.663us 8.00% 6.326s 2.109ms 3000 sum 11.71% 81.163ms 11.71% 81.163ms 40.581us 6.15% 4.863s 2.431ms 2000 ViewBackward 9.72% 67.398ms 23.92% 165.801ms 41.450us 0.07% 57.930ms 14.482us 4000 reshape 8.35% 57.903ms 14.20% 98.403ms 24.601us 0.04% 35.134ms 8.783us 4000 as_strided 5.84% 40.500ms 5.84% 40.500ms 10.125us 0.01% 11.111ms 2.778us 4000 torch::autograd::GraphRoot 1.81% 12.526ms 1.81% 12.526ms 12.526us 0.01% 8.698ms 8.698us 1000 ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 693.093ms CUDA time total: 79.113s ``` After this PR ``` GroupNorm forward: 6.693606367000029ms --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 21.65% 17.851ms 100.00% 82.451ms 82.451us 50.04% 6.627s 6.627ms 1000 native_group_norm 78.35% 64.601ms 78.35% 64.601ms 64.601us 49.96% 6.616s 6.616ms 1000 --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 82.451ms CUDA time total: 13.244s GroupNorm backward: 17.170999962007045ms ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- NativeGroupNormBackward 12.91% 34.087ms 50.89% 134.411ms 134.411us 31.42% 10.733s 10.733ms 1000 non_differentiable_native_group_norm_backward 37.98% 100.324ms 37.98% 100.324ms 100.324us 31.36% 10.713s 10.713ms 1000 torch::autograd::AccumulateGrad 19.36% 51.140ms 44.70% 118.057ms 39.352us 18.62% 6.361s 2.120ms 3000 add_ 25.34% 66.917ms 25.34% 66.917ms 22.306us 18.57% 6.345s 2.115ms 3000 torch::autograd::GraphRoot 4.41% 11.656ms 4.41% 11.656ms 11.656us 0.02% 7.626ms 7.626us 1000 ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 264.124ms CUDA time total: 34.159s ``` Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine. Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/) [ghstack-poisoned]

Pull Request resolved: #28204 Optimize GroupNorm on CUDA ghstack-source-id: 92091549 Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)

bonlime · 2020-04-24T15:37:12Z

Hi. Why this PR hasn't been merged yet?

lopuhin · 2020-05-23T20:37:41Z

FWIW this PR still builds on current master (after resolving the conflicts) and provides around 6-8% speedup for resnet50 training with torch.amp on 2080Ti compared to master(using the same batch size), and max batch size is about 40% larger which is a great win 👍 . In terms of speed batchnorm is still much faster.

Optimize GroupNorm on CUDA Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine. Before this PR ``` GroupNorm forward: 11.333400868010358ms -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 23.06% 56.891ms 100.00% 246.741ms 246.741us 31.20% 11.369s 11.369ms 1000 batch_norm 8.59% 21.194ms 35.82% 88.382ms 88.382us 18.89% 6.884s 6.884ms 1000 _batch_norm_impl_index 6.50% 16.045ms 27.23% 67.189ms 67.189us 18.86% 6.872s 6.872ms 1000 native_batch_norm 20.73% 51.143ms 20.73% 51.143ms 51.143us 18.83% 6.861s 6.861ms 1000 addcmul 15.17% 37.434ms 15.17% 37.434ms 37.434us 12.13% 4.419s 4.419ms 1000 view 21.76% 53.695ms 21.76% 53.695ms 13.424us 0.06% 21.595ms 5.399us 4000 contiguous 4.19% 10.339ms 4.19% 10.339ms 10.339us 0.03% 9.650ms 9.650us 1000 -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 246.741ms CUDA time total: 36.436s GroupNorm backward: 42.1425356430118ms ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- AddcmulBackward 10.08% 69.875ms 36.22% 251.046ms 251.046us 24.64% 19.492s 19.492ms 1000 mul 26.14% 181.171ms 26.14% 181.171ms 45.293us 24.60% 19.460s 4.865ms 4000 NativeBatchNormBackward 3.44% 23.832ms 9.82% 68.072ms 68.072us 14.23% 11.261s 11.261ms 1000 native_batch_norm_backward 6.38% 44.240ms 6.38% 44.240ms 44.240us 14.23% 11.255s 11.255ms 1000 torch::autograd::AccumulateGrad 7.14% 49.495ms 16.52% 114.485ms 38.162us 8.02% 6.343s 2.114ms 3000 add_ 9.38% 64.990ms 9.38% 64.990ms 21.663us 8.00% 6.326s 2.109ms 3000 sum 11.71% 81.163ms 11.71% 81.163ms 40.581us 6.15% 4.863s 2.431ms 2000 ViewBackward 9.72% 67.398ms 23.92% 165.801ms 41.450us 0.07% 57.930ms 14.482us 4000 reshape 8.35% 57.903ms 14.20% 98.403ms 24.601us 0.04% 35.134ms 8.783us 4000 as_strided 5.84% 40.500ms 5.84% 40.500ms 10.125us 0.01% 11.111ms 2.778us 4000 torch::autograd::GraphRoot 1.81% 12.526ms 1.81% 12.526ms 12.526us 0.01% 8.698ms 8.698us 1000 ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 693.093ms CUDA time total: 79.113s ``` After this PR ``` GroupNorm forward: 6.693606367000029ms --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 21.65% 17.851ms 100.00% 82.451ms 82.451us 50.04% 6.627s 6.627ms 1000 native_group_norm 78.35% 64.601ms 78.35% 64.601ms 64.601us 49.96% 6.616s 6.616ms 1000 --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 82.451ms CUDA time total: 13.244s GroupNorm backward: 17.170999962007045ms ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- NativeGroupNormBackward 12.91% 34.087ms 50.89% 134.411ms 134.411us 31.42% 10.733s 10.733ms 1000 non_differentiable_native_group_norm_backward 37.98% 100.324ms 37.98% 100.324ms 100.324us 31.36% 10.713s 10.713ms 1000 torch::autograd::AccumulateGrad 19.36% 51.140ms 44.70% 118.057ms 39.352us 18.62% 6.361s 2.120ms 3000 add_ 25.34% 66.917ms 25.34% 66.917ms 22.306us 18.57% 6.345s 2.115ms 3000 torch::autograd::GraphRoot 4.41% 11.656ms 4.41% 11.656ms 11.656us 0.02% 7.626ms 7.626us 1000 ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 264.124ms CUDA time total: 34.159s ``` Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine. Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/) [ghstack-poisoned]

dr-ci · 2020-05-23T23:08:31Z

💊 CI failures summary and remediations

As of commit 2c78ef2 (more details on the Dr. CI page):

1/1 failures introduced in this PR

XLA failure

Job pytorch_xla_linux_bionic_py3_6_clang9_test is failing. Please create an issue with title prefixed by [PT_BREAK] in pytorch/xla and link to to this PR. If you have questions, please reach out to @ailzhang / @dlibenzi / @JackCaoG.

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 94 times.

Optimize GroupNorm on CUDA Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine. Before this PR ``` GroupNorm forward: 11.333400868010358ms -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 23.06% 56.891ms 100.00% 246.741ms 246.741us 31.20% 11.369s 11.369ms 1000 batch_norm 8.59% 21.194ms 35.82% 88.382ms 88.382us 18.89% 6.884s 6.884ms 1000 _batch_norm_impl_index 6.50% 16.045ms 27.23% 67.189ms 67.189us 18.86% 6.872s 6.872ms 1000 native_batch_norm 20.73% 51.143ms 20.73% 51.143ms 51.143us 18.83% 6.861s 6.861ms 1000 addcmul 15.17% 37.434ms 15.17% 37.434ms 37.434us 12.13% 4.419s 4.419ms 1000 view 21.76% 53.695ms 21.76% 53.695ms 13.424us 0.06% 21.595ms 5.399us 4000 contiguous 4.19% 10.339ms 4.19% 10.339ms 10.339us 0.03% 9.650ms 9.650us 1000 -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 246.741ms CUDA time total: 36.436s GroupNorm backward: 42.1425356430118ms ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- AddcmulBackward 10.08% 69.875ms 36.22% 251.046ms 251.046us 24.64% 19.492s 19.492ms 1000 mul 26.14% 181.171ms 26.14% 181.171ms 45.293us 24.60% 19.460s 4.865ms 4000 NativeBatchNormBackward 3.44% 23.832ms 9.82% 68.072ms 68.072us 14.23% 11.261s 11.261ms 1000 native_batch_norm_backward 6.38% 44.240ms 6.38% 44.240ms 44.240us 14.23% 11.255s 11.255ms 1000 torch::autograd::AccumulateGrad 7.14% 49.495ms 16.52% 114.485ms 38.162us 8.02% 6.343s 2.114ms 3000 add_ 9.38% 64.990ms 9.38% 64.990ms 21.663us 8.00% 6.326s 2.109ms 3000 sum 11.71% 81.163ms 11.71% 81.163ms 40.581us 6.15% 4.863s 2.431ms 2000 ViewBackward 9.72% 67.398ms 23.92% 165.801ms 41.450us 0.07% 57.930ms 14.482us 4000 reshape 8.35% 57.903ms 14.20% 98.403ms 24.601us 0.04% 35.134ms 8.783us 4000 as_strided 5.84% 40.500ms 5.84% 40.500ms 10.125us 0.01% 11.111ms 2.778us 4000 torch::autograd::GraphRoot 1.81% 12.526ms 1.81% 12.526ms 12.526us 0.01% 8.698ms 8.698us 1000 ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 693.093ms CUDA time total: 79.113s ``` After this PR ``` GroupNorm forward: 6.693606367000029ms --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 21.65% 17.851ms 100.00% 82.451ms 82.451us 50.04% 6.627s 6.627ms 1000 native_group_norm 78.35% 64.601ms 78.35% 64.601ms 64.601us 49.96% 6.616s 6.616ms 1000 --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 82.451ms CUDA time total: 13.244s GroupNorm backward: 17.170999962007045ms ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- NativeGroupNormBackward 12.91% 34.087ms 50.89% 134.411ms 134.411us 31.42% 10.733s 10.733ms 1000 non_differentiable_native_group_norm_backward 37.98% 100.324ms 37.98% 100.324ms 100.324us 31.36% 10.713s 10.713ms 1000 torch::autograd::AccumulateGrad 19.36% 51.140ms 44.70% 118.057ms 39.352us 18.62% 6.361s 2.120ms 3000 add_ 25.34% 66.917ms 25.34% 66.917ms 22.306us 18.57% 6.345s 2.115ms 3000 torch::autograd::GraphRoot 4.41% 11.656ms 4.41% 11.656ms 11.656us 0.02% 7.626ms 7.626us 1000 ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 264.124ms CUDA time total: 34.159s ``` Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine. Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/) [ghstack-poisoned]

Pull Request resolved: #28204 Optimize GroupNorm on CUDA ghstack-source-id: 104603817 Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)

Optimize GroupNorm on CUDA Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine. Before this PR ``` GroupNorm forward: 11.333400868010358ms -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 23.06% 56.891ms 100.00% 246.741ms 246.741us 31.20% 11.369s 11.369ms 1000 batch_norm 8.59% 21.194ms 35.82% 88.382ms 88.382us 18.89% 6.884s 6.884ms 1000 _batch_norm_impl_index 6.50% 16.045ms 27.23% 67.189ms 67.189us 18.86% 6.872s 6.872ms 1000 native_batch_norm 20.73% 51.143ms 20.73% 51.143ms 51.143us 18.83% 6.861s 6.861ms 1000 addcmul 15.17% 37.434ms 15.17% 37.434ms 37.434us 12.13% 4.419s 4.419ms 1000 view 21.76% 53.695ms 21.76% 53.695ms 13.424us 0.06% 21.595ms 5.399us 4000 contiguous 4.19% 10.339ms 4.19% 10.339ms 10.339us 0.03% 9.650ms 9.650us 1000 -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 246.741ms CUDA time total: 36.436s GroupNorm backward: 42.1425356430118ms ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- AddcmulBackward 10.08% 69.875ms 36.22% 251.046ms 251.046us 24.64% 19.492s 19.492ms 1000 mul 26.14% 181.171ms 26.14% 181.171ms 45.293us 24.60% 19.460s 4.865ms 4000 NativeBatchNormBackward 3.44% 23.832ms 9.82% 68.072ms 68.072us 14.23% 11.261s 11.261ms 1000 native_batch_norm_backward 6.38% 44.240ms 6.38% 44.240ms 44.240us 14.23% 11.255s 11.255ms 1000 torch::autograd::AccumulateGrad 7.14% 49.495ms 16.52% 114.485ms 38.162us 8.02% 6.343s 2.114ms 3000 add_ 9.38% 64.990ms 9.38% 64.990ms 21.663us 8.00% 6.326s 2.109ms 3000 sum 11.71% 81.163ms 11.71% 81.163ms 40.581us 6.15% 4.863s 2.431ms 2000 ViewBackward 9.72% 67.398ms 23.92% 165.801ms 41.450us 0.07% 57.930ms 14.482us 4000 reshape 8.35% 57.903ms 14.20% 98.403ms 24.601us 0.04% 35.134ms 8.783us 4000 as_strided 5.84% 40.500ms 5.84% 40.500ms 10.125us 0.01% 11.111ms 2.778us 4000 torch::autograd::GraphRoot 1.81% 12.526ms 1.81% 12.526ms 12.526us 0.01% 8.698ms 8.698us 1000 ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 693.093ms CUDA time total: 79.113s ``` After this PR ``` GroupNorm forward: 6.693606367000029ms --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 21.65% 17.851ms 100.00% 82.451ms 82.451us 50.04% 6.627s 6.627ms 1000 native_group_norm 78.35% 64.601ms 78.35% 64.601ms 64.601us 49.96% 6.616s 6.616ms 1000 --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 82.451ms CUDA time total: 13.244s GroupNorm backward: 17.170999962007045ms ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- NativeGroupNormBackward 12.91% 34.087ms 50.89% 134.411ms 134.411us 31.42% 10.733s 10.733ms 1000 non_differentiable_native_group_norm_backward 37.98% 100.324ms 37.98% 100.324ms 100.324us 31.36% 10.713s 10.713ms 1000 torch::autograd::AccumulateGrad 19.36% 51.140ms 44.70% 118.057ms 39.352us 18.62% 6.361s 2.120ms 3000 add_ 25.34% 66.917ms 25.34% 66.917ms 22.306us 18.57% 6.345s 2.115ms 3000 torch::autograd::GraphRoot 4.41% 11.656ms 4.41% 11.656ms 11.656us 0.02% 7.626ms 7.626us 1000 ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 264.124ms CUDA time total: 34.159s ``` Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine. Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/) [ghstack-poisoned]

Pull Request resolved: #28204 Optimize GroupNorm on CUDA ghstack-source-id: 104616173 Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)

Optimize GroupNorm on CUDA Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine. Before this PR ``` GroupNorm forward: 11.333400868010358ms -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 23.06% 56.891ms 100.00% 246.741ms 246.741us 31.20% 11.369s 11.369ms 1000 batch_norm 8.59% 21.194ms 35.82% 88.382ms 88.382us 18.89% 6.884s 6.884ms 1000 _batch_norm_impl_index 6.50% 16.045ms 27.23% 67.189ms 67.189us 18.86% 6.872s 6.872ms 1000 native_batch_norm 20.73% 51.143ms 20.73% 51.143ms 51.143us 18.83% 6.861s 6.861ms 1000 addcmul 15.17% 37.434ms 15.17% 37.434ms 37.434us 12.13% 4.419s 4.419ms 1000 view 21.76% 53.695ms 21.76% 53.695ms 13.424us 0.06% 21.595ms 5.399us 4000 contiguous 4.19% 10.339ms 4.19% 10.339ms 10.339us 0.03% 9.650ms 9.650us 1000 -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 246.741ms CUDA time total: 36.436s GroupNorm backward: 42.1425356430118ms ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- AddcmulBackward 10.08% 69.875ms 36.22% 251.046ms 251.046us 24.64% 19.492s 19.492ms 1000 mul 26.14% 181.171ms 26.14% 181.171ms 45.293us 24.60% 19.460s 4.865ms 4000 NativeBatchNormBackward 3.44% 23.832ms 9.82% 68.072ms 68.072us 14.23% 11.261s 11.261ms 1000 native_batch_norm_backward 6.38% 44.240ms 6.38% 44.240ms 44.240us 14.23% 11.255s 11.255ms 1000 torch::autograd::AccumulateGrad 7.14% 49.495ms 16.52% 114.485ms 38.162us 8.02% 6.343s 2.114ms 3000 add_ 9.38% 64.990ms 9.38% 64.990ms 21.663us 8.00% 6.326s 2.109ms 3000 sum 11.71% 81.163ms 11.71% 81.163ms 40.581us 6.15% 4.863s 2.431ms 2000 ViewBackward 9.72% 67.398ms 23.92% 165.801ms 41.450us 0.07% 57.930ms 14.482us 4000 reshape 8.35% 57.903ms 14.20% 98.403ms 24.601us 0.04% 35.134ms 8.783us 4000 as_strided 5.84% 40.500ms 5.84% 40.500ms 10.125us 0.01% 11.111ms 2.778us 4000 torch::autograd::GraphRoot 1.81% 12.526ms 1.81% 12.526ms 12.526us 0.01% 8.698ms 8.698us 1000 ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 693.093ms CUDA time total: 79.113s ``` After this PR ``` GroupNorm forward: 6.693606367000029ms --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 21.65% 17.851ms 100.00% 82.451ms 82.451us 50.04% 6.627s 6.627ms 1000 native_group_norm 78.35% 64.601ms 78.35% 64.601ms 64.601us 49.96% 6.616s 6.616ms 1000 --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 82.451ms CUDA time total: 13.244s GroupNorm backward: 17.170999962007045ms ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- NativeGroupNormBackward 12.91% 34.087ms 50.89% 134.411ms 134.411us 31.42% 10.733s 10.733ms 1000 non_differentiable_native_group_norm_backward 37.98% 100.324ms 37.98% 100.324ms 100.324us 31.36% 10.713s 10.713ms 1000 torch::autograd::AccumulateGrad 19.36% 51.140ms 44.70% 118.057ms 39.352us 18.62% 6.361s 2.120ms 3000 add_ 25.34% 66.917ms 25.34% 66.917ms 22.306us 18.57% 6.345s 2.115ms 3000 torch::autograd::GraphRoot 4.41% 11.656ms 4.41% 11.656ms 11.656us 0.02% 7.626ms 7.626us 1000 ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 264.124ms CUDA time total: 34.159s ``` Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine. Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/) [ghstack-poisoned]

Pull Request resolved: #28204 Optimize GroupNorm on CUDA ghstack-source-id: 105051827 Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)

Optimize GroupNorm on CUDA Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine. Before this PR ``` GroupNorm forward: 11.333400868010358ms -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 23.06% 56.891ms 100.00% 246.741ms 246.741us 31.20% 11.369s 11.369ms 1000 batch_norm 8.59% 21.194ms 35.82% 88.382ms 88.382us 18.89% 6.884s 6.884ms 1000 _batch_norm_impl_index 6.50% 16.045ms 27.23% 67.189ms 67.189us 18.86% 6.872s 6.872ms 1000 native_batch_norm 20.73% 51.143ms 20.73% 51.143ms 51.143us 18.83% 6.861s 6.861ms 1000 addcmul 15.17% 37.434ms 15.17% 37.434ms 37.434us 12.13% 4.419s 4.419ms 1000 view 21.76% 53.695ms 21.76% 53.695ms 13.424us 0.06% 21.595ms 5.399us 4000 contiguous 4.19% 10.339ms 4.19% 10.339ms 10.339us 0.03% 9.650ms 9.650us 1000 -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 246.741ms CUDA time total: 36.436s GroupNorm backward: 42.1425356430118ms ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- AddcmulBackward 10.08% 69.875ms 36.22% 251.046ms 251.046us 24.64% 19.492s 19.492ms 1000 mul 26.14% 181.171ms 26.14% 181.171ms 45.293us 24.60% 19.460s 4.865ms 4000 NativeBatchNormBackward 3.44% 23.832ms 9.82% 68.072ms 68.072us 14.23% 11.261s 11.261ms 1000 native_batch_norm_backward 6.38% 44.240ms 6.38% 44.240ms 44.240us 14.23% 11.255s 11.255ms 1000 torch::autograd::AccumulateGrad 7.14% 49.495ms 16.52% 114.485ms 38.162us 8.02% 6.343s 2.114ms 3000 add_ 9.38% 64.990ms 9.38% 64.990ms 21.663us 8.00% 6.326s 2.109ms 3000 sum 11.71% 81.163ms 11.71% 81.163ms 40.581us 6.15% 4.863s 2.431ms 2000 ViewBackward 9.72% 67.398ms 23.92% 165.801ms 41.450us 0.07% 57.930ms 14.482us 4000 reshape 8.35% 57.903ms 14.20% 98.403ms 24.601us 0.04% 35.134ms 8.783us 4000 as_strided 5.84% 40.500ms 5.84% 40.500ms 10.125us 0.01% 11.111ms 2.778us 4000 torch::autograd::GraphRoot 1.81% 12.526ms 1.81% 12.526ms 12.526us 0.01% 8.698ms 8.698us 1000 ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 693.093ms CUDA time total: 79.113s ``` After this PR ``` GroupNorm forward: 6.693606367000029ms --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 21.65% 17.851ms 100.00% 82.451ms 82.451us 50.04% 6.627s 6.627ms 1000 native_group_norm 78.35% 64.601ms 78.35% 64.601ms 64.601us 49.96% 6.616s 6.616ms 1000 --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 82.451ms CUDA time total: 13.244s GroupNorm backward: 17.170999962007045ms ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- NativeGroupNormBackward 12.91% 34.087ms 50.89% 134.411ms 134.411us 31.42% 10.733s 10.733ms 1000 non_differentiable_native_group_norm_backward 37.98% 100.324ms 37.98% 100.324ms 100.324us 31.36% 10.713s 10.713ms 1000 torch::autograd::AccumulateGrad 19.36% 51.140ms 44.70% 118.057ms 39.352us 18.62% 6.361s 2.120ms 3000 add_ 25.34% 66.917ms 25.34% 66.917ms 22.306us 18.57% 6.345s 2.115ms 3000 torch::autograd::GraphRoot 4.41% 11.656ms 4.41% 11.656ms 11.656us 0.02% 7.626ms 7.626us 1000 ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 264.124ms CUDA time total: 34.159s ``` Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine. Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/) [ghstack-poisoned]

Pull Request resolved: #28204 Optimize GroupNorm on CUDA ghstack-source-id: 105117221 Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)

Optimize GroupNorm on CUDA Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine. Before this PR ``` GroupNorm forward: 11.333400868010358ms -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 23.06% 56.891ms 100.00% 246.741ms 246.741us 31.20% 11.369s 11.369ms 1000 batch_norm 8.59% 21.194ms 35.82% 88.382ms 88.382us 18.89% 6.884s 6.884ms 1000 _batch_norm_impl_index 6.50% 16.045ms 27.23% 67.189ms 67.189us 18.86% 6.872s 6.872ms 1000 native_batch_norm 20.73% 51.143ms 20.73% 51.143ms 51.143us 18.83% 6.861s 6.861ms 1000 addcmul 15.17% 37.434ms 15.17% 37.434ms 37.434us 12.13% 4.419s 4.419ms 1000 view 21.76% 53.695ms 21.76% 53.695ms 13.424us 0.06% 21.595ms 5.399us 4000 contiguous 4.19% 10.339ms 4.19% 10.339ms 10.339us 0.03% 9.650ms 9.650us 1000 -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 246.741ms CUDA time total: 36.436s GroupNorm backward: 42.1425356430118ms ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- AddcmulBackward 10.08% 69.875ms 36.22% 251.046ms 251.046us 24.64% 19.492s 19.492ms 1000 mul 26.14% 181.171ms 26.14% 181.171ms 45.293us 24.60% 19.460s 4.865ms 4000 NativeBatchNormBackward 3.44% 23.832ms 9.82% 68.072ms 68.072us 14.23% 11.261s 11.261ms 1000 native_batch_norm_backward 6.38% 44.240ms 6.38% 44.240ms 44.240us 14.23% 11.255s 11.255ms 1000 torch::autograd::AccumulateGrad 7.14% 49.495ms 16.52% 114.485ms 38.162us 8.02% 6.343s 2.114ms 3000 add_ 9.38% 64.990ms 9.38% 64.990ms 21.663us 8.00% 6.326s 2.109ms 3000 sum 11.71% 81.163ms 11.71% 81.163ms 40.581us 6.15% 4.863s 2.431ms 2000 ViewBackward 9.72% 67.398ms 23.92% 165.801ms 41.450us 0.07% 57.930ms 14.482us 4000 reshape 8.35% 57.903ms 14.20% 98.403ms 24.601us 0.04% 35.134ms 8.783us 4000 as_strided 5.84% 40.500ms 5.84% 40.500ms 10.125us 0.01% 11.111ms 2.778us 4000 torch::autograd::GraphRoot 1.81% 12.526ms 1.81% 12.526ms 12.526us 0.01% 8.698ms 8.698us 1000 ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 693.093ms CUDA time total: 79.113s ``` After this PR ``` GroupNorm forward: 6.693606367000029ms --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 21.65% 17.851ms 100.00% 82.451ms 82.451us 50.04% 6.627s 6.627ms 1000 native_group_norm 78.35% 64.601ms 78.35% 64.601ms 64.601us 49.96% 6.616s 6.616ms 1000 --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 82.451ms CUDA time total: 13.244s GroupNorm backward: 17.170999962007045ms ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- NativeGroupNormBackward 12.91% 34.087ms 50.89% 134.411ms 134.411us 31.42% 10.733s 10.733ms 1000 non_differentiable_native_group_norm_backward 37.98% 100.324ms 37.98% 100.324ms 100.324us 31.36% 10.713s 10.713ms 1000 torch::autograd::AccumulateGrad 19.36% 51.140ms 44.70% 118.057ms 39.352us 18.62% 6.361s 2.120ms 3000 add_ 25.34% 66.917ms 25.34% 66.917ms 22.306us 18.57% 6.345s 2.115ms 3000 torch::autograd::GraphRoot 4.41% 11.656ms 4.41% 11.656ms 11.656us 0.02% 7.626ms 7.626us 1000 ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 264.124ms CUDA time total: 34.159s ``` Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine. Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/) [ghstack-poisoned]

Pull Request resolved: #28204 Optimize GroupNorm on CUDA ghstack-source-id: 105150481 Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)

Optimize GroupNorm on CUDA Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine. Before this PR ``` GroupNorm forward: 11.333400868010358ms -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 23.06% 56.891ms 100.00% 246.741ms 246.741us 31.20% 11.369s 11.369ms 1000 batch_norm 8.59% 21.194ms 35.82% 88.382ms 88.382us 18.89% 6.884s 6.884ms 1000 _batch_norm_impl_index 6.50% 16.045ms 27.23% 67.189ms 67.189us 18.86% 6.872s 6.872ms 1000 native_batch_norm 20.73% 51.143ms 20.73% 51.143ms 51.143us 18.83% 6.861s 6.861ms 1000 addcmul 15.17% 37.434ms 15.17% 37.434ms 37.434us 12.13% 4.419s 4.419ms 1000 view 21.76% 53.695ms 21.76% 53.695ms 13.424us 0.06% 21.595ms 5.399us 4000 contiguous 4.19% 10.339ms 4.19% 10.339ms 10.339us 0.03% 9.650ms 9.650us 1000 -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 246.741ms CUDA time total: 36.436s GroupNorm backward: 42.1425356430118ms ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- AddcmulBackward 10.08% 69.875ms 36.22% 251.046ms 251.046us 24.64% 19.492s 19.492ms 1000 mul 26.14% 181.171ms 26.14% 181.171ms 45.293us 24.60% 19.460s 4.865ms 4000 NativeBatchNormBackward 3.44% 23.832ms 9.82% 68.072ms 68.072us 14.23% 11.261s 11.261ms 1000 native_batch_norm_backward 6.38% 44.240ms 6.38% 44.240ms 44.240us 14.23% 11.255s 11.255ms 1000 torch::autograd::AccumulateGrad 7.14% 49.495ms 16.52% 114.485ms 38.162us 8.02% 6.343s 2.114ms 3000 add_ 9.38% 64.990ms 9.38% 64.990ms 21.663us 8.00% 6.326s 2.109ms 3000 sum 11.71% 81.163ms 11.71% 81.163ms 40.581us 6.15% 4.863s 2.431ms 2000 ViewBackward 9.72% 67.398ms 23.92% 165.801ms 41.450us 0.07% 57.930ms 14.482us 4000 reshape 8.35% 57.903ms 14.20% 98.403ms 24.601us 0.04% 35.134ms 8.783us 4000 as_strided 5.84% 40.500ms 5.84% 40.500ms 10.125us 0.01% 11.111ms 2.778us 4000 torch::autograd::GraphRoot 1.81% 12.526ms 1.81% 12.526ms 12.526us 0.01% 8.698ms 8.698us 1000 ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 693.093ms CUDA time total: 79.113s ``` After this PR ``` GroupNorm forward: 6.693606367000029ms --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 21.65% 17.851ms 100.00% 82.451ms 82.451us 50.04% 6.627s 6.627ms 1000 native_group_norm 78.35% 64.601ms 78.35% 64.601ms 64.601us 49.96% 6.616s 6.616ms 1000 --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 82.451ms CUDA time total: 13.244s GroupNorm backward: 17.170999962007045ms ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- NativeGroupNormBackward 12.91% 34.087ms 50.89% 134.411ms 134.411us 31.42% 10.733s 10.733ms 1000 non_differentiable_native_group_norm_backward 37.98% 100.324ms 37.98% 100.324ms 100.324us 31.36% 10.713s 10.713ms 1000 torch::autograd::AccumulateGrad 19.36% 51.140ms 44.70% 118.057ms 39.352us 18.62% 6.361s 2.120ms 3000 add_ 25.34% 66.917ms 25.34% 66.917ms 22.306us 18.57% 6.345s 2.115ms 3000 torch::autograd::GraphRoot 4.41% 11.656ms 4.41% 11.656ms 11.656us 0.02% 7.626ms 7.626us 1000 ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 264.124ms CUDA time total: 34.159s ``` Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine. Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/) [ghstack-poisoned]

Pull Request resolved: #28204 Optimize GroupNorm on CUDA ghstack-source-id: 105365117 Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)

ngimel

Looks good, I have a few small comments. Also, to be safe, I'd assert in native_group_norm_backward that tensors you expect to be contiguous are contiguous - you make sure that is the case in derivatives.yaml, but if someone calls group_norm_backward directly the contiguity of inputs is not guaranteed.

ngimel · 2020-06-05T17:13:59Z

+  const int wid = threadIdx.x / C10_WARP_SIZE;
+  val = WarpReduceSum(val);
+  if (lid == 0) {
+    shared[wid] = val;


This is unsafe if you are calling BlockReduceSum in a row with the same shared parameter. You don't do it in your code, but if someone later reuses this it can lead to errors. Can you please synchronize before writing to shared?

Not sure if I correctly understand this. I added one __syncthreads() before here. Is that correct?

Oh, you are right, I'm blind. It's all good.

ngimel · 2020-06-05T23:30:34Z

+      N, C, G, mean_data, rstd_data, gamma_data, beta_data, a_data, b_data);
+  if (HxW < kCUDANumThreads) {
+    B = (N * C * HxW + kCUDANumThreads - 1) / kCUDANumThreads;
+    GroupNormForwardSimpleCUDAKernel<T><<<B, kCUDANumThreads, 0, cuda_stream>>>(


is it better for perf?

Yes, since when the feature map becomes smaller by layers, some layers only has feature map whose size is 7 * 7 or 14 * 14, in that case, some of the threads are not working at all. In these cases, this will help.

ngimel · 2020-06-05T23:44:59Z

+  const auto& gamma = weight.is_contiguous() ? weight : weight.contiguous();
+  const auto& beta = bias.is_contiguous() ? bias : bias.contiguous();
+  return std::get<0>(
+      at::native_group_norm(X, gamma, beta, N, C, HxW, num_groups, eps));


You probably need to move .contiguous() calls on X, gamma and beta inside native_group_norm, because native_group_norm cannot handle non-contiguous inputs, yet, since it is exposed in native_functions.yaml, someone can call it directly, and it does not check contiguity of inputs.

ngimel · 2020-06-05T23:45:52Z

+    int64_t C,
+    int64_t HxW,
+    int64_t group,
+    Tensor* dX,


Why are you passing pointers here instead of non-const references? I think in the rest on the codebase we use references.

Just to make it clear that pointers are for output arguments, and references are for inputs. This is something we used to do in Caffe2. We can consider to unify the styles in later PRs.

Optimize GroupNorm on CUDA Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine. Before this PR ``` GroupNorm forward: 11.333400868010358ms -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 23.06% 56.891ms 100.00% 246.741ms 246.741us 31.20% 11.369s 11.369ms 1000 batch_norm 8.59% 21.194ms 35.82% 88.382ms 88.382us 18.89% 6.884s 6.884ms 1000 _batch_norm_impl_index 6.50% 16.045ms 27.23% 67.189ms 67.189us 18.86% 6.872s 6.872ms 1000 native_batch_norm 20.73% 51.143ms 20.73% 51.143ms 51.143us 18.83% 6.861s 6.861ms 1000 addcmul 15.17% 37.434ms 15.17% 37.434ms 37.434us 12.13% 4.419s 4.419ms 1000 view 21.76% 53.695ms 21.76% 53.695ms 13.424us 0.06% 21.595ms 5.399us 4000 contiguous 4.19% 10.339ms 4.19% 10.339ms 10.339us 0.03% 9.650ms 9.650us 1000 -------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 246.741ms CUDA time total: 36.436s GroupNorm backward: 42.1425356430118ms ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- AddcmulBackward 10.08% 69.875ms 36.22% 251.046ms 251.046us 24.64% 19.492s 19.492ms 1000 mul 26.14% 181.171ms 26.14% 181.171ms 45.293us 24.60% 19.460s 4.865ms 4000 NativeBatchNormBackward 3.44% 23.832ms 9.82% 68.072ms 68.072us 14.23% 11.261s 11.261ms 1000 native_batch_norm_backward 6.38% 44.240ms 6.38% 44.240ms 44.240us 14.23% 11.255s 11.255ms 1000 torch::autograd::AccumulateGrad 7.14% 49.495ms 16.52% 114.485ms 38.162us 8.02% 6.343s 2.114ms 3000 add_ 9.38% 64.990ms 9.38% 64.990ms 21.663us 8.00% 6.326s 2.109ms 3000 sum 11.71% 81.163ms 11.71% 81.163ms 40.581us 6.15% 4.863s 2.431ms 2000 ViewBackward 9.72% 67.398ms 23.92% 165.801ms 41.450us 0.07% 57.930ms 14.482us 4000 reshape 8.35% 57.903ms 14.20% 98.403ms 24.601us 0.04% 35.134ms 8.783us 4000 as_strided 5.84% 40.500ms 5.84% 40.500ms 10.125us 0.01% 11.111ms 2.778us 4000 torch::autograd::GraphRoot 1.81% 12.526ms 1.81% 12.526ms 12.526us 0.01% 8.698ms 8.698us 1000 ----------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 693.093ms CUDA time total: 79.113s ``` After this PR ``` GroupNorm forward: 6.693606367000029ms --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- group_norm 21.65% 17.851ms 100.00% 82.451ms 82.451us 50.04% 6.627s 6.627ms 1000 native_group_norm 78.35% 64.601ms 78.35% 64.601ms 64.601us 49.96% 6.616s 6.616ms 1000 --------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 82.451ms CUDA time total: 13.244s GroupNorm backward: 17.170999962007045ms ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- NativeGroupNormBackward 12.91% 34.087ms 50.89% 134.411ms 134.411us 31.42% 10.733s 10.733ms 1000 non_differentiable_native_group_norm_backward 37.98% 100.324ms 37.98% 100.324ms 100.324us 31.36% 10.713s 10.713ms 1000 torch::autograd::AccumulateGrad 19.36% 51.140ms 44.70% 118.057ms 39.352us 18.62% 6.361s 2.120ms 3000 add_ 25.34% 66.917ms 25.34% 66.917ms 22.306us 18.57% 6.345s 2.115ms 3000 torch::autograd::GraphRoot 4.41% 11.656ms 4.41% 11.656ms 11.656us 0.02% 7.626ms 7.626us 1000 ------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- Self CPU time total: 264.124ms CUDA time total: 34.159s ``` Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine. Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/) [ghstack-poisoned]

Pull Request resolved: #28204 Optimize GroupNorm on CUDA ghstack-source-id: 105388365 Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)

facebook-github-bot · 2020-06-07T22:11:16Z

This pull request has been merged in 614dd03.

ezyang · 2020-06-08T15:10:44Z

XLA failure on PR was real.

ailzhang · 2020-06-08T15:35:20Z

@ezyang We have just merged the fix on xla side. Should be back green in1-2 hrs. :D

ezyang · 2020-06-08T15:45:29Z

OK, I will ninja unland the revert

Summary: Pull Request resolved: pytorch#28204 Optimize GroupNorm on CUDA ghstack-source-id: 105388365 Test Plan: buck test mode/dev-nosan caffe2/test:nn -- "GroupNorm" Reviewed By: houseroad Differential Revision: D17923732 fbshipit-source-id: 9afaf01288bd9d273eed89909bff77243df89e9f

Optimize GroupNorm on CUDA

abb6440

Optimize GroupNorm on CUDA Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/) [ghstack-poisoned]

xiaomengy mentioned this pull request Oct 17, 2019

Optimize GroupNorm on CPU #28203

Closed

xiaomengy added a commit that referenced this pull request Oct 17, 2019

Optimize GroupNorm on CUDA

5ff7c78

Optimize GroupNorm on CUDA Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/) ghstack-source-id: 92059612 Pull Request resolved: #28204

xiaomengy requested review from dzhulgakov, houseroad and ngimel October 17, 2019 04:30

xiaomengy added a commit that referenced this pull request Oct 17, 2019

Optimize GroupNorm on CUDA

ae1f57d

Pull Request resolved: #28204 Optimize GroupNorm on CUDA ghstack-source-id: 92091549 Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)

xiaomengy added a commit that referenced this pull request May 24, 2020

Optimize GroupNorm on CUDA

60ada01

Pull Request resolved: #28204 Optimize GroupNorm on CUDA ghstack-source-id: 104603817 Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)

xiaomengy added a commit that referenced this pull request May 25, 2020

Optimize GroupNorm on CUDA

939559f

Pull Request resolved: #28204 Optimize GroupNorm on CUDA ghstack-source-id: 104616173 Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)

xiaomengy mentioned this pull request May 25, 2020

[PT_BREAK] Improve CUDA performance of GroupNorm pytorch/xla#2126

Closed

ailzhang mentioned this pull request May 30, 2020

Lower group_norm and its backward. pytorch/xla#2150

Merged

xiaomengy added a commit that referenced this pull request Jun 1, 2020

Optimize GroupNorm on CUDA

e16a910

Pull Request resolved: #28204 Optimize GroupNorm on CUDA ghstack-source-id: 105051827 Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)

xiaomengy added a commit that referenced this pull request Jun 2, 2020

Optimize GroupNorm on CUDA

d00d76f

Pull Request resolved: #28204 Optimize GroupNorm on CUDA ghstack-source-id: 105117221 Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)

xiaomengy added a commit that referenced this pull request Jun 3, 2020

Optimize GroupNorm on CUDA

f3b3eff

Pull Request resolved: #28204 Optimize GroupNorm on CUDA ghstack-source-id: 105150481 Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)

xiaomengy added a commit that referenced this pull request Jun 5, 2020

Optimize GroupNorm on CUDA

340c075

Pull Request resolved: #28204 Optimize GroupNorm on CUDA ghstack-source-id: 105365117 Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)

ngimel approved these changes Jun 5, 2020

View reviewed changes

xiaomengy added a commit that referenced this pull request Jun 6, 2020

Optimize GroupNorm on CUDA

4d7a075

Pull Request resolved: #28204 Optimize GroupNorm on CUDA ghstack-source-id: 105388365 Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)

facebook-github-bot closed this in 614dd03 Jun 7, 2020

facebook-github-bot added the merged label Jun 7, 2020

facebook-github-bot deleted the gh/BIT-silence/2/head branch June 11, 2020 14:18

xiaomengy mentioned this pull request Oct 11, 2020

GroupNorm is slow stragely when calling backward() in pytorch 1.6.0 #46086

Closed

mruberry added the Merged label Oct 28, 2020

Conversation

xiaomengy commented Oct 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xiaomengy commented Oct 17, 2019

Uh oh!

xiaomengy commented Oct 17, 2019

Uh oh!

bonlime commented Apr 24, 2020

Uh oh!

lopuhin commented May 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dr-ci Bot commented May 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

XLA failure

Uh oh!

ngimel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jun 7, 2020

Uh oh!

ezyang commented Jun 8, 2020

Uh oh!

ailzhang commented Jun 8, 2020

Uh oh!

ezyang commented Jun 8, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

xiaomengy commented Oct 17, 2019 •

edited

Loading

lopuhin commented May 23, 2020 •

edited

Loading

dr-ci Bot commented May 23, 2020 •

edited

Loading