Skip to content

Optimize GroupNorm on CUDA#28204

Closed
xiaomengy wants to merge 23 commits intogh/BIT-silence/2/basefrom
gh/BIT-silence/2/head
Closed

Optimize GroupNorm on CUDA#28204
xiaomengy wants to merge 23 commits intogh/BIT-silence/2/basefrom
gh/BIT-silence/2/head

Conversation

@xiaomengy
Copy link
Copy Markdown
Contributor

@xiaomengy xiaomengy commented Oct 17, 2019

Stack from ghstack:

Optimize GroupNorm on CUDA

Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine.

Before this PR

GroupNorm forward: 11.333400868010358ms
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                        Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
group_norm                  23.06%           56.891ms         100.00%          246.741ms        246.741us        31.20%           11.369s          11.369ms         1000             
batch_norm                  8.59%            21.194ms         35.82%           88.382ms         88.382us         18.89%           6.884s           6.884ms          1000             
_batch_norm_impl_index      6.50%            16.045ms         27.23%           67.189ms         67.189us         18.86%           6.872s           6.872ms          1000             
native_batch_norm           20.73%           51.143ms         20.73%           51.143ms         51.143us         18.83%           6.861s           6.861ms          1000             
addcmul                     15.17%           37.434ms         15.17%           37.434ms         37.434us         12.13%           4.419s           4.419ms          1000             
view                        21.76%           53.695ms         21.76%           53.695ms         13.424us         0.06%            21.595ms         5.399us          4000             
contiguous                  4.19%            10.339ms         4.19%            10.339ms         10.339us         0.03%            9.650ms          9.650us          1000             
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 246.741ms
CUDA time total: 36.436s

GroupNorm backward: 42.1425356430118ms
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                 Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
AddcmulBackward                      10.08%           69.875ms         36.22%           251.046ms        251.046us        24.64%           19.492s          19.492ms         1000             
mul                                  26.14%           181.171ms        26.14%           181.171ms        45.293us         24.60%           19.460s          4.865ms          4000             
NativeBatchNormBackward              3.44%            23.832ms         9.82%            68.072ms         68.072us         14.23%           11.261s          11.261ms         1000             
native_batch_norm_backward           6.38%            44.240ms         6.38%            44.240ms         44.240us         14.23%           11.255s          11.255ms         1000             
torch::autograd::AccumulateGrad      7.14%            49.495ms         16.52%           114.485ms        38.162us         8.02%            6.343s           2.114ms          3000             
add_                                 9.38%            64.990ms         9.38%            64.990ms         21.663us         8.00%            6.326s           2.109ms          3000             
sum                                  11.71%           81.163ms         11.71%           81.163ms         40.581us         6.15%            4.863s           2.431ms          2000             
ViewBackward                         9.72%            67.398ms         23.92%           165.801ms        41.450us         0.07%            57.930ms         14.482us         4000             
reshape                              8.35%            57.903ms         14.20%           98.403ms         24.601us         0.04%            35.134ms         8.783us          4000             
as_strided                           5.84%            40.500ms         5.84%            40.500ms         10.125us         0.01%            11.111ms         2.778us          4000             
torch::autograd::GraphRoot           1.81%            12.526ms         1.81%            12.526ms         12.526us         0.01%            8.698ms          8.698us          1000             
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 693.093ms
CUDA time total: 79.113s

After this PR

GroupNorm forward: 6.693606367000029ms
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                   Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
group_norm             21.65%           17.851ms         100.00%          82.451ms         82.451us         50.04%           6.627s           6.627ms          1000             
native_group_norm      78.35%           64.601ms         78.35%           64.601ms         64.601us         49.96%           6.616s           6.616ms          1000             
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 82.451ms
CUDA time total: 13.244s

GroupNorm backward: 17.170999962007045ms
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                               Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
NativeGroupNormBackward                            12.91%           34.087ms         50.89%           134.411ms        134.411us        31.42%           10.733s          10.733ms         1000             
non_differentiable_native_group_norm_backward      37.98%           100.324ms        37.98%           100.324ms        100.324us        31.36%           10.713s          10.713ms         1000             
torch::autograd::AccumulateGrad                    19.36%           51.140ms         44.70%           118.057ms        39.352us         18.62%           6.361s           2.120ms          3000             
add_                                               25.34%           66.917ms         25.34%           66.917ms         22.306us         18.57%           6.345s           2.115ms          3000             
torch::autograd::GraphRoot                         4.41%            11.656ms         4.41%            11.656ms         11.656us         0.02%            7.626ms          7.626us          1000             
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 264.124ms
CUDA time total: 34.159s

Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine.

Differential Revision: D17923732

Optimize GroupNorm on CUDA

Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)

[ghstack-poisoned]
xiaomengy added a commit that referenced this pull request Oct 17, 2019
Optimize GroupNorm on CUDA

Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)

ghstack-source-id: 92059612
Pull Request resolved: #28204
@xiaomengy
Copy link
Copy Markdown
Contributor Author

link to #28201

@xiaomengy
Copy link
Copy Markdown
Contributor Author

cc @ppwwyyxx

Optimize GroupNorm on CUDA

Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine.

Before this PR

```
GroupNorm forward: 11.333400868010358ms
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                        Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
group_norm                  23.06%           56.891ms         100.00%          246.741ms        246.741us        31.20%           11.369s          11.369ms         1000             
batch_norm                  8.59%            21.194ms         35.82%           88.382ms         88.382us         18.89%           6.884s           6.884ms          1000             
_batch_norm_impl_index      6.50%            16.045ms         27.23%           67.189ms         67.189us         18.86%           6.872s           6.872ms          1000             
native_batch_norm           20.73%           51.143ms         20.73%           51.143ms         51.143us         18.83%           6.861s           6.861ms          1000             
addcmul                     15.17%           37.434ms         15.17%           37.434ms         37.434us         12.13%           4.419s           4.419ms          1000             
view                        21.76%           53.695ms         21.76%           53.695ms         13.424us         0.06%            21.595ms         5.399us          4000             
contiguous                  4.19%            10.339ms         4.19%            10.339ms         10.339us         0.03%            9.650ms          9.650us          1000             
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 246.741ms
CUDA time total: 36.436s

GroupNorm backward: 42.1425356430118ms
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                 Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
AddcmulBackward                      10.08%           69.875ms         36.22%           251.046ms        251.046us        24.64%           19.492s          19.492ms         1000             
mul                                  26.14%           181.171ms        26.14%           181.171ms        45.293us         24.60%           19.460s          4.865ms          4000             
NativeBatchNormBackward              3.44%            23.832ms         9.82%            68.072ms         68.072us         14.23%           11.261s          11.261ms         1000             
native_batch_norm_backward           6.38%            44.240ms         6.38%            44.240ms         44.240us         14.23%           11.255s          11.255ms         1000             
torch::autograd::AccumulateGrad      7.14%            49.495ms         16.52%           114.485ms        38.162us         8.02%            6.343s           2.114ms          3000             
add_                                 9.38%            64.990ms         9.38%            64.990ms         21.663us         8.00%            6.326s           2.109ms          3000             
sum                                  11.71%           81.163ms         11.71%           81.163ms         40.581us         6.15%            4.863s           2.431ms          2000             
ViewBackward                         9.72%            67.398ms         23.92%           165.801ms        41.450us         0.07%            57.930ms         14.482us         4000             
reshape                              8.35%            57.903ms         14.20%           98.403ms         24.601us         0.04%            35.134ms         8.783us          4000             
as_strided                           5.84%            40.500ms         5.84%            40.500ms         10.125us         0.01%            11.111ms         2.778us          4000             
torch::autograd::GraphRoot           1.81%            12.526ms         1.81%            12.526ms         12.526us         0.01%            8.698ms          8.698us          1000             
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 693.093ms
CUDA time total: 79.113s
```

After this PR

```
GroupNorm forward: 6.693606367000029ms
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                   Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
group_norm             21.65%           17.851ms         100.00%          82.451ms         82.451us         50.04%           6.627s           6.627ms          1000             
native_group_norm      78.35%           64.601ms         78.35%           64.601ms         64.601us         49.96%           6.616s           6.616ms          1000             
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 82.451ms
CUDA time total: 13.244s

GroupNorm backward: 17.170999962007045ms
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                               Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
NativeGroupNormBackward                            12.91%           34.087ms         50.89%           134.411ms        134.411us        31.42%           10.733s          10.733ms         1000             
non_differentiable_native_group_norm_backward      37.98%           100.324ms        37.98%           100.324ms        100.324us        31.36%           10.713s          10.713ms         1000             
torch::autograd::AccumulateGrad                    19.36%           51.140ms         44.70%           118.057ms        39.352us         18.62%           6.361s           2.120ms          3000             
add_                                               25.34%           66.917ms         25.34%           66.917ms         22.306us         18.57%           6.345s           2.115ms          3000             
torch::autograd::GraphRoot                         4.41%            11.656ms         4.41%            11.656ms         11.656us         0.02%            7.626ms          7.626us          1000             
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 264.124ms
CUDA time total: 34.159s
```

Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine.

Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)

[ghstack-poisoned]
xiaomengy added a commit that referenced this pull request Oct 17, 2019
Pull Request resolved: #28204

Optimize GroupNorm on CUDA
ghstack-source-id: 92091549

Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)
@bonlime
Copy link
Copy Markdown

bonlime commented Apr 24, 2020

Hi. Why this PR hasn't been merged yet?

@lopuhin
Copy link
Copy Markdown
Contributor

lopuhin commented May 23, 2020

FWIW this PR still builds on current master (after resolving the conflicts) and provides around 6-8% speedup for resnet50 training with torch.amp on 2080Ti compared to master(using the same batch size), and max batch size is about 40% larger which is a great win 👍 . In terms of speed batchnorm is still much faster.

Optimize GroupNorm on CUDA

Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine.

Before this PR

```
GroupNorm forward: 11.333400868010358ms
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                        Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
group_norm                  23.06%           56.891ms         100.00%          246.741ms        246.741us        31.20%           11.369s          11.369ms         1000             
batch_norm                  8.59%            21.194ms         35.82%           88.382ms         88.382us         18.89%           6.884s           6.884ms          1000             
_batch_norm_impl_index      6.50%            16.045ms         27.23%           67.189ms         67.189us         18.86%           6.872s           6.872ms          1000             
native_batch_norm           20.73%           51.143ms         20.73%           51.143ms         51.143us         18.83%           6.861s           6.861ms          1000             
addcmul                     15.17%           37.434ms         15.17%           37.434ms         37.434us         12.13%           4.419s           4.419ms          1000             
view                        21.76%           53.695ms         21.76%           53.695ms         13.424us         0.06%            21.595ms         5.399us          4000             
contiguous                  4.19%            10.339ms         4.19%            10.339ms         10.339us         0.03%            9.650ms          9.650us          1000             
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 246.741ms
CUDA time total: 36.436s

GroupNorm backward: 42.1425356430118ms
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                 Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
AddcmulBackward                      10.08%           69.875ms         36.22%           251.046ms        251.046us        24.64%           19.492s          19.492ms         1000             
mul                                  26.14%           181.171ms        26.14%           181.171ms        45.293us         24.60%           19.460s          4.865ms          4000             
NativeBatchNormBackward              3.44%            23.832ms         9.82%            68.072ms         68.072us         14.23%           11.261s          11.261ms         1000             
native_batch_norm_backward           6.38%            44.240ms         6.38%            44.240ms         44.240us         14.23%           11.255s          11.255ms         1000             
torch::autograd::AccumulateGrad      7.14%            49.495ms         16.52%           114.485ms        38.162us         8.02%            6.343s           2.114ms          3000             
add_                                 9.38%            64.990ms         9.38%            64.990ms         21.663us         8.00%            6.326s           2.109ms          3000             
sum                                  11.71%           81.163ms         11.71%           81.163ms         40.581us         6.15%            4.863s           2.431ms          2000             
ViewBackward                         9.72%            67.398ms         23.92%           165.801ms        41.450us         0.07%            57.930ms         14.482us         4000             
reshape                              8.35%            57.903ms         14.20%           98.403ms         24.601us         0.04%            35.134ms         8.783us          4000             
as_strided                           5.84%            40.500ms         5.84%            40.500ms         10.125us         0.01%            11.111ms         2.778us          4000             
torch::autograd::GraphRoot           1.81%            12.526ms         1.81%            12.526ms         12.526us         0.01%            8.698ms          8.698us          1000             
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 693.093ms
CUDA time total: 79.113s
```

After this PR

```
GroupNorm forward: 6.693606367000029ms
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                   Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
group_norm             21.65%           17.851ms         100.00%          82.451ms         82.451us         50.04%           6.627s           6.627ms          1000             
native_group_norm      78.35%           64.601ms         78.35%           64.601ms         64.601us         49.96%           6.616s           6.616ms          1000             
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 82.451ms
CUDA time total: 13.244s

GroupNorm backward: 17.170999962007045ms
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                               Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
NativeGroupNormBackward                            12.91%           34.087ms         50.89%           134.411ms        134.411us        31.42%           10.733s          10.733ms         1000             
non_differentiable_native_group_norm_backward      37.98%           100.324ms        37.98%           100.324ms        100.324us        31.36%           10.713s          10.713ms         1000             
torch::autograd::AccumulateGrad                    19.36%           51.140ms         44.70%           118.057ms        39.352us         18.62%           6.361s           2.120ms          3000             
add_                                               25.34%           66.917ms         25.34%           66.917ms         22.306us         18.57%           6.345s           2.115ms          3000             
torch::autograd::GraphRoot                         4.41%            11.656ms         4.41%            11.656ms         11.656us         0.02%            7.626ms          7.626us          1000             
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 264.124ms
CUDA time total: 34.159s
```

Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine.

Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)

[ghstack-poisoned]
@dr-ci
Copy link
Copy Markdown

dr-ci Bot commented May 23, 2020

💊 CI failures summary and remediations

As of commit 2c78ef2 (more details on the Dr. CI page):


  • 1/1 failures introduced in this PR

XLA failure

Job pytorch_xla_linux_bionic_py3_6_clang9_test is failing. Please create an issue with title prefixed by [PT_BREAK] in pytorch/xla and link to to this PR. If you have questions, please reach out to @ailzhang / @dlibenzi / @JackCaoG.


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 94 times.

Optimize GroupNorm on CUDA

Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine.

Before this PR

```
GroupNorm forward: 11.333400868010358ms
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                        Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
group_norm                  23.06%           56.891ms         100.00%          246.741ms        246.741us        31.20%           11.369s          11.369ms         1000             
batch_norm                  8.59%            21.194ms         35.82%           88.382ms         88.382us         18.89%           6.884s           6.884ms          1000             
_batch_norm_impl_index      6.50%            16.045ms         27.23%           67.189ms         67.189us         18.86%           6.872s           6.872ms          1000             
native_batch_norm           20.73%           51.143ms         20.73%           51.143ms         51.143us         18.83%           6.861s           6.861ms          1000             
addcmul                     15.17%           37.434ms         15.17%           37.434ms         37.434us         12.13%           4.419s           4.419ms          1000             
view                        21.76%           53.695ms         21.76%           53.695ms         13.424us         0.06%            21.595ms         5.399us          4000             
contiguous                  4.19%            10.339ms         4.19%            10.339ms         10.339us         0.03%            9.650ms          9.650us          1000             
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 246.741ms
CUDA time total: 36.436s

GroupNorm backward: 42.1425356430118ms
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                 Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
AddcmulBackward                      10.08%           69.875ms         36.22%           251.046ms        251.046us        24.64%           19.492s          19.492ms         1000             
mul                                  26.14%           181.171ms        26.14%           181.171ms        45.293us         24.60%           19.460s          4.865ms          4000             
NativeBatchNormBackward              3.44%            23.832ms         9.82%            68.072ms         68.072us         14.23%           11.261s          11.261ms         1000             
native_batch_norm_backward           6.38%            44.240ms         6.38%            44.240ms         44.240us         14.23%           11.255s          11.255ms         1000             
torch::autograd::AccumulateGrad      7.14%            49.495ms         16.52%           114.485ms        38.162us         8.02%            6.343s           2.114ms          3000             
add_                                 9.38%            64.990ms         9.38%            64.990ms         21.663us         8.00%            6.326s           2.109ms          3000             
sum                                  11.71%           81.163ms         11.71%           81.163ms         40.581us         6.15%            4.863s           2.431ms          2000             
ViewBackward                         9.72%            67.398ms         23.92%           165.801ms        41.450us         0.07%            57.930ms         14.482us         4000             
reshape                              8.35%            57.903ms         14.20%           98.403ms         24.601us         0.04%            35.134ms         8.783us          4000             
as_strided                           5.84%            40.500ms         5.84%            40.500ms         10.125us         0.01%            11.111ms         2.778us          4000             
torch::autograd::GraphRoot           1.81%            12.526ms         1.81%            12.526ms         12.526us         0.01%            8.698ms          8.698us          1000             
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 693.093ms
CUDA time total: 79.113s
```

After this PR

```
GroupNorm forward: 6.693606367000029ms
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                   Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
group_norm             21.65%           17.851ms         100.00%          82.451ms         82.451us         50.04%           6.627s           6.627ms          1000             
native_group_norm      78.35%           64.601ms         78.35%           64.601ms         64.601us         49.96%           6.616s           6.616ms          1000             
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 82.451ms
CUDA time total: 13.244s

GroupNorm backward: 17.170999962007045ms
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                               Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
NativeGroupNormBackward                            12.91%           34.087ms         50.89%           134.411ms        134.411us        31.42%           10.733s          10.733ms         1000             
non_differentiable_native_group_norm_backward      37.98%           100.324ms        37.98%           100.324ms        100.324us        31.36%           10.713s          10.713ms         1000             
torch::autograd::AccumulateGrad                    19.36%           51.140ms         44.70%           118.057ms        39.352us         18.62%           6.361s           2.120ms          3000             
add_                                               25.34%           66.917ms         25.34%           66.917ms         22.306us         18.57%           6.345s           2.115ms          3000             
torch::autograd::GraphRoot                         4.41%            11.656ms         4.41%            11.656ms         11.656us         0.02%            7.626ms          7.626us          1000             
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 264.124ms
CUDA time total: 34.159s
```

Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine.

Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)

[ghstack-poisoned]
Optimize GroupNorm on CUDA

Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine.

Before this PR

```
GroupNorm forward: 11.333400868010358ms
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                        Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
group_norm                  23.06%           56.891ms         100.00%          246.741ms        246.741us        31.20%           11.369s          11.369ms         1000             
batch_norm                  8.59%            21.194ms         35.82%           88.382ms         88.382us         18.89%           6.884s           6.884ms          1000             
_batch_norm_impl_index      6.50%            16.045ms         27.23%           67.189ms         67.189us         18.86%           6.872s           6.872ms          1000             
native_batch_norm           20.73%           51.143ms         20.73%           51.143ms         51.143us         18.83%           6.861s           6.861ms          1000             
addcmul                     15.17%           37.434ms         15.17%           37.434ms         37.434us         12.13%           4.419s           4.419ms          1000             
view                        21.76%           53.695ms         21.76%           53.695ms         13.424us         0.06%            21.595ms         5.399us          4000             
contiguous                  4.19%            10.339ms         4.19%            10.339ms         10.339us         0.03%            9.650ms          9.650us          1000             
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 246.741ms
CUDA time total: 36.436s

GroupNorm backward: 42.1425356430118ms
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                 Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
AddcmulBackward                      10.08%           69.875ms         36.22%           251.046ms        251.046us        24.64%           19.492s          19.492ms         1000             
mul                                  26.14%           181.171ms        26.14%           181.171ms        45.293us         24.60%           19.460s          4.865ms          4000             
NativeBatchNormBackward              3.44%            23.832ms         9.82%            68.072ms         68.072us         14.23%           11.261s          11.261ms         1000             
native_batch_norm_backward           6.38%            44.240ms         6.38%            44.240ms         44.240us         14.23%           11.255s          11.255ms         1000             
torch::autograd::AccumulateGrad      7.14%            49.495ms         16.52%           114.485ms        38.162us         8.02%            6.343s           2.114ms          3000             
add_                                 9.38%            64.990ms         9.38%            64.990ms         21.663us         8.00%            6.326s           2.109ms          3000             
sum                                  11.71%           81.163ms         11.71%           81.163ms         40.581us         6.15%            4.863s           2.431ms          2000             
ViewBackward                         9.72%            67.398ms         23.92%           165.801ms        41.450us         0.07%            57.930ms         14.482us         4000             
reshape                              8.35%            57.903ms         14.20%           98.403ms         24.601us         0.04%            35.134ms         8.783us          4000             
as_strided                           5.84%            40.500ms         5.84%            40.500ms         10.125us         0.01%            11.111ms         2.778us          4000             
torch::autograd::GraphRoot           1.81%            12.526ms         1.81%            12.526ms         12.526us         0.01%            8.698ms          8.698us          1000             
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 693.093ms
CUDA time total: 79.113s
```

After this PR

```
GroupNorm forward: 6.693606367000029ms
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                   Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
group_norm             21.65%           17.851ms         100.00%          82.451ms         82.451us         50.04%           6.627s           6.627ms          1000             
native_group_norm      78.35%           64.601ms         78.35%           64.601ms         64.601us         49.96%           6.616s           6.616ms          1000             
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 82.451ms
CUDA time total: 13.244s

GroupNorm backward: 17.170999962007045ms
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                               Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
NativeGroupNormBackward                            12.91%           34.087ms         50.89%           134.411ms        134.411us        31.42%           10.733s          10.733ms         1000             
non_differentiable_native_group_norm_backward      37.98%           100.324ms        37.98%           100.324ms        100.324us        31.36%           10.713s          10.713ms         1000             
torch::autograd::AccumulateGrad                    19.36%           51.140ms         44.70%           118.057ms        39.352us         18.62%           6.361s           2.120ms          3000             
add_                                               25.34%           66.917ms         25.34%           66.917ms         22.306us         18.57%           6.345s           2.115ms          3000             
torch::autograd::GraphRoot                         4.41%            11.656ms         4.41%            11.656ms         11.656us         0.02%            7.626ms          7.626us          1000             
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 264.124ms
CUDA time total: 34.159s
```

Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine.

Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)

[ghstack-poisoned]
Optimize GroupNorm on CUDA

Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine.

Before this PR

```
GroupNorm forward: 11.333400868010358ms
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                        Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
group_norm                  23.06%           56.891ms         100.00%          246.741ms        246.741us        31.20%           11.369s          11.369ms         1000             
batch_norm                  8.59%            21.194ms         35.82%           88.382ms         88.382us         18.89%           6.884s           6.884ms          1000             
_batch_norm_impl_index      6.50%            16.045ms         27.23%           67.189ms         67.189us         18.86%           6.872s           6.872ms          1000             
native_batch_norm           20.73%           51.143ms         20.73%           51.143ms         51.143us         18.83%           6.861s           6.861ms          1000             
addcmul                     15.17%           37.434ms         15.17%           37.434ms         37.434us         12.13%           4.419s           4.419ms          1000             
view                        21.76%           53.695ms         21.76%           53.695ms         13.424us         0.06%            21.595ms         5.399us          4000             
contiguous                  4.19%            10.339ms         4.19%            10.339ms         10.339us         0.03%            9.650ms          9.650us          1000             
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 246.741ms
CUDA time total: 36.436s

GroupNorm backward: 42.1425356430118ms
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                 Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
AddcmulBackward                      10.08%           69.875ms         36.22%           251.046ms        251.046us        24.64%           19.492s          19.492ms         1000             
mul                                  26.14%           181.171ms        26.14%           181.171ms        45.293us         24.60%           19.460s          4.865ms          4000             
NativeBatchNormBackward              3.44%            23.832ms         9.82%            68.072ms         68.072us         14.23%           11.261s          11.261ms         1000             
native_batch_norm_backward           6.38%            44.240ms         6.38%            44.240ms         44.240us         14.23%           11.255s          11.255ms         1000             
torch::autograd::AccumulateGrad      7.14%            49.495ms         16.52%           114.485ms        38.162us         8.02%            6.343s           2.114ms          3000             
add_                                 9.38%            64.990ms         9.38%            64.990ms         21.663us         8.00%            6.326s           2.109ms          3000             
sum                                  11.71%           81.163ms         11.71%           81.163ms         40.581us         6.15%            4.863s           2.431ms          2000             
ViewBackward                         9.72%            67.398ms         23.92%           165.801ms        41.450us         0.07%            57.930ms         14.482us         4000             
reshape                              8.35%            57.903ms         14.20%           98.403ms         24.601us         0.04%            35.134ms         8.783us          4000             
as_strided                           5.84%            40.500ms         5.84%            40.500ms         10.125us         0.01%            11.111ms         2.778us          4000             
torch::autograd::GraphRoot           1.81%            12.526ms         1.81%            12.526ms         12.526us         0.01%            8.698ms          8.698us          1000             
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 693.093ms
CUDA time total: 79.113s
```

After this PR

```
GroupNorm forward: 6.693606367000029ms
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                   Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
group_norm             21.65%           17.851ms         100.00%          82.451ms         82.451us         50.04%           6.627s           6.627ms          1000             
native_group_norm      78.35%           64.601ms         78.35%           64.601ms         64.601us         49.96%           6.616s           6.616ms          1000             
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 82.451ms
CUDA time total: 13.244s

GroupNorm backward: 17.170999962007045ms
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                               Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
NativeGroupNormBackward                            12.91%           34.087ms         50.89%           134.411ms        134.411us        31.42%           10.733s          10.733ms         1000             
non_differentiable_native_group_norm_backward      37.98%           100.324ms        37.98%           100.324ms        100.324us        31.36%           10.713s          10.713ms         1000             
torch::autograd::AccumulateGrad                    19.36%           51.140ms         44.70%           118.057ms        39.352us         18.62%           6.361s           2.120ms          3000             
add_                                               25.34%           66.917ms         25.34%           66.917ms         22.306us         18.57%           6.345s           2.115ms          3000             
torch::autograd::GraphRoot                         4.41%            11.656ms         4.41%            11.656ms         11.656us         0.02%            7.626ms          7.626us          1000             
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 264.124ms
CUDA time total: 34.159s
```

Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine.

Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)

[ghstack-poisoned]
Optimize GroupNorm on CUDA

Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine.

Before this PR

```
GroupNorm forward: 11.333400868010358ms
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                        Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
group_norm                  23.06%           56.891ms         100.00%          246.741ms        246.741us        31.20%           11.369s          11.369ms         1000             
batch_norm                  8.59%            21.194ms         35.82%           88.382ms         88.382us         18.89%           6.884s           6.884ms          1000             
_batch_norm_impl_index      6.50%            16.045ms         27.23%           67.189ms         67.189us         18.86%           6.872s           6.872ms          1000             
native_batch_norm           20.73%           51.143ms         20.73%           51.143ms         51.143us         18.83%           6.861s           6.861ms          1000             
addcmul                     15.17%           37.434ms         15.17%           37.434ms         37.434us         12.13%           4.419s           4.419ms          1000             
view                        21.76%           53.695ms         21.76%           53.695ms         13.424us         0.06%            21.595ms         5.399us          4000             
contiguous                  4.19%            10.339ms         4.19%            10.339ms         10.339us         0.03%            9.650ms          9.650us          1000             
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 246.741ms
CUDA time total: 36.436s

GroupNorm backward: 42.1425356430118ms
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                 Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
AddcmulBackward                      10.08%           69.875ms         36.22%           251.046ms        251.046us        24.64%           19.492s          19.492ms         1000             
mul                                  26.14%           181.171ms        26.14%           181.171ms        45.293us         24.60%           19.460s          4.865ms          4000             
NativeBatchNormBackward              3.44%            23.832ms         9.82%            68.072ms         68.072us         14.23%           11.261s          11.261ms         1000             
native_batch_norm_backward           6.38%            44.240ms         6.38%            44.240ms         44.240us         14.23%           11.255s          11.255ms         1000             
torch::autograd::AccumulateGrad      7.14%            49.495ms         16.52%           114.485ms        38.162us         8.02%            6.343s           2.114ms          3000             
add_                                 9.38%            64.990ms         9.38%            64.990ms         21.663us         8.00%            6.326s           2.109ms          3000             
sum                                  11.71%           81.163ms         11.71%           81.163ms         40.581us         6.15%            4.863s           2.431ms          2000             
ViewBackward                         9.72%            67.398ms         23.92%           165.801ms        41.450us         0.07%            57.930ms         14.482us         4000             
reshape                              8.35%            57.903ms         14.20%           98.403ms         24.601us         0.04%            35.134ms         8.783us          4000             
as_strided                           5.84%            40.500ms         5.84%            40.500ms         10.125us         0.01%            11.111ms         2.778us          4000             
torch::autograd::GraphRoot           1.81%            12.526ms         1.81%            12.526ms         12.526us         0.01%            8.698ms          8.698us          1000             
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 693.093ms
CUDA time total: 79.113s
```

After this PR

```
GroupNorm forward: 6.693606367000029ms
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                   Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
group_norm             21.65%           17.851ms         100.00%          82.451ms         82.451us         50.04%           6.627s           6.627ms          1000             
native_group_norm      78.35%           64.601ms         78.35%           64.601ms         64.601us         49.96%           6.616s           6.616ms          1000             
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 82.451ms
CUDA time total: 13.244s

GroupNorm backward: 17.170999962007045ms
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                               Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
NativeGroupNormBackward                            12.91%           34.087ms         50.89%           134.411ms        134.411us        31.42%           10.733s          10.733ms         1000             
non_differentiable_native_group_norm_backward      37.98%           100.324ms        37.98%           100.324ms        100.324us        31.36%           10.713s          10.713ms         1000             
torch::autograd::AccumulateGrad                    19.36%           51.140ms         44.70%           118.057ms        39.352us         18.62%           6.361s           2.120ms          3000             
add_                                               25.34%           66.917ms         25.34%           66.917ms         22.306us         18.57%           6.345s           2.115ms          3000             
torch::autograd::GraphRoot                         4.41%            11.656ms         4.41%            11.656ms         11.656us         0.02%            7.626ms          7.626us          1000             
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 264.124ms
CUDA time total: 34.159s
```

Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine.

Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)

[ghstack-poisoned]
Optimize GroupNorm on CUDA

Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine.

Before this PR

```
GroupNorm forward: 11.333400868010358ms
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                        Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
group_norm                  23.06%           56.891ms         100.00%          246.741ms        246.741us        31.20%           11.369s          11.369ms         1000             
batch_norm                  8.59%            21.194ms         35.82%           88.382ms         88.382us         18.89%           6.884s           6.884ms          1000             
_batch_norm_impl_index      6.50%            16.045ms         27.23%           67.189ms         67.189us         18.86%           6.872s           6.872ms          1000             
native_batch_norm           20.73%           51.143ms         20.73%           51.143ms         51.143us         18.83%           6.861s           6.861ms          1000             
addcmul                     15.17%           37.434ms         15.17%           37.434ms         37.434us         12.13%           4.419s           4.419ms          1000             
view                        21.76%           53.695ms         21.76%           53.695ms         13.424us         0.06%            21.595ms         5.399us          4000             
contiguous                  4.19%            10.339ms         4.19%            10.339ms         10.339us         0.03%            9.650ms          9.650us          1000             
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 246.741ms
CUDA time total: 36.436s

GroupNorm backward: 42.1425356430118ms
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                 Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
AddcmulBackward                      10.08%           69.875ms         36.22%           251.046ms        251.046us        24.64%           19.492s          19.492ms         1000             
mul                                  26.14%           181.171ms        26.14%           181.171ms        45.293us         24.60%           19.460s          4.865ms          4000             
NativeBatchNormBackward              3.44%            23.832ms         9.82%            68.072ms         68.072us         14.23%           11.261s          11.261ms         1000             
native_batch_norm_backward           6.38%            44.240ms         6.38%            44.240ms         44.240us         14.23%           11.255s          11.255ms         1000             
torch::autograd::AccumulateGrad      7.14%            49.495ms         16.52%           114.485ms        38.162us         8.02%            6.343s           2.114ms          3000             
add_                                 9.38%            64.990ms         9.38%            64.990ms         21.663us         8.00%            6.326s           2.109ms          3000             
sum                                  11.71%           81.163ms         11.71%           81.163ms         40.581us         6.15%            4.863s           2.431ms          2000             
ViewBackward                         9.72%            67.398ms         23.92%           165.801ms        41.450us         0.07%            57.930ms         14.482us         4000             
reshape                              8.35%            57.903ms         14.20%           98.403ms         24.601us         0.04%            35.134ms         8.783us          4000             
as_strided                           5.84%            40.500ms         5.84%            40.500ms         10.125us         0.01%            11.111ms         2.778us          4000             
torch::autograd::GraphRoot           1.81%            12.526ms         1.81%            12.526ms         12.526us         0.01%            8.698ms          8.698us          1000             
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 693.093ms
CUDA time total: 79.113s
```

After this PR

```
GroupNorm forward: 6.693606367000029ms
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                   Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
group_norm             21.65%           17.851ms         100.00%          82.451ms         82.451us         50.04%           6.627s           6.627ms          1000             
native_group_norm      78.35%           64.601ms         78.35%           64.601ms         64.601us         49.96%           6.616s           6.616ms          1000             
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 82.451ms
CUDA time total: 13.244s

GroupNorm backward: 17.170999962007045ms
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                               Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
NativeGroupNormBackward                            12.91%           34.087ms         50.89%           134.411ms        134.411us        31.42%           10.733s          10.733ms         1000             
non_differentiable_native_group_norm_backward      37.98%           100.324ms        37.98%           100.324ms        100.324us        31.36%           10.713s          10.713ms         1000             
torch::autograd::AccumulateGrad                    19.36%           51.140ms         44.70%           118.057ms        39.352us         18.62%           6.361s           2.120ms          3000             
add_                                               25.34%           66.917ms         25.34%           66.917ms         22.306us         18.57%           6.345s           2.115ms          3000             
torch::autograd::GraphRoot                         4.41%            11.656ms         4.41%            11.656ms         11.656us         0.02%            7.626ms          7.626us          1000             
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 264.124ms
CUDA time total: 34.159s
```

Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine.

Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)

[ghstack-poisoned]
Optimize GroupNorm on CUDA

Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine.

Before this PR

```
GroupNorm forward: 11.333400868010358ms
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                        Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
group_norm                  23.06%           56.891ms         100.00%          246.741ms        246.741us        31.20%           11.369s          11.369ms         1000             
batch_norm                  8.59%            21.194ms         35.82%           88.382ms         88.382us         18.89%           6.884s           6.884ms          1000             
_batch_norm_impl_index      6.50%            16.045ms         27.23%           67.189ms         67.189us         18.86%           6.872s           6.872ms          1000             
native_batch_norm           20.73%           51.143ms         20.73%           51.143ms         51.143us         18.83%           6.861s           6.861ms          1000             
addcmul                     15.17%           37.434ms         15.17%           37.434ms         37.434us         12.13%           4.419s           4.419ms          1000             
view                        21.76%           53.695ms         21.76%           53.695ms         13.424us         0.06%            21.595ms         5.399us          4000             
contiguous                  4.19%            10.339ms         4.19%            10.339ms         10.339us         0.03%            9.650ms          9.650us          1000             
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 246.741ms
CUDA time total: 36.436s

GroupNorm backward: 42.1425356430118ms
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                 Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
AddcmulBackward                      10.08%           69.875ms         36.22%           251.046ms        251.046us        24.64%           19.492s          19.492ms         1000             
mul                                  26.14%           181.171ms        26.14%           181.171ms        45.293us         24.60%           19.460s          4.865ms          4000             
NativeBatchNormBackward              3.44%            23.832ms         9.82%            68.072ms         68.072us         14.23%           11.261s          11.261ms         1000             
native_batch_norm_backward           6.38%            44.240ms         6.38%            44.240ms         44.240us         14.23%           11.255s          11.255ms         1000             
torch::autograd::AccumulateGrad      7.14%            49.495ms         16.52%           114.485ms        38.162us         8.02%            6.343s           2.114ms          3000             
add_                                 9.38%            64.990ms         9.38%            64.990ms         21.663us         8.00%            6.326s           2.109ms          3000             
sum                                  11.71%           81.163ms         11.71%           81.163ms         40.581us         6.15%            4.863s           2.431ms          2000             
ViewBackward                         9.72%            67.398ms         23.92%           165.801ms        41.450us         0.07%            57.930ms         14.482us         4000             
reshape                              8.35%            57.903ms         14.20%           98.403ms         24.601us         0.04%            35.134ms         8.783us          4000             
as_strided                           5.84%            40.500ms         5.84%            40.500ms         10.125us         0.01%            11.111ms         2.778us          4000             
torch::autograd::GraphRoot           1.81%            12.526ms         1.81%            12.526ms         12.526us         0.01%            8.698ms          8.698us          1000             
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 693.093ms
CUDA time total: 79.113s
```

After this PR

```
GroupNorm forward: 6.693606367000029ms
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                   Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
group_norm             21.65%           17.851ms         100.00%          82.451ms         82.451us         50.04%           6.627s           6.627ms          1000             
native_group_norm      78.35%           64.601ms         78.35%           64.601ms         64.601us         49.96%           6.616s           6.616ms          1000             
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 82.451ms
CUDA time total: 13.244s

GroupNorm backward: 17.170999962007045ms
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                               Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
NativeGroupNormBackward                            12.91%           34.087ms         50.89%           134.411ms        134.411us        31.42%           10.733s          10.733ms         1000             
non_differentiable_native_group_norm_backward      37.98%           100.324ms        37.98%           100.324ms        100.324us        31.36%           10.713s          10.713ms         1000             
torch::autograd::AccumulateGrad                    19.36%           51.140ms         44.70%           118.057ms        39.352us         18.62%           6.361s           2.120ms          3000             
add_                                               25.34%           66.917ms         25.34%           66.917ms         22.306us         18.57%           6.345s           2.115ms          3000             
torch::autograd::GraphRoot                         4.41%            11.656ms         4.41%            11.656ms         11.656us         0.02%            7.626ms          7.626us          1000             
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 264.124ms
CUDA time total: 34.159s
```

Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine.

Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)

[ghstack-poisoned]
xiaomengy added a commit that referenced this pull request May 24, 2020
Pull Request resolved: #28204

Optimize GroupNorm on CUDA
ghstack-source-id: 104603817

Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)
Optimize GroupNorm on CUDA

Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine.

Before this PR

```
GroupNorm forward: 11.333400868010358ms
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                        Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
group_norm                  23.06%           56.891ms         100.00%          246.741ms        246.741us        31.20%           11.369s          11.369ms         1000             
batch_norm                  8.59%            21.194ms         35.82%           88.382ms         88.382us         18.89%           6.884s           6.884ms          1000             
_batch_norm_impl_index      6.50%            16.045ms         27.23%           67.189ms         67.189us         18.86%           6.872s           6.872ms          1000             
native_batch_norm           20.73%           51.143ms         20.73%           51.143ms         51.143us         18.83%           6.861s           6.861ms          1000             
addcmul                     15.17%           37.434ms         15.17%           37.434ms         37.434us         12.13%           4.419s           4.419ms          1000             
view                        21.76%           53.695ms         21.76%           53.695ms         13.424us         0.06%            21.595ms         5.399us          4000             
contiguous                  4.19%            10.339ms         4.19%            10.339ms         10.339us         0.03%            9.650ms          9.650us          1000             
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 246.741ms
CUDA time total: 36.436s

GroupNorm backward: 42.1425356430118ms
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                 Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
AddcmulBackward                      10.08%           69.875ms         36.22%           251.046ms        251.046us        24.64%           19.492s          19.492ms         1000             
mul                                  26.14%           181.171ms        26.14%           181.171ms        45.293us         24.60%           19.460s          4.865ms          4000             
NativeBatchNormBackward              3.44%            23.832ms         9.82%            68.072ms         68.072us         14.23%           11.261s          11.261ms         1000             
native_batch_norm_backward           6.38%            44.240ms         6.38%            44.240ms         44.240us         14.23%           11.255s          11.255ms         1000             
torch::autograd::AccumulateGrad      7.14%            49.495ms         16.52%           114.485ms        38.162us         8.02%            6.343s           2.114ms          3000             
add_                                 9.38%            64.990ms         9.38%            64.990ms         21.663us         8.00%            6.326s           2.109ms          3000             
sum                                  11.71%           81.163ms         11.71%           81.163ms         40.581us         6.15%            4.863s           2.431ms          2000             
ViewBackward                         9.72%            67.398ms         23.92%           165.801ms        41.450us         0.07%            57.930ms         14.482us         4000             
reshape                              8.35%            57.903ms         14.20%           98.403ms         24.601us         0.04%            35.134ms         8.783us          4000             
as_strided                           5.84%            40.500ms         5.84%            40.500ms         10.125us         0.01%            11.111ms         2.778us          4000             
torch::autograd::GraphRoot           1.81%            12.526ms         1.81%            12.526ms         12.526us         0.01%            8.698ms          8.698us          1000             
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 693.093ms
CUDA time total: 79.113s
```

After this PR

```
GroupNorm forward: 6.693606367000029ms
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                   Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
group_norm             21.65%           17.851ms         100.00%          82.451ms         82.451us         50.04%           6.627s           6.627ms          1000             
native_group_norm      78.35%           64.601ms         78.35%           64.601ms         64.601us         49.96%           6.616s           6.616ms          1000             
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 82.451ms
CUDA time total: 13.244s

GroupNorm backward: 17.170999962007045ms
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                               Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
NativeGroupNormBackward                            12.91%           34.087ms         50.89%           134.411ms        134.411us        31.42%           10.733s          10.733ms         1000             
non_differentiable_native_group_norm_backward      37.98%           100.324ms        37.98%           100.324ms        100.324us        31.36%           10.713s          10.713ms         1000             
torch::autograd::AccumulateGrad                    19.36%           51.140ms         44.70%           118.057ms        39.352us         18.62%           6.361s           2.120ms          3000             
add_                                               25.34%           66.917ms         25.34%           66.917ms         22.306us         18.57%           6.345s           2.115ms          3000             
torch::autograd::GraphRoot                         4.41%            11.656ms         4.41%            11.656ms         11.656us         0.02%            7.626ms          7.626us          1000             
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 264.124ms
CUDA time total: 34.159s
```

Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine.

Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)

[ghstack-poisoned]
Optimize GroupNorm on CUDA

Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine.

Before this PR

```
GroupNorm forward: 11.333400868010358ms
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                        Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
group_norm                  23.06%           56.891ms         100.00%          246.741ms        246.741us        31.20%           11.369s          11.369ms         1000             
batch_norm                  8.59%            21.194ms         35.82%           88.382ms         88.382us         18.89%           6.884s           6.884ms          1000             
_batch_norm_impl_index      6.50%            16.045ms         27.23%           67.189ms         67.189us         18.86%           6.872s           6.872ms          1000             
native_batch_norm           20.73%           51.143ms         20.73%           51.143ms         51.143us         18.83%           6.861s           6.861ms          1000             
addcmul                     15.17%           37.434ms         15.17%           37.434ms         37.434us         12.13%           4.419s           4.419ms          1000             
view                        21.76%           53.695ms         21.76%           53.695ms         13.424us         0.06%            21.595ms         5.399us          4000             
contiguous                  4.19%            10.339ms         4.19%            10.339ms         10.339us         0.03%            9.650ms          9.650us          1000             
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 246.741ms
CUDA time total: 36.436s

GroupNorm backward: 42.1425356430118ms
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                 Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
AddcmulBackward                      10.08%           69.875ms         36.22%           251.046ms        251.046us        24.64%           19.492s          19.492ms         1000             
mul                                  26.14%           181.171ms        26.14%           181.171ms        45.293us         24.60%           19.460s          4.865ms          4000             
NativeBatchNormBackward              3.44%            23.832ms         9.82%            68.072ms         68.072us         14.23%           11.261s          11.261ms         1000             
native_batch_norm_backward           6.38%            44.240ms         6.38%            44.240ms         44.240us         14.23%           11.255s          11.255ms         1000             
torch::autograd::AccumulateGrad      7.14%            49.495ms         16.52%           114.485ms        38.162us         8.02%            6.343s           2.114ms          3000             
add_                                 9.38%            64.990ms         9.38%            64.990ms         21.663us         8.00%            6.326s           2.109ms          3000             
sum                                  11.71%           81.163ms         11.71%           81.163ms         40.581us         6.15%            4.863s           2.431ms          2000             
ViewBackward                         9.72%            67.398ms         23.92%           165.801ms        41.450us         0.07%            57.930ms         14.482us         4000             
reshape                              8.35%            57.903ms         14.20%           98.403ms         24.601us         0.04%            35.134ms         8.783us          4000             
as_strided                           5.84%            40.500ms         5.84%            40.500ms         10.125us         0.01%            11.111ms         2.778us          4000             
torch::autograd::GraphRoot           1.81%            12.526ms         1.81%            12.526ms         12.526us         0.01%            8.698ms          8.698us          1000             
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 693.093ms
CUDA time total: 79.113s
```

After this PR

```
GroupNorm forward: 6.693606367000029ms
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                   Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
group_norm             21.65%           17.851ms         100.00%          82.451ms         82.451us         50.04%           6.627s           6.627ms          1000             
native_group_norm      78.35%           64.601ms         78.35%           64.601ms         64.601us         49.96%           6.616s           6.616ms          1000             
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 82.451ms
CUDA time total: 13.244s

GroupNorm backward: 17.170999962007045ms
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                               Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
NativeGroupNormBackward                            12.91%           34.087ms         50.89%           134.411ms        134.411us        31.42%           10.733s          10.733ms         1000             
non_differentiable_native_group_norm_backward      37.98%           100.324ms        37.98%           100.324ms        100.324us        31.36%           10.713s          10.713ms         1000             
torch::autograd::AccumulateGrad                    19.36%           51.140ms         44.70%           118.057ms        39.352us         18.62%           6.361s           2.120ms          3000             
add_                                               25.34%           66.917ms         25.34%           66.917ms         22.306us         18.57%           6.345s           2.115ms          3000             
torch::autograd::GraphRoot                         4.41%            11.656ms         4.41%            11.656ms         11.656us         0.02%            7.626ms          7.626us          1000             
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 264.124ms
CUDA time total: 34.159s
```

Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine.

Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)

[ghstack-poisoned]
Optimize GroupNorm on CUDA

Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine.

Before this PR

```
GroupNorm forward: 11.333400868010358ms
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                        Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
group_norm                  23.06%           56.891ms         100.00%          246.741ms        246.741us        31.20%           11.369s          11.369ms         1000             
batch_norm                  8.59%            21.194ms         35.82%           88.382ms         88.382us         18.89%           6.884s           6.884ms          1000             
_batch_norm_impl_index      6.50%            16.045ms         27.23%           67.189ms         67.189us         18.86%           6.872s           6.872ms          1000             
native_batch_norm           20.73%           51.143ms         20.73%           51.143ms         51.143us         18.83%           6.861s           6.861ms          1000             
addcmul                     15.17%           37.434ms         15.17%           37.434ms         37.434us         12.13%           4.419s           4.419ms          1000             
view                        21.76%           53.695ms         21.76%           53.695ms         13.424us         0.06%            21.595ms         5.399us          4000             
contiguous                  4.19%            10.339ms         4.19%            10.339ms         10.339us         0.03%            9.650ms          9.650us          1000             
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 246.741ms
CUDA time total: 36.436s

GroupNorm backward: 42.1425356430118ms
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                 Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
AddcmulBackward                      10.08%           69.875ms         36.22%           251.046ms        251.046us        24.64%           19.492s          19.492ms         1000             
mul                                  26.14%           181.171ms        26.14%           181.171ms        45.293us         24.60%           19.460s          4.865ms          4000             
NativeBatchNormBackward              3.44%            23.832ms         9.82%            68.072ms         68.072us         14.23%           11.261s          11.261ms         1000             
native_batch_norm_backward           6.38%            44.240ms         6.38%            44.240ms         44.240us         14.23%           11.255s          11.255ms         1000             
torch::autograd::AccumulateGrad      7.14%            49.495ms         16.52%           114.485ms        38.162us         8.02%            6.343s           2.114ms          3000             
add_                                 9.38%            64.990ms         9.38%            64.990ms         21.663us         8.00%            6.326s           2.109ms          3000             
sum                                  11.71%           81.163ms         11.71%           81.163ms         40.581us         6.15%            4.863s           2.431ms          2000             
ViewBackward                         9.72%            67.398ms         23.92%           165.801ms        41.450us         0.07%            57.930ms         14.482us         4000             
reshape                              8.35%            57.903ms         14.20%           98.403ms         24.601us         0.04%            35.134ms         8.783us          4000             
as_strided                           5.84%            40.500ms         5.84%            40.500ms         10.125us         0.01%            11.111ms         2.778us          4000             
torch::autograd::GraphRoot           1.81%            12.526ms         1.81%            12.526ms         12.526us         0.01%            8.698ms          8.698us          1000             
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 693.093ms
CUDA time total: 79.113s
```

After this PR

```
GroupNorm forward: 6.693606367000029ms
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                   Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
group_norm             21.65%           17.851ms         100.00%          82.451ms         82.451us         50.04%           6.627s           6.627ms          1000             
native_group_norm      78.35%           64.601ms         78.35%           64.601ms         64.601us         49.96%           6.616s           6.616ms          1000             
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 82.451ms
CUDA time total: 13.244s

GroupNorm backward: 17.170999962007045ms
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                               Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
NativeGroupNormBackward                            12.91%           34.087ms         50.89%           134.411ms        134.411us        31.42%           10.733s          10.733ms         1000             
non_differentiable_native_group_norm_backward      37.98%           100.324ms        37.98%           100.324ms        100.324us        31.36%           10.713s          10.713ms         1000             
torch::autograd::AccumulateGrad                    19.36%           51.140ms         44.70%           118.057ms        39.352us         18.62%           6.361s           2.120ms          3000             
add_                                               25.34%           66.917ms         25.34%           66.917ms         22.306us         18.57%           6.345s           2.115ms          3000             
torch::autograd::GraphRoot                         4.41%            11.656ms         4.41%            11.656ms         11.656us         0.02%            7.626ms          7.626us          1000             
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 264.124ms
CUDA time total: 34.159s
```

Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine.

Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)

[ghstack-poisoned]
xiaomengy added a commit that referenced this pull request May 25, 2020
Pull Request resolved: #28204

Optimize GroupNorm on CUDA
ghstack-source-id: 104616173

Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)
Optimize GroupNorm on CUDA

Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine.

Before this PR

```
GroupNorm forward: 11.333400868010358ms
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                        Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
group_norm                  23.06%           56.891ms         100.00%          246.741ms        246.741us        31.20%           11.369s          11.369ms         1000             
batch_norm                  8.59%            21.194ms         35.82%           88.382ms         88.382us         18.89%           6.884s           6.884ms          1000             
_batch_norm_impl_index      6.50%            16.045ms         27.23%           67.189ms         67.189us         18.86%           6.872s           6.872ms          1000             
native_batch_norm           20.73%           51.143ms         20.73%           51.143ms         51.143us         18.83%           6.861s           6.861ms          1000             
addcmul                     15.17%           37.434ms         15.17%           37.434ms         37.434us         12.13%           4.419s           4.419ms          1000             
view                        21.76%           53.695ms         21.76%           53.695ms         13.424us         0.06%            21.595ms         5.399us          4000             
contiguous                  4.19%            10.339ms         4.19%            10.339ms         10.339us         0.03%            9.650ms          9.650us          1000             
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 246.741ms
CUDA time total: 36.436s

GroupNorm backward: 42.1425356430118ms
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                 Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
AddcmulBackward                      10.08%           69.875ms         36.22%           251.046ms        251.046us        24.64%           19.492s          19.492ms         1000             
mul                                  26.14%           181.171ms        26.14%           181.171ms        45.293us         24.60%           19.460s          4.865ms          4000             
NativeBatchNormBackward              3.44%            23.832ms         9.82%            68.072ms         68.072us         14.23%           11.261s          11.261ms         1000             
native_batch_norm_backward           6.38%            44.240ms         6.38%            44.240ms         44.240us         14.23%           11.255s          11.255ms         1000             
torch::autograd::AccumulateGrad      7.14%            49.495ms         16.52%           114.485ms        38.162us         8.02%            6.343s           2.114ms          3000             
add_                                 9.38%            64.990ms         9.38%            64.990ms         21.663us         8.00%            6.326s           2.109ms          3000             
sum                                  11.71%           81.163ms         11.71%           81.163ms         40.581us         6.15%            4.863s           2.431ms          2000             
ViewBackward                         9.72%            67.398ms         23.92%           165.801ms        41.450us         0.07%            57.930ms         14.482us         4000             
reshape                              8.35%            57.903ms         14.20%           98.403ms         24.601us         0.04%            35.134ms         8.783us          4000             
as_strided                           5.84%            40.500ms         5.84%            40.500ms         10.125us         0.01%            11.111ms         2.778us          4000             
torch::autograd::GraphRoot           1.81%            12.526ms         1.81%            12.526ms         12.526us         0.01%            8.698ms          8.698us          1000             
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 693.093ms
CUDA time total: 79.113s
```

After this PR

```
GroupNorm forward: 6.693606367000029ms
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                   Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
group_norm             21.65%           17.851ms         100.00%          82.451ms         82.451us         50.04%           6.627s           6.627ms          1000             
native_group_norm      78.35%           64.601ms         78.35%           64.601ms         64.601us         49.96%           6.616s           6.616ms          1000             
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 82.451ms
CUDA time total: 13.244s

GroupNorm backward: 17.170999962007045ms
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                               Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
NativeGroupNormBackward                            12.91%           34.087ms         50.89%           134.411ms        134.411us        31.42%           10.733s          10.733ms         1000             
non_differentiable_native_group_norm_backward      37.98%           100.324ms        37.98%           100.324ms        100.324us        31.36%           10.713s          10.713ms         1000             
torch::autograd::AccumulateGrad                    19.36%           51.140ms         44.70%           118.057ms        39.352us         18.62%           6.361s           2.120ms          3000             
add_                                               25.34%           66.917ms         25.34%           66.917ms         22.306us         18.57%           6.345s           2.115ms          3000             
torch::autograd::GraphRoot                         4.41%            11.656ms         4.41%            11.656ms         11.656us         0.02%            7.626ms          7.626us          1000             
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 264.124ms
CUDA time total: 34.159s
```

Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine.

Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)

[ghstack-poisoned]
Optimize GroupNorm on CUDA

Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine.

Before this PR

```
GroupNorm forward: 11.333400868010358ms
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                        Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
group_norm                  23.06%           56.891ms         100.00%          246.741ms        246.741us        31.20%           11.369s          11.369ms         1000             
batch_norm                  8.59%            21.194ms         35.82%           88.382ms         88.382us         18.89%           6.884s           6.884ms          1000             
_batch_norm_impl_index      6.50%            16.045ms         27.23%           67.189ms         67.189us         18.86%           6.872s           6.872ms          1000             
native_batch_norm           20.73%           51.143ms         20.73%           51.143ms         51.143us         18.83%           6.861s           6.861ms          1000             
addcmul                     15.17%           37.434ms         15.17%           37.434ms         37.434us         12.13%           4.419s           4.419ms          1000             
view                        21.76%           53.695ms         21.76%           53.695ms         13.424us         0.06%            21.595ms         5.399us          4000             
contiguous                  4.19%            10.339ms         4.19%            10.339ms         10.339us         0.03%            9.650ms          9.650us          1000             
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 246.741ms
CUDA time total: 36.436s

GroupNorm backward: 42.1425356430118ms
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                 Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
AddcmulBackward                      10.08%           69.875ms         36.22%           251.046ms        251.046us        24.64%           19.492s          19.492ms         1000             
mul                                  26.14%           181.171ms        26.14%           181.171ms        45.293us         24.60%           19.460s          4.865ms          4000             
NativeBatchNormBackward              3.44%            23.832ms         9.82%            68.072ms         68.072us         14.23%           11.261s          11.261ms         1000             
native_batch_norm_backward           6.38%            44.240ms         6.38%            44.240ms         44.240us         14.23%           11.255s          11.255ms         1000             
torch::autograd::AccumulateGrad      7.14%            49.495ms         16.52%           114.485ms        38.162us         8.02%            6.343s           2.114ms          3000             
add_                                 9.38%            64.990ms         9.38%            64.990ms         21.663us         8.00%            6.326s           2.109ms          3000             
sum                                  11.71%           81.163ms         11.71%           81.163ms         40.581us         6.15%            4.863s           2.431ms          2000             
ViewBackward                         9.72%            67.398ms         23.92%           165.801ms        41.450us         0.07%            57.930ms         14.482us         4000             
reshape                              8.35%            57.903ms         14.20%           98.403ms         24.601us         0.04%            35.134ms         8.783us          4000             
as_strided                           5.84%            40.500ms         5.84%            40.500ms         10.125us         0.01%            11.111ms         2.778us          4000             
torch::autograd::GraphRoot           1.81%            12.526ms         1.81%            12.526ms         12.526us         0.01%            8.698ms          8.698us          1000             
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 693.093ms
CUDA time total: 79.113s
```

After this PR

```
GroupNorm forward: 6.693606367000029ms
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                   Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
group_norm             21.65%           17.851ms         100.00%          82.451ms         82.451us         50.04%           6.627s           6.627ms          1000             
native_group_norm      78.35%           64.601ms         78.35%           64.601ms         64.601us         49.96%           6.616s           6.616ms          1000             
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 82.451ms
CUDA time total: 13.244s

GroupNorm backward: 17.170999962007045ms
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                               Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
NativeGroupNormBackward                            12.91%           34.087ms         50.89%           134.411ms        134.411us        31.42%           10.733s          10.733ms         1000             
non_differentiable_native_group_norm_backward      37.98%           100.324ms        37.98%           100.324ms        100.324us        31.36%           10.713s          10.713ms         1000             
torch::autograd::AccumulateGrad                    19.36%           51.140ms         44.70%           118.057ms        39.352us         18.62%           6.361s           2.120ms          3000             
add_                                               25.34%           66.917ms         25.34%           66.917ms         22.306us         18.57%           6.345s           2.115ms          3000             
torch::autograd::GraphRoot                         4.41%            11.656ms         4.41%            11.656ms         11.656us         0.02%            7.626ms          7.626us          1000             
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 264.124ms
CUDA time total: 34.159s
```

Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine.

Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)

[ghstack-poisoned]
Optimize GroupNorm on CUDA

Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine.

Before this PR

```
GroupNorm forward: 11.333400868010358ms
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                        Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
group_norm                  23.06%           56.891ms         100.00%          246.741ms        246.741us        31.20%           11.369s          11.369ms         1000             
batch_norm                  8.59%            21.194ms         35.82%           88.382ms         88.382us         18.89%           6.884s           6.884ms          1000             
_batch_norm_impl_index      6.50%            16.045ms         27.23%           67.189ms         67.189us         18.86%           6.872s           6.872ms          1000             
native_batch_norm           20.73%           51.143ms         20.73%           51.143ms         51.143us         18.83%           6.861s           6.861ms          1000             
addcmul                     15.17%           37.434ms         15.17%           37.434ms         37.434us         12.13%           4.419s           4.419ms          1000             
view                        21.76%           53.695ms         21.76%           53.695ms         13.424us         0.06%            21.595ms         5.399us          4000             
contiguous                  4.19%            10.339ms         4.19%            10.339ms         10.339us         0.03%            9.650ms          9.650us          1000             
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 246.741ms
CUDA time total: 36.436s

GroupNorm backward: 42.1425356430118ms
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                 Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
AddcmulBackward                      10.08%           69.875ms         36.22%           251.046ms        251.046us        24.64%           19.492s          19.492ms         1000             
mul                                  26.14%           181.171ms        26.14%           181.171ms        45.293us         24.60%           19.460s          4.865ms          4000             
NativeBatchNormBackward              3.44%            23.832ms         9.82%            68.072ms         68.072us         14.23%           11.261s          11.261ms         1000             
native_batch_norm_backward           6.38%            44.240ms         6.38%            44.240ms         44.240us         14.23%           11.255s          11.255ms         1000             
torch::autograd::AccumulateGrad      7.14%            49.495ms         16.52%           114.485ms        38.162us         8.02%            6.343s           2.114ms          3000             
add_                                 9.38%            64.990ms         9.38%            64.990ms         21.663us         8.00%            6.326s           2.109ms          3000             
sum                                  11.71%           81.163ms         11.71%           81.163ms         40.581us         6.15%            4.863s           2.431ms          2000             
ViewBackward                         9.72%            67.398ms         23.92%           165.801ms        41.450us         0.07%            57.930ms         14.482us         4000             
reshape                              8.35%            57.903ms         14.20%           98.403ms         24.601us         0.04%            35.134ms         8.783us          4000             
as_strided                           5.84%            40.500ms         5.84%            40.500ms         10.125us         0.01%            11.111ms         2.778us          4000             
torch::autograd::GraphRoot           1.81%            12.526ms         1.81%            12.526ms         12.526us         0.01%            8.698ms          8.698us          1000             
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 693.093ms
CUDA time total: 79.113s
```

After this PR

```
GroupNorm forward: 6.693606367000029ms
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                   Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
group_norm             21.65%           17.851ms         100.00%          82.451ms         82.451us         50.04%           6.627s           6.627ms          1000             
native_group_norm      78.35%           64.601ms         78.35%           64.601ms         64.601us         49.96%           6.616s           6.616ms          1000             
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 82.451ms
CUDA time total: 13.244s

GroupNorm backward: 17.170999962007045ms
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                               Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
NativeGroupNormBackward                            12.91%           34.087ms         50.89%           134.411ms        134.411us        31.42%           10.733s          10.733ms         1000             
non_differentiable_native_group_norm_backward      37.98%           100.324ms        37.98%           100.324ms        100.324us        31.36%           10.713s          10.713ms         1000             
torch::autograd::AccumulateGrad                    19.36%           51.140ms         44.70%           118.057ms        39.352us         18.62%           6.361s           2.120ms          3000             
add_                                               25.34%           66.917ms         25.34%           66.917ms         22.306us         18.57%           6.345s           2.115ms          3000             
torch::autograd::GraphRoot                         4.41%            11.656ms         4.41%            11.656ms         11.656us         0.02%            7.626ms          7.626us          1000             
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 264.124ms
CUDA time total: 34.159s
```

Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine.

Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)

[ghstack-poisoned]
xiaomengy added a commit that referenced this pull request Jun 1, 2020
Pull Request resolved: #28204

Optimize GroupNorm on CUDA
ghstack-source-id: 105051827

Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)
Optimize GroupNorm on CUDA

Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine.

Before this PR

```
GroupNorm forward: 11.333400868010358ms
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                        Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
group_norm                  23.06%           56.891ms         100.00%          246.741ms        246.741us        31.20%           11.369s          11.369ms         1000             
batch_norm                  8.59%            21.194ms         35.82%           88.382ms         88.382us         18.89%           6.884s           6.884ms          1000             
_batch_norm_impl_index      6.50%            16.045ms         27.23%           67.189ms         67.189us         18.86%           6.872s           6.872ms          1000             
native_batch_norm           20.73%           51.143ms         20.73%           51.143ms         51.143us         18.83%           6.861s           6.861ms          1000             
addcmul                     15.17%           37.434ms         15.17%           37.434ms         37.434us         12.13%           4.419s           4.419ms          1000             
view                        21.76%           53.695ms         21.76%           53.695ms         13.424us         0.06%            21.595ms         5.399us          4000             
contiguous                  4.19%            10.339ms         4.19%            10.339ms         10.339us         0.03%            9.650ms          9.650us          1000             
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 246.741ms
CUDA time total: 36.436s

GroupNorm backward: 42.1425356430118ms
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                 Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
AddcmulBackward                      10.08%           69.875ms         36.22%           251.046ms        251.046us        24.64%           19.492s          19.492ms         1000             
mul                                  26.14%           181.171ms        26.14%           181.171ms        45.293us         24.60%           19.460s          4.865ms          4000             
NativeBatchNormBackward              3.44%            23.832ms         9.82%            68.072ms         68.072us         14.23%           11.261s          11.261ms         1000             
native_batch_norm_backward           6.38%            44.240ms         6.38%            44.240ms         44.240us         14.23%           11.255s          11.255ms         1000             
torch::autograd::AccumulateGrad      7.14%            49.495ms         16.52%           114.485ms        38.162us         8.02%            6.343s           2.114ms          3000             
add_                                 9.38%            64.990ms         9.38%            64.990ms         21.663us         8.00%            6.326s           2.109ms          3000             
sum                                  11.71%           81.163ms         11.71%           81.163ms         40.581us         6.15%            4.863s           2.431ms          2000             
ViewBackward                         9.72%            67.398ms         23.92%           165.801ms        41.450us         0.07%            57.930ms         14.482us         4000             
reshape                              8.35%            57.903ms         14.20%           98.403ms         24.601us         0.04%            35.134ms         8.783us          4000             
as_strided                           5.84%            40.500ms         5.84%            40.500ms         10.125us         0.01%            11.111ms         2.778us          4000             
torch::autograd::GraphRoot           1.81%            12.526ms         1.81%            12.526ms         12.526us         0.01%            8.698ms          8.698us          1000             
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 693.093ms
CUDA time total: 79.113s
```

After this PR

```
GroupNorm forward: 6.693606367000029ms
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                   Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
group_norm             21.65%           17.851ms         100.00%          82.451ms         82.451us         50.04%           6.627s           6.627ms          1000             
native_group_norm      78.35%           64.601ms         78.35%           64.601ms         64.601us         49.96%           6.616s           6.616ms          1000             
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 82.451ms
CUDA time total: 13.244s

GroupNorm backward: 17.170999962007045ms
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                               Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
NativeGroupNormBackward                            12.91%           34.087ms         50.89%           134.411ms        134.411us        31.42%           10.733s          10.733ms         1000             
non_differentiable_native_group_norm_backward      37.98%           100.324ms        37.98%           100.324ms        100.324us        31.36%           10.713s          10.713ms         1000             
torch::autograd::AccumulateGrad                    19.36%           51.140ms         44.70%           118.057ms        39.352us         18.62%           6.361s           2.120ms          3000             
add_                                               25.34%           66.917ms         25.34%           66.917ms         22.306us         18.57%           6.345s           2.115ms          3000             
torch::autograd::GraphRoot                         4.41%            11.656ms         4.41%            11.656ms         11.656us         0.02%            7.626ms          7.626us          1000             
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 264.124ms
CUDA time total: 34.159s
```

Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine.

Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)

[ghstack-poisoned]
xiaomengy added a commit that referenced this pull request Jun 2, 2020
Pull Request resolved: #28204

Optimize GroupNorm on CUDA
ghstack-source-id: 105117221

Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)
Optimize GroupNorm on CUDA

Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine.

Before this PR

```
GroupNorm forward: 11.333400868010358ms
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                        Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
group_norm                  23.06%           56.891ms         100.00%          246.741ms        246.741us        31.20%           11.369s          11.369ms         1000             
batch_norm                  8.59%            21.194ms         35.82%           88.382ms         88.382us         18.89%           6.884s           6.884ms          1000             
_batch_norm_impl_index      6.50%            16.045ms         27.23%           67.189ms         67.189us         18.86%           6.872s           6.872ms          1000             
native_batch_norm           20.73%           51.143ms         20.73%           51.143ms         51.143us         18.83%           6.861s           6.861ms          1000             
addcmul                     15.17%           37.434ms         15.17%           37.434ms         37.434us         12.13%           4.419s           4.419ms          1000             
view                        21.76%           53.695ms         21.76%           53.695ms         13.424us         0.06%            21.595ms         5.399us          4000             
contiguous                  4.19%            10.339ms         4.19%            10.339ms         10.339us         0.03%            9.650ms          9.650us          1000             
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 246.741ms
CUDA time total: 36.436s

GroupNorm backward: 42.1425356430118ms
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                 Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
AddcmulBackward                      10.08%           69.875ms         36.22%           251.046ms        251.046us        24.64%           19.492s          19.492ms         1000             
mul                                  26.14%           181.171ms        26.14%           181.171ms        45.293us         24.60%           19.460s          4.865ms          4000             
NativeBatchNormBackward              3.44%            23.832ms         9.82%            68.072ms         68.072us         14.23%           11.261s          11.261ms         1000             
native_batch_norm_backward           6.38%            44.240ms         6.38%            44.240ms         44.240us         14.23%           11.255s          11.255ms         1000             
torch::autograd::AccumulateGrad      7.14%            49.495ms         16.52%           114.485ms        38.162us         8.02%            6.343s           2.114ms          3000             
add_                                 9.38%            64.990ms         9.38%            64.990ms         21.663us         8.00%            6.326s           2.109ms          3000             
sum                                  11.71%           81.163ms         11.71%           81.163ms         40.581us         6.15%            4.863s           2.431ms          2000             
ViewBackward                         9.72%            67.398ms         23.92%           165.801ms        41.450us         0.07%            57.930ms         14.482us         4000             
reshape                              8.35%            57.903ms         14.20%           98.403ms         24.601us         0.04%            35.134ms         8.783us          4000             
as_strided                           5.84%            40.500ms         5.84%            40.500ms         10.125us         0.01%            11.111ms         2.778us          4000             
torch::autograd::GraphRoot           1.81%            12.526ms         1.81%            12.526ms         12.526us         0.01%            8.698ms          8.698us          1000             
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 693.093ms
CUDA time total: 79.113s
```

After this PR

```
GroupNorm forward: 6.693606367000029ms
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                   Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
group_norm             21.65%           17.851ms         100.00%          82.451ms         82.451us         50.04%           6.627s           6.627ms          1000             
native_group_norm      78.35%           64.601ms         78.35%           64.601ms         64.601us         49.96%           6.616s           6.616ms          1000             
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 82.451ms
CUDA time total: 13.244s

GroupNorm backward: 17.170999962007045ms
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                               Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
NativeGroupNormBackward                            12.91%           34.087ms         50.89%           134.411ms        134.411us        31.42%           10.733s          10.733ms         1000             
non_differentiable_native_group_norm_backward      37.98%           100.324ms        37.98%           100.324ms        100.324us        31.36%           10.713s          10.713ms         1000             
torch::autograd::AccumulateGrad                    19.36%           51.140ms         44.70%           118.057ms        39.352us         18.62%           6.361s           2.120ms          3000             
add_                                               25.34%           66.917ms         25.34%           66.917ms         22.306us         18.57%           6.345s           2.115ms          3000             
torch::autograd::GraphRoot                         4.41%            11.656ms         4.41%            11.656ms         11.656us         0.02%            7.626ms          7.626us          1000             
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 264.124ms
CUDA time total: 34.159s
```

Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine.

Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)

[ghstack-poisoned]
xiaomengy added a commit that referenced this pull request Jun 3, 2020
Pull Request resolved: #28204

Optimize GroupNorm on CUDA
ghstack-source-id: 105150481

Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)
Optimize GroupNorm on CUDA

Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine.

Before this PR

```
GroupNorm forward: 11.333400868010358ms
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                        Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
group_norm                  23.06%           56.891ms         100.00%          246.741ms        246.741us        31.20%           11.369s          11.369ms         1000             
batch_norm                  8.59%            21.194ms         35.82%           88.382ms         88.382us         18.89%           6.884s           6.884ms          1000             
_batch_norm_impl_index      6.50%            16.045ms         27.23%           67.189ms         67.189us         18.86%           6.872s           6.872ms          1000             
native_batch_norm           20.73%           51.143ms         20.73%           51.143ms         51.143us         18.83%           6.861s           6.861ms          1000             
addcmul                     15.17%           37.434ms         15.17%           37.434ms         37.434us         12.13%           4.419s           4.419ms          1000             
view                        21.76%           53.695ms         21.76%           53.695ms         13.424us         0.06%            21.595ms         5.399us          4000             
contiguous                  4.19%            10.339ms         4.19%            10.339ms         10.339us         0.03%            9.650ms          9.650us          1000             
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 246.741ms
CUDA time total: 36.436s

GroupNorm backward: 42.1425356430118ms
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                 Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
AddcmulBackward                      10.08%           69.875ms         36.22%           251.046ms        251.046us        24.64%           19.492s          19.492ms         1000             
mul                                  26.14%           181.171ms        26.14%           181.171ms        45.293us         24.60%           19.460s          4.865ms          4000             
NativeBatchNormBackward              3.44%            23.832ms         9.82%            68.072ms         68.072us         14.23%           11.261s          11.261ms         1000             
native_batch_norm_backward           6.38%            44.240ms         6.38%            44.240ms         44.240us         14.23%           11.255s          11.255ms         1000             
torch::autograd::AccumulateGrad      7.14%            49.495ms         16.52%           114.485ms        38.162us         8.02%            6.343s           2.114ms          3000             
add_                                 9.38%            64.990ms         9.38%            64.990ms         21.663us         8.00%            6.326s           2.109ms          3000             
sum                                  11.71%           81.163ms         11.71%           81.163ms         40.581us         6.15%            4.863s           2.431ms          2000             
ViewBackward                         9.72%            67.398ms         23.92%           165.801ms        41.450us         0.07%            57.930ms         14.482us         4000             
reshape                              8.35%            57.903ms         14.20%           98.403ms         24.601us         0.04%            35.134ms         8.783us          4000             
as_strided                           5.84%            40.500ms         5.84%            40.500ms         10.125us         0.01%            11.111ms         2.778us          4000             
torch::autograd::GraphRoot           1.81%            12.526ms         1.81%            12.526ms         12.526us         0.01%            8.698ms          8.698us          1000             
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 693.093ms
CUDA time total: 79.113s
```

After this PR

```
GroupNorm forward: 6.693606367000029ms
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                   Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
group_norm             21.65%           17.851ms         100.00%          82.451ms         82.451us         50.04%           6.627s           6.627ms          1000             
native_group_norm      78.35%           64.601ms         78.35%           64.601ms         64.601us         49.96%           6.616s           6.616ms          1000             
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 82.451ms
CUDA time total: 13.244s

GroupNorm backward: 17.170999962007045ms
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                               Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
NativeGroupNormBackward                            12.91%           34.087ms         50.89%           134.411ms        134.411us        31.42%           10.733s          10.733ms         1000             
non_differentiable_native_group_norm_backward      37.98%           100.324ms        37.98%           100.324ms        100.324us        31.36%           10.713s          10.713ms         1000             
torch::autograd::AccumulateGrad                    19.36%           51.140ms         44.70%           118.057ms        39.352us         18.62%           6.361s           2.120ms          3000             
add_                                               25.34%           66.917ms         25.34%           66.917ms         22.306us         18.57%           6.345s           2.115ms          3000             
torch::autograd::GraphRoot                         4.41%            11.656ms         4.41%            11.656ms         11.656us         0.02%            7.626ms          7.626us          1000             
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 264.124ms
CUDA time total: 34.159s
```

Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine.

Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)

[ghstack-poisoned]
xiaomengy added a commit that referenced this pull request Jun 5, 2020
Pull Request resolved: #28204

Optimize GroupNorm on CUDA
ghstack-source-id: 105365117

Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)
Copy link
Copy Markdown
Collaborator

@ngimel ngimel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I have a few small comments. Also, to be safe, I'd assert in native_group_norm_backward that tensors you expect to be contiguous are contiguous - you make sure that is the case in derivatives.yaml, but if someone calls group_norm_backward directly the contiguity of inputs is not guaranteed.

const int wid = threadIdx.x / C10_WARP_SIZE;
val = WarpReduceSum(val);
if (lid == 0) {
shared[wid] = val;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unsafe if you are calling BlockReduceSum in a row with the same shared parameter. You don't do it in your code, but if someone later reuses this it can lead to errors. Can you please synchronize before writing to shared?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I correctly understand this. I added one __syncthreads() before here. Is that correct?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you are right, I'm blind. It's all good.

N, C, G, mean_data, rstd_data, gamma_data, beta_data, a_data, b_data);
if (HxW < kCUDANumThreads) {
B = (N * C * HxW + kCUDANumThreads - 1) / kCUDANumThreads;
GroupNormForwardSimpleCUDAKernel<T><<<B, kCUDANumThreads, 0, cuda_stream>>>(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it better for perf?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, since when the feature map becomes smaller by layers, some layers only has feature map whose size is 7 * 7 or 14 * 14, in that case, some of the threads are not working at all. In these cases, this will help.

const auto& gamma = weight.is_contiguous() ? weight : weight.contiguous();
const auto& beta = bias.is_contiguous() ? bias : bias.contiguous();
return std::get<0>(
at::native_group_norm(X, gamma, beta, N, C, HxW, num_groups, eps));
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably need to move .contiguous() calls on X, gamma and beta inside native_group_norm, because native_group_norm cannot handle non-contiguous inputs, yet, since it is exposed in native_functions.yaml, someone can call it directly, and it does not check contiguity of inputs.

int64_t C,
int64_t HxW,
int64_t group,
Tensor* dX,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you passing pointers here instead of non-const references? I think in the rest on the codebase we use references.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make it clear that pointers are for output arguments, and references are for inputs. This is something we used to do in Caffe2. We can consider to unify the styles in later PRs.

Optimize GroupNorm on CUDA

Profiling result for input shape = [256, 512, 56, 56], num_groups = 32 on a V100 machine.

Before this PR

```
GroupNorm forward: 11.333400868010358ms
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                        Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
group_norm                  23.06%           56.891ms         100.00%          246.741ms        246.741us        31.20%           11.369s          11.369ms         1000             
batch_norm                  8.59%            21.194ms         35.82%           88.382ms         88.382us         18.89%           6.884s           6.884ms          1000             
_batch_norm_impl_index      6.50%            16.045ms         27.23%           67.189ms         67.189us         18.86%           6.872s           6.872ms          1000             
native_batch_norm           20.73%           51.143ms         20.73%           51.143ms         51.143us         18.83%           6.861s           6.861ms          1000             
addcmul                     15.17%           37.434ms         15.17%           37.434ms         37.434us         12.13%           4.419s           4.419ms          1000             
view                        21.76%           53.695ms         21.76%           53.695ms         13.424us         0.06%            21.595ms         5.399us          4000             
contiguous                  4.19%            10.339ms         4.19%            10.339ms         10.339us         0.03%            9.650ms          9.650us          1000             
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 246.741ms
CUDA time total: 36.436s

GroupNorm backward: 42.1425356430118ms
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                 Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
AddcmulBackward                      10.08%           69.875ms         36.22%           251.046ms        251.046us        24.64%           19.492s          19.492ms         1000             
mul                                  26.14%           181.171ms        26.14%           181.171ms        45.293us         24.60%           19.460s          4.865ms          4000             
NativeBatchNormBackward              3.44%            23.832ms         9.82%            68.072ms         68.072us         14.23%           11.261s          11.261ms         1000             
native_batch_norm_backward           6.38%            44.240ms         6.38%            44.240ms         44.240us         14.23%           11.255s          11.255ms         1000             
torch::autograd::AccumulateGrad      7.14%            49.495ms         16.52%           114.485ms        38.162us         8.02%            6.343s           2.114ms          3000             
add_                                 9.38%            64.990ms         9.38%            64.990ms         21.663us         8.00%            6.326s           2.109ms          3000             
sum                                  11.71%           81.163ms         11.71%           81.163ms         40.581us         6.15%            4.863s           2.431ms          2000             
ViewBackward                         9.72%            67.398ms         23.92%           165.801ms        41.450us         0.07%            57.930ms         14.482us         4000             
reshape                              8.35%            57.903ms         14.20%           98.403ms         24.601us         0.04%            35.134ms         8.783us          4000             
as_strided                           5.84%            40.500ms         5.84%            40.500ms         10.125us         0.01%            11.111ms         2.778us          4000             
torch::autograd::GraphRoot           1.81%            12.526ms         1.81%            12.526ms         12.526us         0.01%            8.698ms          8.698us          1000             
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 693.093ms
CUDA time total: 79.113s
```

After this PR

```
GroupNorm forward: 6.693606367000029ms
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                   Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
group_norm             21.65%           17.851ms         100.00%          82.451ms         82.451us         50.04%           6.627s           6.627ms          1000             
native_group_norm      78.35%           64.601ms         78.35%           64.601ms         64.601us         49.96%           6.616s           6.616ms          1000             
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 82.451ms
CUDA time total: 13.244s

GroupNorm backward: 17.170999962007045ms
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                               Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
NativeGroupNormBackward                            12.91%           34.087ms         50.89%           134.411ms        134.411us        31.42%           10.733s          10.733ms         1000             
non_differentiable_native_group_norm_backward      37.98%           100.324ms        37.98%           100.324ms        100.324us        31.36%           10.713s          10.713ms         1000             
torch::autograd::AccumulateGrad                    19.36%           51.140ms         44.70%           118.057ms        39.352us         18.62%           6.361s           2.120ms          3000             
add_                                               25.34%           66.917ms         25.34%           66.917ms         22.306us         18.57%           6.345s           2.115ms          3000             
torch::autograd::GraphRoot                         4.41%            11.656ms         4.41%            11.656ms         11.656us         0.02%            7.626ms          7.626us          1000             
-------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 264.124ms
CUDA time total: 34.159s
```

Testing with Detectron2, this PR can make maskrcnn with GroupNorm for both training and inference 10% faster on a V100 machine.

Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)

[ghstack-poisoned]
xiaomengy added a commit that referenced this pull request Jun 6, 2020
Pull Request resolved: #28204

Optimize GroupNorm on CUDA
ghstack-source-id: 105388365

Differential Revision: [D17923732](https://our.internmc.facebook.com/intern/diff/D17923732/)
@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request has been merged in 614dd03.

@ezyang
Copy link
Copy Markdown
Contributor

ezyang commented Jun 8, 2020

XLA failure on PR was real.

@ailzhang
Copy link
Copy Markdown
Contributor

ailzhang commented Jun 8, 2020

@ezyang We have just merged the fix on xla side. Should be back green in1-2 hrs. :D

@ezyang
Copy link
Copy Markdown
Contributor

ezyang commented Jun 8, 2020

OK, I will ninja unland the revert

@facebook-github-bot facebook-github-bot deleted the gh/BIT-silence/2/head branch June 11, 2020 14:18
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
Summary:
Pull Request resolved: pytorch#28204

Optimize GroupNorm on CUDA
ghstack-source-id: 105388365

Test Plan: buck test mode/dev-nosan caffe2/test:nn -- "GroupNorm"

Reviewed By: houseroad

Differential Revision: D17923732

fbshipit-source-id: 9afaf01288bd9d273eed89909bff77243df89e9f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants