Optimize GroupNorm in PyTorch

## 🚀 Feature

Improve the performance of GroupNorm operator.

## Motivation



Similar as https://github.com/pytorch/pytorch/issues/27633, the current GroupNorm implementation is reshaping the input and doing BatchNorm to get the moments of input, then using addcmul for affine. This implementation is inefficient for both CPU and CUDA. 

The performance benchmark for input shape = [128, 256, 28, 28], num_groups = 32 is shown below.

```
GroupNorm forward: 210.97913278000487ms
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                        Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
native_batch_norm           68.00%           135.920s         68.00%           135.920s         135.920ms        NaN              0.000us          0.000us          1000             
addcmul                     31.94%           63.841s          31.94%           63.841s          63.841ms         NaN              0.000us          0.000us          1000             
view                        0.03%            69.572ms         0.03%            69.572ms         17.393us         NaN              0.000us          0.000us          4000             
group_norm                  0.01%            29.095ms         100.00%          199.870s         199.870ms        NaN              0.000us          0.000us          1000             
_batch_norm_impl_index      0.00%            5.179ms          68.01%           135.925s         135.925ms        NaN              0.000us          0.000us          1000             
batch_norm                  0.00%            4.594ms          68.01%           135.929s         135.929ms        NaN              0.000us          0.000us          1000             
contiguous                  0.00%            1.815ms          0.00%            1.815ms          1.815us          NaN              0.000us          0.000us          1000             
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 199.870s
CUDA time total: 0.000us

GroupNorm backward: 498.55486265799846ms
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                 Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
mul                                  46.02%           224.071s         46.02%           224.071s         56.018ms         NaN              0.000us          0.000us          4000             
native_batch_norm_backward           43.86%           213.559s         43.86%           213.559s         213.559ms        NaN              0.000us          0.000us          1000             
sum                                  3.91%            19.040s          3.91%            19.040s          9.520ms          NaN              0.000us          0.000us          2000             
add_                                 3.28%            15.951s          3.28%            15.951s          5.317ms          NaN              0.000us          0.000us          3000             
AddcmulBackward                      1.95%            9.501s           47.97%           233.571s         233.571ms        NaN              0.000us          0.000us          1000             
torch::autograd::AccumulateGrad      0.97%            4.737s           4.25%            20.687s          6.896ms          NaN              0.000us          0.000us          3000             
as_strided                           0.00%            20.049ms         0.00%            20.049ms         5.012us          NaN              0.000us          0.000us          4000             
NativeBatchNormBackward              0.00%            13.951ms         43.86%           213.573s         213.573ms        NaN              0.000us          0.000us          1000             
reshape                              0.00%            13.034ms         0.01%            33.083ms         8.271us          NaN              0.000us          0.000us          4000             
ViewBackward                         0.00%            9.419ms          0.01%            42.502ms         10.626us         NaN              0.000us          0.000us          4000             
torch::autograd::GraphRoot           0.00%            1.189ms          0.00%            1.189ms          1.189us          NaN              0.000us          0.000us          1000             
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 486.916s
CUDA time total: 0.000us
```

And the performance benchmark for input shape = [256, 512, 56, 56], num_groups = 32 is shown below.

```
GroupNorm forward: 11.333400868010358ms
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                        Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
group_norm                  23.06%           56.891ms         100.00%          246.741ms        246.741us        31.20%           11.369s          11.369ms         1000             
batch_norm                  8.59%            21.194ms         35.82%           88.382ms         88.382us         18.89%           6.884s           6.884ms          1000             
_batch_norm_impl_index      6.50%            16.045ms         27.23%           67.189ms         67.189us         18.86%           6.872s           6.872ms          1000             
native_batch_norm           20.73%           51.143ms         20.73%           51.143ms         51.143us         18.83%           6.861s           6.861ms          1000             
addcmul                     15.17%           37.434ms         15.17%           37.434ms         37.434us         12.13%           4.419s           4.419ms          1000             
view                        21.76%           53.695ms         21.76%           53.695ms         13.424us         0.06%            21.595ms         5.399us          4000             
contiguous                  4.19%            10.339ms         4.19%            10.339ms         10.339us         0.03%            9.650ms          9.650us          1000             
--------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 246.741ms
CUDA time total: 36.436s

GroupNorm backward: 42.1425356430118ms
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Name                                 Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
AddcmulBackward                      10.08%           69.875ms         36.22%           251.046ms        251.046us        24.64%           19.492s          19.492ms         1000             
mul                                  26.14%           181.171ms        26.14%           181.171ms        45.293us         24.60%           19.460s          4.865ms          4000             
NativeBatchNormBackward              3.44%            23.832ms         9.82%            68.072ms         68.072us         14.23%           11.261s          11.261ms         1000             
native_batch_norm_backward           6.38%            44.240ms         6.38%            44.240ms         44.240us         14.23%           11.255s          11.255ms         1000             
torch::autograd::AccumulateGrad      7.14%            49.495ms         16.52%           114.485ms        38.162us         8.02%            6.343s           2.114ms          3000             
add_                                 9.38%            64.990ms         9.38%            64.990ms         21.663us         8.00%            6.326s           2.109ms          3000             
sum                                  11.71%           81.163ms         11.71%           81.163ms         40.581us         6.15%            4.863s           2.431ms          2000             
ViewBackward                         9.72%            67.398ms         23.92%           165.801ms        41.450us         0.07%            57.930ms         14.482us         4000             
reshape                              8.35%            57.903ms         14.20%           98.403ms         24.601us         0.04%            35.134ms         8.783us          4000             
as_strided                           5.84%            40.500ms         5.84%            40.500ms         10.125us         0.01%            11.111ms         2.778us          4000             
torch::autograd::GraphRoot           1.81%            12.526ms         1.81%            12.526ms         12.526us         0.01%            8.698ms          8.698us          1000             
-----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
Self CPU time total: 693.093ms
CUDA time total: 79.113s

```

We can see that for both CPU and GPU version of GroupNorm, using BatchNorm with addcmul make things slow especially for backward pass. Actually on CPU side, since BatchNorm for inference is a affine function and can be fused with Conv, it makes the GroupNorm very slow when using BatchNorm on CPU for inference.

## Pitch



To implement a optimized version of GroupNorm which fused everything together.

## Alternatives



## Additional context


cc @dzhulgakov @ngimel @ppwwyyxx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize GroupNorm in PyTorch #28201

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimize GroupNorm in PyTorch #28201

Description

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions