Skip to content

Improved perfs for vectorized bilinear interpolate cpu uint8 RGB-case (channels last)#96848

Closed
vfdev-5 wants to merge 10 commits intogh/vfdev-5/2/basefrom
gh/vfdev-5/2/head
Closed

Improved perfs for vectorized bilinear interpolate cpu uint8 RGB-case (channels last)#96848
vfdev-5 wants to merge 10 commits intogh/vfdev-5/2/basefrom
gh/vfdev-5/2/head

Conversation

@vfdev-5
Copy link
Copy Markdown
Contributor

@vfdev-5 vfdev-5 commented Mar 15, 2023

Stack from ghstack (oldest at bottom):

Description

Results

  • Pillow (9.0.0.post1) == Pillow-SIMD
[-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------]
                                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+gitd6e220c) PR  |  torch (2.1.0a0+git2b75955) nightly  |  Speed-up: PR vs nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True        |    38.674 (+-0.323)    |         57.591 (+-0.244)        |          131.033 (+-1.448)           |      2.275 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False       |                        |         39.471 (+-0.166)        |          113.911 (+-1.736)           |      2.886 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True      |   128.512 (+-1.916)    |        161.592 (+-1.242)        |          299.679 (+-2.099)           |      1.855 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False     |                        |        150.994 (+-1.180)        |          285.331 (+-1.919)           |      1.890 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True      |   180.045 (+-2.223)    |        220.581 (+-1.363)        |          431.057 (+-3.536)           |      1.954 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False     |                        |        219.391 (+-1.409)        |          429.410 (+-3.620)           |      1.957 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True        |   113.911 (+-1.024)    |        129.457 (+-1.295)        |          459.610 (+-13.322)          |      3.550 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False       |                        |         59.800 (+-0.199)        |          400.015 (+-11.815)          |      6.689 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True      |   283.050 (+-2.664)    |        339.143 (+-1.209)        |          683.555 (+-4.466)           |      2.016 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False     |                        |        250.601 (+-1.236)        |          603.545 (+-2.644)           |      2.408 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True        |   186.723 (+-2.213)    |        199.960 (+-1.343)        |          860.867 (+-21.763)          |      4.305 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False       |                        |         79.188 (+-0.261)        |          703.019 (+-25.805)          |      8.878 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True      |   412.353 (+-4.476)    |        462.230 (+-1.983)        |         1101.673 (+-49.299)          |      2.383 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False     |                        |        327.973 (+-1.852)        |          941.062 (+-5.549)           |      2.869 (+-0.000)    

      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True        |    61.191 (+-0.926)    |         80.795 (+-0.518)        |          160.853 (+-1.506)           |      1.991 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True      |   134.488 (+-2.129)    |        169.147 (+-1.324)        |          327.343 (+-2.846)           |      1.935 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True    |  1037.045 (+-24.982)   |        938.623 (+-9.010)        |         2603.360 (+-20.530)          |      2.774 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True        |    52.792 (+-0.613)    |         73.692 (+-0.264)        |          131.829 (+-1.333)           |      1.789 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True      |   139.596 (+-1.944)    |        173.778 (+-1.039)        |          320.063 (+-2.562)           |      1.842 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True    |   690.132 (+-10.946)   |        772.758 (+-2.864)        |         2036.860 (+-36.109)          |      2.636 (+-0.000)    
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False       |                        |         78.747 (+-0.799)        |          158.479 (+-1.702)           |      2.013 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False     |                        |        167.046 (+-1.077)        |          322.104 (+-2.764)           |      1.928 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False   |                        |        918.967 (+-5.251)        |         2611.388 (+-29.917)          |      2.842 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False       |                        |         55.336 (+-0.251)        |          113.869 (+-1.243)           |      2.058 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False     |                        |        156.505 (+-1.095)        |          299.861 (+-2.710)           |      1.916 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False   |                        |        514.344 (+-1.905)        |         1776.796 (+-19.660)          |      3.454 (+-0.000)    

Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

Source

Context

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @datumbox @pmeier

- Based on #96651
- Fixed mem pointer alignment

[ghstack-poisoned]
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Mar 15, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/96848

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit caaf0a5:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

## Description

- Based on #96651
  - Improved perfs for vectorized interpolate uint8 RGB-case
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

```
[------------------------------------------------------------------------------------------ Resize -----------------------------------------------------------------------------------------]
                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+gitcc42a3f) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear 256 -> 32 aa=True     |          38.8          |                56.0             |                 133.2                |            2.4
      3 torch.uint8 channels_last bilinear 256 -> 32 aa=False    |                        |                37.5             |                 112.8                |            3.0
      3 torch.uint8 channels_last bilinear 256 -> 224 aa=True    |         128.7          |               157.0             |                 305.4                |            1.9
      3 torch.uint8 channels_last bilinear 256 -> 224 aa=False   |                        |               146.4             |                 288.7                |            2.0
      3 torch.uint8 channels_last bilinear 256 -> 320 aa=True    |         179.4          |               215.8             |                 442.5                |            2.1
      3 torch.uint8 channels_last bilinear 256 -> 320 aa=False   |                        |               212.5             |                 436.9                |            2.1
      3 torch.uint8 channels_last bilinear 520 -> 32 aa=True     |         113.3          |               127.9             |                 464.8                |            3.6
      3 torch.uint8 channels_last bilinear 520 -> 32 aa=False    |                        |                56.8             |                 365.5                |            6.4
      3 torch.uint8 channels_last bilinear 520 -> 224 aa=True    |         281.7          |               325.2             |                 722.4                |            2.2
      3 torch.uint8 channels_last bilinear 520 -> 224 aa=False   |                        |               239.1             |                 593.5                |            2.5
      3 torch.uint8 channels_last bilinear 712 -> 32 aa=True     |         186.2          |               200.7             |                 833.8                |            4.2
      3 torch.uint8 channels_last bilinear 712 -> 32 aa=False    |                        |                75.2             |                 651.4                |            8.7
      3 torch.uint8 channels_last bilinear 712 -> 224 aa=True    |         410.0          |               444.5             |                1128.4                |            2.5
      3 torch.uint8 channels_last bilinear 712 -> 224 aa=False   |                        |               309.3             |                 917.6                |            3.0
```

Note: for other cases (see Source below) speed-up is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230315-144416-pr_vs_nightly_speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Mar 15, 2023
- Based on #96651
- Fixed mem pointer alignment

ghstack-source-id: d961810
Pull Request resolved: #96848
## Description

- Based on #96651
  - Improved perfs for vectorized interpolate uint8 RGB-case
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

```
[------------------------------------------------------------------------------------------ Resize -----------------------------------------------------------------------------------------]
                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+git0968a5d) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear 256 -> 32 aa=True     |          39.0          |                56.6             |                 133.2                |            2.4
      3 torch.uint8 channels_last bilinear 256 -> 32 aa=False    |                        |                36.9             |                 112.8                |            3.1
      3 torch.uint8 channels_last bilinear 256 -> 224 aa=True    |         128.1          |               152.5             |                 305.4                |            2.0
      3 torch.uint8 channels_last bilinear 256 -> 224 aa=False   |                        |               141.1             |                 288.7                |            2.0
      3 torch.uint8 channels_last bilinear 256 -> 320 aa=True    |         179.6          |               208.8             |                 442.5                |            2.1
      3 torch.uint8 channels_last bilinear 256 -> 320 aa=False   |                        |               206.4             |                 436.9                |            2.1
      3 torch.uint8 channels_last bilinear 520 -> 32 aa=True     |         113.3          |               132.1             |                 464.8                |            3.5
      3 torch.uint8 channels_last bilinear 520 -> 32 aa=False    |                        |                57.2             |                 365.5                |            6.4
      3 torch.uint8 channels_last bilinear 520 -> 224 aa=True    |         281.7          |               327.4             |                 722.4                |            2.2
      3 torch.uint8 channels_last bilinear 520 -> 224 aa=False   |                        |               230.2             |                 593.5                |            2.6
      3 torch.uint8 channels_last bilinear 712 -> 32 aa=True     |         186.9          |               210.5             |                 833.8                |            4.0
      3 torch.uint8 channels_last bilinear 712 -> 32 aa=False    |                        |                75.6             |                 651.4                |            8.6
      3 torch.uint8 channels_last bilinear 712 -> 224 aa=True    |         410.3          |               450.9             |                1128.4                |            2.5
      3 torch.uint8 channels_last bilinear 712 -> 224 aa=False   |                        |               298.7             |                 917.6                |            3.1

```

Note: for other cases (see Source below) speed-up is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230315-162238-pr_vs_nightly_speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Mar 17, 2023
- Based on #96651
- Fixed mem pointer alignment

ghstack-source-id: 4ee5e45
Pull Request resolved: #96848
Copy link
Copy Markdown
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for working on this @vfdev-5 ! I mostly just have questions below, for my own understanding.

For future reference, it might be worth clarifying in the PR description that these improvements concern only:

  • the bilinear mode
  • channels_last RGB CPU tensors

Regarding the benchmarks, could you please clarify that we're comparing against pillow SIMD - the current table shows Pillow (9.0.0.post1). Also, it'd be interesting to look at more upscaling results; right now it seems that mostly downscaling situations are reported.

Finally, what is the plan w.r.t. testing the correctness of this new implementation?

Comment thread aten/src/ATen/native/cpu/UpSampleKernelAVXAntialias.h
Comment thread aten/src/ATen/native/cpu/UpSampleKernelAVXAntialias.h
Comment thread aten/src/ATen/native/cpu/UpSampleKernelAVXAntialias.h
Comment thread aten/src/ATen/native/cpu/UpSampleKernelAVXAntialias.h Outdated
Comment thread aten/src/ATen/native/cpu/UpSampleKernelAVXAntialias.h Outdated
Comment thread aten/src/ATen/native/cpu/UpSampleKernelAVXAntialias.h Outdated
@vfdev-5 vfdev-5 changed the title Improved perfs for vectorized interpolate cpu uint8 RGB-case Improved perfs for vectorized bilinear interpolate cpu uint8 RGB-case (channels last) Mar 20, 2023
…t8 RGB-case (channels last)"


## Description

- Based on #96651
  - Improved perfs for vectorized bilinear interpolate uint8 RGB-case, channels last
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (`Pillow (9.0.0.post1)`)
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

```
[------------------------------------------------------------------------------------------ Resize -----------------------------------------------------------------------------------------]
                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+git0968a5d) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear 256 -> 32 aa=True     |          39.0          |                56.6             |                 133.2                |            2.4
      3 torch.uint8 channels_last bilinear 256 -> 32 aa=False    |                        |                36.9             |                 112.8                |            3.1
      3 torch.uint8 channels_last bilinear 256 -> 224 aa=True    |         128.1          |               152.5             |                 305.4                |            2.0
      3 torch.uint8 channels_last bilinear 256 -> 224 aa=False   |                        |               141.1             |                 288.7                |            2.0
      3 torch.uint8 channels_last bilinear 256 -> 320 aa=True    |         179.6          |               208.8             |                 442.5                |            2.1
      3 torch.uint8 channels_last bilinear 256 -> 320 aa=False   |                        |               206.4             |                 436.9                |            2.1
      3 torch.uint8 channels_last bilinear 520 -> 32 aa=True     |         113.3          |               132.1             |                 464.8                |            3.5
      3 torch.uint8 channels_last bilinear 520 -> 32 aa=False    |                        |                57.2             |                 365.5                |            6.4
      3 torch.uint8 channels_last bilinear 520 -> 224 aa=True    |         281.7          |               327.4             |                 722.4                |            2.2
      3 torch.uint8 channels_last bilinear 520 -> 224 aa=False   |                        |               230.2             |                 593.5                |            2.6
      3 torch.uint8 channels_last bilinear 712 -> 32 aa=True     |         186.9          |               210.5             |                 833.8                |            4.0
      3 torch.uint8 channels_last bilinear 712 -> 32 aa=False    |                        |                75.6             |                 651.4                |            8.6
      3 torch.uint8 channels_last bilinear 712 -> 224 aa=True    |         410.3          |               450.9             |                1128.4                |            2.5
      3 torch.uint8 channels_last bilinear 712 -> 224 aa=False   |                        |               298.7             |                 917.6                |            3.1

```

Note: for other cases (see Source below) speed-up is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230315-162238-pr_vs_nightly_speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Mar 20, 2023
- Based on #96651
- Fixed mem pointer alignment

ghstack-source-id: c82a73d
Pull Request resolved: #96848
Comment thread aten/src/ATen/native/cpu/UpSampleKernelAVXAntialias.h Outdated
Comment thread aten/src/ATen/native/cpu/UpSampleKernelAVXAntialias.h Outdated
Comment thread aten/src/ATen/native/cpu/UpSampleKernelAVXAntialias.h Outdated
vfdev-5 added a commit to vfdev-5/pytorch that referenced this pull request Mar 21, 2023
- Based on pytorch#96651
- Fixed mem pointer alignment

ghstack-source-id: c82a73d
Pull Request resolved: pytorch#96848
…t8 RGB-case (channels last)"


## Description

- Based on #96651
  - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last**
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results)
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

- `Pillow (9.0.0.post1)` == Pillow-SIMD

```
[-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------]
                                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+gitc005105) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True        |    38.670 (+-0.445)    |         57.366 (+-0.799)        |          132.147 (+-1.236)           |      2.304 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False       |                        |         37.825 (+-0.417)        |          111.789 (+-1.175)           |      2.955 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True      |   127.898 (+-1.335)    |        153.081 (+-2.346)        |          302.518 (+-2.632)           |      1.976 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False     |                        |        141.695 (+-1.415)        |          286.663 (+-2.494)           |      2.023 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True      |   179.735 (+-2.054)    |        210.613 (+-3.116)        |          439.375 (+-4.014)           |      2.086 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False     |                        |        207.601 (+-1.639)        |          438.537 (+-4.143)           |      2.112 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True        |   112.679 (+-1.321)    |        130.863 (+-1.987)        |          446.804 (+-3.283)           |      3.414 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False       |                        |         57.968 (+-0.270)        |          374.244 (+-13.598)          |      6.456 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True      |   282.398 (+-3.485)    |        322.986 (+-1.947)        |          720.197 (+-3.467)           |      2.230 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False     |                        |        231.625 (+-2.006)        |          592.834 (+-3.903)           |      2.559 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True        |   185.711 (+-1.666)    |        201.069 (+-2.182)        |          787.868 (+-3.648)           |      3.918 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False       |                        |         75.975 (+-0.696)        |          651.016 (+-3.926)           |      8.569 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True      |   410.236 (+-6.021)    |        451.486 (+-3.939)        |         1123.923 (+-14.988)          |      2.489 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False     |                        |        299.597 (+-1.887)        |          915.347 (+-4.486)           |      3.055 (+-0.000)    

      # More test-cases from #90771
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True        |    60.751 (+-0.285)    |         78.538 (+-1.282)        |          170.465 (+-1.830)           |      2.170 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True      |   133.619 (+-2.035)    |        159.614 (+-1.587)        |          330.971 (+-3.249)           |      2.074 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True    |   950.243 (+-10.641)   |        891.369 (+-17.946)       |         2805.510 (+-25.503)          |      3.147 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True        |    52.771 (+-0.961)    |         72.253 (+-1.020)        |          135.933 (+-1.625)           |      1.881 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True      |   139.107 (+-2.143)    |        165.844 (+-2.177)        |          321.112 (+-2.904)           |      1.936 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True    |   691.470 (+-9.566)    |        764.942 (+-11.192)       |         2050.880 (+-22.188)          |      2.681 (+-0.000)    
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False       |                        |         77.375 (+-1.345)        |          169.646 (+-1.640)           |      2.193 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False     |                        |        159.115 (+-3.935)        |          329.754 (+-2.590)           |      2.072 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False   |                        |        877.248 (+-5.736)        |         2815.870 (+-22.589)          |      3.210 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False       |                        |         53.120 (+-0.316)        |          112.024 (+-1.225)           |      2.109 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False     |                        |        147.330 (+-1.871)        |          299.152 (+-3.353)           |      2.030 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False   |                        |        472.182 (+-10.785)       |         1698.601 (+-16.785)          |      3.597 (+-0.000)    
```

Note: for other cases (see Source below) speed-up is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230320-160044-pr_vs_nightly-speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Mar 21, 2023
- Based on #96651
- Fixed mem pointer alignment

ghstack-source-id: f807362
Pull Request resolved: #96848
@vfdev-5 vfdev-5 requested a review from peterbell10 March 21, 2023 14:04
Copy link
Copy Markdown
Collaborator

@peterbell10 peterbell10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks reasonable but I'll wait for a respose to NicolasHug's question on testing and more benchmarks for upsampling. Also some benchmarks showing 4 channels haven't regressed would be nice.

Comment thread aten/src/ATen/native/cpu/UpSampleKernelAVXAntialias.h Outdated
@vfdev-5
Copy link
Copy Markdown
Contributor Author

vfdev-5 commented Mar 21, 2023

@peterbell10 I already added more upsampling benchmarks, see the description after the line "# More test-cases from #90771". @NicolasHug can you confirm that those benchmark are sufficient ? Thanks

…t8 RGB-case (channels last)"


## Description

- Based on #96651
  - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last**
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results)
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

- `Pillow (9.0.0.post1)` == Pillow-SIMD

```
[-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------]
                                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+git8d955df) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True        |    38.649 (+-0.306)    |         55.828 (+-0.370)        |          132.147 (+-1.236)           |      2.367 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False       |                        |         36.826 (+-0.229)        |          111.789 (+-1.175)           |      3.036 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True      |   128.233 (+-1.313)    |        153.827 (+-1.229)        |          302.518 (+-2.632)           |      1.967 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False     |                        |        143.886 (+-1.409)        |          286.663 (+-2.494)           |      1.992 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True      |   179.504 (+-1.825)    |        211.569 (+-1.336)        |          439.375 (+-4.014)           |      2.077 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False     |                        |        209.888 (+-1.443)        |          438.537 (+-4.143)           |      2.089 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True        |   112.891 (+-1.118)    |        129.373 (+-1.396)        |          446.804 (+-3.283)           |      3.454 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False       |                        |         56.858 (+-0.227)        |          374.244 (+-13.598)          |      6.582 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True      |   282.917 (+-2.992)    |        324.378 (+-1.694)        |          720.197 (+-3.467)           |      2.220 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False     |                        |        236.078 (+-1.679)        |          592.834 (+-3.903)           |      2.511 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True        |   185.595 (+-1.633)    |        202.000 (+-1.920)        |          787.868 (+-3.648)           |      3.900 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False       |                        |         75.421 (+-0.512)        |          651.016 (+-3.926)           |      8.632 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True      |   409.691 (+-2.735)    |        449.927 (+-2.500)        |         1123.923 (+-14.988)          |      2.498 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False     |                        |        306.691 (+-2.095)        |          915.347 (+-4.486)           |      2.985 (+-0.000)    

      # More test-cases from #90771
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True        |    60.740 (+-0.278)    |         78.745 (+-0.286)        |          170.465 (+-1.830)           |      2.165 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True      |   133.029 (+-1.619)    |        162.393 (+-1.289)        |          330.971 (+-3.249)           |      2.038 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True    |   948.849 (+-2.749)    |        896.127 (+-3.696)        |         2805.510 (+-25.503)          |      3.131 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True        |    52.505 (+-0.319)    |         70.617 (+-0.344)        |          135.933 (+-1.625)           |      1.925 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True      |   138.671 (+-1.953)    |        165.638 (+-1.473)        |          321.112 (+-2.904)           |      1.939 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True    |   689.492 (+-2.917)    |        758.162 (+-3.719)        |         2050.880 (+-22.188)          |      2.705 (+-0.000)    
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False       |                        |         77.300 (+-0.307)        |          169.646 (+-1.640)           |      2.195 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False     |                        |        159.525 (+-1.225)        |          329.754 (+-2.590)           |      2.067 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False   |                        |        890.106 (+-3.358)        |         2815.870 (+-22.589)          |      3.164 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False       |                        |         52.399 (+-0.314)        |          112.024 (+-1.225)           |      2.138 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False     |                        |        148.780 (+-1.282)        |          299.152 (+-3.353)           |      2.011 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False   |                        |        479.273 (+-3.432)        |         1698.601 (+-16.785)          |      3.544 (+-0.000)    
      4
```

Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230321-145513-pr_vs_nightly-speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Mar 22, 2023
- Based on #96651
- Fixed mem pointer alignment

ghstack-source-id: 6132906
Pull Request resolved: #96848
…t8 RGB-case (channels last)"


## Description

- Based on #96651
  - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last**
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results)
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

- `Pillow (9.0.0.post1)` == Pillow-SIMD

```
[-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------]
                                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+gitce4be01) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True        |    38.548 (+-0.280)    |         57.536 (+-0.210)        |          132.147 (+-1.236)           |      2.297 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False       |                        |         38.532 (+-0.219)        |          111.789 (+-1.175)           |      2.901 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True      |   127.689 (+-1.348)    |        156.262 (+-1.213)        |          302.518 (+-2.632)           |      1.936 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False     |                        |        145.483 (+-1.077)        |          286.663 (+-2.494)           |      1.970 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True      |   178.117 (+-1.956)    |        215.053 (+-1.470)        |          439.375 (+-4.014)           |      2.043 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False     |                        |        211.340 (+-2.239)        |          438.537 (+-4.143)           |      2.075 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True        |   112.593 (+-1.266)    |        130.414 (+-1.633)        |          446.804 (+-3.283)           |      3.426 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False       |                        |         58.767 (+-0.203)        |          374.244 (+-13.598)          |      6.368 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True      |   283.210 (+-2.937)    |        324.157 (+-1.895)        |          720.197 (+-3.467)           |      2.222 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False     |                        |        239.800 (+-2.492)        |          592.834 (+-3.903)           |      2.472 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True        |   186.255 (+-1.629)    |        204.834 (+-1.496)        |          787.868 (+-3.648)           |      3.846 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False       |                        |         77.335 (+-0.341)        |          651.016 (+-3.926)           |      8.418 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True      |   410.286 (+-2.439)    |        443.934 (+-2.899)        |         1123.923 (+-14.988)          |      2.532 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False     |                        |        312.220 (+-2.307)        |          915.347 (+-4.486)           |      2.932 (+-0.000)    

      # More test-cases from #90771
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True        |    60.611 (+-0.337)    |         80.849 (+-1.780)        |          170.465 (+-1.830)           |      2.108 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True      |   132.971 (+-1.624)    |        164.892 (+-1.426)        |          330.971 (+-3.249)           |      2.007 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True    |   948.467 (+-3.179)    |        891.414 (+-5.282)        |         2805.510 (+-25.503)          |      3.147 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True        |    52.539 (+-0.327)    |         72.471 (+-0.367)        |          135.933 (+-1.625)           |      1.876 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True      |   138.669 (+-1.867)    |        168.628 (+-1.213)        |          321.112 (+-2.904)           |      1.904 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True    |   689.933 (+-3.175)    |        746.911 (+-2.985)        |         2050.880 (+-22.188)          |      2.746 (+-0.000)    
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False       |                        |         78.347 (+-0.338)        |          169.646 (+-1.640)           |      2.165 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False     |                        |        162.194 (+-1.089)        |          329.754 (+-2.590)           |      2.033 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False   |                        |        894.476 (+-2.738)        |         2815.870 (+-22.589)          |      3.148 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False       |                        |         52.728 (+-0.406)        |          112.024 (+-1.225)           |      2.125 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False     |                        |        151.560 (+-1.128)        |          299.152 (+-3.353)           |      1.974 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False   |                        |        500.053 (+-4.288)        |         1698.601 (+-16.785)          |      3.397 (+-0.000)    
```

Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230322-132441-pr_vs_nightly-speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Mar 23, 2023
- Based on #96651
- Fixed mem pointer alignment

ghstack-source-id: 93bd276
Pull Request resolved: #96848
…t8 RGB-case (channels last)"


## Description

- Based on #96651
  - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last**
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results)
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

- `Pillow (9.0.0.post1)` == Pillow-SIMD

```
[-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------]
                                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+gitce4be01) PR  |  torch (2.1.0a0+git5309c44) nightly  |  Speed-up: PR vs nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True        |    38.548 (+-0.280)    |         57.536 (+-0.210)        |          132.147 (+-1.236)           |      2.297 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False       |                        |         38.532 (+-0.219)        |          111.789 (+-1.175)           |      2.901 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True      |   127.689 (+-1.348)    |        156.262 (+-1.213)        |          302.518 (+-2.632)           |      1.936 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False     |                        |        145.483 (+-1.077)        |          286.663 (+-2.494)           |      1.970 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True      |   178.117 (+-1.956)    |        215.053 (+-1.470)        |          439.375 (+-4.014)           |      2.043 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False     |                        |        211.340 (+-2.239)        |          438.537 (+-4.143)           |      2.075 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True        |   112.593 (+-1.266)    |        130.414 (+-1.633)        |          446.804 (+-3.283)           |      3.426 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False       |                        |         58.767 (+-0.203)        |          374.244 (+-13.598)          |      6.368 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True      |   283.210 (+-2.937)    |        324.157 (+-1.895)        |          720.197 (+-3.467)           |      2.222 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False     |                        |        239.800 (+-2.492)        |          592.834 (+-3.903)           |      2.472 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True        |   186.255 (+-1.629)    |        204.834 (+-1.496)        |          787.868 (+-3.648)           |      3.846 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False       |                        |         77.335 (+-0.341)        |          651.016 (+-3.926)           |      8.418 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True      |   410.286 (+-2.439)    |        443.934 (+-2.899)        |         1123.923 (+-14.988)          |      2.532 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False     |                        |        312.220 (+-2.307)        |          915.347 (+-4.486)           |      2.932 (+-0.000)    

      # More test-cases from #90771
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True        |    60.611 (+-0.337)    |         80.849 (+-1.780)        |          170.465 (+-1.830)           |      2.108 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True      |   132.971 (+-1.624)    |        164.892 (+-1.426)        |          330.971 (+-3.249)           |      2.007 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True    |   948.467 (+-3.179)    |        891.414 (+-5.282)        |         2805.510 (+-25.503)          |      3.147 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True        |    52.539 (+-0.327)    |         72.471 (+-0.367)        |          135.933 (+-1.625)           |      1.876 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True      |   138.669 (+-1.867)    |        168.628 (+-1.213)        |          321.112 (+-2.904)           |      1.904 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True    |   689.933 (+-3.175)    |        746.911 (+-2.985)        |         2050.880 (+-22.188)          |      2.746 (+-0.000)    
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False       |                        |         78.347 (+-0.338)        |          169.646 (+-1.640)           |      2.165 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False     |                        |        162.194 (+-1.089)        |          329.754 (+-2.590)           |      2.033 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False   |                        |        894.476 (+-2.738)        |         2815.870 (+-22.589)          |      3.148 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False       |                        |         52.728 (+-0.406)        |          112.024 (+-1.225)           |      2.125 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False     |                        |        151.560 (+-1.128)        |          299.152 (+-3.353)           |      1.974 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False   |                        |        500.053 (+-4.288)        |         1698.601 (+-16.785)          |      3.397 (+-0.000)    
```

Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230322-132441-pr_vs_nightly-speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Mar 29, 2023
- Based on #96651
- Fixed mem pointer alignment

ghstack-source-id: 8b90000
Pull Request resolved: #96848
@vfdev-5
Copy link
Copy Markdown
Contributor Author

vfdev-5 commented Mar 29, 2023

From your benchmarks: ....
Are these swapped? 2.1x speedup for 4 channels, 1.1x speedup for 3 channels.

@peterbell10 @NicolasHug it turned out that this is due noisy measurements on my machine for "channels first (1024, 1024) -> (256, 256)" cases. I measured just these cases on nightly and on this PR and could not get any reliable results. Overall numbers look similar, so I do not expect any improvement for these cases...

@vfdev-5 vfdev-5 closed this Mar 29, 2023
@vfdev-5 vfdev-5 reopened this Mar 29, 2023
@vfdev-5 vfdev-5 requested a review from peterbell10 March 29, 2023 21:14
@vfdev-5
Copy link
Copy Markdown
Contributor Author

vfdev-5 commented Mar 30, 2023

@pytorchbot merge

@pytorch-bot pytorch-bot Bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 30, 2023
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge failed

Reason: This PR needs a label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team Raised by workflow job

@vfdev-5 vfdev-5 added the release notes: nn release notes category label Mar 30, 2023
@vfdev-5
Copy link
Copy Markdown
Contributor Author

vfdev-5 commented Mar 30, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@vfdev-5
Copy link
Copy Markdown
Contributor Author

vfdev-5 commented Mar 30, 2023

@pytorchbot rebase

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

…t8 RGB-case (channels last)"


## Description

- Based on #96651
  - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last**
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results)
  - RGBA case perfs are the same after refactoring (see Source link below) 
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

- `Pillow (9.0.0.post1)` == Pillow-SIMD

```
[-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------]
                                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+gitd6e220c) PR  |  torch (2.1.0a0+git2b75955) nightly  |  Speed-up: PR vs nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True        |    38.674 (+-0.323)    |         57.591 (+-0.244)        |          131.033 (+-1.448)           |      2.275 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False       |                        |         39.471 (+-0.166)        |          113.911 (+-1.736)           |      2.886 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True      |   128.512 (+-1.916)    |        161.592 (+-1.242)        |          299.679 (+-2.099)           |      1.855 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False     |                        |        150.994 (+-1.180)        |          285.331 (+-1.919)           |      1.890 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True      |   180.045 (+-2.223)    |        220.581 (+-1.363)        |          431.057 (+-3.536)           |      1.954 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False     |                        |        219.391 (+-1.409)        |          429.410 (+-3.620)           |      1.957 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True        |   113.911 (+-1.024)    |        129.457 (+-1.295)        |          459.610 (+-13.322)          |      3.550 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False       |                        |         59.800 (+-0.199)        |          400.015 (+-11.815)          |      6.689 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True      |   283.050 (+-2.664)    |        339.143 (+-1.209)        |          683.555 (+-4.466)           |      2.016 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False     |                        |        250.601 (+-1.236)        |          603.545 (+-2.644)           |      2.408 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True        |   186.723 (+-2.213)    |        199.960 (+-1.343)        |          860.867 (+-21.763)          |      4.305 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False       |                        |         79.188 (+-0.261)        |          703.019 (+-25.805)          |      8.878 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True      |   412.353 (+-4.476)    |        462.230 (+-1.983)        |         1101.673 (+-49.299)          |      2.383 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False     |                        |        327.973 (+-1.852)        |          941.062 (+-5.549)           |      2.869 (+-0.000)    

      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True        |    61.191 (+-0.926)    |         80.795 (+-0.518)        |          160.853 (+-1.506)           |      1.991 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True      |   134.488 (+-2.129)    |        169.147 (+-1.324)        |          327.343 (+-2.846)           |      1.935 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True    |  1037.045 (+-24.982)   |        938.623 (+-9.010)        |         2603.360 (+-20.530)          |      2.774 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True        |    52.792 (+-0.613)    |         73.692 (+-0.264)        |          131.829 (+-1.333)           |      1.789 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True      |   139.596 (+-1.944)    |        173.778 (+-1.039)        |          320.063 (+-2.562)           |      1.842 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True    |   690.132 (+-10.946)   |        772.758 (+-2.864)        |         2036.860 (+-36.109)          |      2.636 (+-0.000)    
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False       |                        |         78.747 (+-0.799)        |          158.479 (+-1.702)           |      2.013 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False     |                        |        167.046 (+-1.077)        |          322.104 (+-2.764)           |      1.928 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False   |                        |        918.967 (+-5.251)        |         2611.388 (+-29.917)          |      2.842 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False       |                        |         55.336 (+-0.251)        |          113.869 (+-1.243)           |      2.058 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False     |                        |        156.505 (+-1.095)        |          299.861 (+-2.710)           |      1.916 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False   |                        |        514.344 (+-1.905)        |         1776.796 (+-19.660)          |      3.454 (+-0.000)    

```

Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230329-181023-pr_vs_nightly-speedup-md)


## Context

- #90771



cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 datumbox pmeier

[ghstack-poisoned]
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Successfully rebased gh/vfdev-5/2/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/96848)

pytorchmergebot pushed a commit that referenced this pull request Mar 30, 2023
- Based on #96651
- Fixed mem pointer alignment

ghstack-source-id: 6c30da9
Pull Request resolved: #96848
@vfdev-5
Copy link
Copy Markdown
Contributor Author

vfdev-5 commented Mar 30, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@vfdev-5 vfdev-5 deleted the gh/vfdev-5/2/head branch March 30, 2023 11:52
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 25, 2026
… (channels last) (pytorch#96848)

## Description

- Based on pytorch#96651
  - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last**
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results)
  - RGBA case perfs are the same after refactoring (see Source link below)
- Fixed mem pointer alignment, added more comments (reviews from pytorch#96651)

## Results

- `Pillow (9.0.0.post1)` == Pillow-SIMD

```
[-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------]
                                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+gitd6e220c) PR  |  torch (2.1.0a0+git2b75955) nightly  |  Speed-up: PR vs nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True        |    38.674 (+-0.323)    |         57.591 (+-0.244)        |          131.033 (+-1.448)           |      2.275 (+-0.000)
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False       |                        |         39.471 (+-0.166)        |          113.911 (+-1.736)           |      2.886 (+-0.000)
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True      |   128.512 (+-1.916)    |        161.592 (+-1.242)        |          299.679 (+-2.099)           |      1.855 (+-0.000)
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False     |                        |        150.994 (+-1.180)        |          285.331 (+-1.919)           |      1.890 (+-0.000)
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True      |   180.045 (+-2.223)    |        220.581 (+-1.363)        |          431.057 (+-3.536)           |      1.954 (+-0.000)
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False     |                        |        219.391 (+-1.409)        |          429.410 (+-3.620)           |      1.957 (+-0.000)
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True        |   113.911 (+-1.024)    |        129.457 (+-1.295)        |          459.610 (+-13.322)          |      3.550 (+-0.000)
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False       |                        |         59.800 (+-0.199)        |          400.015 (+-11.815)          |      6.689 (+-0.000)
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True      |   283.050 (+-2.664)    |        339.143 (+-1.209)        |          683.555 (+-4.466)           |      2.016 (+-0.000)
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False     |                        |        250.601 (+-1.236)        |          603.545 (+-2.644)           |      2.408 (+-0.000)
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True        |   186.723 (+-2.213)    |        199.960 (+-1.343)        |          860.867 (+-21.763)          |      4.305 (+-0.000)
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False       |                        |         79.188 (+-0.261)        |          703.019 (+-25.805)          |      8.878 (+-0.000)
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True      |   412.353 (+-4.476)    |        462.230 (+-1.983)        |         1101.673 (+-49.299)          |      2.383 (+-0.000)
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False     |                        |        327.973 (+-1.852)        |          941.062 (+-5.549)           |      2.869 (+-0.000)

      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True        |    61.191 (+-0.926)    |         80.795 (+-0.518)        |          160.853 (+-1.506)           |      1.991 (+-0.000)
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True      |   134.488 (+-2.129)    |        169.147 (+-1.324)        |          327.343 (+-2.846)           |      1.935 (+-0.000)
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True    |  1037.045 (+-24.982)   |        938.623 (+-9.010)        |         2603.360 (+-20.530)          |      2.774 (+-0.000)
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True        |    52.792 (+-0.613)    |         73.692 (+-0.264)        |          131.829 (+-1.333)           |      1.789 (+-0.000)
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True      |   139.596 (+-1.944)    |        173.778 (+-1.039)        |          320.063 (+-2.562)           |      1.842 (+-0.000)
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True    |   690.132 (+-10.946)   |        772.758 (+-2.864)        |         2036.860 (+-36.109)          |      2.636 (+-0.000)
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False       |                        |         78.747 (+-0.799)        |          158.479 (+-1.702)           |      2.013 (+-0.000)
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False     |                        |        167.046 (+-1.077)        |          322.104 (+-2.764)           |      1.928 (+-0.000)
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False   |                        |        918.967 (+-5.251)        |         2611.388 (+-29.917)          |      2.842 (+-0.000)
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False       |                        |         55.336 (+-0.251)        |          113.869 (+-1.243)           |      2.058 (+-0.000)
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False     |                        |        156.505 (+-1.095)        |          299.861 (+-2.710)           |      1.916 (+-0.000)
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False   |                        |        514.344 (+-1.905)        |         1776.796 (+-19.660)          |      3.454 (+-0.000)

```

Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230329-181023-pr_vs_nightly-speedup-md)

## Context

- pytorch#90771

Pull Request resolved: pytorch#96848
Approved by: https://github.com/NicolasHug, https://github.com/peterbell10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged module: cpu CPU specific problem (e.g., perf, algorithm) module: interpolation module: vision open source release notes: nn release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants