Improved perfs for vectorized bilinear interpolate cpu uint8 RGB-case (channels last) by vfdev-5 · Pull Request #96848 · pytorch/pytorch

vfdev-5 · 2023-03-15T13:47:35Z

Stack from ghstack (oldest at bottom):

-> Improved perfs for vectorized bilinear interpolate cpu uint8 RGB-case (channels last) #96848

Description

Based on Improved perfs for vectorized interpolate cpu uint8 RGB-case #96651
- Improved perfs for vectorized bilinear interpolate uint8 RGB-case, channels last
  - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
- Performances are more close to Pillow-SIMD (labeled as Pillow (9.0.0.post1) in the results)
- RGBA case perfs are the same after refactoring (see Source link below)
Fixed mem pointer alignment, added more comments (reviews from Improved perfs for vectorized interpolate cpu uint8 RGB-case #96651)

Results

Pillow (9.0.0.post1) == Pillow-SIMD

[-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------]
                                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+gitd6e220c) PR  |  torch (2.1.0a0+git2b75955) nightly  |  Speed-up: PR vs nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True        |    38.674 (+-0.323)    |         57.591 (+-0.244)        |          131.033 (+-1.448)           |      2.275 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False       |                        |         39.471 (+-0.166)        |          113.911 (+-1.736)           |      2.886 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True      |   128.512 (+-1.916)    |        161.592 (+-1.242)        |          299.679 (+-2.099)           |      1.855 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False     |                        |        150.994 (+-1.180)        |          285.331 (+-1.919)           |      1.890 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True      |   180.045 (+-2.223)    |        220.581 (+-1.363)        |          431.057 (+-3.536)           |      1.954 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False     |                        |        219.391 (+-1.409)        |          429.410 (+-3.620)           |      1.957 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True        |   113.911 (+-1.024)    |        129.457 (+-1.295)        |          459.610 (+-13.322)          |      3.550 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False       |                        |         59.800 (+-0.199)        |          400.015 (+-11.815)          |      6.689 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True      |   283.050 (+-2.664)    |        339.143 (+-1.209)        |          683.555 (+-4.466)           |      2.016 (+-0.000)    
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False     |                        |        250.601 (+-1.236)        |          603.545 (+-2.644)           |      2.408 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True        |   186.723 (+-2.213)    |        199.960 (+-1.343)        |          860.867 (+-21.763)          |      4.305 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False       |                        |         79.188 (+-0.261)        |          703.019 (+-25.805)          |      8.878 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True      |   412.353 (+-4.476)    |        462.230 (+-1.983)        |         1101.673 (+-49.299)          |      2.383 (+-0.000)    
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False     |                        |        327.973 (+-1.852)        |          941.062 (+-5.549)           |      2.869 (+-0.000)    

      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True        |    61.191 (+-0.926)    |         80.795 (+-0.518)        |          160.853 (+-1.506)           |      1.991 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True      |   134.488 (+-2.129)    |        169.147 (+-1.324)        |          327.343 (+-2.846)           |      1.935 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True    |  1037.045 (+-24.982)   |        938.623 (+-9.010)        |         2603.360 (+-20.530)          |      2.774 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True        |    52.792 (+-0.613)    |         73.692 (+-0.264)        |          131.829 (+-1.333)           |      1.789 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True      |   139.596 (+-1.944)    |        173.778 (+-1.039)        |          320.063 (+-2.562)           |      1.842 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True    |   690.132 (+-10.946)   |        772.758 (+-2.864)        |         2036.860 (+-36.109)          |      2.636 (+-0.000)    
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False       |                        |         78.747 (+-0.799)        |          158.479 (+-1.702)           |      2.013 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False     |                        |        167.046 (+-1.077)        |          322.104 (+-2.764)           |      1.928 (+-0.000)    
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False   |                        |        918.967 (+-5.251)        |         2611.388 (+-29.917)          |      2.842 (+-0.000)    
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False       |                        |         55.336 (+-0.251)        |          113.869 (+-1.243)           |      2.058 (+-0.000)    
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False     |                        |        156.505 (+-1.095)        |          299.861 (+-2.710)           |      1.916 (+-0.000)    
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False   |                        |        514.344 (+-1.905)        |         1776.796 (+-19.660)          |      3.454 (+-0.000)

Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

Source

Context

Add uint8 support for interpolate for CPU images #90771

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @datumbox @pmeier

- Based on #96651 - Fixed mem pointer alignment [ghstack-poisoned]

pytorch-bot · 2023-03-15T13:47:38Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/96848

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit caaf0a5:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

## Description - Based on #96651 - Improved perfs for vectorized interpolate uint8 RGB-case - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results ``` [------------------------------------------------------------------------------------------ Resize -----------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+gitcc42a3f) PR | torch (2.1.0a0+git5309c44) nightly | Speed-up: PR vs nightly 1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear 256 -> 32 aa=True | 38.8 | 56.0 | 133.2 | 2.4 3 torch.uint8 channels_last bilinear 256 -> 32 aa=False | | 37.5 | 112.8 | 3.0 3 torch.uint8 channels_last bilinear 256 -> 224 aa=True | 128.7 | 157.0 | 305.4 | 1.9 3 torch.uint8 channels_last bilinear 256 -> 224 aa=False | | 146.4 | 288.7 | 2.0 3 torch.uint8 channels_last bilinear 256 -> 320 aa=True | 179.4 | 215.8 | 442.5 | 2.1 3 torch.uint8 channels_last bilinear 256 -> 320 aa=False | | 212.5 | 436.9 | 2.1 3 torch.uint8 channels_last bilinear 520 -> 32 aa=True | 113.3 | 127.9 | 464.8 | 3.6 3 torch.uint8 channels_last bilinear 520 -> 32 aa=False | | 56.8 | 365.5 | 6.4 3 torch.uint8 channels_last bilinear 520 -> 224 aa=True | 281.7 | 325.2 | 722.4 | 2.2 3 torch.uint8 channels_last bilinear 520 -> 224 aa=False | | 239.1 | 593.5 | 2.5 3 torch.uint8 channels_last bilinear 712 -> 32 aa=True | 186.2 | 200.7 | 833.8 | 4.2 3 torch.uint8 channels_last bilinear 712 -> 32 aa=False | | 75.2 | 651.4 | 8.7 3 torch.uint8 channels_last bilinear 712 -> 224 aa=True | 410.0 | 444.5 | 1128.4 | 2.5 3 torch.uint8 channels_last bilinear 712 -> 224 aa=False | | 309.3 | 917.6 | 3.0 ``` Note: for other cases (see Source below) speed-up is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230315-144416-pr_vs_nightly_speedup-md) ## Context - #90771 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

- Based on #96651 - Fixed mem pointer alignment ghstack-source-id: d961810 Pull Request resolved: #96848

## Description - Based on #96651 - Improved perfs for vectorized interpolate uint8 RGB-case - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results ``` [------------------------------------------------------------------------------------------ Resize -----------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+git0968a5d) PR | torch (2.1.0a0+git5309c44) nightly | Speed-up: PR vs nightly 1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear 256 -> 32 aa=True | 39.0 | 56.6 | 133.2 | 2.4 3 torch.uint8 channels_last bilinear 256 -> 32 aa=False | | 36.9 | 112.8 | 3.1 3 torch.uint8 channels_last bilinear 256 -> 224 aa=True | 128.1 | 152.5 | 305.4 | 2.0 3 torch.uint8 channels_last bilinear 256 -> 224 aa=False | | 141.1 | 288.7 | 2.0 3 torch.uint8 channels_last bilinear 256 -> 320 aa=True | 179.6 | 208.8 | 442.5 | 2.1 3 torch.uint8 channels_last bilinear 256 -> 320 aa=False | | 206.4 | 436.9 | 2.1 3 torch.uint8 channels_last bilinear 520 -> 32 aa=True | 113.3 | 132.1 | 464.8 | 3.5 3 torch.uint8 channels_last bilinear 520 -> 32 aa=False | | 57.2 | 365.5 | 6.4 3 torch.uint8 channels_last bilinear 520 -> 224 aa=True | 281.7 | 327.4 | 722.4 | 2.2 3 torch.uint8 channels_last bilinear 520 -> 224 aa=False | | 230.2 | 593.5 | 2.6 3 torch.uint8 channels_last bilinear 712 -> 32 aa=True | 186.9 | 210.5 | 833.8 | 4.0 3 torch.uint8 channels_last bilinear 712 -> 32 aa=False | | 75.6 | 651.4 | 8.6 3 torch.uint8 channels_last bilinear 712 -> 224 aa=True | 410.3 | 450.9 | 1128.4 | 2.5 3 torch.uint8 channels_last bilinear 712 -> 224 aa=False | | 298.7 | 917.6 | 3.1 ``` Note: for other cases (see Source below) speed-up is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230315-162238-pr_vs_nightly_speedup-md) ## Context - #90771 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

- Based on #96651 - Fixed mem pointer alignment ghstack-source-id: 4ee5e45 Pull Request resolved: #96848

NicolasHug

Thanks a lot for working on this @vfdev-5 ! I mostly just have questions below, for my own understanding.

For future reference, it might be worth clarifying in the PR description that these improvements concern only:

the bilinear mode
channels_last RGB CPU tensors

Regarding the benchmarks, could you please clarify that we're comparing against pillow SIMD - the current table shows Pillow (9.0.0.post1). Also, it'd be interesting to look at more upscaling results; right now it seems that mostly downscaling situations are reported.

Finally, what is the plan w.r.t. testing the correctness of this new implementation?

…t8 RGB-case (channels last)" ## Description - Based on #96651 - Improved perfs for vectorized bilinear interpolate uint8 RGB-case, channels last - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD (`Pillow (9.0.0.post1)`) - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results ``` [------------------------------------------------------------------------------------------ Resize -----------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+git0968a5d) PR | torch (2.1.0a0+git5309c44) nightly | Speed-up: PR vs nightly 1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear 256 -> 32 aa=True | 39.0 | 56.6 | 133.2 | 2.4 3 torch.uint8 channels_last bilinear 256 -> 32 aa=False | | 36.9 | 112.8 | 3.1 3 torch.uint8 channels_last bilinear 256 -> 224 aa=True | 128.1 | 152.5 | 305.4 | 2.0 3 torch.uint8 channels_last bilinear 256 -> 224 aa=False | | 141.1 | 288.7 | 2.0 3 torch.uint8 channels_last bilinear 256 -> 320 aa=True | 179.6 | 208.8 | 442.5 | 2.1 3 torch.uint8 channels_last bilinear 256 -> 320 aa=False | | 206.4 | 436.9 | 2.1 3 torch.uint8 channels_last bilinear 520 -> 32 aa=True | 113.3 | 132.1 | 464.8 | 3.5 3 torch.uint8 channels_last bilinear 520 -> 32 aa=False | | 57.2 | 365.5 | 6.4 3 torch.uint8 channels_last bilinear 520 -> 224 aa=True | 281.7 | 327.4 | 722.4 | 2.2 3 torch.uint8 channels_last bilinear 520 -> 224 aa=False | | 230.2 | 593.5 | 2.6 3 torch.uint8 channels_last bilinear 712 -> 32 aa=True | 186.9 | 210.5 | 833.8 | 4.0 3 torch.uint8 channels_last bilinear 712 -> 32 aa=False | | 75.6 | 651.4 | 8.6 3 torch.uint8 channels_last bilinear 712 -> 224 aa=True | 410.3 | 450.9 | 1128.4 | 2.5 3 torch.uint8 channels_last bilinear 712 -> 224 aa=False | | 298.7 | 917.6 | 3.1 ``` Note: for other cases (see Source below) speed-up is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230315-162238-pr_vs_nightly_speedup-md) ## Context - #90771 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

- Based on #96651 - Fixed mem pointer alignment ghstack-source-id: c82a73d Pull Request resolved: #96848

- Based on pytorch#96651 - Fixed mem pointer alignment ghstack-source-id: c82a73d Pull Request resolved: pytorch#96848

…t8 RGB-case (channels last)" ## Description - Based on #96651 - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last** - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results) - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results - `Pillow (9.0.0.post1)` == Pillow-SIMD ``` [-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+gitc005105) PR | torch (2.1.0a0+git5309c44) nightly | Speed-up: PR vs nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True | 38.670 (+-0.445) | 57.366 (+-0.799) | 132.147 (+-1.236) | 2.304 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False | | 37.825 (+-0.417) | 111.789 (+-1.175) | 2.955 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True | 127.898 (+-1.335) | 153.081 (+-2.346) | 302.518 (+-2.632) | 1.976 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False | | 141.695 (+-1.415) | 286.663 (+-2.494) | 2.023 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True | 179.735 (+-2.054) | 210.613 (+-3.116) | 439.375 (+-4.014) | 2.086 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False | | 207.601 (+-1.639) | 438.537 (+-4.143) | 2.112 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True | 112.679 (+-1.321) | 130.863 (+-1.987) | 446.804 (+-3.283) | 3.414 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False | | 57.968 (+-0.270) | 374.244 (+-13.598) | 6.456 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True | 282.398 (+-3.485) | 322.986 (+-1.947) | 720.197 (+-3.467) | 2.230 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False | | 231.625 (+-2.006) | 592.834 (+-3.903) | 2.559 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True | 185.711 (+-1.666) | 201.069 (+-2.182) | 787.868 (+-3.648) | 3.918 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False | | 75.975 (+-0.696) | 651.016 (+-3.926) | 8.569 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True | 410.236 (+-6.021) | 451.486 (+-3.939) | 1123.923 (+-14.988) | 2.489 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False | | 299.597 (+-1.887) | 915.347 (+-4.486) | 3.055 (+-0.000) # More test-cases from #90771 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True | 60.751 (+-0.285) | 78.538 (+-1.282) | 170.465 (+-1.830) | 2.170 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True | 133.619 (+-2.035) | 159.614 (+-1.587) | 330.971 (+-3.249) | 2.074 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True | 950.243 (+-10.641) | 891.369 (+-17.946) | 2805.510 (+-25.503) | 3.147 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True | 52.771 (+-0.961) | 72.253 (+-1.020) | 135.933 (+-1.625) | 1.881 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True | 139.107 (+-2.143) | 165.844 (+-2.177) | 321.112 (+-2.904) | 1.936 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True | 691.470 (+-9.566) | 764.942 (+-11.192) | 2050.880 (+-22.188) | 2.681 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False | | 77.375 (+-1.345) | 169.646 (+-1.640) | 2.193 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False | | 159.115 (+-3.935) | 329.754 (+-2.590) | 2.072 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False | | 877.248 (+-5.736) | 2815.870 (+-22.589) | 3.210 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False | | 53.120 (+-0.316) | 112.024 (+-1.225) | 2.109 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False | | 147.330 (+-1.871) | 299.152 (+-3.353) | 2.030 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False | | 472.182 (+-10.785) | 1698.601 (+-16.785) | 3.597 (+-0.000) ``` Note: for other cases (see Source below) speed-up is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230320-160044-pr_vs_nightly-speedup-md) ## Context - #90771 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

- Based on #96651 - Fixed mem pointer alignment ghstack-source-id: f807362 Pull Request resolved: #96848

peterbell10

Code looks reasonable but I'll wait for a respose to NicolasHug's question on testing and more benchmarks for upsampling. Also some benchmarks showing 4 channels haven't regressed would be nice.

vfdev-5 · 2023-03-21T22:56:59Z

@peterbell10 I already added more upsampling benchmarks, see the description after the line "# More test-cases from #90771". @NicolasHug can you confirm that those benchmark are sufficient ? Thanks

…t8 RGB-case (channels last)" ## Description - Based on #96651 - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last** - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results) - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results - `Pillow (9.0.0.post1)` == Pillow-SIMD ``` [-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+git8d955df) PR | torch (2.1.0a0+git5309c44) nightly | Speed-up: PR vs nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True | 38.649 (+-0.306) | 55.828 (+-0.370) | 132.147 (+-1.236) | 2.367 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False | | 36.826 (+-0.229) | 111.789 (+-1.175) | 3.036 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True | 128.233 (+-1.313) | 153.827 (+-1.229) | 302.518 (+-2.632) | 1.967 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False | | 143.886 (+-1.409) | 286.663 (+-2.494) | 1.992 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True | 179.504 (+-1.825) | 211.569 (+-1.336) | 439.375 (+-4.014) | 2.077 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False | | 209.888 (+-1.443) | 438.537 (+-4.143) | 2.089 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True | 112.891 (+-1.118) | 129.373 (+-1.396) | 446.804 (+-3.283) | 3.454 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False | | 56.858 (+-0.227) | 374.244 (+-13.598) | 6.582 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True | 282.917 (+-2.992) | 324.378 (+-1.694) | 720.197 (+-3.467) | 2.220 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False | | 236.078 (+-1.679) | 592.834 (+-3.903) | 2.511 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True | 185.595 (+-1.633) | 202.000 (+-1.920) | 787.868 (+-3.648) | 3.900 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False | | 75.421 (+-0.512) | 651.016 (+-3.926) | 8.632 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True | 409.691 (+-2.735) | 449.927 (+-2.500) | 1123.923 (+-14.988) | 2.498 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False | | 306.691 (+-2.095) | 915.347 (+-4.486) | 2.985 (+-0.000) # More test-cases from #90771 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True | 60.740 (+-0.278) | 78.745 (+-0.286) | 170.465 (+-1.830) | 2.165 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True | 133.029 (+-1.619) | 162.393 (+-1.289) | 330.971 (+-3.249) | 2.038 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True | 948.849 (+-2.749) | 896.127 (+-3.696) | 2805.510 (+-25.503) | 3.131 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True | 52.505 (+-0.319) | 70.617 (+-0.344) | 135.933 (+-1.625) | 1.925 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True | 138.671 (+-1.953) | 165.638 (+-1.473) | 321.112 (+-2.904) | 1.939 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True | 689.492 (+-2.917) | 758.162 (+-3.719) | 2050.880 (+-22.188) | 2.705 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False | | 77.300 (+-0.307) | 169.646 (+-1.640) | 2.195 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False | | 159.525 (+-1.225) | 329.754 (+-2.590) | 2.067 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False | | 890.106 (+-3.358) | 2815.870 (+-22.589) | 3.164 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False | | 52.399 (+-0.314) | 112.024 (+-1.225) | 2.138 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False | | 148.780 (+-1.282) | 299.152 (+-3.353) | 2.011 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False | | 479.273 (+-3.432) | 1698.601 (+-16.785) | 3.544 (+-0.000) 4 ``` Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230321-145513-pr_vs_nightly-speedup-md) ## Context - #90771 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

- Based on #96651 - Fixed mem pointer alignment ghstack-source-id: 6132906 Pull Request resolved: #96848

…t8 RGB-case (channels last)" ## Description - Based on #96651 - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last** - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results) - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results - `Pillow (9.0.0.post1)` == Pillow-SIMD ``` [-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+gitce4be01) PR | torch (2.1.0a0+git5309c44) nightly | Speed-up: PR vs nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True | 38.548 (+-0.280) | 57.536 (+-0.210) | 132.147 (+-1.236) | 2.297 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False | | 38.532 (+-0.219) | 111.789 (+-1.175) | 2.901 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True | 127.689 (+-1.348) | 156.262 (+-1.213) | 302.518 (+-2.632) | 1.936 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False | | 145.483 (+-1.077) | 286.663 (+-2.494) | 1.970 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True | 178.117 (+-1.956) | 215.053 (+-1.470) | 439.375 (+-4.014) | 2.043 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False | | 211.340 (+-2.239) | 438.537 (+-4.143) | 2.075 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True | 112.593 (+-1.266) | 130.414 (+-1.633) | 446.804 (+-3.283) | 3.426 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False | | 58.767 (+-0.203) | 374.244 (+-13.598) | 6.368 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True | 283.210 (+-2.937) | 324.157 (+-1.895) | 720.197 (+-3.467) | 2.222 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False | | 239.800 (+-2.492) | 592.834 (+-3.903) | 2.472 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True | 186.255 (+-1.629) | 204.834 (+-1.496) | 787.868 (+-3.648) | 3.846 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False | | 77.335 (+-0.341) | 651.016 (+-3.926) | 8.418 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True | 410.286 (+-2.439) | 443.934 (+-2.899) | 1123.923 (+-14.988) | 2.532 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False | | 312.220 (+-2.307) | 915.347 (+-4.486) | 2.932 (+-0.000) # More test-cases from #90771 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True | 60.611 (+-0.337) | 80.849 (+-1.780) | 170.465 (+-1.830) | 2.108 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True | 132.971 (+-1.624) | 164.892 (+-1.426) | 330.971 (+-3.249) | 2.007 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True | 948.467 (+-3.179) | 891.414 (+-5.282) | 2805.510 (+-25.503) | 3.147 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True | 52.539 (+-0.327) | 72.471 (+-0.367) | 135.933 (+-1.625) | 1.876 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True | 138.669 (+-1.867) | 168.628 (+-1.213) | 321.112 (+-2.904) | 1.904 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True | 689.933 (+-3.175) | 746.911 (+-2.985) | 2050.880 (+-22.188) | 2.746 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False | | 78.347 (+-0.338) | 169.646 (+-1.640) | 2.165 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False | | 162.194 (+-1.089) | 329.754 (+-2.590) | 2.033 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False | | 894.476 (+-2.738) | 2815.870 (+-22.589) | 3.148 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False | | 52.728 (+-0.406) | 112.024 (+-1.225) | 2.125 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False | | 151.560 (+-1.128) | 299.152 (+-3.353) | 1.974 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False | | 500.053 (+-4.288) | 1698.601 (+-16.785) | 3.397 (+-0.000) ``` Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230322-132441-pr_vs_nightly-speedup-md) ## Context - #90771 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

- Based on #96651 - Fixed mem pointer alignment ghstack-source-id: 93bd276 Pull Request resolved: #96848

…t8 RGB-case (channels last)" ## Description - Based on #96651 - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last** - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results) - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results - `Pillow (9.0.0.post1)` == Pillow-SIMD ``` [-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+gitce4be01) PR | torch (2.1.0a0+git5309c44) nightly | Speed-up: PR vs nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True | 38.548 (+-0.280) | 57.536 (+-0.210) | 132.147 (+-1.236) | 2.297 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False | | 38.532 (+-0.219) | 111.789 (+-1.175) | 2.901 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True | 127.689 (+-1.348) | 156.262 (+-1.213) | 302.518 (+-2.632) | 1.936 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False | | 145.483 (+-1.077) | 286.663 (+-2.494) | 1.970 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True | 178.117 (+-1.956) | 215.053 (+-1.470) | 439.375 (+-4.014) | 2.043 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False | | 211.340 (+-2.239) | 438.537 (+-4.143) | 2.075 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True | 112.593 (+-1.266) | 130.414 (+-1.633) | 446.804 (+-3.283) | 3.426 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False | | 58.767 (+-0.203) | 374.244 (+-13.598) | 6.368 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True | 283.210 (+-2.937) | 324.157 (+-1.895) | 720.197 (+-3.467) | 2.222 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False | | 239.800 (+-2.492) | 592.834 (+-3.903) | 2.472 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True | 186.255 (+-1.629) | 204.834 (+-1.496) | 787.868 (+-3.648) | 3.846 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False | | 77.335 (+-0.341) | 651.016 (+-3.926) | 8.418 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True | 410.286 (+-2.439) | 443.934 (+-2.899) | 1123.923 (+-14.988) | 2.532 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False | | 312.220 (+-2.307) | 915.347 (+-4.486) | 2.932 (+-0.000) # More test-cases from #90771 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True | 60.611 (+-0.337) | 80.849 (+-1.780) | 170.465 (+-1.830) | 2.108 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True | 132.971 (+-1.624) | 164.892 (+-1.426) | 330.971 (+-3.249) | 2.007 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True | 948.467 (+-3.179) | 891.414 (+-5.282) | 2805.510 (+-25.503) | 3.147 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True | 52.539 (+-0.327) | 72.471 (+-0.367) | 135.933 (+-1.625) | 1.876 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True | 138.669 (+-1.867) | 168.628 (+-1.213) | 321.112 (+-2.904) | 1.904 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True | 689.933 (+-3.175) | 746.911 (+-2.985) | 2050.880 (+-22.188) | 2.746 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False | | 78.347 (+-0.338) | 169.646 (+-1.640) | 2.165 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False | | 162.194 (+-1.089) | 329.754 (+-2.590) | 2.033 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False | | 894.476 (+-2.738) | 2815.870 (+-22.589) | 3.148 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False | | 52.728 (+-0.406) | 112.024 (+-1.225) | 2.125 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False | | 151.560 (+-1.128) | 299.152 (+-3.353) | 1.974 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False | | 500.053 (+-4.288) | 1698.601 (+-16.785) | 3.397 (+-0.000) ``` Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230322-132441-pr_vs_nightly-speedup-md) ## Context - #90771 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

- Based on #96651 - Fixed mem pointer alignment ghstack-source-id: 8b90000 Pull Request resolved: #96848

vfdev-5 · 2023-03-29T16:40:09Z

From your benchmarks: ....
Are these swapped? 2.1x speedup for 4 channels, 1.1x speedup for 3 channels.

@peterbell10 @NicolasHug it turned out that this is due noisy measurements on my machine for "channels first (1024, 1024) -> (256, 256)" cases. I measured just these cases on nightly and on this PR and could not get any reliable results. Overall numbers look similar, so I do not expect any improvement for these cases...

vfdev-5 · 2023-03-30T08:44:14Z

@pytorchbot merge

pytorchmergebot · 2023-03-30T08:47:01Z

Merge failed

Reason: This PR needs a label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team

Raised by workflow job

vfdev-5 · 2023-03-30T08:51:31Z

@pytorchbot merge

pytorchmergebot · 2023-03-30T08:53:34Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-03-30T08:53:38Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

pull / linux-bionic-cuda11.7-py3.10-gcc7-sm86 / test (default, 5, 5, linux.g5.4xlarge.nvidia.gpu)

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

vfdev-5 · 2023-03-30T08:56:50Z

@pytorchbot rebase

pytorchmergebot · 2023-03-30T08:59:17Z

@pytorchbot successfully started a rebase job. Check the current status here

…t8 RGB-case (channels last)" ## Description - Based on #96651 - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last** - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results) - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results - `Pillow (9.0.0.post1)` == Pillow-SIMD ``` [-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+gitd6e220c) PR | torch (2.1.0a0+git2b75955) nightly | Speed-up: PR vs nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True | 38.674 (+-0.323) | 57.591 (+-0.244) | 131.033 (+-1.448) | 2.275 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False | | 39.471 (+-0.166) | 113.911 (+-1.736) | 2.886 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True | 128.512 (+-1.916) | 161.592 (+-1.242) | 299.679 (+-2.099) | 1.855 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False | | 150.994 (+-1.180) | 285.331 (+-1.919) | 1.890 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True | 180.045 (+-2.223) | 220.581 (+-1.363) | 431.057 (+-3.536) | 1.954 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False | | 219.391 (+-1.409) | 429.410 (+-3.620) | 1.957 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True | 113.911 (+-1.024) | 129.457 (+-1.295) | 459.610 (+-13.322) | 3.550 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False | | 59.800 (+-0.199) | 400.015 (+-11.815) | 6.689 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True | 283.050 (+-2.664) | 339.143 (+-1.209) | 683.555 (+-4.466) | 2.016 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False | | 250.601 (+-1.236) | 603.545 (+-2.644) | 2.408 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True | 186.723 (+-2.213) | 199.960 (+-1.343) | 860.867 (+-21.763) | 4.305 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False | | 79.188 (+-0.261) | 703.019 (+-25.805) | 8.878 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True | 412.353 (+-4.476) | 462.230 (+-1.983) | 1101.673 (+-49.299) | 2.383 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False | | 327.973 (+-1.852) | 941.062 (+-5.549) | 2.869 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True | 61.191 (+-0.926) | 80.795 (+-0.518) | 160.853 (+-1.506) | 1.991 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True | 134.488 (+-2.129) | 169.147 (+-1.324) | 327.343 (+-2.846) | 1.935 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True | 1037.045 (+-24.982) | 938.623 (+-9.010) | 2603.360 (+-20.530) | 2.774 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True | 52.792 (+-0.613) | 73.692 (+-0.264) | 131.829 (+-1.333) | 1.789 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True | 139.596 (+-1.944) | 173.778 (+-1.039) | 320.063 (+-2.562) | 1.842 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True | 690.132 (+-10.946) | 772.758 (+-2.864) | 2036.860 (+-36.109) | 2.636 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False | | 78.747 (+-0.799) | 158.479 (+-1.702) | 2.013 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False | | 167.046 (+-1.077) | 322.104 (+-2.764) | 1.928 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False | | 918.967 (+-5.251) | 2611.388 (+-29.917) | 2.842 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False | | 55.336 (+-0.251) | 113.869 (+-1.243) | 2.058 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False | | 156.505 (+-1.095) | 299.861 (+-2.710) | 1.916 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False | | 514.344 (+-1.905) | 1776.796 (+-19.660) | 3.454 (+-0.000) ``` Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230329-181023-pr_vs_nightly-speedup-md) ## Context - #90771 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 datumbox pmeier [ghstack-poisoned]

pytorchmergebot · 2023-03-30T08:59:42Z

Successfully rebased gh/vfdev-5/2/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/96848)

- Based on #96651 - Fixed mem pointer alignment ghstack-source-id: 6c30da9 Pull Request resolved: #96848

vfdev-5 · 2023-03-30T09:50:55Z

@pytorchbot merge

pytorchmergebot · 2023-03-30T09:54:04Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

… (channels last) (pytorch#96848) ## Description - Based on pytorch#96651 - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last** - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results) - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from pytorch#96651) ## Results - `Pillow (9.0.0.post1)` == Pillow-SIMD ``` [-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+gitd6e220c) PR | torch (2.1.0a0+git2b75955) nightly | Speed-up: PR vs nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True | 38.674 (+-0.323) | 57.591 (+-0.244) | 131.033 (+-1.448) | 2.275 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False | | 39.471 (+-0.166) | 113.911 (+-1.736) | 2.886 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True | 128.512 (+-1.916) | 161.592 (+-1.242) | 299.679 (+-2.099) | 1.855 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False | | 150.994 (+-1.180) | 285.331 (+-1.919) | 1.890 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True | 180.045 (+-2.223) | 220.581 (+-1.363) | 431.057 (+-3.536) | 1.954 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False | | 219.391 (+-1.409) | 429.410 (+-3.620) | 1.957 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True | 113.911 (+-1.024) | 129.457 (+-1.295) | 459.610 (+-13.322) | 3.550 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False | | 59.800 (+-0.199) | 400.015 (+-11.815) | 6.689 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True | 283.050 (+-2.664) | 339.143 (+-1.209) | 683.555 (+-4.466) | 2.016 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False | | 250.601 (+-1.236) | 603.545 (+-2.644) | 2.408 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True | 186.723 (+-2.213) | 199.960 (+-1.343) | 860.867 (+-21.763) | 4.305 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False | | 79.188 (+-0.261) | 703.019 (+-25.805) | 8.878 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True | 412.353 (+-4.476) | 462.230 (+-1.983) | 1101.673 (+-49.299) | 2.383 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False | | 327.973 (+-1.852) | 941.062 (+-5.549) | 2.869 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True | 61.191 (+-0.926) | 80.795 (+-0.518) | 160.853 (+-1.506) | 1.991 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True | 134.488 (+-2.129) | 169.147 (+-1.324) | 327.343 (+-2.846) | 1.935 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True | 1037.045 (+-24.982) | 938.623 (+-9.010) | 2603.360 (+-20.530) | 2.774 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True | 52.792 (+-0.613) | 73.692 (+-0.264) | 131.829 (+-1.333) | 1.789 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True | 139.596 (+-1.944) | 173.778 (+-1.039) | 320.063 (+-2.562) | 1.842 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True | 690.132 (+-10.946) | 772.758 (+-2.864) | 2036.860 (+-36.109) | 2.636 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False | | 78.747 (+-0.799) | 158.479 (+-1.702) | 2.013 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False | | 167.046 (+-1.077) | 322.104 (+-2.764) | 1.928 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False | | 918.967 (+-5.251) | 2611.388 (+-29.917) | 2.842 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False | | 55.336 (+-0.251) | 113.869 (+-1.243) | 2.058 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False | | 156.505 (+-1.095) | 299.861 (+-2.710) | 1.916 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False | | 514.344 (+-1.905) | 1776.796 (+-19.660) | 3.454 (+-0.000) ``` Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230329-181023-pr_vs_nightly-speedup-md) ## Context - pytorch#90771 Pull Request resolved: pytorch#96848 Approved by: https://github.com/NicolasHug, https://github.com/peterbell10

Improved perfs for vectorized interpolate cpu uint8 RGB-case

15d1f35

- Based on #96651 - Fixed mem pointer alignment [ghstack-poisoned]

vfdev-5 mentioned this pull request Mar 15, 2023

Code update for vectorized interpolate cpu uint8 #96847

Closed

github-actions Bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Mar 15, 2023

vfdev-5 mentioned this pull request Mar 15, 2023

Improved perfs for vectorized interpolate cpu uint8 RGB-case #96651

Closed

vfdev-5 added module: vision module: interpolation labels Mar 15, 2023

pytorchbot added the open source label Mar 15, 2023

vfdev-5 requested a review from peterbell10 March 15, 2023 14:02

vfdev-5 added a commit that referenced this pull request Mar 15, 2023

Improved perfs for vectorized interpolate cpu uint8 RGB-case

0968a5d

- Based on #96651 - Fixed mem pointer alignment ghstack-source-id: d961810 Pull Request resolved: #96848

vfdev-5 added a commit that referenced this pull request Mar 17, 2023

Improved perfs for vectorized interpolate cpu uint8 RGB-case

bafca4d

- Based on #96651 - Fixed mem pointer alignment ghstack-source-id: 4ee5e45 Pull Request resolved: #96848

NicolasHug reviewed Mar 17, 2023

View reviewed changes

vfdev-5 changed the title ~~Improved perfs for vectorized interpolate cpu uint8 RGB-case~~ Improved perfs for vectorized bilinear interpolate cpu uint8 RGB-case (channels last) Mar 20, 2023

vfdev-5 added a commit that referenced this pull request Mar 20, 2023

Improved perfs for vectorized interpolate cpu uint8 RGB-case

5d3aeb1

- Based on #96651 - Fixed mem pointer alignment ghstack-source-id: c82a73d Pull Request resolved: #96848

peterbell10 reviewed Mar 20, 2023

View reviewed changes

Comment thread aten/src/ATen/native/cpu/UpSampleKernelAVXAntialias.h Outdated

Comment thread aten/src/ATen/native/cpu/UpSampleKernelAVXAntialias.h Outdated

Comment thread aten/src/ATen/native/cpu/UpSampleKernelAVXAntialias.h Outdated

vfdev-5 added a commit to vfdev-5/pytorch that referenced this pull request Mar 21, 2023

Improved perfs for vectorized interpolate cpu uint8 RGB-case

8d955df

- Based on pytorch#96651 - Fixed mem pointer alignment ghstack-source-id: c82a73d Pull Request resolved: pytorch#96848

vfdev-5 added a commit that referenced this pull request Mar 21, 2023

Improved perfs for vectorized interpolate cpu uint8 RGB-case

ba5393b

- Based on #96651 - Fixed mem pointer alignment ghstack-source-id: f807362 Pull Request resolved: #96848

vfdev-5 requested a review from peterbell10 March 21, 2023 14:04

peterbell10 reviewed Mar 21, 2023

View reviewed changes

Comment thread aten/src/ATen/native/cpu/UpSampleKernelAVXAntialias.h Outdated

vfdev-5 added a commit that referenced this pull request Mar 22, 2023

Improved perfs for vectorized interpolate cpu uint8 RGB-case

3bc04a7

- Based on #96651 - Fixed mem pointer alignment ghstack-source-id: 6132906 Pull Request resolved: #96848

vfdev-5 added a commit that referenced this pull request Mar 23, 2023

Improved perfs for vectorized interpolate cpu uint8 RGB-case

16ae598

- Based on #96651 - Fixed mem pointer alignment ghstack-source-id: 93bd276 Pull Request resolved: #96848

vfdev-5 added a commit that referenced this pull request Mar 29, 2023

Improved perfs for vectorized interpolate cpu uint8 RGB-case

48a80b7

- Based on #96651 - Fixed mem pointer alignment ghstack-source-id: 8b90000 Pull Request resolved: #96848

vfdev-5 closed this Mar 29, 2023

vfdev-5 reopened this Mar 29, 2023

vfdev-5 requested a review from peterbell10 March 29, 2023 21:14

peterbell10 approved these changes Mar 29, 2023

View reviewed changes

pytorch-bot Bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 30, 2023

vfdev-5 added the release notes: nn release notes category label Mar 30, 2023

pytorchmergebot pushed a commit that referenced this pull request Mar 30, 2023

Improved perfs for vectorized interpolate cpu uint8 RGB-case

2821091

- Based on #96651 - Fixed mem pointer alignment ghstack-source-id: 6c30da9 Pull Request resolved: #96848

pytorchmergebot added the Merged label Mar 30, 2023

pytorchmergebot closed this in 1916208 Mar 30, 2023

vfdev-5 deleted the gh/vfdev-5/2/head branch March 30, 2023 11:52

NicolasHug mentioned this pull request Apr 4, 2023

Let Resize() handle uint8 images natively for bilinear mode pytorch/vision#7497

Closed

Conversation

vfdev-5 commented Mar 15, 2023 • edited by pytorchmergebot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Results

Context

Uh oh!

pytorch-bot Bot commented Mar 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/96848

✅ No Failures

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

peterbell10 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vfdev-5 commented Mar 21, 2023

Uh oh!

vfdev-5 commented Mar 29, 2023

Uh oh!

vfdev-5 commented Mar 30, 2023

Uh oh!

pytorchmergebot commented Mar 30, 2023

Merge failed

Uh oh!

vfdev-5 commented Mar 30, 2023

Uh oh!

pytorchmergebot commented Mar 30, 2023

Merge started

Uh oh!

pytorchmergebot commented Mar 30, 2023

Merge failed

Uh oh!

vfdev-5 commented Mar 30, 2023

Uh oh!

pytorchmergebot commented Mar 30, 2023

Uh oh!

pytorchmergebot commented Mar 30, 2023

Uh oh!

vfdev-5 commented Mar 30, 2023

Uh oh!

pytorchmergebot commented Mar 30, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

vfdev-5 commented Mar 15, 2023 •

edited by pytorchmergebot

Loading

pytorch-bot Bot commented Mar 15, 2023 •

edited

Loading