Add uint8 support for interpolate for CPU images#90771
Add uint8 support for interpolate for CPU images#90771NicolasHug wants to merge 46 commits intomasterfrom
Conversation
…as_false' into interpolate_uint8_images_linear_cpu_support_dev
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/90771
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit 35443d0: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
Maybe support for bool+nearest can also be enabled without much hassle (especially given that uint8+nearest is already supported). I also wonder, for bool specifically, are all other modes equivalent to nearest? Bool interpolation is useful when working with segmentation masks |
|
@NicolasHug There're a bunch of perf benchmarks of basic transforms in Albumentations repo: https://github.com/albumentations-team/albumentations#benchmarking-results It would be interesting to have such benchmarks run by pytorch itself (to ensure the correct version of torchvision is used etc) and having the results published. Maybe also contribute to the original albumentations benchmarking code. https://github.com/albumentations-team/albumentations/blob/master/benchmark/README.md It would be interesting to see the basic transforms for which core torch / torchvision is slower than albumentations |
|
@vadimkantorov albu uses opencv for majority of transforms and opencv is highly optimized. On the other hand if we can achieve Pillow-SIMD runtime results in pytorch it would be good. |
## Description - Based on #96651 - Improved perfs for vectorized interpolate uint8 RGB-case - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results ``` [------------------------------------------------------------------------------------------ Resize -----------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+gitcc42a3f) PR | torch (2.1.0a0+git5309c44) nightly | Speed-up: PR vs nightly 1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear 256 -> 32 aa=True | 38.8 | 56.0 | 133.2 | 2.4 3 torch.uint8 channels_last bilinear 256 -> 32 aa=False | | 37.5 | 112.8 | 3.0 3 torch.uint8 channels_last bilinear 256 -> 224 aa=True | 128.7 | 157.0 | 305.4 | 1.9 3 torch.uint8 channels_last bilinear 256 -> 224 aa=False | | 146.4 | 288.7 | 2.0 3 torch.uint8 channels_last bilinear 256 -> 320 aa=True | 179.4 | 215.8 | 442.5 | 2.1 3 torch.uint8 channels_last bilinear 256 -> 320 aa=False | | 212.5 | 436.9 | 2.1 3 torch.uint8 channels_last bilinear 520 -> 32 aa=True | 113.3 | 127.9 | 464.8 | 3.6 3 torch.uint8 channels_last bilinear 520 -> 32 aa=False | | 56.8 | 365.5 | 6.4 3 torch.uint8 channels_last bilinear 520 -> 224 aa=True | 281.7 | 325.2 | 722.4 | 2.2 3 torch.uint8 channels_last bilinear 520 -> 224 aa=False | | 239.1 | 593.5 | 2.5 3 torch.uint8 channels_last bilinear 712 -> 32 aa=True | 186.2 | 200.7 | 833.8 | 4.2 3 torch.uint8 channels_last bilinear 712 -> 32 aa=False | | 75.2 | 651.4 | 8.7 3 torch.uint8 channels_last bilinear 712 -> 224 aa=True | 410.0 | 444.5 | 1128.4 | 2.5 3 torch.uint8 channels_last bilinear 712 -> 224 aa=False | | 309.3 | 917.6 | 3.0 ``` Note: for other cases (see Source below) speed-up is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230315-144416-pr_vs_nightly_speedup-md) ## Context - #90771 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…cpu uint8 RGB-case" ## Description - Based on #96651 - Improved perfs for vectorized interpolate uint8 RGB-case - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results ``` [------------------------------------------------------------------------------------------ Resize -----------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+git0968a5d) PR | torch (2.1.0a0+git5309c44) nightly | Speed-up: PR vs nightly 1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear 256 -> 32 aa=True | 39.0 | 56.6 | 133.2 | 2.4 3 torch.uint8 channels_last bilinear 256 -> 32 aa=False | | 36.9 | 112.8 | 3.1 3 torch.uint8 channels_last bilinear 256 -> 224 aa=True | 128.1 | 152.5 | 305.4 | 2.0 3 torch.uint8 channels_last bilinear 256 -> 224 aa=False | | 141.1 | 288.7 | 2.0 3 torch.uint8 channels_last bilinear 256 -> 320 aa=True | 179.6 | 208.8 | 442.5 | 2.1 3 torch.uint8 channels_last bilinear 256 -> 320 aa=False | | 206.4 | 436.9 | 2.1 3 torch.uint8 channels_last bilinear 520 -> 32 aa=True | 113.3 | 132.1 | 464.8 | 3.5 3 torch.uint8 channels_last bilinear 520 -> 32 aa=False | | 57.2 | 365.5 | 6.4 3 torch.uint8 channels_last bilinear 520 -> 224 aa=True | 281.7 | 327.4 | 722.4 | 2.2 3 torch.uint8 channels_last bilinear 520 -> 224 aa=False | | 230.2 | 593.5 | 2.6 3 torch.uint8 channels_last bilinear 712 -> 32 aa=True | 186.9 | 210.5 | 833.8 | 4.0 3 torch.uint8 channels_last bilinear 712 -> 32 aa=False | | 75.6 | 651.4 | 8.6 3 torch.uint8 channels_last bilinear 712 -> 224 aa=True | 410.3 | 450.9 | 1128.4 | 2.5 3 torch.uint8 channels_last bilinear 712 -> 224 aa=False | | 298.7 | 917.6 | 3.1 ``` Note: for other cases (see Source below) speed-up is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230315-162238-pr_vs_nightly_speedup-md) ## Context - #90771 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
## Description - Based on #96651 - Improved perfs for vectorized interpolate uint8 RGB-case - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results ``` [------------------------------------------------------------------------------------------ Resize -----------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+git0968a5d) PR | torch (2.1.0a0+git5309c44) nightly | Speed-up: PR vs nightly 1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear 256 -> 32 aa=True | 39.0 | 56.6 | 133.2 | 2.4 3 torch.uint8 channels_last bilinear 256 -> 32 aa=False | | 36.9 | 112.8 | 3.1 3 torch.uint8 channels_last bilinear 256 -> 224 aa=True | 128.1 | 152.5 | 305.4 | 2.0 3 torch.uint8 channels_last bilinear 256 -> 224 aa=False | | 141.1 | 288.7 | 2.0 3 torch.uint8 channels_last bilinear 256 -> 320 aa=True | 179.6 | 208.8 | 442.5 | 2.1 3 torch.uint8 channels_last bilinear 256 -> 320 aa=False | | 206.4 | 436.9 | 2.1 3 torch.uint8 channels_last bilinear 520 -> 32 aa=True | 113.3 | 132.1 | 464.8 | 3.5 3 torch.uint8 channels_last bilinear 520 -> 32 aa=False | | 57.2 | 365.5 | 6.4 3 torch.uint8 channels_last bilinear 520 -> 224 aa=True | 281.7 | 327.4 | 722.4 | 2.2 3 torch.uint8 channels_last bilinear 520 -> 224 aa=False | | 230.2 | 593.5 | 2.6 3 torch.uint8 channels_last bilinear 712 -> 32 aa=True | 186.9 | 210.5 | 833.8 | 4.0 3 torch.uint8 channels_last bilinear 712 -> 32 aa=False | | 75.6 | 651.4 | 8.6 3 torch.uint8 channels_last bilinear 712 -> 224 aa=True | 410.3 | 450.9 | 1128.4 | 2.5 3 torch.uint8 channels_last bilinear 712 -> 224 aa=False | | 298.7 | 917.6 | 3.1 ``` Note: for other cases (see Source below) speed-up is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230315-162238-pr_vs_nightly_speedup-md) ## Context - #90771 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…erpolate cpu uint8 RGB-case (channels last)" ## Description - Based on #96651 - Improved perfs for vectorized bilinear interpolate uint8 RGB-case, channels last - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD (`Pillow (9.0.0.post1)`) - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results ``` [------------------------------------------------------------------------------------------ Resize -----------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+git0968a5d) PR | torch (2.1.0a0+git5309c44) nightly | Speed-up: PR vs nightly 1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear 256 -> 32 aa=True | 39.0 | 56.6 | 133.2 | 2.4 3 torch.uint8 channels_last bilinear 256 -> 32 aa=False | | 36.9 | 112.8 | 3.1 3 torch.uint8 channels_last bilinear 256 -> 224 aa=True | 128.1 | 152.5 | 305.4 | 2.0 3 torch.uint8 channels_last bilinear 256 -> 224 aa=False | | 141.1 | 288.7 | 2.0 3 torch.uint8 channels_last bilinear 256 -> 320 aa=True | 179.6 | 208.8 | 442.5 | 2.1 3 torch.uint8 channels_last bilinear 256 -> 320 aa=False | | 206.4 | 436.9 | 2.1 3 torch.uint8 channels_last bilinear 520 -> 32 aa=True | 113.3 | 132.1 | 464.8 | 3.5 3 torch.uint8 channels_last bilinear 520 -> 32 aa=False | | 57.2 | 365.5 | 6.4 3 torch.uint8 channels_last bilinear 520 -> 224 aa=True | 281.7 | 327.4 | 722.4 | 2.2 3 torch.uint8 channels_last bilinear 520 -> 224 aa=False | | 230.2 | 593.5 | 2.6 3 torch.uint8 channels_last bilinear 712 -> 32 aa=True | 186.9 | 210.5 | 833.8 | 4.0 3 torch.uint8 channels_last bilinear 712 -> 32 aa=False | | 75.6 | 651.4 | 8.6 3 torch.uint8 channels_last bilinear 712 -> 224 aa=True | 410.3 | 450.9 | 1128.4 | 2.5 3 torch.uint8 channels_last bilinear 712 -> 224 aa=False | | 298.7 | 917.6 | 3.1 ``` Note: for other cases (see Source below) speed-up is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230315-162238-pr_vs_nightly_speedup-md) ## Context - #90771 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…t8 RGB-case (channels last)" ## Description - Based on #96651 - Improved perfs for vectorized bilinear interpolate uint8 RGB-case, channels last - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD (`Pillow (9.0.0.post1)`) - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results ``` [------------------------------------------------------------------------------------------ Resize -----------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+git0968a5d) PR | torch (2.1.0a0+git5309c44) nightly | Speed-up: PR vs nightly 1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear 256 -> 32 aa=True | 39.0 | 56.6 | 133.2 | 2.4 3 torch.uint8 channels_last bilinear 256 -> 32 aa=False | | 36.9 | 112.8 | 3.1 3 torch.uint8 channels_last bilinear 256 -> 224 aa=True | 128.1 | 152.5 | 305.4 | 2.0 3 torch.uint8 channels_last bilinear 256 -> 224 aa=False | | 141.1 | 288.7 | 2.0 3 torch.uint8 channels_last bilinear 256 -> 320 aa=True | 179.6 | 208.8 | 442.5 | 2.1 3 torch.uint8 channels_last bilinear 256 -> 320 aa=False | | 206.4 | 436.9 | 2.1 3 torch.uint8 channels_last bilinear 520 -> 32 aa=True | 113.3 | 132.1 | 464.8 | 3.5 3 torch.uint8 channels_last bilinear 520 -> 32 aa=False | | 57.2 | 365.5 | 6.4 3 torch.uint8 channels_last bilinear 520 -> 224 aa=True | 281.7 | 327.4 | 722.4 | 2.2 3 torch.uint8 channels_last bilinear 520 -> 224 aa=False | | 230.2 | 593.5 | 2.6 3 torch.uint8 channels_last bilinear 712 -> 32 aa=True | 186.9 | 210.5 | 833.8 | 4.0 3 torch.uint8 channels_last bilinear 712 -> 32 aa=False | | 75.6 | 651.4 | 8.6 3 torch.uint8 channels_last bilinear 712 -> 224 aa=True | 410.3 | 450.9 | 1128.4 | 2.5 3 torch.uint8 channels_last bilinear 712 -> 224 aa=False | | 298.7 | 917.6 | 3.1 ``` Note: for other cases (see Source below) speed-up is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230315-162238-pr_vs_nightly_speedup-md) ## Context - #90771 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…erpolate cpu uint8 RGB-case (channels last)" ## Description - Based on #96651 - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last** - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results) - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results - `Pillow (9.0.0.post1)` == Pillow-SIMD ``` [-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+gitc005105) PR | torch (2.1.0a0+git5309c44) nightly | Speed-up: PR vs nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True | 38.670 (+-0.445) | 57.366 (+-0.799) | 132.147 (+-1.236) | 2.304 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False | | 37.825 (+-0.417) | 111.789 (+-1.175) | 2.955 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True | 127.898 (+-1.335) | 153.081 (+-2.346) | 302.518 (+-2.632) | 1.976 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False | | 141.695 (+-1.415) | 286.663 (+-2.494) | 2.023 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True | 179.735 (+-2.054) | 210.613 (+-3.116) | 439.375 (+-4.014) | 2.086 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False | | 207.601 (+-1.639) | 438.537 (+-4.143) | 2.112 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True | 112.679 (+-1.321) | 130.863 (+-1.987) | 446.804 (+-3.283) | 3.414 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False | | 57.968 (+-0.270) | 374.244 (+-13.598) | 6.456 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True | 282.398 (+-3.485) | 322.986 (+-1.947) | 720.197 (+-3.467) | 2.230 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False | | 231.625 (+-2.006) | 592.834 (+-3.903) | 2.559 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True | 185.711 (+-1.666) | 201.069 (+-2.182) | 787.868 (+-3.648) | 3.918 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False | | 75.975 (+-0.696) | 651.016 (+-3.926) | 8.569 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True | 410.236 (+-6.021) | 451.486 (+-3.939) | 1123.923 (+-14.988) | 2.489 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False | | 299.597 (+-1.887) | 915.347 (+-4.486) | 3.055 (+-0.000) # More test-cases from #90771 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True | 60.751 (+-0.285) | 78.538 (+-1.282) | 170.465 (+-1.830) | 2.170 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True | 133.619 (+-2.035) | 159.614 (+-1.587) | 330.971 (+-3.249) | 2.074 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True | 950.243 (+-10.641) | 891.369 (+-17.946) | 2805.510 (+-25.503) | 3.147 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True | 52.771 (+-0.961) | 72.253 (+-1.020) | 135.933 (+-1.625) | 1.881 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True | 139.107 (+-2.143) | 165.844 (+-2.177) | 321.112 (+-2.904) | 1.936 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True | 691.470 (+-9.566) | 764.942 (+-11.192) | 2050.880 (+-22.188) | 2.681 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False | | 77.375 (+-1.345) | 169.646 (+-1.640) | 2.193 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False | | 159.115 (+-3.935) | 329.754 (+-2.590) | 2.072 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False | | 877.248 (+-5.736) | 2815.870 (+-22.589) | 3.210 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False | | 53.120 (+-0.316) | 112.024 (+-1.225) | 2.109 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False | | 147.330 (+-1.871) | 299.152 (+-3.353) | 2.030 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False | | 472.182 (+-10.785) | 1698.601 (+-16.785) | 3.597 (+-0.000) ``` Note: for other cases (see Source below) speed-up is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230320-160044-pr_vs_nightly-speedup-md) ## Context - #90771 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…t8 RGB-case (channels last)" ## Description - Based on #96651 - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last** - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results) - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results - `Pillow (9.0.0.post1)` == Pillow-SIMD ``` [-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+gitc005105) PR | torch (2.1.0a0+git5309c44) nightly | Speed-up: PR vs nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True | 38.670 (+-0.445) | 57.366 (+-0.799) | 132.147 (+-1.236) | 2.304 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False | | 37.825 (+-0.417) | 111.789 (+-1.175) | 2.955 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True | 127.898 (+-1.335) | 153.081 (+-2.346) | 302.518 (+-2.632) | 1.976 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False | | 141.695 (+-1.415) | 286.663 (+-2.494) | 2.023 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True | 179.735 (+-2.054) | 210.613 (+-3.116) | 439.375 (+-4.014) | 2.086 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False | | 207.601 (+-1.639) | 438.537 (+-4.143) | 2.112 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True | 112.679 (+-1.321) | 130.863 (+-1.987) | 446.804 (+-3.283) | 3.414 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False | | 57.968 (+-0.270) | 374.244 (+-13.598) | 6.456 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True | 282.398 (+-3.485) | 322.986 (+-1.947) | 720.197 (+-3.467) | 2.230 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False | | 231.625 (+-2.006) | 592.834 (+-3.903) | 2.559 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True | 185.711 (+-1.666) | 201.069 (+-2.182) | 787.868 (+-3.648) | 3.918 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False | | 75.975 (+-0.696) | 651.016 (+-3.926) | 8.569 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True | 410.236 (+-6.021) | 451.486 (+-3.939) | 1123.923 (+-14.988) | 2.489 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False | | 299.597 (+-1.887) | 915.347 (+-4.486) | 3.055 (+-0.000) # More test-cases from #90771 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True | 60.751 (+-0.285) | 78.538 (+-1.282) | 170.465 (+-1.830) | 2.170 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True | 133.619 (+-2.035) | 159.614 (+-1.587) | 330.971 (+-3.249) | 2.074 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True | 950.243 (+-10.641) | 891.369 (+-17.946) | 2805.510 (+-25.503) | 3.147 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True | 52.771 (+-0.961) | 72.253 (+-1.020) | 135.933 (+-1.625) | 1.881 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True | 139.107 (+-2.143) | 165.844 (+-2.177) | 321.112 (+-2.904) | 1.936 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True | 691.470 (+-9.566) | 764.942 (+-11.192) | 2050.880 (+-22.188) | 2.681 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False | | 77.375 (+-1.345) | 169.646 (+-1.640) | 2.193 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False | | 159.115 (+-3.935) | 329.754 (+-2.590) | 2.072 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False | | 877.248 (+-5.736) | 2815.870 (+-22.589) | 3.210 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False | | 53.120 (+-0.316) | 112.024 (+-1.225) | 2.109 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False | | 147.330 (+-1.871) | 299.152 (+-3.353) | 2.030 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False | | 472.182 (+-10.785) | 1698.601 (+-16.785) | 3.597 (+-0.000) ``` Note: for other cases (see Source below) speed-up is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230320-160044-pr_vs_nightly-speedup-md) ## Context - #90771 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…erpolate cpu uint8 RGB-case (channels last)" ## Description - Based on #96651 - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last** - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results) - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results - `Pillow (9.0.0.post1)` == Pillow-SIMD ``` [-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+git8d955df) PR | torch (2.1.0a0+git5309c44) nightly | Speed-up: PR vs nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True | 38.649 (+-0.306) | 55.828 (+-0.370) | 132.147 (+-1.236) | 2.367 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False | | 36.826 (+-0.229) | 111.789 (+-1.175) | 3.036 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True | 128.233 (+-1.313) | 153.827 (+-1.229) | 302.518 (+-2.632) | 1.967 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False | | 143.886 (+-1.409) | 286.663 (+-2.494) | 1.992 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True | 179.504 (+-1.825) | 211.569 (+-1.336) | 439.375 (+-4.014) | 2.077 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False | | 209.888 (+-1.443) | 438.537 (+-4.143) | 2.089 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True | 112.891 (+-1.118) | 129.373 (+-1.396) | 446.804 (+-3.283) | 3.454 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False | | 56.858 (+-0.227) | 374.244 (+-13.598) | 6.582 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True | 282.917 (+-2.992) | 324.378 (+-1.694) | 720.197 (+-3.467) | 2.220 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False | | 236.078 (+-1.679) | 592.834 (+-3.903) | 2.511 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True | 185.595 (+-1.633) | 202.000 (+-1.920) | 787.868 (+-3.648) | 3.900 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False | | 75.421 (+-0.512) | 651.016 (+-3.926) | 8.632 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True | 409.691 (+-2.735) | 449.927 (+-2.500) | 1123.923 (+-14.988) | 2.498 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False | | 306.691 (+-2.095) | 915.347 (+-4.486) | 2.985 (+-0.000) # More test-cases from #90771 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True | 60.740 (+-0.278) | 78.745 (+-0.286) | 170.465 (+-1.830) | 2.165 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True | 133.029 (+-1.619) | 162.393 (+-1.289) | 330.971 (+-3.249) | 2.038 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True | 948.849 (+-2.749) | 896.127 (+-3.696) | 2805.510 (+-25.503) | 3.131 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True | 52.505 (+-0.319) | 70.617 (+-0.344) | 135.933 (+-1.625) | 1.925 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True | 138.671 (+-1.953) | 165.638 (+-1.473) | 321.112 (+-2.904) | 1.939 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True | 689.492 (+-2.917) | 758.162 (+-3.719) | 2050.880 (+-22.188) | 2.705 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False | | 77.300 (+-0.307) | 169.646 (+-1.640) | 2.195 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False | | 159.525 (+-1.225) | 329.754 (+-2.590) | 2.067 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False | | 890.106 (+-3.358) | 2815.870 (+-22.589) | 3.164 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False | | 52.399 (+-0.314) | 112.024 (+-1.225) | 2.138 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False | | 148.780 (+-1.282) | 299.152 (+-3.353) | 2.011 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False | | 479.273 (+-3.432) | 1698.601 (+-16.785) | 3.544 (+-0.000) 4 ``` Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230321-145513-pr_vs_nightly-speedup-md) ## Context - #90771 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…t8 RGB-case (channels last)" ## Description - Based on #96651 - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last** - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results) - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results - `Pillow (9.0.0.post1)` == Pillow-SIMD ``` [-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+git8d955df) PR | torch (2.1.0a0+git5309c44) nightly | Speed-up: PR vs nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True | 38.649 (+-0.306) | 55.828 (+-0.370) | 132.147 (+-1.236) | 2.367 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False | | 36.826 (+-0.229) | 111.789 (+-1.175) | 3.036 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True | 128.233 (+-1.313) | 153.827 (+-1.229) | 302.518 (+-2.632) | 1.967 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False | | 143.886 (+-1.409) | 286.663 (+-2.494) | 1.992 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True | 179.504 (+-1.825) | 211.569 (+-1.336) | 439.375 (+-4.014) | 2.077 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False | | 209.888 (+-1.443) | 438.537 (+-4.143) | 2.089 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True | 112.891 (+-1.118) | 129.373 (+-1.396) | 446.804 (+-3.283) | 3.454 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False | | 56.858 (+-0.227) | 374.244 (+-13.598) | 6.582 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True | 282.917 (+-2.992) | 324.378 (+-1.694) | 720.197 (+-3.467) | 2.220 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False | | 236.078 (+-1.679) | 592.834 (+-3.903) | 2.511 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True | 185.595 (+-1.633) | 202.000 (+-1.920) | 787.868 (+-3.648) | 3.900 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False | | 75.421 (+-0.512) | 651.016 (+-3.926) | 8.632 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True | 409.691 (+-2.735) | 449.927 (+-2.500) | 1123.923 (+-14.988) | 2.498 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False | | 306.691 (+-2.095) | 915.347 (+-4.486) | 2.985 (+-0.000) # More test-cases from #90771 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True | 60.740 (+-0.278) | 78.745 (+-0.286) | 170.465 (+-1.830) | 2.165 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True | 133.029 (+-1.619) | 162.393 (+-1.289) | 330.971 (+-3.249) | 2.038 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True | 948.849 (+-2.749) | 896.127 (+-3.696) | 2805.510 (+-25.503) | 3.131 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True | 52.505 (+-0.319) | 70.617 (+-0.344) | 135.933 (+-1.625) | 1.925 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True | 138.671 (+-1.953) | 165.638 (+-1.473) | 321.112 (+-2.904) | 1.939 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True | 689.492 (+-2.917) | 758.162 (+-3.719) | 2050.880 (+-22.188) | 2.705 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False | | 77.300 (+-0.307) | 169.646 (+-1.640) | 2.195 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False | | 159.525 (+-1.225) | 329.754 (+-2.590) | 2.067 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False | | 890.106 (+-3.358) | 2815.870 (+-22.589) | 3.164 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False | | 52.399 (+-0.314) | 112.024 (+-1.225) | 2.138 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False | | 148.780 (+-1.282) | 299.152 (+-3.353) | 2.011 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False | | 479.273 (+-3.432) | 1698.601 (+-16.785) | 3.544 (+-0.000) 4 ``` Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230321-145513-pr_vs_nightly-speedup-md) ## Context - #90771 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…erpolate cpu uint8 RGB-case (channels last)" ## Description - Based on #96651 - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last** - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results) - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results - `Pillow (9.0.0.post1)` == Pillow-SIMD ``` [-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+gitce4be01) PR | torch (2.1.0a0+git5309c44) nightly | Speed-up: PR vs nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True | 38.548 (+-0.280) | 57.536 (+-0.210) | 132.147 (+-1.236) | 2.297 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False | | 38.532 (+-0.219) | 111.789 (+-1.175) | 2.901 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True | 127.689 (+-1.348) | 156.262 (+-1.213) | 302.518 (+-2.632) | 1.936 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False | | 145.483 (+-1.077) | 286.663 (+-2.494) | 1.970 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True | 178.117 (+-1.956) | 215.053 (+-1.470) | 439.375 (+-4.014) | 2.043 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False | | 211.340 (+-2.239) | 438.537 (+-4.143) | 2.075 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True | 112.593 (+-1.266) | 130.414 (+-1.633) | 446.804 (+-3.283) | 3.426 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False | | 58.767 (+-0.203) | 374.244 (+-13.598) | 6.368 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True | 283.210 (+-2.937) | 324.157 (+-1.895) | 720.197 (+-3.467) | 2.222 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False | | 239.800 (+-2.492) | 592.834 (+-3.903) | 2.472 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True | 186.255 (+-1.629) | 204.834 (+-1.496) | 787.868 (+-3.648) | 3.846 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False | | 77.335 (+-0.341) | 651.016 (+-3.926) | 8.418 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True | 410.286 (+-2.439) | 443.934 (+-2.899) | 1123.923 (+-14.988) | 2.532 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False | | 312.220 (+-2.307) | 915.347 (+-4.486) | 2.932 (+-0.000) # More test-cases from #90771 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True | 60.611 (+-0.337) | 80.849 (+-1.780) | 170.465 (+-1.830) | 2.108 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True | 132.971 (+-1.624) | 164.892 (+-1.426) | 330.971 (+-3.249) | 2.007 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True | 948.467 (+-3.179) | 891.414 (+-5.282) | 2805.510 (+-25.503) | 3.147 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True | 52.539 (+-0.327) | 72.471 (+-0.367) | 135.933 (+-1.625) | 1.876 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True | 138.669 (+-1.867) | 168.628 (+-1.213) | 321.112 (+-2.904) | 1.904 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True | 689.933 (+-3.175) | 746.911 (+-2.985) | 2050.880 (+-22.188) | 2.746 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False | | 78.347 (+-0.338) | 169.646 (+-1.640) | 2.165 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False | | 162.194 (+-1.089) | 329.754 (+-2.590) | 2.033 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False | | 894.476 (+-2.738) | 2815.870 (+-22.589) | 3.148 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False | | 52.728 (+-0.406) | 112.024 (+-1.225) | 2.125 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False | | 151.560 (+-1.128) | 299.152 (+-3.353) | 1.974 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False | | 500.053 (+-4.288) | 1698.601 (+-16.785) | 3.397 (+-0.000) ``` Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230322-132441-pr_vs_nightly-speedup-md) ## Context - #90771 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…t8 RGB-case (channels last)" ## Description - Based on #96651 - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last** - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results) - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results - `Pillow (9.0.0.post1)` == Pillow-SIMD ``` [-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+gitce4be01) PR | torch (2.1.0a0+git5309c44) nightly | Speed-up: PR vs nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True | 38.548 (+-0.280) | 57.536 (+-0.210) | 132.147 (+-1.236) | 2.297 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False | | 38.532 (+-0.219) | 111.789 (+-1.175) | 2.901 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True | 127.689 (+-1.348) | 156.262 (+-1.213) | 302.518 (+-2.632) | 1.936 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False | | 145.483 (+-1.077) | 286.663 (+-2.494) | 1.970 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True | 178.117 (+-1.956) | 215.053 (+-1.470) | 439.375 (+-4.014) | 2.043 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False | | 211.340 (+-2.239) | 438.537 (+-4.143) | 2.075 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True | 112.593 (+-1.266) | 130.414 (+-1.633) | 446.804 (+-3.283) | 3.426 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False | | 58.767 (+-0.203) | 374.244 (+-13.598) | 6.368 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True | 283.210 (+-2.937) | 324.157 (+-1.895) | 720.197 (+-3.467) | 2.222 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False | | 239.800 (+-2.492) | 592.834 (+-3.903) | 2.472 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True | 186.255 (+-1.629) | 204.834 (+-1.496) | 787.868 (+-3.648) | 3.846 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False | | 77.335 (+-0.341) | 651.016 (+-3.926) | 8.418 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True | 410.286 (+-2.439) | 443.934 (+-2.899) | 1123.923 (+-14.988) | 2.532 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False | | 312.220 (+-2.307) | 915.347 (+-4.486) | 2.932 (+-0.000) # More test-cases from #90771 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True | 60.611 (+-0.337) | 80.849 (+-1.780) | 170.465 (+-1.830) | 2.108 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True | 132.971 (+-1.624) | 164.892 (+-1.426) | 330.971 (+-3.249) | 2.007 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True | 948.467 (+-3.179) | 891.414 (+-5.282) | 2805.510 (+-25.503) | 3.147 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True | 52.539 (+-0.327) | 72.471 (+-0.367) | 135.933 (+-1.625) | 1.876 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True | 138.669 (+-1.867) | 168.628 (+-1.213) | 321.112 (+-2.904) | 1.904 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True | 689.933 (+-3.175) | 746.911 (+-2.985) | 2050.880 (+-22.188) | 2.746 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False | | 78.347 (+-0.338) | 169.646 (+-1.640) | 2.165 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False | | 162.194 (+-1.089) | 329.754 (+-2.590) | 2.033 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False | | 894.476 (+-2.738) | 2815.870 (+-22.589) | 3.148 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False | | 52.728 (+-0.406) | 112.024 (+-1.225) | 2.125 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False | | 151.560 (+-1.128) | 299.152 (+-3.353) | 1.974 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False | | 500.053 (+-4.288) | 1698.601 (+-16.785) | 3.397 (+-0.000) ``` Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230322-132441-pr_vs_nightly-speedup-md) ## Context - #90771 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…erpolate cpu uint8 RGB-case (channels last)" ## Description - Based on #96651 - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last** - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results) - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results - `Pillow (9.0.0.post1)` == Pillow-SIMD ``` [-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+gitce4be01) PR | torch (2.1.0a0+git5309c44) nightly | Speed-up: PR vs nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True | 38.548 (+-0.280) | 57.536 (+-0.210) | 132.147 (+-1.236) | 2.297 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False | | 38.532 (+-0.219) | 111.789 (+-1.175) | 2.901 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True | 127.689 (+-1.348) | 156.262 (+-1.213) | 302.518 (+-2.632) | 1.936 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False | | 145.483 (+-1.077) | 286.663 (+-2.494) | 1.970 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True | 178.117 (+-1.956) | 215.053 (+-1.470) | 439.375 (+-4.014) | 2.043 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False | | 211.340 (+-2.239) | 438.537 (+-4.143) | 2.075 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True | 112.593 (+-1.266) | 130.414 (+-1.633) | 446.804 (+-3.283) | 3.426 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False | | 58.767 (+-0.203) | 374.244 (+-13.598) | 6.368 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True | 283.210 (+-2.937) | 324.157 (+-1.895) | 720.197 (+-3.467) | 2.222 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False | | 239.800 (+-2.492) | 592.834 (+-3.903) | 2.472 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True | 186.255 (+-1.629) | 204.834 (+-1.496) | 787.868 (+-3.648) | 3.846 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False | | 77.335 (+-0.341) | 651.016 (+-3.926) | 8.418 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True | 410.286 (+-2.439) | 443.934 (+-2.899) | 1123.923 (+-14.988) | 2.532 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False | | 312.220 (+-2.307) | 915.347 (+-4.486) | 2.932 (+-0.000) # More test-cases from #90771 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True | 60.611 (+-0.337) | 80.849 (+-1.780) | 170.465 (+-1.830) | 2.108 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True | 132.971 (+-1.624) | 164.892 (+-1.426) | 330.971 (+-3.249) | 2.007 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True | 948.467 (+-3.179) | 891.414 (+-5.282) | 2805.510 (+-25.503) | 3.147 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True | 52.539 (+-0.327) | 72.471 (+-0.367) | 135.933 (+-1.625) | 1.876 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True | 138.669 (+-1.867) | 168.628 (+-1.213) | 321.112 (+-2.904) | 1.904 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True | 689.933 (+-3.175) | 746.911 (+-2.985) | 2050.880 (+-22.188) | 2.746 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False | | 78.347 (+-0.338) | 169.646 (+-1.640) | 2.165 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False | | 162.194 (+-1.089) | 329.754 (+-2.590) | 2.033 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False | | 894.476 (+-2.738) | 2815.870 (+-22.589) | 3.148 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False | | 52.728 (+-0.406) | 112.024 (+-1.225) | 2.125 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False | | 151.560 (+-1.128) | 299.152 (+-3.353) | 1.974 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False | | 500.053 (+-4.288) | 1698.601 (+-16.785) | 3.397 (+-0.000) ``` Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230322-132441-pr_vs_nightly-speedup-md) ## Context - #90771 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…t8 RGB-case (channels last)" ## Description - Based on #96651 - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last** - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results) - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results - `Pillow (9.0.0.post1)` == Pillow-SIMD ``` [-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+gitce4be01) PR | torch (2.1.0a0+git5309c44) nightly | Speed-up: PR vs nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True | 38.548 (+-0.280) | 57.536 (+-0.210) | 132.147 (+-1.236) | 2.297 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False | | 38.532 (+-0.219) | 111.789 (+-1.175) | 2.901 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True | 127.689 (+-1.348) | 156.262 (+-1.213) | 302.518 (+-2.632) | 1.936 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False | | 145.483 (+-1.077) | 286.663 (+-2.494) | 1.970 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True | 178.117 (+-1.956) | 215.053 (+-1.470) | 439.375 (+-4.014) | 2.043 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False | | 211.340 (+-2.239) | 438.537 (+-4.143) | 2.075 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True | 112.593 (+-1.266) | 130.414 (+-1.633) | 446.804 (+-3.283) | 3.426 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False | | 58.767 (+-0.203) | 374.244 (+-13.598) | 6.368 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True | 283.210 (+-2.937) | 324.157 (+-1.895) | 720.197 (+-3.467) | 2.222 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False | | 239.800 (+-2.492) | 592.834 (+-3.903) | 2.472 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True | 186.255 (+-1.629) | 204.834 (+-1.496) | 787.868 (+-3.648) | 3.846 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False | | 77.335 (+-0.341) | 651.016 (+-3.926) | 8.418 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True | 410.286 (+-2.439) | 443.934 (+-2.899) | 1123.923 (+-14.988) | 2.532 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False | | 312.220 (+-2.307) | 915.347 (+-4.486) | 2.932 (+-0.000) # More test-cases from #90771 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True | 60.611 (+-0.337) | 80.849 (+-1.780) | 170.465 (+-1.830) | 2.108 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True | 132.971 (+-1.624) | 164.892 (+-1.426) | 330.971 (+-3.249) | 2.007 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True | 948.467 (+-3.179) | 891.414 (+-5.282) | 2805.510 (+-25.503) | 3.147 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True | 52.539 (+-0.327) | 72.471 (+-0.367) | 135.933 (+-1.625) | 1.876 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True | 138.669 (+-1.867) | 168.628 (+-1.213) | 321.112 (+-2.904) | 1.904 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True | 689.933 (+-3.175) | 746.911 (+-2.985) | 2050.880 (+-22.188) | 2.746 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False | | 78.347 (+-0.338) | 169.646 (+-1.640) | 2.165 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False | | 162.194 (+-1.089) | 329.754 (+-2.590) | 2.033 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False | | 894.476 (+-2.738) | 2815.870 (+-22.589) | 3.148 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False | | 52.728 (+-0.406) | 112.024 (+-1.225) | 2.125 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False | | 151.560 (+-1.128) | 299.152 (+-3.353) | 1.974 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False | | 500.053 (+-4.288) | 1698.601 (+-16.785) | 3.397 (+-0.000) ``` Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230322-132441-pr_vs_nightly-speedup-md) ## Context - #90771 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…erpolate cpu uint8 RGB-case (channels last)" ## Description - Based on #96651 - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last** - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results) - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results - `Pillow (9.0.0.post1)` == Pillow-SIMD ``` [-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+gitce4be01) PR | torch (2.1.0a0+git5309c44) nightly | Speed-up: PR vs nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True | 38.548 (+-0.280) | 57.536 (+-0.210) | 132.147 (+-1.236) | 2.297 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False | | 38.532 (+-0.219) | 111.789 (+-1.175) | 2.901 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True | 127.689 (+-1.348) | 156.262 (+-1.213) | 302.518 (+-2.632) | 1.936 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False | | 145.483 (+-1.077) | 286.663 (+-2.494) | 1.970 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True | 178.117 (+-1.956) | 215.053 (+-1.470) | 439.375 (+-4.014) | 2.043 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False | | 211.340 (+-2.239) | 438.537 (+-4.143) | 2.075 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True | 112.593 (+-1.266) | 130.414 (+-1.633) | 446.804 (+-3.283) | 3.426 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False | | 58.767 (+-0.203) | 374.244 (+-13.598) | 6.368 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True | 283.210 (+-2.937) | 324.157 (+-1.895) | 720.197 (+-3.467) | 2.222 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False | | 239.800 (+-2.492) | 592.834 (+-3.903) | 2.472 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True | 186.255 (+-1.629) | 204.834 (+-1.496) | 787.868 (+-3.648) | 3.846 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False | | 77.335 (+-0.341) | 651.016 (+-3.926) | 8.418 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True | 410.286 (+-2.439) | 443.934 (+-2.899) | 1123.923 (+-14.988) | 2.532 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False | | 312.220 (+-2.307) | 915.347 (+-4.486) | 2.932 (+-0.000) # More test-cases from #90771 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True | 60.611 (+-0.337) | 80.849 (+-1.780) | 170.465 (+-1.830) | 2.108 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True | 132.971 (+-1.624) | 164.892 (+-1.426) | 330.971 (+-3.249) | 2.007 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True | 948.467 (+-3.179) | 891.414 (+-5.282) | 2805.510 (+-25.503) | 3.147 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True | 52.539 (+-0.327) | 72.471 (+-0.367) | 135.933 (+-1.625) | 1.876 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True | 138.669 (+-1.867) | 168.628 (+-1.213) | 321.112 (+-2.904) | 1.904 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True | 689.933 (+-3.175) | 746.911 (+-2.985) | 2050.880 (+-22.188) | 2.746 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False | | 78.347 (+-0.338) | 169.646 (+-1.640) | 2.165 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False | | 162.194 (+-1.089) | 329.754 (+-2.590) | 2.033 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False | | 894.476 (+-2.738) | 2815.870 (+-22.589) | 3.148 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False | | 52.728 (+-0.406) | 112.024 (+-1.225) | 2.125 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False | | 151.560 (+-1.128) | 299.152 (+-3.353) | 1.974 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False | | 500.053 (+-4.288) | 1698.601 (+-16.785) | 3.397 (+-0.000) ``` Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230322-132441-pr_vs_nightly-speedup-md) ## Context - #90771 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…t8 RGB-case (channels last)" ## Description - Based on #96651 - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last** - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results) - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results - `Pillow (9.0.0.post1)` == Pillow-SIMD ``` [-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+gitce4be01) PR | torch (2.1.0a0+git5309c44) nightly | Speed-up: PR vs nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True | 38.548 (+-0.280) | 57.536 (+-0.210) | 132.147 (+-1.236) | 2.297 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False | | 38.532 (+-0.219) | 111.789 (+-1.175) | 2.901 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True | 127.689 (+-1.348) | 156.262 (+-1.213) | 302.518 (+-2.632) | 1.936 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False | | 145.483 (+-1.077) | 286.663 (+-2.494) | 1.970 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True | 178.117 (+-1.956) | 215.053 (+-1.470) | 439.375 (+-4.014) | 2.043 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False | | 211.340 (+-2.239) | 438.537 (+-4.143) | 2.075 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True | 112.593 (+-1.266) | 130.414 (+-1.633) | 446.804 (+-3.283) | 3.426 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False | | 58.767 (+-0.203) | 374.244 (+-13.598) | 6.368 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True | 283.210 (+-2.937) | 324.157 (+-1.895) | 720.197 (+-3.467) | 2.222 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False | | 239.800 (+-2.492) | 592.834 (+-3.903) | 2.472 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True | 186.255 (+-1.629) | 204.834 (+-1.496) | 787.868 (+-3.648) | 3.846 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False | | 77.335 (+-0.341) | 651.016 (+-3.926) | 8.418 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True | 410.286 (+-2.439) | 443.934 (+-2.899) | 1123.923 (+-14.988) | 2.532 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False | | 312.220 (+-2.307) | 915.347 (+-4.486) | 2.932 (+-0.000) # More test-cases from #90771 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True | 60.611 (+-0.337) | 80.849 (+-1.780) | 170.465 (+-1.830) | 2.108 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True | 132.971 (+-1.624) | 164.892 (+-1.426) | 330.971 (+-3.249) | 2.007 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True | 948.467 (+-3.179) | 891.414 (+-5.282) | 2805.510 (+-25.503) | 3.147 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True | 52.539 (+-0.327) | 72.471 (+-0.367) | 135.933 (+-1.625) | 1.876 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True | 138.669 (+-1.867) | 168.628 (+-1.213) | 321.112 (+-2.904) | 1.904 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True | 689.933 (+-3.175) | 746.911 (+-2.985) | 2050.880 (+-22.188) | 2.746 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False | | 78.347 (+-0.338) | 169.646 (+-1.640) | 2.165 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False | | 162.194 (+-1.089) | 329.754 (+-2.590) | 2.033 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False | | 894.476 (+-2.738) | 2815.870 (+-22.589) | 3.148 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False | | 52.728 (+-0.406) | 112.024 (+-1.225) | 2.125 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False | | 151.560 (+-1.128) | 299.152 (+-3.353) | 1.974 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False | | 500.053 (+-4.288) | 1698.601 (+-16.785) | 3.397 (+-0.000) ``` Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230322-132441-pr_vs_nightly-speedup-md) ## Context - #90771 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]
…erpolate cpu uint8 RGB-case (channels last)" ## Description - Based on #96651 - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last** - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results) - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results - `Pillow (9.0.0.post1)` == Pillow-SIMD ``` [-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+gitd6e220c) PR | torch (2.1.0a0+git2b75955) nightly | Speed-up: PR vs nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True | 38.674 (+-0.323) | 57.591 (+-0.244) | 131.033 (+-1.448) | 2.275 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False | | 39.471 (+-0.166) | 113.911 (+-1.736) | 2.886 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True | 128.512 (+-1.916) | 161.592 (+-1.242) | 299.679 (+-2.099) | 1.855 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False | | 150.994 (+-1.180) | 285.331 (+-1.919) | 1.890 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True | 180.045 (+-2.223) | 220.581 (+-1.363) | 431.057 (+-3.536) | 1.954 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False | | 219.391 (+-1.409) | 429.410 (+-3.620) | 1.957 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True | 113.911 (+-1.024) | 129.457 (+-1.295) | 459.610 (+-13.322) | 3.550 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False | | 59.800 (+-0.199) | 400.015 (+-11.815) | 6.689 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True | 283.050 (+-2.664) | 339.143 (+-1.209) | 683.555 (+-4.466) | 2.016 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False | | 250.601 (+-1.236) | 603.545 (+-2.644) | 2.408 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True | 186.723 (+-2.213) | 199.960 (+-1.343) | 860.867 (+-21.763) | 4.305 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False | | 79.188 (+-0.261) | 703.019 (+-25.805) | 8.878 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True | 412.353 (+-4.476) | 462.230 (+-1.983) | 1101.673 (+-49.299) | 2.383 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False | | 327.973 (+-1.852) | 941.062 (+-5.549) | 2.869 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True | 61.191 (+-0.926) | 80.795 (+-0.518) | 160.853 (+-1.506) | 1.991 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True | 134.488 (+-2.129) | 169.147 (+-1.324) | 327.343 (+-2.846) | 1.935 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True | 1037.045 (+-24.982) | 938.623 (+-9.010) | 2603.360 (+-20.530) | 2.774 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True | 52.792 (+-0.613) | 73.692 (+-0.264) | 131.829 (+-1.333) | 1.789 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True | 139.596 (+-1.944) | 173.778 (+-1.039) | 320.063 (+-2.562) | 1.842 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True | 690.132 (+-10.946) | 772.758 (+-2.864) | 2036.860 (+-36.109) | 2.636 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False | | 78.747 (+-0.799) | 158.479 (+-1.702) | 2.013 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False | | 167.046 (+-1.077) | 322.104 (+-2.764) | 1.928 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False | | 918.967 (+-5.251) | 2611.388 (+-29.917) | 2.842 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False | | 55.336 (+-0.251) | 113.869 (+-1.243) | 2.058 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False | | 156.505 (+-1.095) | 299.861 (+-2.710) | 1.916 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False | | 514.344 (+-1.905) | 1776.796 (+-19.660) | 3.454 (+-0.000) ``` Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230329-181023-pr_vs_nightly-speedup-md) ## Context - #90771 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 datumbox pmeier [ghstack-poisoned]
…t8 RGB-case (channels last)" ## Description - Based on #96651 - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last** - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results) - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results - `Pillow (9.0.0.post1)` == Pillow-SIMD ``` [-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+gitd6e220c) PR | torch (2.1.0a0+git2b75955) nightly | Speed-up: PR vs nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True | 38.674 (+-0.323) | 57.591 (+-0.244) | 131.033 (+-1.448) | 2.275 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False | | 39.471 (+-0.166) | 113.911 (+-1.736) | 2.886 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True | 128.512 (+-1.916) | 161.592 (+-1.242) | 299.679 (+-2.099) | 1.855 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False | | 150.994 (+-1.180) | 285.331 (+-1.919) | 1.890 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True | 180.045 (+-2.223) | 220.581 (+-1.363) | 431.057 (+-3.536) | 1.954 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False | | 219.391 (+-1.409) | 429.410 (+-3.620) | 1.957 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True | 113.911 (+-1.024) | 129.457 (+-1.295) | 459.610 (+-13.322) | 3.550 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False | | 59.800 (+-0.199) | 400.015 (+-11.815) | 6.689 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True | 283.050 (+-2.664) | 339.143 (+-1.209) | 683.555 (+-4.466) | 2.016 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False | | 250.601 (+-1.236) | 603.545 (+-2.644) | 2.408 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True | 186.723 (+-2.213) | 199.960 (+-1.343) | 860.867 (+-21.763) | 4.305 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False | | 79.188 (+-0.261) | 703.019 (+-25.805) | 8.878 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True | 412.353 (+-4.476) | 462.230 (+-1.983) | 1101.673 (+-49.299) | 2.383 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False | | 327.973 (+-1.852) | 941.062 (+-5.549) | 2.869 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True | 61.191 (+-0.926) | 80.795 (+-0.518) | 160.853 (+-1.506) | 1.991 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True | 134.488 (+-2.129) | 169.147 (+-1.324) | 327.343 (+-2.846) | 1.935 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True | 1037.045 (+-24.982) | 938.623 (+-9.010) | 2603.360 (+-20.530) | 2.774 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True | 52.792 (+-0.613) | 73.692 (+-0.264) | 131.829 (+-1.333) | 1.789 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True | 139.596 (+-1.944) | 173.778 (+-1.039) | 320.063 (+-2.562) | 1.842 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True | 690.132 (+-10.946) | 772.758 (+-2.864) | 2036.860 (+-36.109) | 2.636 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False | | 78.747 (+-0.799) | 158.479 (+-1.702) | 2.013 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False | | 167.046 (+-1.077) | 322.104 (+-2.764) | 1.928 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False | | 918.967 (+-5.251) | 2611.388 (+-29.917) | 2.842 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False | | 55.336 (+-0.251) | 113.869 (+-1.243) | 2.058 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False | | 156.505 (+-1.095) | 299.861 (+-2.710) | 1.916 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False | | 514.344 (+-1.905) | 1776.796 (+-19.660) | 3.454 (+-0.000) ``` Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230329-181023-pr_vs_nightly-speedup-md) ## Context - #90771 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 datumbox pmeier [ghstack-poisoned]
… (channels last) (#96848) ## Description - Based on #96651 - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last** - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results) - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from #96651) ## Results - `Pillow (9.0.0.post1)` == Pillow-SIMD ``` [-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+gitd6e220c) PR | torch (2.1.0a0+git2b75955) nightly | Speed-up: PR vs nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True | 38.674 (+-0.323) | 57.591 (+-0.244) | 131.033 (+-1.448) | 2.275 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False | | 39.471 (+-0.166) | 113.911 (+-1.736) | 2.886 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True | 128.512 (+-1.916) | 161.592 (+-1.242) | 299.679 (+-2.099) | 1.855 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False | | 150.994 (+-1.180) | 285.331 (+-1.919) | 1.890 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True | 180.045 (+-2.223) | 220.581 (+-1.363) | 431.057 (+-3.536) | 1.954 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False | | 219.391 (+-1.409) | 429.410 (+-3.620) | 1.957 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True | 113.911 (+-1.024) | 129.457 (+-1.295) | 459.610 (+-13.322) | 3.550 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False | | 59.800 (+-0.199) | 400.015 (+-11.815) | 6.689 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True | 283.050 (+-2.664) | 339.143 (+-1.209) | 683.555 (+-4.466) | 2.016 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False | | 250.601 (+-1.236) | 603.545 (+-2.644) | 2.408 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True | 186.723 (+-2.213) | 199.960 (+-1.343) | 860.867 (+-21.763) | 4.305 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False | | 79.188 (+-0.261) | 703.019 (+-25.805) | 8.878 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True | 412.353 (+-4.476) | 462.230 (+-1.983) | 1101.673 (+-49.299) | 2.383 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False | | 327.973 (+-1.852) | 941.062 (+-5.549) | 2.869 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True | 61.191 (+-0.926) | 80.795 (+-0.518) | 160.853 (+-1.506) | 1.991 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True | 134.488 (+-2.129) | 169.147 (+-1.324) | 327.343 (+-2.846) | 1.935 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True | 1037.045 (+-24.982) | 938.623 (+-9.010) | 2603.360 (+-20.530) | 2.774 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True | 52.792 (+-0.613) | 73.692 (+-0.264) | 131.829 (+-1.333) | 1.789 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True | 139.596 (+-1.944) | 173.778 (+-1.039) | 320.063 (+-2.562) | 1.842 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True | 690.132 (+-10.946) | 772.758 (+-2.864) | 2036.860 (+-36.109) | 2.636 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False | | 78.747 (+-0.799) | 158.479 (+-1.702) | 2.013 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False | | 167.046 (+-1.077) | 322.104 (+-2.764) | 1.928 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False | | 918.967 (+-5.251) | 2611.388 (+-29.917) | 2.842 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False | | 55.336 (+-0.251) | 113.869 (+-1.243) | 2.058 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False | | 156.505 (+-1.095) | 299.861 (+-2.710) | 1.916 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False | | 514.344 (+-1.905) | 1776.796 (+-19.660) | 3.454 (+-0.000) ``` Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230329-181023-pr_vs_nightly-speedup-md) ## Context - #90771 Pull Request resolved: #96848 Approved by: https://github.com/NicolasHug, https://github.com/peterbell10
Description: - Now that pytorch/pytorch#90771 is merged, let Resize() rely on interpolate()'s native uint8 handling instead of converting to and from float. - uint8 input is not casted to f32 for nearest mode and bilinear mode if the latter has AVX2. Context: pytorch#7217 Benchmarks: ``` [----------- Resize cpu torch.uint8 InterpolationMode.NEAREST -----------] | resize v2 | resize stable | resize nightly 1 threads: --------------------------------------------------------------- (3, 400, 400) | 457 | 461 | 480 (16, 3, 400, 400) | 6870 | 6850 | 10100 Times are in microseconds (us). [---------- Resize cpu torch.uint8 InterpolationMode.BILINEAR -----------] | resize v2 | resize stable | resize nightly 1 threads: --------------------------------------------------------------- (3, 400, 400) | 326 | 329 | 844 (16, 3, 400, 400) | 4380 | 4390 | 14800 Times are in microseconds (us). ``` [Source](https://gist.github.com/vfdev-5/a2e30ed50b5996807c9b09d5d33d8bc2)
Description: - Now that pytorch/pytorch#90771 is merged, let Resize() rely on interpolate()'s native uint8 handling instead of converting to and from float. - uint8 input is not casted to f32 for nearest mode and bilinear mode if the latter has AVX2. Context: pytorch#7217 Benchmarks: ``` [----------- Resize cpu torch.uint8 InterpolationMode.NEAREST -----------] | resize v2 | resize stable | resize nightly 1 threads: --------------------------------------------------------------- (3, 400, 400) | 457 | 461 | 480 (16, 3, 400, 400) | 6870 | 6850 | 10100 Times are in microseconds (us). [---------- Resize cpu torch.uint8 InterpolationMode.BILINEAR -----------] | resize v2 | resize stable | resize nightly 1 threads: --------------------------------------------------------------- (3, 400, 400) | 326 | 329 | 844 (16, 3, 400, 400) | 4380 | 4390 | 14800 Times are in microseconds (us). ``` [Source](https://gist.github.com/vfdev-5/a2e30ed50b5996807c9b09d5d33d8bc2)
Joint work with @vfdev-5 This PR introduces native uint8 support for `interpolate()`, for `bilinear` ~and `bicubic`~ modes for CPU images (`mode=nearest[_exact]` was already supported ). On a typical torchvision training job on ImageNet, the speedup are ~4X when AVX2 is supported, comparing the uint8 native (this PR) vs torchvision's current `Resize()`: ``` AA = antialias float = uint8->float->interpolate()->round()->clamp()->uint8 (what Resize() currently does) input_size output_size channels_last AA mode num_threads speed-up float vs uint8 (this PR) (1, 3, 270, 268) -> (224, 224) True True bilinear num_threads=1 4X 2.6ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) True False bilinear num_threads=1 2.1X 1.3ms vs 0.6ms (1, 3, 270, 268) -> (224, 224) False True bilinear num_threads=1 3X 2.1ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False False bilinear num_threads=1 4X 2.4ms vs 0.6ms (Note: we removed bicubic support for now) (1, 3, 270, 268) -> (224, 224) True True bicubic num_threads=1 4X 2.9ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) True False bicubic num_threads=1 5X 3.1ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False True bicubic num_threads=1 3X 2.4ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False False bicubic num_threads=1 4X 2.8ms vs 0.7ms ``` There is still room for further speed-ups (see TODOs in the code). #### More benchmark details with AVX2 support - speedups typically range from 1.5X to 10X. A few edge-cases are slower, worth investigating why. <details> ``` AA = antialias float = uint8->float->interpolate()->round()->clamp()->uint8 (what Resize() currently does) input_size output_size channels_last AA mode num_threads speed-up float vs uint8 (this PR) (1, 3, 64, 64) -> (224, 224) True True bilinear num_threads=1 5X 1.1ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) True False bilinear num_threads=1 5X 1.2ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) False True bilinear num_threads=1 2.8X 0.6ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) False False bilinear num_threads=1 7X 1.6ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) True True bicubic num_threads=1 5X 1.2ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) True False bicubic num_threads=1 12X 2.9ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) False True bicubic num_threads=1 3X 0.8ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) False False bicubic num_threads=1 7X 1.8ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) True True bilinear num_threads=2 2.6X 0.6ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) True False bilinear num_threads=2 2.8X 0.6ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) False True bilinear num_threads=2 1.7X 0.4ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) False False bilinear num_threads=2 1.4X 0.3ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) True True bicubic num_threads=2 2.7X 0.7ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) True False bicubic num_threads=2 7X 1.6ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) False True bicubic num_threads=2 1.8X 0.4ms vs 0.2ms (1, 3, 64, 64) -> (224, 224) False False bicubic num_threads=2 4X 1.0ms vs 0.2ms (1, 3, 224, 224) -> (270, 268) True True bilinear num_threads=1 4X 2.5ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) True False bilinear num_threads=1 3.0X 1.8ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) False True bilinear num_threads=1 3X 1.8ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) False False bilinear num_threads=1 4X 2.3ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) True True bicubic num_threads=1 4X 2.7ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) True False bicubic num_threads=1 7X 4.3ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) False True bicubic num_threads=1 3X 2.1ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) False False bicubic num_threads=1 4X 2.6ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) True True bilinear num_threads=2 2.7X 1.6ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) True False bilinear num_threads=2 2.6X 1.5ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) False True bilinear num_threads=2 2.1X 1.2ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) False False bilinear num_threads=2 1.6X 0.9ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) True True bicubic num_threads=2 2.8X 1.7ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) True False bicubic num_threads=2 5X 2.8ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) False True bicubic num_threads=2 2.3X 1.4ms vs 0.6ms (1, 3, 224, 224) -> (270, 268) False False bicubic num_threads=2 3X 1.9ms vs 0.6ms (1, 3, 256, 256) -> (1024, 1024) True True bilinear num_threads=1 4X 26.6ms vs 6.7ms (1, 3, 256, 256) -> (1024, 1024) True False bilinear num_threads=1 4X 23.9ms vs 6.8ms (1, 3, 256, 256) -> (1024, 1024) False True bilinear num_threads=1 2.5X 16.8ms vs 6.8ms (1, 3, 256, 256) -> (1024, 1024) False False bilinear num_threads=1 5X 33.1ms vs 6.8ms (1, 3, 256, 256) -> (1024, 1024) True True bicubic num_threads=1 4X 25.9ms vs 7.3ms (1, 3, 256, 256) -> (1024, 1024) True False bicubic num_threads=1 8X 59.6ms vs 7.3ms (1, 3, 256, 256) -> (1024, 1024) False True bicubic num_threads=1 1.9X 14.3ms vs 7.4ms (1, 3, 256, 256) -> (1024, 1024) False False bicubic num_threads=1 5X 35.4ms vs 7.3ms (1, 3, 256, 256) -> (1024, 1024) True True bilinear num_threads=2 2.0X 13.6ms vs 6.8ms (1, 3, 256, 256) -> (1024, 1024) True False bilinear num_threads=2 2.2X 14.8ms vs 6.7ms (1, 3, 256, 256) -> (1024, 1024) False True bilinear num_threads=2 1.3X 8.8ms vs 6.9ms (1, 3, 256, 256) -> (1024, 1024) False False bilinear num_threads=2 1.2X 8.4ms vs 6.8ms (1, 3, 256, 256) -> (1024, 1024) True True bicubic num_threads=2 1.8X 12.8ms vs 7.3ms (1, 3, 256, 256) -> (1024, 1024) True False bicubic num_threads=2 4X 32.1ms vs 7.2ms (1, 3, 256, 256) -> (1024, 1024) False True bicubic num_threads=2 1.4X 10.1ms vs 7.3ms (1, 3, 256, 256) -> (1024, 1024) False False bicubic num_threads=2 2.9X 20.9ms vs 7.3ms (1, 3, 224, 224) -> (64, 64) True True bilinear num_threads=1 1.4X 0.5ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) True False bilinear num_threads=1 0.7X 0.2ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False True bilinear num_threads=1 1.3X 0.4ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False False bilinear num_threads=1 1.4X 0.4ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) True True bicubic num_threads=1 2.1X 0.7ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) True False bicubic num_threads=1 1.3X 0.4ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False True bicubic num_threads=1 1.9X 0.6ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False False bicubic num_threads=1 1.0X 0.3ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) True True bilinear num_threads=2 1.0X 0.3ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) True False bilinear num_threads=2 0.6X 0.2ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False True bilinear num_threads=2 0.8X 0.3ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False False bilinear num_threads=2 1.4X 0.4ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) True True bicubic num_threads=2 1.4X 0.5ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) True False bicubic num_threads=2 1.2X 0.4ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False True bicubic num_threads=2 1.2X 0.4ms vs 0.4ms (1, 3, 224, 224) -> (64, 64) False False bicubic num_threads=2 0.9X 0.3ms vs 0.3ms (1, 3, 270, 268) -> (224, 224) True True bilinear num_threads=1 4X 2.6ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) True False bilinear num_threads=1 2.1X 1.3ms vs 0.6ms (1, 3, 270, 268) -> (224, 224) False True bilinear num_threads=1 3X 2.1ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False False bilinear num_threads=1 4X 2.4ms vs 0.6ms (1, 3, 270, 268) -> (224, 224) True True bicubic num_threads=1 4X 2.9ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) True False bicubic num_threads=1 5X 3.1ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False True bicubic num_threads=1 3X 2.4ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False False bicubic num_threads=1 4X 2.8ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) True True bilinear num_threads=2 1.5X 1.0ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) True False bilinear num_threads=2 1.2X 0.8ms vs 0.6ms (1, 3, 270, 268) -> (224, 224) False True bilinear num_threads=2 2.3X 1.5ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False False bilinear num_threads=2 1.9X 1.2ms vs 0.6ms (1, 3, 270, 268) -> (224, 224) True True bicubic num_threads=2 1.6X 1.2ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) True False bicubic num_threads=2 4X 2.4ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False True bicubic num_threads=2 2.4X 1.6ms vs 0.7ms (1, 3, 270, 268) -> (224, 224) False False bicubic num_threads=2 2.8X 1.8ms vs 0.6ms (1, 3, 1024, 1024) -> (256, 256) True True bilinear num_threads=1 2.1X 12.8ms vs 6.1ms (1, 3, 1024, 1024) -> (256, 256) True False bilinear num_threads=1 0.6X 3.8ms vs 5.9ms (1, 3, 1024, 1024) -> (256, 256) False True bilinear num_threads=1 1.2X 7.1ms vs 6.1ms (1, 3, 1024, 1024) -> (256, 256) False False bilinear num_threads=1 1.9X 11.0ms vs 5.9ms (1, 3, 1024, 1024) -> (256, 256) True True bicubic num_threads=1 2.0X 12.6ms vs 6.4ms (1, 3, 1024, 1024) -> (256, 256) True False bicubic num_threads=1 1.0X 6.1ms vs 6.0ms (1, 3, 1024, 1024) -> (256, 256) False True bicubic num_threads=1 1.8X 11.3ms vs 6.4ms (1, 3, 1024, 1024) -> (256, 256) False False bicubic num_threads=1 0.8X 4.6ms vs 6.0ms (1, 3, 1024, 1024) -> (256, 256) True True bilinear num_threads=2 1.6X 9.3ms vs 6.0ms (1, 3, 1024, 1024) -> (256, 256) True False bilinear num_threads=2 0.3X 2.0ms vs 5.8ms (1, 3, 1024, 1024) -> (256, 256) False True bilinear num_threads=2 1.2X 7.2ms vs 6.0ms (1, 3, 1024, 1024) -> (256, 256) False False bilinear num_threads=2 0.3X 1.6ms vs 5.8ms (1, 3, 1024, 1024) -> (256, 256) True True bicubic num_threads=2 1.1X 7.1ms vs 6.5ms (1, 3, 1024, 1024) -> (256, 256) True False bicubic num_threads=2 0.6X 3.3ms vs 5.9ms (1, 3, 1024, 1024) -> (256, 256) False True bicubic num_threads=2 0.9X 5.9ms vs 6.3ms (1, 3, 1024, 1024) -> (256, 256) False False bicubic num_threads=2 0.4X 2.4ms vs 5.9ms ``` </details> without AVX2 support - no significant speed-up, but there are various possible improvements (see TODOs) <details> ``` AA = antialias float = uint8->float->interpolate()->round()->clamp()->uint8 (what Resize() currently does) input_size output_size channels_last AA mode num_threads speed-up float vs uint8 (this PR) (1, 3, 64, 64) -> (224, 224) True True bilinear num_threads=1 0.9X 1.5ms vs 1.6ms (1, 3, 64, 64) -> (224, 224) True False bilinear num_threads=1 0.9X 1.5ms vs 1.6ms (1, 3, 64, 64) -> (224, 224) False True bilinear num_threads=1 0.8X 0.9ms vs 1.1ms (1, 3, 64, 64) -> (224, 224) False False bilinear num_threads=1 1.5X 1.7ms vs 1.1ms (1, 3, 64, 64) -> (224, 224) True True bicubic num_threads=1 0.9X 1.6ms vs 1.8ms (1, 3, 64, 64) -> (224, 224) True False bicubic num_threads=1 2.1X 3.9ms vs 1.9ms (1, 3, 64, 64) -> (224, 224) False True bicubic num_threads=1 0.8X 1.1ms vs 1.4ms (1, 3, 64, 64) -> (224, 224) False False bicubic num_threads=1 1.7X 2.4ms vs 1.5ms (1, 3, 64, 64) -> (224, 224) True True bilinear num_threads=2 0.9X 0.8ms vs 0.8ms (1, 3, 64, 64) -> (224, 224) True False bilinear num_threads=2 0.9X 0.8ms vs 0.8ms (1, 3, 64, 64) -> (224, 224) False True bilinear num_threads=2 0.9X 0.5ms vs 0.6ms (1, 3, 64, 64) -> (224, 224) False False bilinear num_threads=2 0.7X 0.5ms vs 0.7ms (1, 3, 64, 64) -> (224, 224) True True bicubic num_threads=2 0.9X 0.9ms vs 1.0ms (1, 3, 64, 64) -> (224, 224) True False bicubic num_threads=2 2.1X 2.0ms vs 1.0ms (1, 3, 64, 64) -> (224, 224) False True bicubic num_threads=2 0.8X 0.6ms vs 0.8ms (1, 3, 64, 64) -> (224, 224) False False bicubic num_threads=2 1.7X 1.3ms vs 0.8ms (1, 3, 224, 224) -> (270, 268) True True bilinear num_threads=1 1.0X 3.0ms vs 3.0ms (1, 3, 224, 224) -> (270, 268) True False bilinear num_threads=1 1.0X 2.8ms vs 2.9ms (1, 3, 224, 224) -> (270, 268) False True bilinear num_threads=1 1.0X 2.3ms vs 2.2ms (1, 3, 224, 224) -> (270, 268) False False bilinear num_threads=1 1.4X 3.3ms vs 2.3ms (1, 3, 224, 224) -> (270, 268) True True bicubic num_threads=1 1.0X 3.5ms vs 3.5ms (1, 3, 224, 224) -> (270, 268) True False bicubic num_threads=1 1.7X 6.1ms vs 3.5ms (1, 3, 224, 224) -> (270, 268) False True bicubic num_threads=1 0.9X 2.6ms vs 2.9ms (1, 3, 224, 224) -> (270, 268) False False bicubic num_threads=1 1.4X 4.2ms vs 2.9ms (1, 3, 224, 224) -> (270, 268) True True bilinear num_threads=2 1.0X 1.7ms vs 1.7ms (1, 3, 224, 224) -> (270, 268) True False bilinear num_threads=2 0.9X 1.6ms vs 1.8ms (1, 3, 224, 224) -> (270, 268) False True bilinear num_threads=2 0.9X 1.3ms vs 1.4ms (1, 3, 224, 224) -> (270, 268) False False bilinear num_threads=2 0.7X 1.1ms vs 1.6ms (1, 3, 224, 224) -> (270, 268) True True bicubic num_threads=2 1.0X 2.0ms vs 2.0ms (1, 3, 224, 224) -> (270, 268) True False bicubic num_threads=2 1.7X 3.2ms vs 1.9ms (1, 3, 224, 224) -> (270, 268) False True bicubic num_threads=2 0.8X 1.5ms vs 1.9ms (1, 3, 224, 224) -> (270, 268) False False bicubic num_threads=2 1.2X 2.3ms vs 1.9ms (1, 3, 256, 256) -> (1024, 1024) True True bilinear num_threads=1 1.1X 34.7ms vs 32.4ms (1, 3, 256, 256) -> (1024, 1024) True False bilinear num_threads=1 1.0X 31.2ms vs 32.4ms (1, 3, 256, 256) -> (1024, 1024) False True bilinear num_threads=1 1.0X 23.5ms vs 22.7ms (1, 3, 256, 256) -> (1024, 1024) False False bilinear num_threads=1 1.9X 42.5ms vs 22.7ms (1, 3, 256, 256) -> (1024, 1024) True True bicubic num_threads=1 0.9X 33.9ms vs 37.4ms (1, 3, 256, 256) -> (1024, 1024) True False bicubic num_threads=1 2.2X 84.0ms vs 37.5ms (1, 3, 256, 256) -> (1024, 1024) False True bicubic num_threads=1 1.0X 28.4ms vs 28.8ms (1, 3, 256, 256) -> (1024, 1024) False False bicubic num_threads=1 2.0X 56.7ms vs 28.8ms (1, 3, 256, 256) -> (1024, 1024) True True bilinear num_threads=2 1.1X 17.5ms vs 16.4ms (1, 3, 256, 256) -> (1024, 1024) True False bilinear num_threads=2 1.1X 17.7ms vs 16.4ms (1, 3, 256, 256) -> (1024, 1024) False True bilinear num_threads=2 0.8X 8.8ms vs 11.4ms (1, 3, 256, 256) -> (1024, 1024) False False bilinear num_threads=2 1.0X 11.1ms vs 11.4ms (1, 3, 256, 256) -> (1024, 1024) True True bicubic num_threads=2 1.1X 19.9ms vs 18.8ms (1, 3, 256, 256) -> (1024, 1024) True False bicubic num_threads=2 2.3X 42.5ms vs 18.7ms (1, 3, 256, 256) -> (1024, 1024) False True bicubic num_threads=2 1.0X 14.1ms vs 14.5ms (1, 3, 256, 256) -> (1024, 1024) False False bicubic num_threads=2 2.0X 28.4ms vs 14.5ms (1, 3, 224, 224) -> (64, 64) True True bilinear num_threads=1 1.0X 0.6ms vs 0.6ms (1, 3, 224, 224) -> (64, 64) True False bilinear num_threads=1 0.7X 0.3ms vs 0.4ms (1, 3, 224, 224) -> (64, 64) False True bilinear num_threads=1 0.9X 0.5ms vs 0.6ms (1, 3, 224, 224) -> (64, 64) False False bilinear num_threads=1 1.7X 0.6ms vs 0.4ms (1, 3, 224, 224) -> (64, 64) True True bicubic num_threads=1 1.0X 0.8ms vs 0.8ms (1, 3, 224, 224) -> (64, 64) True False bicubic num_threads=1 1.1X 0.5ms vs 0.5ms (1, 3, 224, 224) -> (64, 64) False True bicubic num_threads=1 0.9X 0.7ms vs 0.8ms (1, 3, 224, 224) -> (64, 64) False False bicubic num_threads=1 0.9X 0.4ms vs 0.4ms (1, 3, 224, 224) -> (64, 64) True True bilinear num_threads=2 1.0X 0.4ms vs 0.4ms (1, 3, 224, 224) -> (64, 64) True False bilinear num_threads=2 0.8X 0.2ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False True bilinear num_threads=2 0.9X 0.3ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False False bilinear num_threads=2 1.3X 0.3ms vs 0.2ms (1, 3, 224, 224) -> (64, 64) True True bicubic num_threads=2 1.0X 0.5ms vs 0.5ms (1, 3, 224, 224) -> (64, 64) True False bicubic num_threads=2 1.3X 0.4ms vs 0.3ms (1, 3, 224, 224) -> (64, 64) False True bicubic num_threads=2 0.9X 0.5ms vs 0.5ms (1, 3, 224, 224) -> (64, 64) False False bicubic num_threads=2 1.2X 0.3ms vs 0.3ms (1, 3, 270, 268) -> (224, 224) True True bilinear num_threads=1 0.8X 2.1ms vs 2.5ms (1, 3, 270, 268) -> (224, 224) True False bilinear num_threads=1 0.7X 1.6ms vs 2.4ms (1, 3, 270, 268) -> (224, 224) False True bilinear num_threads=1 1.2X 2.4ms vs 2.1ms (1, 3, 270, 268) -> (224, 224) False False bilinear num_threads=1 1.3X 2.6ms vs 2.0ms (1, 3, 270, 268) -> (224, 224) True True bicubic num_threads=1 1.1X 3.4ms vs 3.0ms (1, 3, 270, 268) -> (224, 224) True False bicubic num_threads=1 1.7X 4.8ms vs 2.8ms (1, 3, 270, 268) -> (224, 224) False True bicubic num_threads=1 1.1X 2.9ms vs 2.7ms (1, 3, 270, 268) -> (224, 224) False False bicubic num_threads=1 1.4X 3.5ms vs 2.4ms (1, 3, 270, 268) -> (224, 224) True True bilinear num_threads=2 0.9X 1.2ms vs 1.3ms (1, 3, 270, 268) -> (224, 224) True False bilinear num_threads=2 1.3X 1.6ms vs 1.2ms (1, 3, 270, 268) -> (224, 224) False True bilinear num_threads=2 0.8X 0.9ms vs 1.1ms (1, 3, 270, 268) -> (224, 224) False False bilinear num_threads=2 1.3X 1.3ms vs 1.0ms (1, 3, 270, 268) -> (224, 224) True True bicubic num_threads=2 1.4X 2.2ms vs 1.6ms (1, 3, 270, 268) -> (224, 224) True False bicubic num_threads=2 1.9X 2.8ms vs 1.5ms (1, 3, 270, 268) -> (224, 224) False True bicubic num_threads=2 0.8X 1.1ms vs 1.4ms (1, 3, 270, 268) -> (224, 224) False False bicubic num_threads=2 1.7X 2.1ms vs 1.3ms (1, 3, 1024, 1024) -> (256, 256) True True bilinear num_threads=1 1.0X 10.0ms vs 9.9ms (1, 3, 1024, 1024) -> (256, 256) True False bilinear num_threads=1 0.7X 4.6ms vs 6.2ms (1, 3, 1024, 1024) -> (256, 256) False True bilinear num_threads=1 0.9X 9.1ms vs 9.8ms (1, 3, 1024, 1024) -> (256, 256) False False bilinear num_threads=1 1.7X 9.4ms vs 5.7ms (1, 3, 1024, 1024) -> (256, 256) True True bicubic num_threads=1 1.0X 15.2ms vs 14.8ms (1, 3, 1024, 1024) -> (256, 256) True False bicubic num_threads=1 1.0X 7.6ms vs 7.5ms (1, 3, 1024, 1024) -> (256, 256) False True bicubic num_threads=1 0.9X 13.3ms vs 14.4ms (1, 3, 1024, 1024) -> (256, 256) False False bicubic num_threads=1 0.8X 5.9ms vs 7.0ms (1, 3, 1024, 1024) -> (256, 256) True True bilinear num_threads=2 1.2X 6.0ms vs 5.2ms (1, 3, 1024, 1024) -> (256, 256) True False bilinear num_threads=2 0.7X 2.3ms vs 3.2ms (1, 3, 1024, 1024) -> (256, 256) False True bilinear num_threads=2 1.0X 4.8ms vs 5.0ms (1, 3, 1024, 1024) -> (256, 256) False False bilinear num_threads=2 0.7X 1.9ms vs 2.9ms (1, 3, 1024, 1024) -> (256, 256) True True bicubic num_threads=2 1.6X 12.3ms vs 7.5ms (1, 3, 1024, 1024) -> (256, 256) True False bicubic num_threads=2 1.0X 3.9ms vs 3.9ms (1, 3, 1024, 1024) -> (256, 256) False True bicubic num_threads=2 1.0X 7.0ms vs 7.3ms (1, 3, 1024, 1024) -> (256, 256) False False bicubic num_threads=2 0.9X 3.0ms vs 3.5ms ``` </details> Benchmark code <details> ```py import operator_benchmark as op_bench import torch """Microbenchmarks for interpolate operator.""" class InterpolateBenchmark(op_bench.TorchBenchmarkBase): def init(self, input_size, output_size, channels_last=False, mode='linear', antialias=False, dtype=torch.float): input_image = torch.randint(0, 256, size=input_size, dtype=torch.uint8, device='cpu') if channels_last: input_image = input_image.contiguous(memory_format=torch.channels_last) self.inputs = { "input_image": input_image, "output_size": output_size, "mode": mode, "antialias": antialias, "dtype":dtype, } self.set_module_name("interpolate") def forward(self, input_image, output_size, mode, antialias, dtype): if dtype == torch.float: input_image = input_image.float() out = torch.nn.functional.interpolate(input_image, size=output_size, mode=mode, align_corners=False, antialias=antialias) if dtype == torch.float: out = out.round().clamp(min=0, max=256).to(torch.uint8) def make_config(): sizes = ( ((224, 224), (64, 64)), ((270, 268), (224, 224)), ((256, 256), (1024, 1024)), ) attrs = [] for (HW1, HW2) in sizes: attrs.append([(1, 3, *HW1), HW2]) # 3 channels # attrs.append([(1, 1, *HW1), HW2]) # 1 channel attrs.append([(1, 3, *HW2), HW1]) # 3 channels # attrs.append([(1, 1, *HW2), HW1]) # 1 channel config = op_bench.config_list( attr_names=["input_size", "output_size"], attrs=attrs, cross_product_configs={ 'channels_last': [True, False], 'mode': ["bilinear", "bicubic"], 'antialias': [True, False], # 'dtype': [torch.float, torch.uint8] # 'dtype': [torch.uint8] 'dtype': [torch.float] }, tags=["short"], ) return config config = make_config() op_bench.generate_pt_test(config, InterpolateBenchmark) if __name__ == "__main__": op_bench.benchmark_runner.main() ``` ```py import re import argparse parser = argparse.ArgumentParser() parser.add_argument("f1", nargs="?", default="main") parser.add_argument("f2", nargs="?", default="new") args = parser.parse_args() with open(args.f1) as f: main = f.readlines() with open(args.f2) as f: new = f.readlines() out = [] for main_line, new_line in zip(main, new): # num_threads=1 # TODO: remove if main_line.startswith("num_threads="): num_threads = int(main_line.split("=")[-1]) if main_line.startswith("# Input"): deets = f"{main_line.strip()}, {num_threads=}" if main_line.startswith("Forward"): main_time = float(main_line.split()[-1]) new_time = float(new_line.split()[-1]) ratio = main_time / new_time fmt = ".1f" if ratio < 3 else ".0f" improv = f"{ratio:{fmt}}X" time_fmt = ",.3f" if new_time < 100 else ",.1f" deets = deets.strip().replace("# Input: ", "") deets = deets.replace(": ", "=") deets = deets.replace("input_size=", "") deets = deets.replace(", output_size=", " -> ") deets = deets.replace("dtype=torch.", "") deets = deets.replace("mode=", "") deets = deets.replace("antialias=", "") deets = deets.replace("channels_last=", "") # deets = deets.replace("channels_last=True, ", "") split = deets.split(",") # size = ','.join(split[:-3]) # mode, dtype, threads = split[-3:] # deets = f"{size:<30} {mode:<15} {dtype:<10} {threads:<15}" size = ','.join(split[:-5]) channels_last, mode, antialias, dtype, threads= split[-5:] deets = f"{size:<33} {channels_last:<7} {antialias:<7} {mode:<10} {threads:<15}" l = f"{deets} {improv:<5} {main_time / 1000:{time_fmt}}ms vs {new_time / 1000:{time_fmt}}ms" out.append(l) def key(s): # s = ''.join(s.split()[1:]) # remove "N.nX" part num_threads = (int(re.findall(r"num_threads=(\d+)", s)[0]),) input_shape, output_shape = re.findall("\(.*?\)", s) input_shape = input_shape[1:-1] # remove parenthesis input_HW = tuple(int(x) for x in input_shape.split(",")[-2:]) input_C = (-int(input_shape.split(",")[1]),) output_HW = tuple(int(x) for x in output_shape[1:-1].split(",")) is_downsample = (output_HW[0] < input_HW[0],) if "linear" in s: mode = "linear" elif "nearest-exact" in s: mode = "nearest-exact" else: # assert "nearest" in s mode = "nearest" mode = (mode,) return is_downsample + input_HW + output_HW + num_threads + input_C + mode for i, l in enumerate(sorted(out, key=key)): if i % 8 == 0: print() # if i % 10 == 0 and i % 40 != 0: # print() # if i % 40 == 0: # print("-" * 100) print(l) ``` </details> Pull Request resolved: pytorch#90771 Approved by: https://github.com/peterbell10, https://github.com/ngimel
… (channels last) (pytorch#96848) ## Description - Based on pytorch#96651 - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last** - unified RGB and RGBA processing code such that RGB input is not copied into RGBA - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results) - RGBA case perfs are the same after refactoring (see Source link below) - Fixed mem pointer alignment, added more comments (reviews from pytorch#96651) ## Results - `Pillow (9.0.0.post1)` == Pillow-SIMD ``` [-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------] | Pillow (9.0.0.post1) | torch (2.1.0a0+gitd6e220c) PR | torch (2.1.0a0+git2b75955) nightly | Speed-up: PR vs nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True | 38.674 (+-0.323) | 57.591 (+-0.244) | 131.033 (+-1.448) | 2.275 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False | | 39.471 (+-0.166) | 113.911 (+-1.736) | 2.886 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True | 128.512 (+-1.916) | 161.592 (+-1.242) | 299.679 (+-2.099) | 1.855 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False | | 150.994 (+-1.180) | 285.331 (+-1.919) | 1.890 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True | 180.045 (+-2.223) | 220.581 (+-1.363) | 431.057 (+-3.536) | 1.954 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False | | 219.391 (+-1.409) | 429.410 (+-3.620) | 1.957 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True | 113.911 (+-1.024) | 129.457 (+-1.295) | 459.610 (+-13.322) | 3.550 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False | | 59.800 (+-0.199) | 400.015 (+-11.815) | 6.689 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True | 283.050 (+-2.664) | 339.143 (+-1.209) | 683.555 (+-4.466) | 2.016 (+-0.000) 3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False | | 250.601 (+-1.236) | 603.545 (+-2.644) | 2.408 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True | 186.723 (+-2.213) | 199.960 (+-1.343) | 860.867 (+-21.763) | 4.305 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False | | 79.188 (+-0.261) | 703.019 (+-25.805) | 8.878 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True | 412.353 (+-4.476) | 462.230 (+-1.983) | 1101.673 (+-49.299) | 2.383 (+-0.000) 3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False | | 327.973 (+-1.852) | 941.062 (+-5.549) | 2.869 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True | 61.191 (+-0.926) | 80.795 (+-0.518) | 160.853 (+-1.506) | 1.991 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True | 134.488 (+-2.129) | 169.147 (+-1.324) | 327.343 (+-2.846) | 1.935 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True | 1037.045 (+-24.982) | 938.623 (+-9.010) | 2603.360 (+-20.530) | 2.774 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True | 52.792 (+-0.613) | 73.692 (+-0.264) | 131.829 (+-1.333) | 1.789 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True | 139.596 (+-1.944) | 173.778 (+-1.039) | 320.063 (+-2.562) | 1.842 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True | 690.132 (+-10.946) | 772.758 (+-2.864) | 2036.860 (+-36.109) | 2.636 (+-0.000) 3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False | | 78.747 (+-0.799) | 158.479 (+-1.702) | 2.013 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False | | 167.046 (+-1.077) | 322.104 (+-2.764) | 1.928 (+-0.000) 3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False | | 918.967 (+-5.251) | 2611.388 (+-29.917) | 2.842 (+-0.000) 3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False | | 55.336 (+-0.251) | 113.869 (+-1.243) | 2.058 (+-0.000) 3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False | | 156.505 (+-1.095) | 299.861 (+-2.710) | 1.916 (+-0.000) 3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False | | 514.344 (+-1.905) | 1776.796 (+-19.660) | 3.454 (+-0.000) ``` Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ... [Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230329-181023-pr_vs_nightly-speedup-md) ## Context - pytorch#90771 Pull Request resolved: pytorch#96848 Approved by: https://github.com/NicolasHug, https://github.com/peterbell10
Joint work with @vfdev-5
This PR introduces native uint8 support for
interpolate(), forbilinearandmodes for CPU images (bicubicmode=nearest[_exact]was already supported ).On a typical torchvision training job on ImageNet, the speedup are ~4X when AVX2 is supported, comparing the uint8 native (this PR) vs torchvision's current
Resize():There is still room for further speed-ups (see TODOs in the code).
More benchmark details
with AVX2 support - speedups typically range from 1.5X to 10X. A few edge-cases are slower, worth investigating why.
Details
without AVX2 support - no significant speed-up, but there are various possible improvements (see TODOs)
Details
Benchmark code
Details
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @datumbox @vfdev-5 @pmeier