Update U8 processing for non-bitexact linear resize#1
Update U8 processing for non-bitexact linear resize#1terfendail wants to merge 3 commits intopmur:resizefrom
Conversation
There appears to be a 2x unroll of the HResizeLinear against k, however the k value is only incremented by 1 during the unroll. This results in k - 1 duplicate passes when k > 1. Likewise, the final pass may not respect the work done by the vector loop. Start it with the offset returned by the vector op if implemented. Note, no vector ops are implemented today. The performance is most noticable on a linear downscale. A set of performance tests are added to characterize this. The performance improvement is 10-50% depending on the scaling.
Performance is mostly gated by the gather operations for x inputs. Likewise, provide a 2x unroll against k, this reduces the number of alpha gathers by 1/2 for larger k. While not a 4x improvement, it still performs substantially better under P9 for a 1.4x improvement. P8 baseline is 1.05-1.10x due to reduced VSX instruction set. Likewise, for float types, this results in a more modest 1.2x improvement.
Performance for SSE2 baseline
Performance for SSE3 baseline
Performance for SSE4_2 baseline
Performance for AVX2 baseline
|
pmur
left a comment
There was a problem hiding this comment.
The 4 channel numbers look good. I see a minor regression on 8u3 against the parent commit, but that may still be faster than neither.
Geometric mean (ms)
Name of Test resize resize resize
vs
resize
(x-factor)
resizeDownLinearNonExact::MatInfo_SizePair::(8UC1, (640x480, 320x240)) 0.081 0.084 0.97
resizeDownLinearNonExact::MatInfo_SizePair::(8UC1, (960x540, 640x480)) 1.799 1.372 1.31
resizeDownLinearNonExact::MatInfo_SizePair::(8UC1, (1280x720, 213x120)) 0.215 0.183 1.17
resizeDownLinearNonExact::MatInfo_SizePair::(8UC1, (1280x720, 320x240)) 0.597 0.495 1.20
resizeDownLinearNonExact::MatInfo_SizePair::(8UC1, (1280x720, 640x480)) 2.026 1.592 1.27
resizeDownLinearNonExact::MatInfo_SizePair::(16UC1, (640x480, 320x240)) 0.149 0.151 0.98
resizeDownLinearNonExact::MatInfo_SizePair::(16UC1, (960x540, 640x480)) 2.180 2.230 0.98
resizeDownLinearNonExact::MatInfo_SizePair::(16UC1, (1280x720, 213x120)) 0.262 0.267 0.98
resizeDownLinearNonExact::MatInfo_SizePair::(16UC1, (1280x720, 320x240)) 0.755 0.769 0.98
resizeDownLinearNonExact::MatInfo_SizePair::(16UC1, (1280x720, 640x480)) 2.505 2.556 0.98
resizeDownLinearNonExact::MatInfo_SizePair::(32FC1, (640x480, 320x240)) 0.179 0.180 1.00
resizeDownLinearNonExact::MatInfo_SizePair::(32FC1, (960x540, 640x480)) 2.140 2.087 1.03
resizeDownLinearNonExact::MatInfo_SizePair::(32FC1, (1280x720, 213x120)) 0.258 0.255 1.01
resizeDownLinearNonExact::MatInfo_SizePair::(32FC1, (1280x720, 320x240)) 0.736 0.702 1.05
resizeDownLinearNonExact::MatInfo_SizePair::(32FC1, (1280x720, 640x480)) 2.450 2.379 1.03
resizeDownLinearNonExact::MatInfo_SizePair::(8UC2, (640x480, 320x240)) 2.368 2.493 0.95
resizeDownLinearNonExact::MatInfo_SizePair::(8UC2, (960x540, 640x480)) 3.497 2.039 1.71
resizeDownLinearNonExact::MatInfo_SizePair::(8UC2, (1280x720, 213x120)) 0.404 0.296 1.37
resizeDownLinearNonExact::MatInfo_SizePair::(8UC2, (1280x720, 320x240)) 1.170 0.850 1.38
resizeDownLinearNonExact::MatInfo_SizePair::(8UC2, (1280x720, 640x480)) 3.952 2.560 1.54
resizeDownLinearNonExact::MatInfo_SizePair::(16UC2, (640x480, 320x240)) 2.233 2.284 0.98
resizeDownLinearNonExact::MatInfo_SizePair::(16UC2, (960x540, 640x480)) 4.287 4.303 1.00
resizeDownLinearNonExact::MatInfo_SizePair::(16UC2, (1280x720, 213x120)) 0.513 0.512 1.00
resizeDownLinearNonExact::MatInfo_SizePair::(16UC2, (1280x720, 320x240)) 1.469 1.495 0.98
resizeDownLinearNonExact::MatInfo_SizePair::(16UC2, (1280x720, 640x480)) 5.024 5.031 1.00
resizeDownLinearNonExact::MatInfo_SizePair::(32FC2, (640x480, 320x240)) 2.175 2.154 1.01
resizeDownLinearNonExact::MatInfo_SizePair::(32FC2, (960x540, 640x480)) 4.116 4.100 1.00
resizeDownLinearNonExact::MatInfo_SizePair::(32FC2, (1280x720, 213x120)) 0.482 0.484 1.00
resizeDownLinearNonExact::MatInfo_SizePair::(32FC2, (1280x720, 320x240)) 1.381 1.395 0.99
resizeDownLinearNonExact::MatInfo_SizePair::(32FC2, (1280x720, 640x480)) 4.743 4.711 1.01
resizeDownLinearNonExact::MatInfo_SizePair::(8UC3, (640x480, 320x240)) 0.394 0.394 1.00
resizeDownLinearNonExact::MatInfo_SizePair::(8UC3, (960x540, 640x480)) 5.204 5.379 0.97
resizeDownLinearNonExact::MatInfo_SizePair::(8UC3, (1280x720, 213x120)) 0.594 0.691 0.86
resizeDownLinearNonExact::MatInfo_SizePair::(8UC3, (1280x720, 320x240)) 1.726 2.019 0.86
resizeDownLinearNonExact::MatInfo_SizePair::(8UC3, (1280x720, 640x480)) 5.839 6.475 0.90
resizeDownLinearNonExact::MatInfo_SizePair::(16UC3, (640x480, 320x240)) 0.923 0.923 1.00
resizeDownLinearNonExact::MatInfo_SizePair::(16UC3, (960x540, 640x480)) 6.432 6.459 1.00
resizeDownLinearNonExact::MatInfo_SizePair::(16UC3, (1280x720, 213x120)) 0.753 0.749 1.00
resizeDownLinearNonExact::MatInfo_SizePair::(16UC3, (1280x720, 320x240)) 2.229 2.225 1.00
resizeDownLinearNonExact::MatInfo_SizePair::(16UC3, (1280x720, 640x480)) 7.483 7.412 1.01
resizeDownLinearNonExact::MatInfo_SizePair::(32FC3, (640x480, 320x240)) 3.149 3.264 0.96
resizeDownLinearNonExact::MatInfo_SizePair::(32FC3, (960x540, 640x480)) 6.221 6.042 1.03
resizeDownLinearNonExact::MatInfo_SizePair::(32FC3, (1280x720, 213x120)) 0.715 0.715 1.00
resizeDownLinearNonExact::MatInfo_SizePair::(32FC3, (1280x720, 320x240)) 2.116 2.069 1.02
resizeDownLinearNonExact::MatInfo_SizePair::(32FC3, (1280x720, 640x480)) 7.092 7.045 1.01
resizeDownLinearNonExact::MatInfo_SizePair::(8UC4, (640x480, 320x240)) 0.304 0.304 1.00
resizeDownLinearNonExact::MatInfo_SizePair::(8UC4, (960x540, 640x480)) 6.821 3.707 1.84
resizeDownLinearNonExact::MatInfo_SizePair::(8UC4, (1280x720, 213x120)) 0.767 0.420 1.83
resizeDownLinearNonExact::MatInfo_SizePair::(8UC4, (1280x720, 320x240)) 2.295 1.217 1.89
resizeDownLinearNonExact::MatInfo_SizePair::(8UC4, (1280x720, 640x480)) 7.738 4.081 1.90
resizeDownLinearNonExact::MatInfo_SizePair::(16UC4, (640x480, 320x240)) 0.561 0.561 1.00
resizeDownLinearNonExact::MatInfo_SizePair::(16UC4, (960x540, 640x480)) 8.522 8.668 0.98
resizeDownLinearNonExact::MatInfo_SizePair::(16UC4, (1280x720, 213x120)) 1.001 0.999 1.00
resizeDownLinearNonExact::MatInfo_SizePair::(16UC4, (1280x720, 320x240)) 2.913 2.882 1.01
resizeDownLinearNonExact::MatInfo_SizePair::(16UC4, (1280x720, 640x480)) 9.792 9.931 0.99
resizeDownLinearNonExact::MatInfo_SizePair::(32FC4, (640x480, 320x240)) 0.399 0.391 1.02
resizeDownLinearNonExact::MatInfo_SizePair::(32FC4, (960x540, 640x480)) 8.231 7.931 1.04
resizeDownLinearNonExact::MatInfo_SizePair::(32FC4, (1280x720, 213x120)) 0.941 0.941 1.00
resizeDownLinearNonExact::MatInfo_SizePair::(32FC4, (1280x720, 320x240)) 2.850 2.769 1.03
resizeDownLinearNonExact::MatInfo_SizePair::(32FC4, (1280x720, 640x480)) 9.594 9.336 1.03
resizeUpLinearNonExact::MatInfo_SizePair::(8UC1, (640x480, 960x540)) 2.658 2.028 1.31
resizeUpLinearNonExact::MatInfo_SizePair::(8UC1, (640x480, 1280x720)) 3.853 2.944 1.31
resizeUpLinearNonExact::MatInfo_SizePair::(16UC1, (640x480, 960x540)) 3.204 3.226 0.99
resizeUpLinearNonExact::MatInfo_SizePair::(16UC1, (640x480, 1280x720)) 4.495 4.549 0.99
resizeUpLinearNonExact::MatInfo_SizePair::(32FC1, (640x480, 960x540)) 3.036 2.955 1.03
resizeUpLinearNonExact::MatInfo_SizePair::(32FC1, (640x480, 1280x720)) 4.242 4.193 1.01
resizeUpLinearNonExact::MatInfo_SizePair::(8UC2, (640x480, 960x540)) 5.109 3.003 1.70
resizeUpLinearNonExact::MatInfo_SizePair::(8UC2, (640x480, 1280x720)) 7.289 4.574 1.59
resizeUpLinearNonExact::MatInfo_SizePair::(16UC2, (640x480, 960x540)) 6.172 6.320 0.98
resizeUpLinearNonExact::MatInfo_SizePair::(16UC2, (640x480, 1280x720)) 8.790 8.962 0.98
resizeUpLinearNonExact::MatInfo_SizePair::(32FC2, (640x480, 960x540)) 5.985 5.899 1.01
resizeUpLinearNonExact::MatInfo_SizePair::(32FC2, (640x480, 1280x720)) 8.385 8.210 1.02
resizeUpLinearNonExact::MatInfo_SizePair::(8UC3, (640x480, 960x540)) 7.625 7.836 0.97
resizeUpLinearNonExact::MatInfo_SizePair::(8UC3, (640x480, 1280x720)) 11.091 11.137 1.00
resizeUpLinearNonExact::MatInfo_SizePair::(16UC3, (640x480, 960x540)) 9.243 9.463 0.98
resizeUpLinearNonExact::MatInfo_SizePair::(16UC3, (640x480, 1280x720)) 13.014 13.376 0.97
resizeUpLinearNonExact::MatInfo_SizePair::(32FC3, (640x480, 960x540)) 8.932 8.717 1.02
resizeUpLinearNonExact::MatInfo_SizePair::(32FC3, (640x480, 1280x720)) 12.762 12.150 1.05
resizeUpLinearNonExact::MatInfo_SizePair::(8UC4, (640x480, 960x540)) 10.124 5.679 1.78
resizeUpLinearNonExact::MatInfo_SizePair::(8UC4, (640x480, 1280x720)) 14.647 8.679 1.69
resizeUpLinearNonExact::MatInfo_SizePair::(16UC4, (640x480, 960x540)) 12.291 12.495 0.98
resizeUpLinearNonExact::MatInfo_SizePair::(16UC4, (640x480, 1280x720)) 17.474 17.744 0.98
resizeUpLinearNonExact::MatInfo_SizePair::(32FC4, (640x480, 960x540)) 11.949 11.492 1.04
resizeUpLinearNonExact::MatInfo_SizePair::(32FC4, (640x480, 1280x720)) 16.816 16.245 1.04
|
The regression is an artifact of the suboptimal v_load_expand_q on PPC. I pulled in my copymask PR which rewrites this. The regression is gone. Thanks! @terfendail what is the preferred path this patch into the PR? I hold no strong opinions. I can keep the HAL improvement in PR 15596 or move it to a separate PR. |
|
It looks like there is a reasonable architecture/API discussion related to universal intrinsics in PR#15596. While improvement of suboptimal intrinsic is certainly a must have. I prefer to extract the improvement to separate PR that could be merged easily and quickly so all PPC users benefit ASAP |
8706162 to
3e14ba5
Compare
relates opencv#15257
This pullrequest changes
I've investigated resize performance degradation for SSE2/SSE3 baselines and it looks like source data vector gathering issue. I've updated U8 processing with channel number specific branches and it provides performance improvement of 1.5 for SSE2/SSE3.
Could you please check whether this change works for VSX?