Convert ImgWarp from SSE SIMD to HAL - 2.8x faster on Power (VSX) and…#15358
Convert ImgWarp from SSE SIMD to HAL - 2.8x faster on Power (VSX) and…#15358alalek merged 5 commits intoopencv:3.4from ChipKerchner:imgwarpToHal
Conversation
|
@terfendail Could you please check performance on x86 (SSE2, SSE4.2, AVX2 CPU baselines)? |
Performance for SSE2 baseline
[94/1970]
Performance for SSE3 baseline
[40/1970]
Performance for SSE4_2 baseline
Performance for AVX2 baseline
|
|
|
||
| #if CV_TRY_SSE4_1 | ||
| Ptr<opt_SSE4_1::WarpPerspectiveLine_SSE4> pwarp_impl_sse4; | ||
| if(CV_CPU_HAS_SUPPORT_SSE4_1) |
There was a problem hiding this comment.
This part is necessary to dynamically call to specific implementation in case SSE4_1-instructions are available.
It looks like existing performance for SSE4 is better than UI-based performance for SSE2 for linear interpolation(at least for some cases) so IMO it make sense to retain dynamic dispatching of opt_SSE4_1::WarpPerspectiveLine_SSE4 here.
@alalek What's your opinion on this?
There was a problem hiding this comment.
Lets keep SSE4_1 optimization here.
What is about performance difference with AVX2 baseline?
There was a problem hiding this comment.
AVX2 baseline is better than SSE4_2 baseline both before and after UI implementation. AVX2 baseline also provide better performance improvement. However AVX2 baseline before UI implementation provide worse performance than SSE4_2 after. So IMO it make sense to implement UI-related dynamic dispatching for the module.
There was a problem hiding this comment.
I changed it from class function and dispatching to static functions. Hopefully this is what you are suggesting.
Do you want me to re-add in the line(s) ?
if(CV_CPU_HAS_SUPPORT_SSE4_1)
There was a problem hiding this comment.
Yes. Checking for CV_CPU_HAS_SUPPORT_SSE4_1 allows to decide on SSE4 availability at runtime and dispatch execution accordingly. Preprocessor check for CV_TRY_SSE4_1 is performed to decide whether dynamic dispatching of the code is possible and necessary(AFAIK this feature is implemented for x86/64 only and could be completely disabled by the user for some reason)
There was a problem hiding this comment.
I think it would be better to retain SSE4_1 implementation the way it was to make this PR cleaner
There was a problem hiding this comment.
Re-add dynamic runtime and dispatch execution.
modules/imgproc/src/imgwarp.cpp
Outdated
| if (pwarp_impl_sse4) | ||
| pwarp_impl_sse4->processNN(M, xy, X0, Y0, W0, bw); | ||
| #if CV_SIMD128_64F | ||
| if (pwarp_impl_CV_SIMD) |
There was a problem hiding this comment.
I think this check is redundant, because the pointer couldn't be empty. I also think that creation of the class is redundant too, because it stores no additional information. IMO it would be better to just call to static functions processNN and process.
modules/imgproc/src/imgwarp.hpp
Outdated
| } | ||
|
|
||
| #if CV_TRY_SSE4_1 | ||
| void WarpPerspectiveLine_ProcessNN_SSE41(const double *M, short* xy, double X0, double Y0, double W0, int bw); |
There was a problem hiding this comment.
I think it would be better to retain SSE4_1 optimizations inside opt_SSE4_1 namespace to avoid possible name conflicts
Convert ImgWarp from SSE SIMD to HAL - 2.8x faster on Power (VSX) and 15% speedup on x86.