HResizeLinear reduce duplicate work and add partial VSX optimization ... and vectorize more copyMask variants#15257
Conversation
modules/imgproc/src/resize.cpp
Outdated
| A1 ... A4 | ||
| B1 ... B4 | ||
| */ | ||
| void inline v_load_expand_deinterlace(const short int*_in, v_int32x4 &out_even, v_int32x4 &out_odd) |
There was a problem hiding this comment.
You should use the 'const short * ptr' as the 1st parameter to be consistent
f3189f9 to
efa9dd7
Compare
modules/imgproc/src/resize.cpp
Outdated
| const ST *S1 = src[k+1]; | ||
| DT *D1 = dst[k+1]; | ||
|
|
||
| for( dx = 0; dx < (xmax - nlanes); dx+=nlanes ) |
There was a problem hiding this comment.
It would be better to use dx <= (xmax - nlanes) here to process one more vector.
The same at L#1558
| #else | ||
| /* This is compiler-agnostic, but will introduce an unneeded splat on the critical path. */ | ||
| #define _LXSIWZX(out, ptr, T) out = (T)vec_udword2_sp(*(uint32_t*)(ptr)); | ||
| #endif |
There was a problem hiding this comment.
@seiko2plus thank you for helping fix the llvm HAL bugs I created. Would you be able to review these changes?
|
@pmur Could you please move changes related to copyMask to separate PR? |
22fb1b5 to
5908f17
Compare
|
|
||
| inline void v_cleanup() {} | ||
|
|
||
| #if CV_SIM128_PERMUTE |
| CV_CPU_OPTIMIZATION_HAL_NAMESPACE_END | ||
| /* Permute 32B into a 16B vector using a control vector. Any control byte value greater than the number of | ||
| bytes in the aggregrate vector results in undefined behavior (e.x PPC will ignore those bits) */ | ||
| inline v_uint8x16 v_permute_2(const v_uint8x16 &ctrl, const v_uint8x16 &in_lo, const v_uint8x16 &in_hi) |
There was a problem hiding this comment.
I prefer to use uniform v_permute name for the intrinsic. It's possible to get quantity of permuted vectors from the list of parameters.
| OPENCV_HAL_IMPL_VSX_TRANSPOSE4x4(v_float32x4, vec_float4) | ||
|
|
||
| CV_CPU_OPTIMIZATION_HAL_NAMESPACE_END | ||
| /* Permute 32B into a 16B vector using a control vector. Any control byte value greater than the number of |
There was a problem hiding this comment.
There should be plain scalar implementation for all intrinsics in intrin_cpp.hpp file. Intrinsic documentation should be added to that file as well.
There was a problem hiding this comment.
I think I may be overlooking it, but how would I build to include the simulated intrinsics?
There was a problem hiding this comment.
Simulated intrinsics file is used to produce doxygen documentation(BUILD_DOCS cmake flag that leads to definition of CV_DOXYGEN to 1 and exclusion of non-simulated intrinsics). Another option to use simulated intrinsics is CPU_BASELINE cmake flag. It's possible to set it to empty value to prohibit usage of any platform specific features that disable non-simulated intrinsics for any platform as well.
modules/imgproc/src/resize.cpp
Outdated
| void inline v_load_expand_deinterlace(const short *ptr, v_int32x4 &out_even, v_int32x4 &out_odd) | ||
| { | ||
| static const v_uint8x16 perm(0,1,4,5,8,9,12,13,2,3,6,7,10,11,14,15); | ||
| v_int16x8 in = v_reinterpret_as_s16(v_permute_1(perm, v_reinterpret_as_u8(v_load(ptr)))); |
There was a problem hiding this comment.
I think it make sense to guard with CV_SIMD128_PERMUTE this part only. It's possible to substitute permute&expand with
v_int32x4 v_src = v_reinterpret_as_s32(v_load(ptr));
out_even = (v_src << 16) >> 16;
out_odd = v_src >> 16;
BTW the substitution use the same quantity of vector operations so probably it could give similar performance gain. @pmur Could you compare performance of both implementations for VSX?
There was a problem hiding this comment.
Ah, neat trick. That should perform nearly identical.
| template<typename ST, typename DT, typename AT, typename DVT> | ||
| struct HResizeLinearVec_X4 | ||
| { | ||
| int operator()(const uchar** _src, uchar** _dst, int count, const int* xofs, |
There was a problem hiding this comment.
Wouldn't it be simpler to define operator as
int operator()(const ST** src, DT** dst, int count, const int* xofs, const AT* alpha, int, int, int cn, int, int xmax) const
| DVT a_odd; | ||
|
|
||
| v_load_expand_deinterlace(&alpha[dx*2], a_even, a_odd); | ||
| DVT s0(S0[sx0], S0[sx1], S0[sx2], S0[sx3]); |
There was a problem hiding this comment.
v_lut intrinsic could be used here DVT s0 = v_lut(S0, xofs + dx);
There was a problem hiding this comment.
Almost. There is also an expanding type conversion happening here between ST and DT in most instantiations of the template.
Performance for SSE2 baseline
Performance for SSE3 baseline
Performance for SSE4_2 baseline
Performance for AVX2 baseline
Looks like the change cause performance regression for U8 on SSE2, SSE3 baselines and SSE42 baseline for downscaling. Going to investigate it further |
|
@terfendail have you had a chance to investigate? Would it be acceptable to guard these optimizations with !CV_SSE? I think this is architectural differences showing themselves here. Maybe someone with ARM hardware can benchmark? |
There appears to be a 2x unroll of the HResizeLinear against k, however the k value is only incremented by 1 during the unroll. This results in k - 1 duplicate passes when k > 1. Likewise, the final pass may not respect the work done by the vector loop. Start it with the offset returned by the vector op if implemented. Note, no vector ops are implemented today. The performance is most noticable on a linear downscale. A set of performance tests are added to characterize this. The performance improvement is 10-50% depending on the scaling.
|
I've prepared PR as a performance investigation result |
|
@terfendail thanks, I've cherry-picked your patch without modification, and cherry-picked the changes to avoid the regression caused by vsx v_load_expand_q. |
modules/imgproc/src/resize.cpp
Outdated
| IPP by default. For now, disable it on these targets until a more | ||
| exhaustive performance analysis can be done. | ||
| */ | ||
| #if CV_SIMD128 && !CV_SSE |
There was a problem hiding this comment.
!CV_SSE guard isn't necessary now
Performance is mostly gated by the gather operations for x inputs. Likewise, provide a 2x unroll against k, this reduces the number of alpha gathers by 1/2 for larger k. While not a 4x improvement, it still performs substantially better under P9 for a 1.4x improvement. P8 baseline is 1.05-1.10x due to reduced VSX instruction set. For float types, this results in a more modest 1.2x improvement.
With a little help, we can do this quickly without gprs on all VSX enabled targets.
|
Is it possible to increase the timeout for the videoio tests? |
|
Thank you for updates! Tests from videoio hangs sporadically on "custom" builder (Ubuntu 18.04 + GStreamer), so timeout will not help there. Just ignore videoio test. |
modules/imgproc/src/resize.cpp
Outdated
| else if(cn == 3) | ||
| { | ||
| const int step = 4; | ||
| const int len0 = xmax & -step; |
There was a problem hiding this comment.
That's my fault. Here should be const int len0 = xmax - step;. While loop condition should be dx <= len0.
Otherwise overflow is possible because actual loop step isn't multiple of 4.
Per feedback, ensure we don't overrun. This was caught via the failure observed in Test_TensorFlow.inception_accuracy.
Performance for SSE2 baseline
Performance for SSE3 baseline
Performance for SSE4_2 baseline
Performance for AVX2 baseline
|
| for( dx = 0; dx < len0; dx += 3*step/4 ) | ||
| { | ||
| v_int16x8 a = v_load(alpha+dx*2); | ||
| v_store(&D[dx], v_dotprod(v_reinterpret_as_s16(v_load_expand_q(S+xofs[dx]) | (v_load_expand_q(S+xofs[dx]+cn)<<16)), a)); |
There was a problem hiding this comment.
v_load_expand_q(S+xofs[dx]+cn)<<16)
Out of buffer access: #16137
|
One more bug was introduced by this patch: #16189 |
Fix a typo in the unrolling of HResizeLinear when k > 1.
Next, add a VSX optimization for the 8u -> 32s operation. This provides a 1.4x speedup for P9 and 1.1X for P8 baselines.