Vectorize minMaxIdx functions#15488
Vectorize minMaxIdx functions#15488alalek merged 9 commits intoopencv:3.4from ChipKerchner:vectorizeMinMax2
Conversation
Performance for SSE2 baseline
Performance for SSE3 baseline
Performance for SSE4_2 baseline
Performance for AVX2 baseline
|
|
Need someone with ARM NEON experience to make sure the new intrinsics for v_reduce_min & v_reduce_max for the v_uint16x8 and v_int16x8 types are correct. I tried to use the v_uint8x16 and v_int8x16 versions as example. Testing (both unit and independent) works. |
modules/core/src/minmax.cpp
Outdated
|
|
||
| #if CV_SIMD128 | ||
| #ifdef _MSC_VER | ||
| #define forceinline __forceinline |
There was a problem hiding this comment.
There is similar CV_ALWAYS_INLINE macro in cvdef.h
modules/core/src/minmax.cpp
Outdated
| #endif | ||
| #endif | ||
|
|
||
| #define MINMAXIDX_REDUCE(suffix, RT, valMin, valMax, idxMin, idxMax, none, \ |
There was a problem hiding this comment.
Is it possible to reduce size of this macro? Or even omit it?
There was a problem hiding this comment.
I don't see how to omit it. Suggestions?
|
Found bug in v_int64x2 comparsions. See 15738 |
|
@ChipKerchner Friendly reminder. |
| } | ||
|
|
||
| #define OPENCV_HAL_IMPL_NEON_REDUCE_OP_16(_Tpvec, _Tpnvec, scalartype, func, vectorfunc, suffix) \ | ||
| inline scalartype v_reduce_##func(const _Tpvec& a) \ |
There was a problem hiding this comment.
Documentation for v_reduce support matrix is here.
if you want to use them in OpenCV algorithms code, then corresponding documentation and intrinsic tests should be updated too.
modules/core/src/minmax.cpp
Outdated
| minMaxIdx_init( src, mask, minval, maxval, minidx, maxidx, minVal, maxVal, minIdx, maxIdx, | ||
| (int)0, (int)USHRT_MAX, v_uint16x8::nlanes, len, startidx, j, len0 ); | ||
|
|
||
| if ( len0 - j >= v_uint16x8::nlanes ) |
There was a problem hiding this comment.
Please use this form: j <= len0 - _uint16x8::nlanes (used in many vectorized loops)
| {A1 A2 A3 ...} => min(A1,A2,A3,...) | ||
| @endcode | ||
| For 32-bit integer and 32-bit floating point types. */ | ||
| For all types except 64-bit integer. */ |
There was a problem hiding this comment.
64-bit floats are not supported too.
|
|
||
| if ( !mask ) | ||
| { | ||
| for( ; k < std::min(len0, j + 32764 * 4 * v_float64x2::nlanes); k += 4 * v_float64x2::nlanes ) |
There was a problem hiding this comment.
It makes it a lot easier to handle the 'mask' case down below (v_load_expand_q is really bad on some platforms).
There was a problem hiding this comment.
This loop doesn't work this mask.
Perhaps, loop below can be iterated over loaded mask size:
-4 * v_float64x2::nlanes
+v_uint16x8::nlaneswith corresponding comment near "for" that loop iterations are performed over mask values.
There was a problem hiding this comment.
I'd rather keep it the way it is for consistency with the other cases.
…r vectorized loops.
…be same as lane type.
|
|
||
| #define MINMAXIDX_REDUCE(suffix, suffix2, maxLimit, IR, valMin, valMax, idxMin, \ | ||
| idxMax, none, minVal, maxVal, minIdx, maxIdx, delta) \ | ||
| template<typename T, typename VT, typename IT> CV_ALWAYS_INLINE void \ |
There was a problem hiding this comment.
IMO template or macro is redundant here. If there was an opportunity to use something like v_reinterpret_as_vectype<uchar> instead of v_reinterpret_as_u8 I would prefer template as a more debug friendly alternative. But at the moment it is impossible, so the macro looks like the only option. Also macro parameters could be reduced to suffix2, VT, IT, T. Something like that:
#define MINMAXIDX_REDUCE(suffix2, VT, IT, T) \
CV_ALWAYS_INLINE void minMaxIdx_reduce( VT &valMin, VT &valMax, IT &idxMin, IT &idxMax, IT &none, \
T &minVal, T &maxVal, size_t &minIdx, size_t &maxIdx, \
size_t delta ) \
{ \
if ( v_check_any(idxMin != none) ) \
{ \
minVal = v_reduce_min(valMin); \
minIdx = (size_t)v_reduce_min(v_select(v_reinterpret_as_##suffix2(v_setall((VT::lane_type)minVal) == valMin), \
idxMin, v_setall_##suffix2((IT::lane_type)-1))) + delta; \
} \
if ( v_check_any(idxMax != none) ) \
{ \
maxVal = v_reduce_max(valMax); \
maxIdx = (size_t)v_reduce_min(v_select(v_reinterpret_as_##suffix2(v_setall((VT::lane_type)maxVal) == valMax), \
idxMax, v_setall_##suffix2((IT::lane_type)-1))) + delta; \
} \
}
There was a problem hiding this comment.
This is not the same - some types (float and double) do NOT use max as -1
There was a problem hiding this comment.
As far as I could understand this value is used as "do not select" value, to restrict result of the following v_reduce_min to indexes of valid search results only. So I don't see the reason that denies replacement of this value by greater one.
Could you please share your thoughts on this?
There was a problem hiding this comment.
This code is used to create a position location of the min or max that can never occur. Your code could be creating NANs for non-integer types. Use of NANs can be unpredictable.
There was a problem hiding this comment.
Position location types are unsigned integers of different width for all existing instantiations. I doubt that floating point types could be used for that somewhere in the future but anyway such a change will require update of the whole algorithm since idxMax still could contain not yet updated parts of the initial none value which also should be treated as NaN's in that case
| EXPECT_EQ((LaneType)((1 + R::nlanes)*R::nlanes/2), v_reduce_sum(a)); | ||
| EXPECT_EQ((LaneType)1, (LaneType)v_reduce_min(a)); | ||
| EXPECT_EQ((LaneType)(R::nlanes), (LaneType)v_reduce_max(a)); | ||
| EXPECT_EQ((LaneType)(sum), (LaneType)v_reduce_sum(a)); |
There was a problem hiding this comment.
I think there should be cast to int rather than to LaneType to avoid overflow
There was a problem hiding this comment.
I mean v_reduce_sum only. It's fine to have LaneType for other reductions since they return value is already of LaneType
| } | ||
|
|
||
| #if CV_SIMD128_64F | ||
| CV_ALWAYS_INLINE double v_reduce_min(const v_float64x2& a) |
There was a problem hiding this comment.
IMO it would be better to extend the overall set of intrinsics
There was a problem hiding this comment.
I think until the intrinsics for all the platforms use v_float64x2 and v_uint64x2 more fully, this is a better approach.
…rameters in MINMAXIDX_REDUCE macro.
|
Are there any blockers for this PR to be able to add it to 4.2 release? |
|
What's holding up this review? |
|
@alalek please take a look. |
|
It looks like there are three open code style discussions:
However all discussions are about readability while performance gain is brilliant. Performance for SSE2 baseline
Performance for SSE3 baseline
Performance for SSE4_2 baseline
Performance for AVX2 baseline
|
Vectorize minMaxIdx functions.
minMaxIdx_8u & minMaxIdx_8s - 11.1x improvement on VSX and 8.6x speedup on x86.
minMaxIdx_16u & minMaxIdx_16s - 8.3x improvement on VSX and 7.5x speedup on x86.
minMaxIdx_32s - 5.1x improvement on VSX and 4.2x speedup on x86.
minMaxIdx_32f - 4.1x improvement on VSX and 3.2x speedup on x86.
minMaxIdx_64f - 1.6x improvement on VSX and 1.5x speedup on x86.