Skip to content

Vectorize minMaxIdx functions#15488

Merged
alalek merged 9 commits intoopencv:3.4from
ChipKerchner:vectorizeMinMax2
Jan 17, 2020
Merged

Vectorize minMaxIdx functions#15488
alalek merged 9 commits intoopencv:3.4from
ChipKerchner:vectorizeMinMax2

Conversation

@ChipKerchner
Copy link
Copy Markdown
Contributor

@ChipKerchner ChipKerchner commented Sep 9, 2019

Vectorize minMaxIdx functions.

minMaxIdx_8u & minMaxIdx_8s - 11.1x improvement on VSX and 8.6x speedup on x86.
minMaxIdx_16u & minMaxIdx_16s - 8.3x improvement on VSX and 7.5x speedup on x86.
minMaxIdx_32s - 5.1x improvement on VSX and 4.2x speedup on x86.
minMaxIdx_32f - 4.1x improvement on VSX and 3.2x speedup on x86.
minMaxIdx_64f - 1.6x improvement on VSX and 1.5x speedup on x86.

force_builders=Custom,ARMv7
build_image:Docs=docs-js
#buildworker:Custom=linux-1
#build_image:Custom=mips64el
#build_image:Custom=javascript-simd

buildworker:Custom=linux-3
build_image:Custom=ubuntu:18.04
CPU_BASELINE:Custom=AVX512_SKX
disable_ipp=ON

@ChipKerchner ChipKerchner changed the base branch from master to 3.4 September 9, 2019 19:31
@terfendail
Copy link
Copy Markdown
Contributor

Performance for SSE2 baseline
Performance test Reference time PR time Speedup
minMaxLoc::Size_MatType::(127x61, 8UC1) 0.008 0.001 7.55
minMaxLoc::Size_MatType::(127x61, 8SC1) 0.008 0.001 7.32
minMaxLoc::Size_MatType::(127x61, 16UC1) 0.008 0.002 4.07
minMaxLoc::Size_MatType::(127x61, 16SC1) 0.006 0.001 4.27
minMaxLoc::Size_MatType::(127x61, 32SC1) 0.008 0.003 2.45
minMaxLoc::Size_MatType::(127x61, 32FC1) 0.008 0.003 2.65
minMaxLoc::Size_MatType::(127x61, 64FC1) 0.008 0.006 1.32
minMaxLoc::Size_MatType::(640x480, 8UC1) 0.298 0.035 8.45
minMaxLoc::Size_MatType::(640x480, 8SC1) 0.284 0.036 7.83
minMaxLoc::Size_MatType::(640x480, 16UC1) 0.318 0.073 4.37
minMaxLoc::Size_MatType::(640x480, 16SC1) 0.219 0.053 4.13
minMaxLoc::Size_MatType::(640x480, 32SC1) 0.289 0.121 2.38
minMaxLoc::Size_MatType::(640x480, 32FC1) 0.289 0.115 2.52
minMaxLoc::Size_MatType::(640x480, 64FC1) 0.292 0.226 1.29
minMaxLoc::Size_MatType::(1280x720, 8UC1) 0.842 0.104 8.13
minMaxLoc::Size_MatType::(1280x720, 8SC1) 0.850 0.106 8.02
minMaxLoc::Size_MatType::(1280x720, 16UC1) 0.965 0.214 4.51
minMaxLoc::Size_MatType::(1280x720, 16SC1) 0.647 0.155 4.17
minMaxLoc::Size_MatType::(1280x720, 32SC1) 0.881 0.360 2.45
minMaxLoc::Size_MatType::(1280x720, 32FC1) 0.903 0.334 2.70
minMaxLoc::Size_MatType::(1280x720, 64FC1) 0.912 0.671 1.36
minMaxLoc::Size_MatType::(1920x1080, 8UC1) 1.996 0.232 8.60
minMaxLoc::Size_MatType::(1920x1080, 8SC1) 1.987 0.238 8.37
minMaxLoc::Size_MatType::(1920x1080, 16UC1) 2.048 0.480 4.27
minMaxLoc::Size_MatType::(1920x1080, 16SC1) 1.585 0.348 4.56
minMaxLoc::Size_MatType::(1920x1080, 32SC1) 1.999 0.838 2.39
minMaxLoc::Size_MatType::(1920x1080, 32FC1) 2.038 0.786 2.59
minMaxLoc::Size_MatType::(1920x1080, 64FC1) 2.245 1.759 1.28
Performance for SSE3 baseline
Performance test Reference time PR time Speedup
minMaxLoc::Size_MatType::(127x61, 8UC1) 0.006 0.001 5.74
minMaxLoc::Size_MatType::(127x61, 8SC1) 0.006 0.001 5.65
minMaxLoc::Size_MatType::(127x61, 16UC1) 0.006 0.002 3.10
minMaxLoc::Size_MatType::(127x61, 16SC1) 0.008 0.001 5.49
minMaxLoc::Size_MatType::(127x61, 32SC1) 0.008 0.003 2.43
minMaxLoc::Size_MatType::(127x61, 32FC1) 0.008 0.003 2.64
minMaxLoc::Size_MatType::(127x61, 64FC1) 0.008 0.006 1.32
minMaxLoc::Size_MatType::(640x480, 8UC1) 0.220 0.035 6.25
minMaxLoc::Size_MatType::(640x480, 8SC1) 0.217 0.036 5.99
minMaxLoc::Size_MatType::(640x480, 16UC1) 0.220 0.070 3.14
minMaxLoc::Size_MatType::(640x480, 16SC1) 0.315 0.053 5.95
minMaxLoc::Size_MatType::(640x480, 32SC1) 0.288 0.118 2.45
minMaxLoc::Size_MatType::(640x480, 32FC1) 0.290 0.113 2.56
minMaxLoc::Size_MatType::(640x480, 64FC1) 0.304 0.229 1.33
minMaxLoc::Size_MatType::(1280x720, 8UC1) 0.689 0.104 6.65
minMaxLoc::Size_MatType::(1280x720, 8SC1) 0.677 0.109 6.19
minMaxLoc::Size_MatType::(1280x720, 16UC1) 0.683 0.214 3.19
minMaxLoc::Size_MatType::(1280x720, 16SC1) 0.886 0.155 5.71
minMaxLoc::Size_MatType::(1280x720, 32SC1) 0.884 0.359 2.46
minMaxLoc::Size_MatType::(1280x720, 32FC1) 0.902 0.339 2.66
minMaxLoc::Size_MatType::(1280x720, 64FC1) 0.913 0.682 1.34
minMaxLoc::Size_MatType::(1920x1080, 8UC1) 1.577 0.232 6.80
minMaxLoc::Size_MatType::(1920x1080, 8SC1) 1.614 0.238 6.80
minMaxLoc::Size_MatType::(1920x1080, 16UC1) 1.558 0.485 3.21
minMaxLoc::Size_MatType::(1920x1080, 16SC1) 1.913 0.349 5.49
minMaxLoc::Size_MatType::(1920x1080, 32SC1) 1.902 0.838 2.27
minMaxLoc::Size_MatType::(1920x1080, 32FC1) 1.944 0.783 2.48
minMaxLoc::Size_MatType::(1920x1080, 64FC1) 2.155 1.775 1.21
Performance for SSE4_2 baseline
Performance test Reference time PR time Speedup
minMaxLoc::Size_MatType::(127x61, 8UC1) 0.008 0.001 8.51
minMaxLoc::Size_MatType::(127x61, 8SC1) 0.008 0.001 10.53
minMaxLoc::Size_MatType::(127x61, 16UC1) 0.009 0.001 6.11
minMaxLoc::Size_MatType::(127x61, 16SC1) 0.006 0.001 4.88
minMaxLoc::Size_MatType::(127x61, 32SC1) 0.008 0.002 4.27
minMaxLoc::Size_MatType::(127x61, 32FC1) 0.008 0.003 2.84
minMaxLoc::Size_MatType::(127x61, 64FC1) 0.008 0.006 1.30
minMaxLoc::Size_MatType::(640x480, 8UC1) 0.298 0.029 10.26
minMaxLoc::Size_MatType::(640x480, 8SC1) 0.297 0.024 12.51
minMaxLoc::Size_MatType::(640x480, 16UC1) 0.348 0.054 6.49
minMaxLoc::Size_MatType::(640x480, 16SC1) 0.229 0.044 5.17
minMaxLoc::Size_MatType::(640x480, 32SC1) 0.298 0.068 4.36
minMaxLoc::Size_MatType::(640x480, 32FC1) 0.305 0.106 2.87
minMaxLoc::Size_MatType::(640x480, 64FC1) 0.309 0.232 1.34
minMaxLoc::Size_MatType::(1280x720, 8UC1) 0.893 0.087 10.25
minMaxLoc::Size_MatType::(1280x720, 8SC1) 0.883 0.070 12.66
minMaxLoc::Size_MatType::(1280x720, 16UC1) 1.028 0.158 6.51
minMaxLoc::Size_MatType::(1280x720, 16SC1) 0.678 0.129 5.25
minMaxLoc::Size_MatType::(1280x720, 32SC1) 0.916 0.199 4.60
minMaxLoc::Size_MatType::(1280x720, 32FC1) 0.925 0.313 2.95
minMaxLoc::Size_MatType::(1280x720, 64FC1) 0.936 0.692 1.35
minMaxLoc::Size_MatType::(1920x1080, 8UC1) 1.988 0.196 10.16
minMaxLoc::Size_MatType::(1920x1080, 8SC1) 1.987 0.156 12.74
minMaxLoc::Size_MatType::(1920x1080, 16UC1) 2.068 0.355 5.82
minMaxLoc::Size_MatType::(1920x1080, 16SC1) 1.522 0.290 5.25
minMaxLoc::Size_MatType::(1920x1080, 32SC1) 1.996 0.476 4.19
minMaxLoc::Size_MatType::(1920x1080, 32FC1) 1.951 0.724 2.69
minMaxLoc::Size_MatType::(1920x1080, 64FC1) 2.158 1.816 1.19
Performance for AVX2 baseline
Performance test Reference time PR time Speedup
minMaxLoc::Size_MatType::(127x61, 8UC1) 0.006 0.001 6.39
minMaxLoc::Size_MatType::(127x61, 8SC1) 0.006 0.001 7.33
minMaxLoc::Size_MatType::(127x61, 16UC1) 0.008 0.002 4.89
minMaxLoc::Size_MatType::(127x61, 16SC1) 0.008 0.001 5.64
minMaxLoc::Size_MatType::(127x61, 32SC1) 0.010 0.002 4.72
minMaxLoc::Size_MatType::(127x61, 32FC1) 0.008 0.003 3.18
minMaxLoc::Size_MatType::(127x61, 64FC1) 0.008 0.006 1.43
minMaxLoc::Size_MatType::(640x480, 8UC1) 0.229 0.032 7.20
minMaxLoc::Size_MatType::(640x480, 8SC1) 0.218 0.030 7.19
minMaxLoc::Size_MatType::(640x480, 16UC1) 0.309 0.059 5.23
minMaxLoc::Size_MatType::(640x480, 16SC1) 0.296 0.049 6.10
minMaxLoc::Size_MatType::(640x480, 32SC1) 0.362 0.078 4.64
minMaxLoc::Size_MatType::(640x480, 32FC1) 0.292 0.095 3.09
minMaxLoc::Size_MatType::(640x480, 64FC1) 0.290 0.212 1.37
minMaxLoc::Size_MatType::(1280x720, 8UC1) 0.661 0.093 7.09
minMaxLoc::Size_MatType::(1280x720, 8SC1) 0.654 0.079 8.30
minMaxLoc::Size_MatType::(1280x720, 16UC1) 0.896 0.172 5.21
minMaxLoc::Size_MatType::(1280x720, 16SC1) 0.885 0.143 6.17
minMaxLoc::Size_MatType::(1280x720, 32SC1) 1.061 0.227 4.67
minMaxLoc::Size_MatType::(1280x720, 32FC1) 0.904 0.274 3.30
minMaxLoc::Size_MatType::(1280x720, 64FC1) 0.911 0.641 1.42
minMaxLoc::Size_MatType::(1920x1080, 8UC1) 1.530 0.209 7.31
minMaxLoc::Size_MatType::(1920x1080, 8SC1) 1.522 0.177 8.61
minMaxLoc::Size_MatType::(1920x1080, 16UC1) 2.007 0.383 5.24
minMaxLoc::Size_MatType::(1920x1080, 16SC1) 1.987 0.319 6.23
minMaxLoc::Size_MatType::(1920x1080, 32SC1) 2.347 0.535 4.39
minMaxLoc::Size_MatType::(1920x1080, 32FC1) 2.038 0.635 3.21
minMaxLoc::Size_MatType::(1920x1080, 64FC1) 2.244 1.738 1.29

@ChipKerchner
Copy link
Copy Markdown
Contributor Author

Need someone with ARM NEON experience to make sure the new intrinsics for v_reduce_min & v_reduce_max for the v_uint16x8 and v_int16x8 types are correct.

I tried to use the v_uint8x16 and v_int8x16 versions as example. Testing (both unit and independent) works.


#if CV_SIMD128
#ifdef _MSC_VER
#define forceinline __forceinline
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is similar CV_ALWAYS_INLINE macro in cvdef.h

#endif
#endif

#define MINMAXIDX_REDUCE(suffix, RT, valMin, valMax, idxMin, idxMax, none, \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to reduce size of this macro? Or even omit it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how to omit it. Suggestions?

@ChipKerchner
Copy link
Copy Markdown
Contributor Author

Found bug in v_int64x2 comparsions. See 15738

@asmorkalov
Copy link
Copy Markdown
Contributor

@ChipKerchner Friendly reminder.

}

#define OPENCV_HAL_IMPL_NEON_REDUCE_OP_16(_Tpvec, _Tpnvec, scalartype, func, vectorfunc, suffix) \
inline scalartype v_reduce_##func(const _Tpvec& a) \
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation for v_reduce support matrix is here.

if you want to use them in OpenCV algorithms code, then corresponding documentation and intrinsic tests should be updated too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Copy link
Copy Markdown
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for update!

minMaxIdx_init( src, mask, minval, maxval, minidx, maxidx, minVal, maxVal, minIdx, maxIdx,
(int)0, (int)USHRT_MAX, v_uint16x8::nlanes, len, startidx, j, len0 );

if ( len0 - j >= v_uint16x8::nlanes )
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use this form: j <= len0 - _uint16x8::nlanes (used in many vectorized loops)

{A1 A2 A3 ...} => min(A1,A2,A3,...)
@endcode
For 32-bit integer and 32-bit floating point types. */
For all types except 64-bit integer. */
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

64-bit floats are not supported too.


if ( !mask )
{
for( ; k < std::min(len0, j + 32764 * 4 * v_float64x2::nlanes); k += 4 * v_float64x2::nlanes )
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need x4 unrolling?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes it a lot easier to handle the 'mask' case down below (v_load_expand_q is really bad on some platforms).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop doesn't work this mask.

Perhaps, loop below can be iterated over loaded mask size:

-4 * v_float64x2::nlanes
+v_uint16x8::nlanes

with corresponding comment near "for" that loop iterations are performed over mask values.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather keep it the way it is for consistency with the other cases.


#define MINMAXIDX_REDUCE(suffix, suffix2, maxLimit, IR, valMin, valMax, idxMin, \
idxMax, none, minVal, maxVal, minIdx, maxIdx, delta) \
template<typename T, typename VT, typename IT> CV_ALWAYS_INLINE void \
Copy link
Copy Markdown
Contributor

@terfendail terfendail Dec 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO template or macro is redundant here. If there was an opportunity to use something like v_reinterpret_as_vectype<uchar> instead of v_reinterpret_as_u8 I would prefer template as a more debug friendly alternative. But at the moment it is impossible, so the macro looks like the only option. Also macro parameters could be reduced to suffix2, VT, IT, T. Something like that:

#define MINMAXIDX_REDUCE(suffix2, VT, IT, T) \
CV_ALWAYS_INLINE void minMaxIdx_reduce( VT &valMin, VT &valMax, IT &idxMin, IT &idxMax, IT &none, \
                                        T &minVal, T &maxVal, size_t &minIdx, size_t &maxIdx, \
                                        size_t delta ) \
{ \
    if ( v_check_any(idxMin != none) ) \
    { \
        minVal = v_reduce_min(valMin); \
        minIdx = (size_t)v_reduce_min(v_select(v_reinterpret_as_##suffix2(v_setall((VT::lane_type)minVal) == valMin), \
                     idxMin, v_setall_##suffix2((IT::lane_type)-1))) + delta; \
    } \
    if ( v_check_any(idxMax != none) ) \
    { \
        maxVal = v_reduce_max(valMax); \
        maxIdx = (size_t)v_reduce_min(v_select(v_reinterpret_as_##suffix2(v_setall((VT::lane_type)maxVal) == valMax), \
                     idxMax, v_setall_##suffix2((IT::lane_type)-1))) + delta; \
    } \
}

Copy link
Copy Markdown
Contributor Author

@ChipKerchner ChipKerchner Dec 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the same - some types (float and double) do NOT use max as -1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I could understand this value is used as "do not select" value, to restrict result of the following v_reduce_min to indexes of valid search results only. So I don't see the reason that denies replacement of this value by greater one.
Could you please share your thoughts on this?

Copy link
Copy Markdown
Contributor Author

@ChipKerchner ChipKerchner Dec 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is used to create a position location of the min or max that can never occur. Your code could be creating NANs for non-integer types. Use of NANs can be unpredictable.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Position location types are unsigned integers of different width for all existing instantiations. I doubt that floating point types could be used for that somewhere in the future but anyway such a change will require update of the whole algorithm since idxMax still could contain not yet updated parts of the initial none value which also should be treated as NaN's in that case

EXPECT_EQ((LaneType)((1 + R::nlanes)*R::nlanes/2), v_reduce_sum(a));
EXPECT_EQ((LaneType)1, (LaneType)v_reduce_min(a));
EXPECT_EQ((LaneType)(R::nlanes), (LaneType)v_reduce_max(a));
EXPECT_EQ((LaneType)(sum), (LaneType)v_reduce_sum(a));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there should be cast to int rather than to LaneType to avoid overflow

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean v_reduce_sum only. It's fine to have LaneType for other reductions since they return value is already of LaneType

}

#if CV_SIMD128_64F
CV_ALWAYS_INLINE double v_reduce_min(const v_float64x2& a)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it would be better to extend the overall set of intrinsics

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think until the intrinsics for all the platforms use v_float64x2 and v_uint64x2 more fully, this is a better approach.

@tompollok
Copy link
Copy Markdown
Contributor

Are there any blockers for this PR to be able to add it to 4.2 release?

@ChipKerchner
Copy link
Copy Markdown
Contributor Author

What's holding up this review?

@asmorkalov
Copy link
Copy Markdown
Contributor

@alalek please take a look.

@terfendail
Copy link
Copy Markdown
Contributor

It looks like there are three open code style discussions:

  • Regarding MINMAXIDX_REDUCE macro simplification
  • Regarding range condition for unrolled vectorized loop
  • Regarding local definition of a few v_float64x2 v_reduce intrinsics

However all discussions are about readability while performance gain is brilliant.

Performance for SSE2 baseline
Performance test Reference time PR time Speedup
minMaxLoc::Size_MatType::(127x61, 8UC1) 0.006 0.001 5.84
minMaxLoc::Size_MatType::(127x61, 8SC1) 0.006 0.001 5.70
minMaxLoc::Size_MatType::(127x61, 16UC1) 0.009 0.002 4.44
minMaxLoc::Size_MatType::(127x61, 16SC1) 0.008 0.001 5.48
minMaxLoc::Size_MatType::(127x61, 32SC1) 0.009 0.003 2.62
minMaxLoc::Size_MatType::(127x61, 32FC1) 0.008 0.003 2.61
minMaxLoc::Size_MatType::(127x61, 64FC1) 0.008 0.006 1.28
minMaxLoc::Size_MatType::(640x480, 8UC1) 0.230 0.040 5.76
minMaxLoc::Size_MatType::(640x480, 8SC1) 0.217 0.035 6.14
minMaxLoc::Size_MatType::(640x480, 16UC1) 0.308 0.071 4.33
minMaxLoc::Size_MatType::(640x480, 16SC1) 0.288 0.052 5.59
minMaxLoc::Size_MatType::(640x480, 32SC1) 0.331 0.122 2.70
minMaxLoc::Size_MatType::(640x480, 32FC1) 0.289 0.110 2.63
minMaxLoc::Size_MatType::(640x480, 64FC1) 0.293 0.232 1.26
minMaxLoc::Size_MatType::(1280x720, 8UC1) 0.661 0.113 5.85
minMaxLoc::Size_MatType::(1280x720, 8SC1) 0.668 0.103 6.48
minMaxLoc::Size_MatType::(1280x720, 16UC1) 0.878 0.206 4.26
minMaxLoc::Size_MatType::(1280x720, 16SC1) 0.890 0.154 5.79
minMaxLoc::Size_MatType::(1280x720, 32SC1) 1.046 0.361 2.90
minMaxLoc::Size_MatType::(1280x720, 32FC1) 0.902 0.319 2.82
minMaxLoc::Size_MatType::(1280x720, 64FC1) 0.911 0.688 1.32
minMaxLoc::Size_MatType::(1920x1080, 8UC1) 1.547 0.255 6.08
minMaxLoc::Size_MatType::(1920x1080, 8SC1) 1.581 0.227 6.96
minMaxLoc::Size_MatType::(1920x1080, 16UC1) 1.991 0.462 4.31
minMaxLoc::Size_MatType::(1920x1080, 16SC1) 1.931 0.332 5.81
minMaxLoc::Size_MatType::(1920x1080, 32SC1) 2.193 0.838 2.62
minMaxLoc::Size_MatType::(1920x1080, 32FC1) 1.959 0.759 2.58
minMaxLoc::Size_MatType::(1920x1080, 64FC1) 2.155 1.761 1.22
Performance for SSE3 baseline
Performance test Reference time PR time Speedup
minMaxLoc::Size_MatType::(127x61, 8UC1) 0.008 0.001 7.63
minMaxLoc::Size_MatType::(127x61, 8SC1) 0.008 0.001 7.37
minMaxLoc::Size_MatType::(127x61, 16UC1) 0.008 0.002 3.97
minMaxLoc::Size_MatType::(127x61, 16SC1) 0.009 0.001 6.49
minMaxLoc::Size_MatType::(127x61, 32SC1) 0.006 0.003 1.84
minMaxLoc::Size_MatType::(127x61, 32FC1) 0.008 0.003 2.61
minMaxLoc::Size_MatType::(127x61, 64FC1) 0.008 0.006 1.28
minMaxLoc::Size_MatType::(640x480, 8UC1) 0.286 0.035 8.17
minMaxLoc::Size_MatType::(640x480, 8SC1) 0.285 0.036 7.95
minMaxLoc::Size_MatType::(640x480, 16UC1) 0.292 0.073 4.00
minMaxLoc::Size_MatType::(640x480, 16SC1) 0.338 0.053 6.39
minMaxLoc::Size_MatType::(640x480, 32SC1) 0.219 0.122 1.79
minMaxLoc::Size_MatType::(640x480, 32FC1) 0.293 0.115 2.55
minMaxLoc::Size_MatType::(640x480, 64FC1) 0.294 0.232 1.27
minMaxLoc::Size_MatType::(1280x720, 8UC1) 0.865 0.104 8.33
minMaxLoc::Size_MatType::(1280x720, 8SC1) 0.852 0.106 8.06
minMaxLoc::Size_MatType::(1280x720, 16UC1) 0.845 0.215 3.92
minMaxLoc::Size_MatType::(1280x720, 16SC1) 0.988 0.150 6.57
minMaxLoc::Size_MatType::(1280x720, 32SC1) 0.655 0.344 1.90
minMaxLoc::Size_MatType::(1280x720, 32FC1) 0.873 0.324 2.70
minMaxLoc::Size_MatType::(1280x720, 64FC1) 0.872 0.661 1.32
minMaxLoc::Size_MatType::(1920x1080, 8UC1) 1.899 0.220 8.65
minMaxLoc::Size_MatType::(1920x1080, 8SC1) 1.925 0.231 8.33
minMaxLoc::Size_MatType::(1920x1080, 16UC1) 1.898 0.462 4.11
minMaxLoc::Size_MatType::(1920x1080, 16SC1) 2.166 0.338 6.41
minMaxLoc::Size_MatType::(1920x1080, 32SC1) 1.484 0.809 1.84
minMaxLoc::Size_MatType::(1920x1080, 32FC1) 1.994 0.733 2.72
minMaxLoc::Size_MatType::(1920x1080, 64FC1) 2.186 1.780 1.23
Performance for SSE4_2 baseline
Performance test Reference time PR time Speedup
minMaxLoc::Size_MatType::(127x61, 8UC1) 0.008 0.001 8.77
minMaxLoc::Size_MatType::(127x61, 8SC1) 0.008 0.001 10.76
minMaxLoc::Size_MatType::(127x61, 16UC1) 0.008 0.001 5.61
minMaxLoc::Size_MatType::(127x61, 16SC1) 0.006 0.001 5.02
minMaxLoc::Size_MatType::(127x61, 32SC1) 0.008 0.002 4.38
minMaxLoc::Size_MatType::(127x61, 32FC1) 0.008 0.003 2.89
minMaxLoc::Size_MatType::(127x61, 64FC1) 0.008 0.006 1.33
minMaxLoc::Size_MatType::(640x480, 8UC1) 0.298 0.030 9.99
minMaxLoc::Size_MatType::(640x480, 8SC1) 0.305 0.023 13.01
minMaxLoc::Size_MatType::(640x480, 16UC1) 0.304 0.052 5.81
minMaxLoc::Size_MatType::(640x480, 16SC1) 0.229 0.043 5.26
minMaxLoc::Size_MatType::(640x480, 32SC1) 0.302 0.069 4.38
minMaxLoc::Size_MatType::(640x480, 32FC1) 0.305 0.104 2.93
minMaxLoc::Size_MatType::(640x480, 64FC1) 0.304 0.223 1.36
minMaxLoc::Size_MatType::(1280x720, 8UC1) 0.885 0.085 10.37
minMaxLoc::Size_MatType::(1280x720, 8SC1) 0.887 0.069 12.91
minMaxLoc::Size_MatType::(1280x720, 16UC1) 1.031 0.156 6.62
minMaxLoc::Size_MatType::(1280x720, 16SC1) 0.686 0.125 5.47
minMaxLoc::Size_MatType::(1280x720, 32SC1) 0.896 0.198 4.53
minMaxLoc::Size_MatType::(1280x720, 32FC1) 0.909 0.298 3.05
minMaxLoc::Size_MatType::(1280x720, 64FC1) 0.924 0.664 1.39
minMaxLoc::Size_MatType::(1920x1080, 8UC1) 2.004 0.194 10.36
minMaxLoc::Size_MatType::(1920x1080, 8SC1) 1.997 0.151 13.26
minMaxLoc::Size_MatType::(1920x1080, 16UC1) 2.175 0.343 6.35
minMaxLoc::Size_MatType::(1920x1080, 16SC1) 1.494 0.282 5.30
minMaxLoc::Size_MatType::(1920x1080, 32SC1) 1.948 0.455 4.28
minMaxLoc::Size_MatType::(1920x1080, 32FC1) 1.980 0.702 2.82
minMaxLoc::Size_MatType::(1920x1080, 64FC1) 2.159 1.745 1.24
Performance for AVX2 baseline
Performance test Reference time PR time Speedup
minMaxLoc::Size_MatType::(127x61, 8UC1) 0.008 0.001 8.39
minMaxLoc::Size_MatType::(127x61, 8SC1) 0.008 0.001 10.81
minMaxLoc::Size_MatType::(127x61, 16UC1) 0.008 0.002 4.82
minMaxLoc::Size_MatType::(127x61, 16SC1) 0.010 0.001 6.91
minMaxLoc::Size_MatType::(127x61, 32SC1) 0.006 0.002 2.90
minMaxLoc::Size_MatType::(127x61, 32FC1) 0.008 0.003 3.11
minMaxLoc::Size_MatType::(127x61, 64FC1) 0.008 0.005 1.52
minMaxLoc::Size_MatType::(640x480, 8UC1) 0.295 0.032 9.23
minMaxLoc::Size_MatType::(640x480, 8SC1) 0.296 0.023 13.15
minMaxLoc::Size_MatType::(640x480, 16UC1) 0.289 0.058 4.96
minMaxLoc::Size_MatType::(640x480, 16SC1) 0.349 0.048 7.21
minMaxLoc::Size_MatType::(640x480, 32SC1) 0.218 0.078 2.80
minMaxLoc::Size_MatType::(640x480, 32FC1) 0.301 0.097 3.11
minMaxLoc::Size_MatType::(640x480, 64FC1) 0.301 0.202 1.49
minMaxLoc::Size_MatType::(1280x720, 8UC1) 0.883 0.094 9.43
minMaxLoc::Size_MatType::(1280x720, 8SC1) 0.885 0.066 13.37
minMaxLoc::Size_MatType::(1280x720, 16UC1) 0.892 0.172 5.17
minMaxLoc::Size_MatType::(1280x720, 16SC1) 1.048 0.143 7.32
minMaxLoc::Size_MatType::(1280x720, 32SC1) 0.682 0.234 2.91
minMaxLoc::Size_MatType::(1280x720, 32FC1) 0.915 0.286 3.19
minMaxLoc::Size_MatType::(1280x720, 64FC1) 0.917 0.622 1.47
minMaxLoc::Size_MatType::(1920x1080, 8UC1) 2.038 0.210 9.70
minMaxLoc::Size_MatType::(1920x1080, 8SC1) 2.042 0.152 13.44
minMaxLoc::Size_MatType::(1920x1080, 16UC1) 2.040 0.383 5.32
minMaxLoc::Size_MatType::(1920x1080, 16SC1) 2.407 0.324 7.43
minMaxLoc::Size_MatType::(1920x1080, 32SC1) 1.573 0.544 2.89
minMaxLoc::Size_MatType::(1920x1080, 32FC1) 2.088 0.660 3.17
minMaxLoc::Size_MatType::(1920x1080, 64FC1) 2.277 1.673 1.36

@alalek alalek merged commit 301626b into opencv:3.4 Jan 17, 2020
@ChipKerchner ChipKerchner deleted the vectorizeMinMax2 branch January 27, 2020 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants