Vectorize minMaxIdx functions by ChipKerchner · Pull Request #15488 · opencv/opencv

ChipKerchner · 2019-09-09T19:30:49Z

Vectorize minMaxIdx functions.

minMaxIdx_8u & minMaxIdx_8s - 11.1x improvement on VSX and 8.6x speedup on x86.
minMaxIdx_16u & minMaxIdx_16s - 8.3x improvement on VSX and 7.5x speedup on x86.
minMaxIdx_32s - 5.1x improvement on VSX and 4.2x speedup on x86.
minMaxIdx_32f - 4.1x improvement on VSX and 3.2x speedup on x86.
minMaxIdx_64f - 1.6x improvement on VSX and 1.5x speedup on x86.

force_builders=Custom,ARMv7
build_image:Docs=docs-js
#buildworker:Custom=linux-1
#build_image:Custom=mips64el
#build_image:Custom=javascript-simd

buildworker:Custom=linux-3
build_image:Custom=ubuntu:18.04
CPU_BASELINE:Custom=AVX512_SKX
disable_ipp=ON

terfendail · 2019-09-10T13:16:39Z

Performance for SSE2 baseline

Performance test	Reference time	PR time	Speedup
minMaxLoc::Size_MatType::(127x61, 8UC1)	0.008	0.001	7.55
minMaxLoc::Size_MatType::(127x61, 8SC1)	0.008	0.001	7.32
minMaxLoc::Size_MatType::(127x61, 16UC1)	0.008	0.002	4.07
minMaxLoc::Size_MatType::(127x61, 16SC1)	0.006	0.001	4.27
minMaxLoc::Size_MatType::(127x61, 32SC1)	0.008	0.003	2.45
minMaxLoc::Size_MatType::(127x61, 32FC1)	0.008	0.003	2.65
minMaxLoc::Size_MatType::(127x61, 64FC1)	0.008	0.006	1.32
minMaxLoc::Size_MatType::(640x480, 8UC1)	0.298	0.035	8.45
minMaxLoc::Size_MatType::(640x480, 8SC1)	0.284	0.036	7.83
minMaxLoc::Size_MatType::(640x480, 16UC1)	0.318	0.073	4.37
minMaxLoc::Size_MatType::(640x480, 16SC1)	0.219	0.053	4.13
minMaxLoc::Size_MatType::(640x480, 32SC1)	0.289	0.121	2.38
minMaxLoc::Size_MatType::(640x480, 32FC1)	0.289	0.115	2.52
minMaxLoc::Size_MatType::(640x480, 64FC1)	0.292	0.226	1.29
minMaxLoc::Size_MatType::(1280x720, 8UC1)	0.842	0.104	8.13
minMaxLoc::Size_MatType::(1280x720, 8SC1)	0.850	0.106	8.02
minMaxLoc::Size_MatType::(1280x720, 16UC1)	0.965	0.214	4.51
minMaxLoc::Size_MatType::(1280x720, 16SC1)	0.647	0.155	4.17
minMaxLoc::Size_MatType::(1280x720, 32SC1)	0.881	0.360	2.45
minMaxLoc::Size_MatType::(1280x720, 32FC1)	0.903	0.334	2.70
minMaxLoc::Size_MatType::(1280x720, 64FC1)	0.912	0.671	1.36
minMaxLoc::Size_MatType::(1920x1080, 8UC1)	1.996	0.232	8.60
minMaxLoc::Size_MatType::(1920x1080, 8SC1)	1.987	0.238	8.37
minMaxLoc::Size_MatType::(1920x1080, 16UC1)	2.048	0.480	4.27
minMaxLoc::Size_MatType::(1920x1080, 16SC1)	1.585	0.348	4.56
minMaxLoc::Size_MatType::(1920x1080, 32SC1)	1.999	0.838	2.39
minMaxLoc::Size_MatType::(1920x1080, 32FC1)	2.038	0.786	2.59
minMaxLoc::Size_MatType::(1920x1080, 64FC1)	2.245	1.759	1.28

Performance for SSE3 baseline

Performance test	Reference time	PR time	Speedup
minMaxLoc::Size_MatType::(127x61, 8UC1)	0.006	0.001	5.74
minMaxLoc::Size_MatType::(127x61, 8SC1)	0.006	0.001	5.65
minMaxLoc::Size_MatType::(127x61, 16UC1)	0.006	0.002	3.10
minMaxLoc::Size_MatType::(127x61, 16SC1)	0.008	0.001	5.49
minMaxLoc::Size_MatType::(127x61, 32SC1)	0.008	0.003	2.43
minMaxLoc::Size_MatType::(127x61, 32FC1)	0.008	0.003	2.64
minMaxLoc::Size_MatType::(127x61, 64FC1)	0.008	0.006	1.32
minMaxLoc::Size_MatType::(640x480, 8UC1)	0.220	0.035	6.25
minMaxLoc::Size_MatType::(640x480, 8SC1)	0.217	0.036	5.99
minMaxLoc::Size_MatType::(640x480, 16UC1)	0.220	0.070	3.14
minMaxLoc::Size_MatType::(640x480, 16SC1)	0.315	0.053	5.95
minMaxLoc::Size_MatType::(640x480, 32SC1)	0.288	0.118	2.45
minMaxLoc::Size_MatType::(640x480, 32FC1)	0.290	0.113	2.56
minMaxLoc::Size_MatType::(640x480, 64FC1)	0.304	0.229	1.33
minMaxLoc::Size_MatType::(1280x720, 8UC1)	0.689	0.104	6.65
minMaxLoc::Size_MatType::(1280x720, 8SC1)	0.677	0.109	6.19
minMaxLoc::Size_MatType::(1280x720, 16UC1)	0.683	0.214	3.19
minMaxLoc::Size_MatType::(1280x720, 16SC1)	0.886	0.155	5.71
minMaxLoc::Size_MatType::(1280x720, 32SC1)	0.884	0.359	2.46
minMaxLoc::Size_MatType::(1280x720, 32FC1)	0.902	0.339	2.66
minMaxLoc::Size_MatType::(1280x720, 64FC1)	0.913	0.682	1.34
minMaxLoc::Size_MatType::(1920x1080, 8UC1)	1.577	0.232	6.80
minMaxLoc::Size_MatType::(1920x1080, 8SC1)	1.614	0.238	6.80
minMaxLoc::Size_MatType::(1920x1080, 16UC1)	1.558	0.485	3.21
minMaxLoc::Size_MatType::(1920x1080, 16SC1)	1.913	0.349	5.49
minMaxLoc::Size_MatType::(1920x1080, 32SC1)	1.902	0.838	2.27
minMaxLoc::Size_MatType::(1920x1080, 32FC1)	1.944	0.783	2.48
minMaxLoc::Size_MatType::(1920x1080, 64FC1)	2.155	1.775	1.21

Performance for SSE4_2 baseline

Performance test	Reference time	PR time	Speedup
minMaxLoc::Size_MatType::(127x61, 8UC1)	0.008	0.001	8.51
minMaxLoc::Size_MatType::(127x61, 8SC1)	0.008	0.001	10.53
minMaxLoc::Size_MatType::(127x61, 16UC1)	0.009	0.001	6.11
minMaxLoc::Size_MatType::(127x61, 16SC1)	0.006	0.001	4.88
minMaxLoc::Size_MatType::(127x61, 32SC1)	0.008	0.002	4.27
minMaxLoc::Size_MatType::(127x61, 32FC1)	0.008	0.003	2.84
minMaxLoc::Size_MatType::(127x61, 64FC1)	0.008	0.006	1.30
minMaxLoc::Size_MatType::(640x480, 8UC1)	0.298	0.029	10.26
minMaxLoc::Size_MatType::(640x480, 8SC1)	0.297	0.024	12.51
minMaxLoc::Size_MatType::(640x480, 16UC1)	0.348	0.054	6.49
minMaxLoc::Size_MatType::(640x480, 16SC1)	0.229	0.044	5.17
minMaxLoc::Size_MatType::(640x480, 32SC1)	0.298	0.068	4.36
minMaxLoc::Size_MatType::(640x480, 32FC1)	0.305	0.106	2.87
minMaxLoc::Size_MatType::(640x480, 64FC1)	0.309	0.232	1.34
minMaxLoc::Size_MatType::(1280x720, 8UC1)	0.893	0.087	10.25
minMaxLoc::Size_MatType::(1280x720, 8SC1)	0.883	0.070	12.66
minMaxLoc::Size_MatType::(1280x720, 16UC1)	1.028	0.158	6.51
minMaxLoc::Size_MatType::(1280x720, 16SC1)	0.678	0.129	5.25
minMaxLoc::Size_MatType::(1280x720, 32SC1)	0.916	0.199	4.60
minMaxLoc::Size_MatType::(1280x720, 32FC1)	0.925	0.313	2.95
minMaxLoc::Size_MatType::(1280x720, 64FC1)	0.936	0.692	1.35
minMaxLoc::Size_MatType::(1920x1080, 8UC1)	1.988	0.196	10.16
minMaxLoc::Size_MatType::(1920x1080, 8SC1)	1.987	0.156	12.74
minMaxLoc::Size_MatType::(1920x1080, 16UC1)	2.068	0.355	5.82
minMaxLoc::Size_MatType::(1920x1080, 16SC1)	1.522	0.290	5.25
minMaxLoc::Size_MatType::(1920x1080, 32SC1)	1.996	0.476	4.19
minMaxLoc::Size_MatType::(1920x1080, 32FC1)	1.951	0.724	2.69
minMaxLoc::Size_MatType::(1920x1080, 64FC1)	2.158	1.816	1.19

Performance for AVX2 baseline

Performance test	Reference time	PR time	Speedup
minMaxLoc::Size_MatType::(127x61, 8UC1)	0.006	0.001	6.39
minMaxLoc::Size_MatType::(127x61, 8SC1)	0.006	0.001	7.33
minMaxLoc::Size_MatType::(127x61, 16UC1)	0.008	0.002	4.89
minMaxLoc::Size_MatType::(127x61, 16SC1)	0.008	0.001	5.64
minMaxLoc::Size_MatType::(127x61, 32SC1)	0.010	0.002	4.72
minMaxLoc::Size_MatType::(127x61, 32FC1)	0.008	0.003	3.18
minMaxLoc::Size_MatType::(127x61, 64FC1)	0.008	0.006	1.43
minMaxLoc::Size_MatType::(640x480, 8UC1)	0.229	0.032	7.20
minMaxLoc::Size_MatType::(640x480, 8SC1)	0.218	0.030	7.19
minMaxLoc::Size_MatType::(640x480, 16UC1)	0.309	0.059	5.23
minMaxLoc::Size_MatType::(640x480, 16SC1)	0.296	0.049	6.10
minMaxLoc::Size_MatType::(640x480, 32SC1)	0.362	0.078	4.64
minMaxLoc::Size_MatType::(640x480, 32FC1)	0.292	0.095	3.09
minMaxLoc::Size_MatType::(640x480, 64FC1)	0.290	0.212	1.37
minMaxLoc::Size_MatType::(1280x720, 8UC1)	0.661	0.093	7.09
minMaxLoc::Size_MatType::(1280x720, 8SC1)	0.654	0.079	8.30
minMaxLoc::Size_MatType::(1280x720, 16UC1)	0.896	0.172	5.21
minMaxLoc::Size_MatType::(1280x720, 16SC1)	0.885	0.143	6.17
minMaxLoc::Size_MatType::(1280x720, 32SC1)	1.061	0.227	4.67
minMaxLoc::Size_MatType::(1280x720, 32FC1)	0.904	0.274	3.30
minMaxLoc::Size_MatType::(1280x720, 64FC1)	0.911	0.641	1.42
minMaxLoc::Size_MatType::(1920x1080, 8UC1)	1.530	0.209	7.31
minMaxLoc::Size_MatType::(1920x1080, 8SC1)	1.522	0.177	8.61
minMaxLoc::Size_MatType::(1920x1080, 16UC1)	2.007	0.383	5.24
minMaxLoc::Size_MatType::(1920x1080, 16SC1)	1.987	0.319	6.23
minMaxLoc::Size_MatType::(1920x1080, 32SC1)	2.347	0.535	4.39
minMaxLoc::Size_MatType::(1920x1080, 32FC1)	2.038	0.635	3.21
minMaxLoc::Size_MatType::(1920x1080, 64FC1)	2.244	1.738	1.29

ChipKerchner · 2019-09-11T18:40:11Z

Need someone with ARM NEON experience to make sure the new intrinsics for v_reduce_min & v_reduce_max for the v_uint16x8 and v_int16x8 types are correct.

I tried to use the v_uint8x16 and v_int8x16 versions as example. Testing (both unit and independent) works.

mshabunin · 2019-09-12T12:24:55Z

modules/core/src/minmax.cpp


+#if CV_SIMD128
+#ifdef _MSC_VER
+    #define forceinline __forceinline


There is similar CV_ALWAYS_INLINE macro in cvdef.h

mshabunin · 2019-09-12T12:30:50Z

modules/core/src/minmax.cpp

+#endif
+#endif
+
+#define MINMAXIDX_REDUCE(suffix, RT, valMin, valMax, idxMin, idxMax, none, \


Is it possible to reduce size of this macro? Or even omit it?

I don't see how to omit it. Suggestions?

modules/core/src/minmax.cpp

ChipKerchner · 2019-10-18T15:34:13Z

Found bug in v_int64x2 comparsions. See 15738

asmorkalov · 2019-11-08T08:07:25Z

@ChipKerchner Friendly reminder.

alalek · 2019-10-29T12:47:11Z

modules/core/include/opencv2/core/hal/intrin_neon.hpp

 }

+#define OPENCV_HAL_IMPL_NEON_REDUCE_OP_16(_Tpvec, _Tpnvec, scalartype, func, vectorfunc, suffix) \
+inline scalartype v_reduce_##func(const _Tpvec& a) \


Documentation for v_reduce support matrix is here.

if you want to use them in OpenCV algorithms code, then corresponding documentation and intrinsic tests should be updated too.

alalek

Thank you for update!

alalek · 2019-11-14T16:09:24Z

modules/core/src/minmax.cpp

+        minMaxIdx_init( src, mask, minval, maxval, minidx, maxidx, minVal, maxVal, minIdx, maxIdx,
+                        (int)0, (int)USHRT_MAX, v_uint16x8::nlanes, len, startidx, j, len0 );
+
+        if ( len0 - j >= v_uint16x8::nlanes )


Please use this form: j <= len0 - _uint16x8::nlanes (used in many vectorized loops)

alalek · 2019-11-14T16:09:36Z

modules/core/include/opencv2/core/hal/intrin_cpp.hpp

 {A1 A2 A3 ...} => min(A1,A2,A3,...)
 @endcode
-For 32-bit integer and 32-bit floating point types. */
+For all types except 64-bit integer. */


64-bit floats are not supported too.

alalek · 2019-11-14T16:27:40Z

modules/core/src/minmax.cpp

+
+                if ( !mask )
+                {
+                    for( ; k < std::min(len0, j + 32764 * 4 * v_float64x2::nlanes); k += 4 * v_float64x2::nlanes )


Why do we need x4 unrolling?

It makes it a lot easier to handle the 'mask' case down below (v_load_expand_q is really bad on some platforms).

This loop doesn't work this mask.

Perhaps, loop below can be iterated over loaded mask size:

-4 * v_float64x2::nlanes +v_uint16x8::nlanes

with corresponding comment near "for" that loop iterations are performed over mask values.

I'd rather keep it the way it is for consistency with the other cases.

…r vectorized loops.

modules/core/include/opencv2/core/hal/intrin_neon.hpp

…be same as lane type.

terfendail · 2019-12-03T10:54:45Z

modules/core/src/minmax.cpp

+
+#define MINMAXIDX_REDUCE(suffix, suffix2, maxLimit, IR, valMin, valMax, idxMin, \
+                         idxMax, none, minVal, maxVal, minIdx, maxIdx, delta) \
+template<typename T, typename VT, typename IT> CV_ALWAYS_INLINE void \


IMO template or macro is redundant here. If there was an opportunity to use something like v_reinterpret_as_vectype<uchar> instead of v_reinterpret_as_u8 I would prefer template as a more debug friendly alternative. But at the moment it is impossible, so the macro looks like the only option. Also macro parameters could be reduced to suffix2, VT, IT, T. Something like that:

#define MINMAXIDX_REDUCE(suffix2, VT, IT, T) \ CV_ALWAYS_INLINE void minMaxIdx_reduce( VT &valMin, VT &valMax, IT &idxMin, IT &idxMax, IT &none, \ T &minVal, T &maxVal, size_t &minIdx, size_t &maxIdx, \ size_t delta ) \ { \ if ( v_check_any(idxMin != none) ) \ { \ minVal = v_reduce_min(valMin); \ minIdx = (size_t)v_reduce_min(v_select(v_reinterpret_as_##suffix2(v_setall((VT::lane_type)minVal) == valMin), \ idxMin, v_setall_##suffix2((IT::lane_type)-1))) + delta; \ } \ if ( v_check_any(idxMax != none) ) \ { \ maxVal = v_reduce_max(valMax); \ maxIdx = (size_t)v_reduce_min(v_select(v_reinterpret_as_##suffix2(v_setall((VT::lane_type)maxVal) == valMax), \ idxMax, v_setall_##suffix2((IT::lane_type)-1))) + delta; \ } \ }

This is not the same - some types (float and double) do NOT use max as -1

As far as I could understand this value is used as "do not select" value, to restrict result of the following v_reduce_min to indexes of valid search results only. So I don't see the reason that denies replacement of this value by greater one.
Could you please share your thoughts on this?

This code is used to create a position location of the min or max that can never occur. Your code could be creating NANs for non-integer types. Use of NANs can be unpredictable.

Position location types are unsigned integers of different width for all existing instantiations. I doubt that floating point types could be used for that somewhere in the future but anyway such a change will require update of the whole algorithm since idxMax still could contain not yet updated parts of the initial none value which also should be treated as NaN's in that case

terfendail · 2019-12-03T11:03:12Z

modules/core/test/test_intrin_utils.hpp

-        EXPECT_EQ((LaneType)((1 + R::nlanes)*R::nlanes/2), v_reduce_sum(a));
+        EXPECT_EQ((LaneType)1, (LaneType)v_reduce_min(a));
+        EXPECT_EQ((LaneType)(R::nlanes), (LaneType)v_reduce_max(a));
+        EXPECT_EQ((LaneType)(sum), (LaneType)v_reduce_sum(a));


I think there should be cast to int rather than to LaneType to avoid overflow

I mean v_reduce_sum only. It's fine to have LaneType for other reductions since they return value is already of LaneType

terfendail · 2019-12-03T11:05:59Z

modules/core/src/minmax.cpp

+}
+
+#if CV_SIMD128_64F
+CV_ALWAYS_INLINE double v_reduce_min(const v_float64x2& a)


IMO it would be better to extend the overall set of intrinsics

I think until the intrinsics for all the platforms use v_float64x2 and v_uint64x2 more fully, this is a better approach.

…rameters in MINMAXIDX_REDUCE macro.

tompollok · 2019-12-09T08:54:06Z

Are there any blockers for this PR to be able to add it to 4.2 release?

ChipKerchner · 2020-01-07T17:30:38Z

What's holding up this review?

asmorkalov · 2020-01-10T06:06:40Z

@alalek please take a look.

terfendail · 2020-01-16T13:58:47Z

It looks like there are three open code style discussions:

Regarding MINMAXIDX_REDUCE macro simplification
Regarding range condition for unrolled vectorized loop
Regarding local definition of a few v_float64x2 v_reduce intrinsics

However all discussions are about readability while performance gain is brilliant.

Performance for SSE2 baseline

Performance test	Reference time	PR time	Speedup
minMaxLoc::Size_MatType::(127x61, 8UC1)	0.006	0.001	5.84
minMaxLoc::Size_MatType::(127x61, 8SC1)	0.006	0.001	5.70
minMaxLoc::Size_MatType::(127x61, 16UC1)	0.009	0.002	4.44
minMaxLoc::Size_MatType::(127x61, 16SC1)	0.008	0.001	5.48
minMaxLoc::Size_MatType::(127x61, 32SC1)	0.009	0.003	2.62
minMaxLoc::Size_MatType::(127x61, 32FC1)	0.008	0.003	2.61
minMaxLoc::Size_MatType::(127x61, 64FC1)	0.008	0.006	1.28
minMaxLoc::Size_MatType::(640x480, 8UC1)	0.230	0.040	5.76
minMaxLoc::Size_MatType::(640x480, 8SC1)	0.217	0.035	6.14
minMaxLoc::Size_MatType::(640x480, 16UC1)	0.308	0.071	4.33
minMaxLoc::Size_MatType::(640x480, 16SC1)	0.288	0.052	5.59
minMaxLoc::Size_MatType::(640x480, 32SC1)	0.331	0.122	2.70
minMaxLoc::Size_MatType::(640x480, 32FC1)	0.289	0.110	2.63
minMaxLoc::Size_MatType::(640x480, 64FC1)	0.293	0.232	1.26
minMaxLoc::Size_MatType::(1280x720, 8UC1)	0.661	0.113	5.85
minMaxLoc::Size_MatType::(1280x720, 8SC1)	0.668	0.103	6.48
minMaxLoc::Size_MatType::(1280x720, 16UC1)	0.878	0.206	4.26
minMaxLoc::Size_MatType::(1280x720, 16SC1)	0.890	0.154	5.79
minMaxLoc::Size_MatType::(1280x720, 32SC1)	1.046	0.361	2.90
minMaxLoc::Size_MatType::(1280x720, 32FC1)	0.902	0.319	2.82
minMaxLoc::Size_MatType::(1280x720, 64FC1)	0.911	0.688	1.32
minMaxLoc::Size_MatType::(1920x1080, 8UC1)	1.547	0.255	6.08
minMaxLoc::Size_MatType::(1920x1080, 8SC1)	1.581	0.227	6.96
minMaxLoc::Size_MatType::(1920x1080, 16UC1)	1.991	0.462	4.31
minMaxLoc::Size_MatType::(1920x1080, 16SC1)	1.931	0.332	5.81
minMaxLoc::Size_MatType::(1920x1080, 32SC1)	2.193	0.838	2.62
minMaxLoc::Size_MatType::(1920x1080, 32FC1)	1.959	0.759	2.58
minMaxLoc::Size_MatType::(1920x1080, 64FC1)	2.155	1.761	1.22

Performance for SSE3 baseline

Performance test	Reference time	PR time	Speedup
minMaxLoc::Size_MatType::(127x61, 8UC1)	0.008	0.001	7.63
minMaxLoc::Size_MatType::(127x61, 8SC1)	0.008	0.001	7.37
minMaxLoc::Size_MatType::(127x61, 16UC1)	0.008	0.002	3.97
minMaxLoc::Size_MatType::(127x61, 16SC1)	0.009	0.001	6.49
minMaxLoc::Size_MatType::(127x61, 32SC1)	0.006	0.003	1.84
minMaxLoc::Size_MatType::(127x61, 32FC1)	0.008	0.003	2.61
minMaxLoc::Size_MatType::(127x61, 64FC1)	0.008	0.006	1.28
minMaxLoc::Size_MatType::(640x480, 8UC1)	0.286	0.035	8.17
minMaxLoc::Size_MatType::(640x480, 8SC1)	0.285	0.036	7.95
minMaxLoc::Size_MatType::(640x480, 16UC1)	0.292	0.073	4.00
minMaxLoc::Size_MatType::(640x480, 16SC1)	0.338	0.053	6.39
minMaxLoc::Size_MatType::(640x480, 32SC1)	0.219	0.122	1.79
minMaxLoc::Size_MatType::(640x480, 32FC1)	0.293	0.115	2.55
minMaxLoc::Size_MatType::(640x480, 64FC1)	0.294	0.232	1.27
minMaxLoc::Size_MatType::(1280x720, 8UC1)	0.865	0.104	8.33
minMaxLoc::Size_MatType::(1280x720, 8SC1)	0.852	0.106	8.06
minMaxLoc::Size_MatType::(1280x720, 16UC1)	0.845	0.215	3.92
minMaxLoc::Size_MatType::(1280x720, 16SC1)	0.988	0.150	6.57
minMaxLoc::Size_MatType::(1280x720, 32SC1)	0.655	0.344	1.90
minMaxLoc::Size_MatType::(1280x720, 32FC1)	0.873	0.324	2.70
minMaxLoc::Size_MatType::(1280x720, 64FC1)	0.872	0.661	1.32
minMaxLoc::Size_MatType::(1920x1080, 8UC1)	1.899	0.220	8.65
minMaxLoc::Size_MatType::(1920x1080, 8SC1)	1.925	0.231	8.33
minMaxLoc::Size_MatType::(1920x1080, 16UC1)	1.898	0.462	4.11
minMaxLoc::Size_MatType::(1920x1080, 16SC1)	2.166	0.338	6.41
minMaxLoc::Size_MatType::(1920x1080, 32SC1)	1.484	0.809	1.84
minMaxLoc::Size_MatType::(1920x1080, 32FC1)	1.994	0.733	2.72
minMaxLoc::Size_MatType::(1920x1080, 64FC1)	2.186	1.780	1.23

Performance for SSE4_2 baseline

Performance test	Reference time	PR time	Speedup
minMaxLoc::Size_MatType::(127x61, 8UC1)	0.008	0.001	8.77
minMaxLoc::Size_MatType::(127x61, 8SC1)	0.008	0.001	10.76
minMaxLoc::Size_MatType::(127x61, 16UC1)	0.008	0.001	5.61
minMaxLoc::Size_MatType::(127x61, 16SC1)	0.006	0.001	5.02
minMaxLoc::Size_MatType::(127x61, 32SC1)	0.008	0.002	4.38
minMaxLoc::Size_MatType::(127x61, 32FC1)	0.008	0.003	2.89
minMaxLoc::Size_MatType::(127x61, 64FC1)	0.008	0.006	1.33
minMaxLoc::Size_MatType::(640x480, 8UC1)	0.298	0.030	9.99
minMaxLoc::Size_MatType::(640x480, 8SC1)	0.305	0.023	13.01
minMaxLoc::Size_MatType::(640x480, 16UC1)	0.304	0.052	5.81
minMaxLoc::Size_MatType::(640x480, 16SC1)	0.229	0.043	5.26
minMaxLoc::Size_MatType::(640x480, 32SC1)	0.302	0.069	4.38
minMaxLoc::Size_MatType::(640x480, 32FC1)	0.305	0.104	2.93
minMaxLoc::Size_MatType::(640x480, 64FC1)	0.304	0.223	1.36
minMaxLoc::Size_MatType::(1280x720, 8UC1)	0.885	0.085	10.37
minMaxLoc::Size_MatType::(1280x720, 8SC1)	0.887	0.069	12.91
minMaxLoc::Size_MatType::(1280x720, 16UC1)	1.031	0.156	6.62
minMaxLoc::Size_MatType::(1280x720, 16SC1)	0.686	0.125	5.47
minMaxLoc::Size_MatType::(1280x720, 32SC1)	0.896	0.198	4.53
minMaxLoc::Size_MatType::(1280x720, 32FC1)	0.909	0.298	3.05
minMaxLoc::Size_MatType::(1280x720, 64FC1)	0.924	0.664	1.39
minMaxLoc::Size_MatType::(1920x1080, 8UC1)	2.004	0.194	10.36
minMaxLoc::Size_MatType::(1920x1080, 8SC1)	1.997	0.151	13.26
minMaxLoc::Size_MatType::(1920x1080, 16UC1)	2.175	0.343	6.35
minMaxLoc::Size_MatType::(1920x1080, 16SC1)	1.494	0.282	5.30
minMaxLoc::Size_MatType::(1920x1080, 32SC1)	1.948	0.455	4.28
minMaxLoc::Size_MatType::(1920x1080, 32FC1)	1.980	0.702	2.82
minMaxLoc::Size_MatType::(1920x1080, 64FC1)	2.159	1.745	1.24

Performance for AVX2 baseline

Performance test	Reference time	PR time	Speedup
minMaxLoc::Size_MatType::(127x61, 8UC1)	0.008	0.001	8.39
minMaxLoc::Size_MatType::(127x61, 8SC1)	0.008	0.001	10.81
minMaxLoc::Size_MatType::(127x61, 16UC1)	0.008	0.002	4.82
minMaxLoc::Size_MatType::(127x61, 16SC1)	0.010	0.001	6.91
minMaxLoc::Size_MatType::(127x61, 32SC1)	0.006	0.002	2.90
minMaxLoc::Size_MatType::(127x61, 32FC1)	0.008	0.003	3.11
minMaxLoc::Size_MatType::(127x61, 64FC1)	0.008	0.005	1.52
minMaxLoc::Size_MatType::(640x480, 8UC1)	0.295	0.032	9.23
minMaxLoc::Size_MatType::(640x480, 8SC1)	0.296	0.023	13.15
minMaxLoc::Size_MatType::(640x480, 16UC1)	0.289	0.058	4.96
minMaxLoc::Size_MatType::(640x480, 16SC1)	0.349	0.048	7.21
minMaxLoc::Size_MatType::(640x480, 32SC1)	0.218	0.078	2.80
minMaxLoc::Size_MatType::(640x480, 32FC1)	0.301	0.097	3.11
minMaxLoc::Size_MatType::(640x480, 64FC1)	0.301	0.202	1.49
minMaxLoc::Size_MatType::(1280x720, 8UC1)	0.883	0.094	9.43
minMaxLoc::Size_MatType::(1280x720, 8SC1)	0.885	0.066	13.37
minMaxLoc::Size_MatType::(1280x720, 16UC1)	0.892	0.172	5.17
minMaxLoc::Size_MatType::(1280x720, 16SC1)	1.048	0.143	7.32
minMaxLoc::Size_MatType::(1280x720, 32SC1)	0.682	0.234	2.91
minMaxLoc::Size_MatType::(1280x720, 32FC1)	0.915	0.286	3.19
minMaxLoc::Size_MatType::(1280x720, 64FC1)	0.917	0.622	1.47
minMaxLoc::Size_MatType::(1920x1080, 8UC1)	2.038	0.210	9.70
minMaxLoc::Size_MatType::(1920x1080, 8SC1)	2.042	0.152	13.44
minMaxLoc::Size_MatType::(1920x1080, 16UC1)	2.040	0.383	5.32
minMaxLoc::Size_MatType::(1920x1080, 16SC1)	2.407	0.324	7.43
minMaxLoc::Size_MatType::(1920x1080, 32SC1)	1.573	0.544	2.89
minMaxLoc::Size_MatType::(1920x1080, 32FC1)	2.088	0.660	3.17
minMaxLoc::Size_MatType::(1920x1080, 64FC1)	2.277	1.673	1.36

ChipKerchner changed the base branch from master to 3.4 September 9, 2019 19:31

mshabunin reviewed Sep 12, 2019

View reviewed changes

terfendail reviewed Sep 13, 2019

View reviewed changes