core: vectorize dotProd_32s for VSX by pmur · Pull Request #15339 · opencv/opencv

pmur · 2019-08-19T18:56:38Z

This uses a few features specific to PPC to run 4 multiplication
chains in parallel using the expanding multiply instructions.

This results in ~2.5x speedup in the dot perf tests.

ChipKerchner · 2019-08-19T20:26:00Z

modules/core/src/matmul.simd.hpp

+    v_float64 b(0.0, 0.0);
+    int vIter = len / v_int32::nlanes;
+
+    for( i = 0; i < vIter; i++ )


Personally I think this should be

int vIter = len & -v_int32::nlanes; for( i = 0; i < vIter; i += v_int32::nlanes )

And then you can remove the

i *= v_int32::nlanes;

OpenCV vectorized loops usually looks like this:

for (; i <= len - v_float32::nlanes; i += v_float32::nlanes)

alalek

Please use OpenCV SIMD intrinsics.
Avoid using of CV_VSX, CV_NEON, CV_SSE, etc.
CV_SIMD / CV_SIMD128 is a proper guard.

alalek · 2019-08-19T19:10:54Z

modules/core/src/matmul.simd.hpp

+    for( i = 0; i < vIter; i++ )
+        {
+        a += v_cvt_f64(v_int64((vec_mule(v_load(src1).val, v_load(src2).val))));
+        b += v_cvt_f64(v_int64((vec_mulo(v_load(src1).val, v_load(src2).val))));


(int32_t)a * (int32_t)b
is not the same as
(double)a * (int32_t)b (see original C++ implementation above).

Overflows are not handled properly here.

vec_mul{e,o} performs 2 full integer multiplications into a larger type (e.g vmulesw). No precision is lost until the product is converted into float64. The conversion should always produce the same result as the scalar code.

v_mul_expand() would also work correctly, but is suboptimal. It unzips which is unneeded work.

IMHO platform specific implementation parts hardly aligned to universal intrinsics concept. However at the moment I don't have a good solution for this case.
@alalek May be it make sense to extend v_dotprod intrinsic to support s32/u32 input?

ChipKerchner · 2019-08-20T17:31:16Z

modules/core/src/matmul.simd.hpp

+#if CV_SIMD128_64F
+    double r = 0.0;
+    int i = 0;
+    int lenAligned = len & -v_float32::nlanes;


This should be v_int32x4::nlanes and not v_float32::nlanes

Use 4x FMA chains to sum on SIMD 128 FP64 targets. On x86 this showed about 1.4x improvement. For PPC, do a full multiply (32x32->64b), convert to DP then accumulate. This may be slightly less precise for some inputs. But is 1.5x faster than the above which is about 1.5x than the FMA above for ~2.5x speedup.

alalek

Well done! Thank you 👍

terfendail · 2019-08-30T16:48:39Z

Performance for SSE2 baseline

Performance test	Reference time	PR time	Speedup
dot::MatType_Length::(8UC1, 32)	0.000	0.000	1.17
dot::MatType_Length::(8UC1, 64)	0.000	0.000	1.13
dot::MatType_Length::(8UC1, 128)	0.001	0.001	0.99
dot::MatType_Length::(8UC1, 256)	0.004	0.004	0.98
dot::MatType_Length::(8UC1, 512)	0.017	0.018	0.98
dot::MatType_Length::(8UC1, 1024)	0.088	0.089	0.99
dot::MatType_Length::(32SC1, 32)	0.001	0.000	1.39
dot::MatType_Length::(32SC1, 64)	0.002	0.002	1.28
dot::MatType_Length::(32SC1, 128)	0.008	0.006	1.30
dot::MatType_Length::(32SC1, 256)	0.032	0.025	1.30
dot::MatType_Length::(32SC1, 512)	0.129	0.103	1.25
dot::MatType_Length::(32SC1, 1024)	0.539	0.427	1.26
dot::MatType_Length::(32FC1, 32)	0.000	0.000	1.14
dot::MatType_Length::(32FC1, 64)	0.001	0.001	1.08
dot::MatType_Length::(32FC1, 128)	0.002	0.002	0.98
dot::MatType_Length::(32FC1, 256)	0.009	0.009	0.97
dot::MatType_Length::(32FC1, 512)	0.075	0.076	0.98
dot::MatType_Length::(32FC1, 1024)	0.346	0.326	1.06

Performance for SSE3 baseline

Performance test	Reference time	PR time	Speedup
dot::MatType_Length::(8UC1, 32)	0.000	0.000	1.09
dot::MatType_Length::(8UC1, 64)	0.000	0.000	1.01
dot::MatType_Length::(8UC1, 128)	0.001	0.001	1.00
dot::MatType_Length::(8UC1, 256)	0.004	0.004	1.00
dot::MatType_Length::(8UC1, 512)	0.018	0.018	1.01
dot::MatType_Length::(8UC1, 1024)	0.088	0.089	0.99
dot::MatType_Length::(32SC1, 32)	0.001	0.000	1.28
dot::MatType_Length::(32SC1, 64)	0.002	0.002	1.32
dot::MatType_Length::(32SC1, 128)	0.008	0.006	1.33
dot::MatType_Length::(32SC1, 256)	0.033	0.025	1.33
dot::MatType_Length::(32SC1, 512)	0.129	0.105	1.23
dot::MatType_Length::(32SC1, 1024)	0.539	0.430	1.26
dot::MatType_Length::(32FC1, 32)	0.000	0.000	1.01
dot::MatType_Length::(32FC1, 64)	0.001	0.001	1.01
dot::MatType_Length::(32FC1, 128)	0.002	0.002	1.00
dot::MatType_Length::(32FC1, 256)	0.009	0.009	1.02
dot::MatType_Length::(32FC1, 512)	0.076	0.076	1.00
dot::MatType_Length::(32FC1, 1024)	0.337	0.334	1.01

Performance for SSE4_2 baseline

Performance test	Reference time	PR time	Speedup
dot::MatType_Length::(8UC1, 32)	0.000	0.000	1.00
dot::MatType_Length::(8UC1, 64)	0.000	0.000	1.03
dot::MatType_Length::(8UC1, 128)	0.001	0.001	1.02
dot::MatType_Length::(8UC1, 256)	0.004	0.004	1.00
dot::MatType_Length::(8UC1, 512)	0.017	0.017	1.00
dot::MatType_Length::(8UC1, 1024)	0.087	0.081	1.08
dot::MatType_Length::(32SC1, 32)	0.001	0.000	1.33
dot::MatType_Length::(32SC1, 64)	0.002	0.002	1.32
dot::MatType_Length::(32SC1, 128)	0.008	0.006	1.33
dot::MatType_Length::(32SC1, 256)	0.033	0.025	1.33
dot::MatType_Length::(32SC1, 512)	0.129	0.101	1.28
dot::MatType_Length::(32SC1, 1024)	0.539	0.409	1.32
dot::MatType_Length::(32FC1, 32)	0.000	0.000	1.03
dot::MatType_Length::(32FC1, 64)	0.001	0.001	1.06
dot::MatType_Length::(32FC1, 128)	0.002	0.002	0.99
dot::MatType_Length::(32FC1, 256)	0.009	0.009	1.00
dot::MatType_Length::(32FC1, 512)	0.075	0.076	1.00
dot::MatType_Length::(32FC1, 1024)	0.337	0.319	1.06

Performance for AVX2 baseline

Performance test	Reference time	PR time	Speedup
dot::MatType_Length::(8UC1, 32)	0.000	0.000	1.03
dot::MatType_Length::(8UC1, 64)	0.000	0.000	1.02
dot::MatType_Length::(8UC1, 128)	0.001	0.001	0.99
dot::MatType_Length::(8UC1, 256)	0.003	0.003	1.00
dot::MatType_Length::(8UC1, 512)	0.013	0.013	1.01
dot::MatType_Length::(8UC1, 1024)	0.080	0.077	1.03
dot::MatType_Length::(32SC1, 32)	0.000	0.000	1.08
dot::MatType_Length::(32SC1, 64)	0.002	0.002	1.09
dot::MatType_Length::(32SC1, 128)	0.007	0.006	1.08
dot::MatType_Length::(32SC1, 256)	0.027	0.025	1.09
dot::MatType_Length::(32SC1, 512)	0.110	0.099	1.12
dot::MatType_Length::(32SC1, 1024)	0.448	0.420	1.06
dot::MatType_Length::(32FC1, 32)	0.000	0.000	0.97
dot::MatType_Length::(32FC1, 64)	0.001	0.001	1.04
dot::MatType_Length::(32FC1, 128)	0.002	0.002	1.00
dot::MatType_Length::(32FC1, 256)	0.008	0.008	1.03
dot::MatType_Length::(32FC1, 512)	0.075	0.075	1.00
dot::MatType_Length::(32FC1, 1024)	0.330	0.330	1.00

seiko2plus · 2019-09-14T09:15:44Z

modules/core/include/opencv2/core/hal/intrin_vsx.hpp

+{
+vec_double2 out;
+
+__asm__ ("xvcvsxddp %x0,%x1" : "=wa"(out) : "wa"(a.val));


too late from me but any use of operator modifier x in inline asm that effect on clang should be guard it by definition CV_COMPILER_VSX_BROKEN_ASM since a wide range of clang versions don't support it,
also xvcvsxddp already been covered by vec_ctd for both compilers in vsx_utils.h, I updated the issue #15506 to contain clang build failure logs and it will be resolved by #15510

ChipKerchner reviewed Aug 19, 2019

View reviewed changes

alalek reviewed Aug 20, 2019

View reviewed changes

pmur force-pushed the dotprod-32s-vsx branch from df285b8 to 5ae89f0 Compare August 20, 2019 16:27

ChipKerchner reviewed Aug 20, 2019

View reviewed changes

pmur force-pushed the dotprod-32s-vsx branch from 5ae89f0 to 33fb253 Compare August 20, 2019 20:29

alalek approved these changes Aug 29, 2019

View reviewed changes

alalek assigned terfendail Aug 29, 2019

terfendail approved these changes Aug 30, 2019

View reviewed changes

opencv-pushbot pushed a commit that referenced this pull request Aug 31, 2019

Merge pull request #15339 from pmur:dotprod-32s-vsx

048ddbf

opencv-pushbot merged commit 33fb253 into opencv:3.4 Aug 31, 2019

alalek mentioned this pull request Sep 5, 2019

Merge 3.4 #15460

Merged

seiko2plus mentioned this pull request Sep 12, 2019

ppc64le: build failure on GCC <= 7 and clang <= 6 #15506

Closed

seiko2plus reviewed Sep 14, 2019

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

core: vectorize dotProd_32s for VSX#15339

core: vectorize dotProd_32s for VSX#15339
opencv-pushbot merged 1 commit intoopencv:3.4from
pmur:dotprod-32s-vsx

pmur commented Aug 19, 2019

Uh oh!

ChipKerchner Aug 19, 2019

Uh oh!

alalek Aug 21, 2019

Uh oh!

alalek left a comment

Uh oh!

alalek Aug 19, 2019

Uh oh!

pmur Aug 20, 2019

Uh oh!

terfendail Aug 30, 2019

Uh oh!

ChipKerchner Aug 20, 2019

Uh oh!

alalek left a comment

Uh oh!

terfendail commented Aug 30, 2019

Uh oh!

seiko2plus Sep 14, 2019 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

Conversation

pmur commented Aug 19, 2019

Uh oh!

ChipKerchner Aug 19, 2019

Choose a reason for hiding this comment

Uh oh!

alalek Aug 21, 2019

Choose a reason for hiding this comment

Uh oh!

alalek left a comment

Choose a reason for hiding this comment

Uh oh!

alalek Aug 19, 2019

Choose a reason for hiding this comment

Uh oh!

pmur Aug 20, 2019

Choose a reason for hiding this comment

Uh oh!

terfendail Aug 30, 2019

Choose a reason for hiding this comment

Uh oh!

ChipKerchner Aug 20, 2019

Choose a reason for hiding this comment

Uh oh!

alalek left a comment

Choose a reason for hiding this comment

Uh oh!

terfendail commented Aug 30, 2019

Uh oh!

seiko2plus Sep 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

seiko2plus Sep 14, 2019 •

edited

Loading