Skip to content

core: vectorize dotProd_32s for VSX#15339

Merged
opencv-pushbot merged 1 commit intoopencv:3.4from
pmur:dotprod-32s-vsx
Aug 31, 2019
Merged

core: vectorize dotProd_32s for VSX#15339
opencv-pushbot merged 1 commit intoopencv:3.4from
pmur:dotprod-32s-vsx

Conversation

@pmur
Copy link
Copy Markdown
Contributor

@pmur pmur commented Aug 19, 2019

This uses a few features specific to PPC to run 4 multiplication
chains in parallel using the expanding multiply instructions.

This results in ~2.5x speedup in the dot perf tests.

v_float64 b(0.0, 0.0);
int vIter = len / v_int32::nlanes;

for( i = 0; i < vIter; i++ )
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I think this should be

int vIter = len & -v_int32::nlanes;
for( i = 0; i < vIter; i += v_int32::nlanes )

And then you can remove the

i *= v_int32::nlanes;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenCV vectorized loops usually looks like this:

for (; i <= len - v_float32::nlanes; i += v_float32::nlanes)

Copy link
Copy Markdown
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use OpenCV SIMD intrinsics.
Avoid using of CV_VSX, CV_NEON, CV_SSE, etc.
CV_SIMD / CV_SIMD128 is a proper guard.

for( i = 0; i < vIter; i++ )
{
a += v_cvt_f64(v_int64((vec_mule(v_load(src1).val, v_load(src2).val))));
b += v_cvt_f64(v_int64((vec_mulo(v_load(src1).val, v_load(src2).val))));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(int32_t)a * (int32_t)b
is not the same as
(double)a * (int32_t)b (see original C++ implementation above).

Overflows are not handled properly here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vec_mul{e,o} performs 2 full integer multiplications into a larger type (e.g vmulesw). No precision is lost until the product is converted into float64. The conversion should always produce the same result as the scalar code.

v_mul_expand() would also work correctly, but is suboptimal. It unzips which is unneeded work.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO platform specific implementation parts hardly aligned to universal intrinsics concept. However at the moment I don't have a good solution for this case.
@alalek May be it make sense to extend v_dotprod intrinsic to support s32/u32 input?

@pmur pmur force-pushed the dotprod-32s-vsx branch from df285b8 to 5ae89f0 Compare August 20, 2019 16:27
#if CV_SIMD128_64F
double r = 0.0;
int i = 0;
int lenAligned = len & -v_float32::nlanes;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be v_int32x4::nlanes and not v_float32::nlanes

Use 4x FMA chains to sum on SIMD 128 FP64 targets. On
x86 this showed about 1.4x improvement.

For PPC, do a full multiply (32x32->64b), convert to DP
then accumulate. This may be slightly less precise for
some inputs. But is 1.5x faster than the above which
is about 1.5x than the FMA above for ~2.5x speedup.
@pmur pmur force-pushed the dotprod-32s-vsx branch from 5ae89f0 to 33fb253 Compare August 20, 2019 20:29
Copy link
Copy Markdown
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done! Thank you 👍

@terfendail
Copy link
Copy Markdown
Contributor

Performance for SSE2 baseline
Performance test Reference time PR time Speedup
dot::MatType_Length::(8UC1, 32) 0.000 0.000 1.17
dot::MatType_Length::(8UC1, 64) 0.000 0.000 1.13
dot::MatType_Length::(8UC1, 128) 0.001 0.001 0.99
dot::MatType_Length::(8UC1, 256) 0.004 0.004 0.98
dot::MatType_Length::(8UC1, 512) 0.017 0.018 0.98
dot::MatType_Length::(8UC1, 1024) 0.088 0.089 0.99
dot::MatType_Length::(32SC1, 32) 0.001 0.000 1.39
dot::MatType_Length::(32SC1, 64) 0.002 0.002 1.28
dot::MatType_Length::(32SC1, 128) 0.008 0.006 1.30
dot::MatType_Length::(32SC1, 256) 0.032 0.025 1.30
dot::MatType_Length::(32SC1, 512) 0.129 0.103 1.25
dot::MatType_Length::(32SC1, 1024) 0.539 0.427 1.26
dot::MatType_Length::(32FC1, 32) 0.000 0.000 1.14
dot::MatType_Length::(32FC1, 64) 0.001 0.001 1.08
dot::MatType_Length::(32FC1, 128) 0.002 0.002 0.98
dot::MatType_Length::(32FC1, 256) 0.009 0.009 0.97
dot::MatType_Length::(32FC1, 512) 0.075 0.076 0.98
dot::MatType_Length::(32FC1, 1024) 0.346 0.326 1.06
Performance for SSE3 baseline
Performance test Reference time PR time Speedup
dot::MatType_Length::(8UC1, 32) 0.000 0.000 1.09
dot::MatType_Length::(8UC1, 64) 0.000 0.000 1.01
dot::MatType_Length::(8UC1, 128) 0.001 0.001 1.00
dot::MatType_Length::(8UC1, 256) 0.004 0.004 1.00
dot::MatType_Length::(8UC1, 512) 0.018 0.018 1.01
dot::MatType_Length::(8UC1, 1024) 0.088 0.089 0.99
dot::MatType_Length::(32SC1, 32) 0.001 0.000 1.28
dot::MatType_Length::(32SC1, 64) 0.002 0.002 1.32
dot::MatType_Length::(32SC1, 128) 0.008 0.006 1.33
dot::MatType_Length::(32SC1, 256) 0.033 0.025 1.33
dot::MatType_Length::(32SC1, 512) 0.129 0.105 1.23
dot::MatType_Length::(32SC1, 1024) 0.539 0.430 1.26
dot::MatType_Length::(32FC1, 32) 0.000 0.000 1.01
dot::MatType_Length::(32FC1, 64) 0.001 0.001 1.01
dot::MatType_Length::(32FC1, 128) 0.002 0.002 1.00
dot::MatType_Length::(32FC1, 256) 0.009 0.009 1.02
dot::MatType_Length::(32FC1, 512) 0.076 0.076 1.00
dot::MatType_Length::(32FC1, 1024) 0.337 0.334 1.01
Performance for SSE4_2 baseline
Performance test Reference time PR time Speedup
dot::MatType_Length::(8UC1, 32) 0.000 0.000 1.00
dot::MatType_Length::(8UC1, 64) 0.000 0.000 1.03
dot::MatType_Length::(8UC1, 128) 0.001 0.001 1.02
dot::MatType_Length::(8UC1, 256) 0.004 0.004 1.00
dot::MatType_Length::(8UC1, 512) 0.017 0.017 1.00
dot::MatType_Length::(8UC1, 1024) 0.087 0.081 1.08
dot::MatType_Length::(32SC1, 32) 0.001 0.000 1.33
dot::MatType_Length::(32SC1, 64) 0.002 0.002 1.32
dot::MatType_Length::(32SC1, 128) 0.008 0.006 1.33
dot::MatType_Length::(32SC1, 256) 0.033 0.025 1.33
dot::MatType_Length::(32SC1, 512) 0.129 0.101 1.28
dot::MatType_Length::(32SC1, 1024) 0.539 0.409 1.32
dot::MatType_Length::(32FC1, 32) 0.000 0.000 1.03
dot::MatType_Length::(32FC1, 64) 0.001 0.001 1.06
dot::MatType_Length::(32FC1, 128) 0.002 0.002 0.99
dot::MatType_Length::(32FC1, 256) 0.009 0.009 1.00
dot::MatType_Length::(32FC1, 512) 0.075 0.076 1.00
dot::MatType_Length::(32FC1, 1024) 0.337 0.319 1.06
Performance for AVX2 baseline
Performance test Reference time PR time Speedup
dot::MatType_Length::(8UC1, 32) 0.000 0.000 1.03
dot::MatType_Length::(8UC1, 64) 0.000 0.000 1.02
dot::MatType_Length::(8UC1, 128) 0.001 0.001 0.99
dot::MatType_Length::(8UC1, 256) 0.003 0.003 1.00
dot::MatType_Length::(8UC1, 512) 0.013 0.013 1.01
dot::MatType_Length::(8UC1, 1024) 0.080 0.077 1.03
dot::MatType_Length::(32SC1, 32) 0.000 0.000 1.08
dot::MatType_Length::(32SC1, 64) 0.002 0.002 1.09
dot::MatType_Length::(32SC1, 128) 0.007 0.006 1.08
dot::MatType_Length::(32SC1, 256) 0.027 0.025 1.09
dot::MatType_Length::(32SC1, 512) 0.110 0.099 1.12
dot::MatType_Length::(32SC1, 1024) 0.448 0.420 1.06
dot::MatType_Length::(32FC1, 32) 0.000 0.000 0.97
dot::MatType_Length::(32FC1, 64) 0.001 0.001 1.04
dot::MatType_Length::(32FC1, 128) 0.002 0.002 1.00
dot::MatType_Length::(32FC1, 256) 0.008 0.008 1.03
dot::MatType_Length::(32FC1, 512) 0.075 0.075 1.00
dot::MatType_Length::(32FC1, 1024) 0.330 0.330 1.00

{
vec_double2 out;

__asm__ ("xvcvsxddp %x0,%x1" : "=wa"(out) : "wa"(a.val));
Copy link
Copy Markdown
Contributor

@seiko2plus seiko2plus Sep 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too late from me but any use of operator modifier x in inline asm that effect on clang should be guard it by definition CV_COMPILER_VSX_BROKEN_ASM since a wide range of clang versions don't support it,
also xvcvsxddp already been covered by vec_ctd for both compilers in vsx_utils.h, I updated the issue #15506 to contain clang build failure logs and it will be resolved by #15510

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants