core: vectorize dotProd_32s for VSX#15339
Conversation
modules/core/src/matmul.simd.hpp
Outdated
| v_float64 b(0.0, 0.0); | ||
| int vIter = len / v_int32::nlanes; | ||
|
|
||
| for( i = 0; i < vIter; i++ ) |
There was a problem hiding this comment.
Personally I think this should be
int vIter = len & -v_int32::nlanes;
for( i = 0; i < vIter; i += v_int32::nlanes )
And then you can remove the
i *= v_int32::nlanes;
There was a problem hiding this comment.
OpenCV vectorized loops usually looks like this:
for (; i <= len - v_float32::nlanes; i += v_float32::nlanes)
alalek
left a comment
There was a problem hiding this comment.
Please use OpenCV SIMD intrinsics.
Avoid using of CV_VSX, CV_NEON, CV_SSE, etc.
CV_SIMD / CV_SIMD128 is a proper guard.
modules/core/src/matmul.simd.hpp
Outdated
| for( i = 0; i < vIter; i++ ) | ||
| { | ||
| a += v_cvt_f64(v_int64((vec_mule(v_load(src1).val, v_load(src2).val)))); | ||
| b += v_cvt_f64(v_int64((vec_mulo(v_load(src1).val, v_load(src2).val)))); |
There was a problem hiding this comment.
(int32_t)a * (int32_t)b
is not the same as
(double)a * (int32_t)b (see original C++ implementation above).
Overflows are not handled properly here.
There was a problem hiding this comment.
vec_mul{e,o} performs 2 full integer multiplications into a larger type (e.g vmulesw). No precision is lost until the product is converted into float64. The conversion should always produce the same result as the scalar code.
v_mul_expand() would also work correctly, but is suboptimal. It unzips which is unneeded work.
There was a problem hiding this comment.
IMHO platform specific implementation parts hardly aligned to universal intrinsics concept. However at the moment I don't have a good solution for this case.
@alalek May be it make sense to extend v_dotprod intrinsic to support s32/u32 input?
modules/core/src/matmul.simd.hpp
Outdated
| #if CV_SIMD128_64F | ||
| double r = 0.0; | ||
| int i = 0; | ||
| int lenAligned = len & -v_float32::nlanes; |
There was a problem hiding this comment.
This should be v_int32x4::nlanes and not v_float32::nlanes
Use 4x FMA chains to sum on SIMD 128 FP64 targets. On x86 this showed about 1.4x improvement. For PPC, do a full multiply (32x32->64b), convert to DP then accumulate. This may be slightly less precise for some inputs. But is 1.5x faster than the above which is about 1.5x than the FMA above for ~2.5x speedup.
Performance for SSE2 baseline
Performance for SSE3 baseline
Performance for SSE4_2 baseline
Performance for AVX2 baseline
|
| { | ||
| vec_double2 out; | ||
|
|
||
| __asm__ ("xvcvsxddp %x0,%x1" : "=wa"(out) : "wa"(a.val)); |
There was a problem hiding this comment.
too late from me but any use of operator modifier x in inline asm that effect on clang should be guard it by definition CV_COMPILER_VSX_BROKEN_ASM since a wide range of clang versions don't support it,
also xvcvsxddp already been covered by vec_ctd for both compilers in vsx_utils.h, I updated the issue #15506 to contain clang build failure logs and it will be resolved by #15510
This uses a few features specific to PPC to run 4 multiplication
chains in parallel using the expanding multiply instructions.
This results in ~2.5x speedup in the dot perf tests.