core: rework and optimize SIMD implementation of dotProd by seiko2plus · Pull Request #15510 · opencv/opencv

seiko2plus · 2019-09-12T14:57:01Z

relates #15506
merge with opencv_extra opencv/opencv_extra#674

This pullrequest changes

add new universal intrinsics v_dotprod[int32], v_dotprod_expand[u&int8, u&int16, int32], v_cvt_f64(int64)
add a boolean param for all v_dotprod&_expand intrinsics that change the behavior of addition order between
pairs in some platforms in order to reach the maximum optimization when the sum among all lanes is what only matters
fix clang build on ppc64le
support wide universal intrinsics for dotProd_32s
remove raw SIMD and activate universal intrinsics for dotProd_8
implement SIMD optimization for dotProd_s16&u16
extend performance test data types of dot prod
fix GCC VSX workaround of vec_mule and vec_mulo (in little-endian must be swapped)
optimize v_mul_expand(int32) on VSX

TODO

performance tests
ppc64le clang coverage
cover msa & wasm
update perf data

force_builders=Custom,armv7,armv8,linux
buildworker:Custom=linux-1
docker_image:Custom=mips64el
pw_compilers=gcc-4.9, power9_gcc-5, power9_gcc-6, power9_gcc-7, clang-4, clang-5, clang-6
pw_with_opencv_extra=branch:issue15506
# https://ocv-power.imavr.com/#/builders/3/builds/84
# imgproc test crash on clang 4, 5 . I disable it for now
pw_disable_tests=rgbd, shape, imgproc

Performance tests

Args used with opencv_perf_core

--gtest_filter=*dot* --perf_threads=1 --perf_min_samples=1000 --gtest_output=xml:[path]

X86

CPU

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 142
Model name:            Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
Stepping:              10
CPU MHz:               818.725
CPU max MHz:           4000.0000
CPU min MHz:           400.0000
BogoMIPS:              3984.00
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              8192K
NUMA node0 CPU(s):     0-7
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb invpcid_single ssbd ibrs ibpb stibp kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp flush_l1d

OS

Linux seiko-pc 4.19.0-5-amd64 #1 SMP Debian 4.19.37-5 (2019-06-19) x86_64 GNU/Linux
gcc version 8.3.0 (Debian 8.3.0-2)
Debian GNU/Linux 10 (buster)

Benchmark

BASELINE AVX2 Geometric mean (ms)

Name of Test	Before	After	After vs Before (x-factor)
dot::MatType_Length::(8UC1, 32)	0.000	0.000	1.16
dot::MatType_Length::(8UC1, 64)	0.000	0.000	1.23
dot::MatType_Length::(8UC1, 128)	0.001	0.001	1.26
dot::MatType_Length::(8UC1, 256)	0.003	0.003	1.24
dot::MatType_Length::(8UC1, 512)	0.014	0.012	1.21
dot::MatType_Length::(8UC1, 1024)	0.054	0.046	1.19
dot::MatType_Length::(8SC1, 32)	0.000	0.000	1.15
dot::MatType_Length::(8SC1, 64)	0.000	0.000	1.28
dot::MatType_Length::(8SC1, 128)	0.001	0.001	1.34
dot::MatType_Length::(8SC1, 256)	0.003	0.003	1.33
dot::MatType_Length::(8SC1, 512)	0.014	0.011	1.18
dot::MatType_Length::(8SC1, 1024)	0.051	0.045	1.13
dot::MatType_Length::(16UC1, 32)	0.001	0.000	4.77
dot::MatType_Length::(16UC1, 64)	0.003	0.000	6.30
dot::MatType_Length::(16UC1, 128)	0.010	0.002	6.20
dot::MatType_Length::(16UC1, 256)	0.041	0.007	6.01
dot::MatType_Length::(16UC1, 512)	0.165	0.027	6.05
dot::MatType_Length::(16UC1, 1024)	0.665	0.112	5.94
dot::MatType_Length::(16SC1, 32)	0.001	0.000	6.08
dot::MatType_Length::(16SC1, 64)	0.003	0.000	8.86
dot::MatType_Length::(16SC1, 128)	0.010	0.001	9.09
dot::MatType_Length::(16SC1, 256)	0.041	0.005	8.56
dot::MatType_Length::(16SC1, 512)	0.166	0.021	8.02
dot::MatType_Length::(16SC1, 1024)	0.668	0.085	7.85
dot::MatType_Length::(32SC1, 32)	0.000	0.000	1.35
dot::MatType_Length::(32SC1, 64)	0.001	0.001	1.35
dot::MatType_Length::(32SC1, 128)	0.004	0.003	1.40
dot::MatType_Length::(32SC1, 256)	0.017	0.013	1.34
dot::MatType_Length::(32SC1, 512)	0.069	0.051	1.35
dot::MatType_Length::(32SC1, 1024)	0.335	0.264	1.27
dot::MatType_Length::(32FC1, 32)	0.000	0.000	0.98
dot::MatType_Length::(32FC1, 64)	0.000	0.000	0.97
dot::MatType_Length::(32FC1, 128)	0.002	0.002	1.01
dot::MatType_Length::(32FC1, 256)	0.010	0.010	1.02
dot::MatType_Length::(32FC1, 512)	0.039	0.038	1.00
dot::MatType_Length::(32FC1, 1024)	0.247	0.249	0.99

BASELINE SSE4_1 Geometric mean (ms)

Name of Test	Before	After	After vs Before (x-factor)
dot::MatType_Length::(8UC1, 32)	0.000	0.000	0.99
dot::MatType_Length::(8UC1, 64)	0.000	0.000	0.98
dot::MatType_Length::(8UC1, 128)	0.001	0.001	0.97
dot::MatType_Length::(8UC1, 256)	0.005	0.005	0.97
dot::MatType_Length::(8UC1, 512)	0.020	0.020	1.00
dot::MatType_Length::(8UC1, 1024)	0.079	0.076	1.04
dot::MatType_Length::(8SC1, 32)	0.000	0.000	1.00
dot::MatType_Length::(8SC1, 64)	0.000	0.000	1.00
dot::MatType_Length::(8SC1, 128)	0.001	0.001	1.00
dot::MatType_Length::(8SC1, 256)	0.005	0.005	0.99
dot::MatType_Length::(8SC1, 512)	0.018	0.019	0.98
dot::MatType_Length::(8SC1, 1024)	0.073	0.073	1.00
dot::MatType_Length::(16UC1, 32)	0.001	0.000	3.05
dot::MatType_Length::(16UC1, 64)	0.003	0.001	3.43
dot::MatType_Length::(16UC1, 128)	0.011	0.003	3.54
dot::MatType_Length::(16UC1, 256)	0.044	0.012	3.55
dot::MatType_Length::(16UC1, 512)	0.177	0.050	3.53
dot::MatType_Length::(16UC1, 1024)	0.707	0.199	3.55
dot::MatType_Length::(16SC1, 32)	0.001	0.000	5.09
dot::MatType_Length::(16SC1, 64)	0.003	0.000	6.49
dot::MatType_Length::(16SC1, 128)	0.011	0.002	6.97
dot::MatType_Length::(16SC1, 256)	0.044	0.006	6.92
dot::MatType_Length::(16SC1, 512)	0.177	0.026	6.84
dot::MatType_Length::(16SC1, 1024)	0.708	0.103	6.85
dot::MatType_Length::(32SC1, 32)	0.000	0.000	1.02
dot::MatType_Length::(32SC1, 64)	0.001	0.001	1.06
dot::MatType_Length::(32SC1, 128)	0.005	0.005	1.01
dot::MatType_Length::(32SC1, 256)	0.019	0.018	1.05
dot::MatType_Length::(32SC1, 512)	0.078	0.074	1.06
dot::MatType_Length::(32SC1, 1024)	0.361	0.346	1.04
dot::MatType_Length::(32FC1, 32)	0.000	0.000	1.01
dot::MatType_Length::(32FC1, 64)	0.000	0.000	0.95
dot::MatType_Length::(32FC1, 128)	0.002	0.002	0.99
dot::MatType_Length::(32FC1, 256)	0.010	0.010	1.00
dot::MatType_Length::(32FC1, 512)	0.038	0.039	0.99
dot::MatType_Length::(32FC1, 1024)	0.261	0.236	1.11

BASELINE SSE3 Geometric mean (ms)

Name of Test	Before	After	After vs Before (x-factor)
dot::MatType_Length::(8UC1, 32)	0.000	0.000	0.99
dot::MatType_Length::(8UC1, 64)	0.000	0.000	0.99
dot::MatType_Length::(8UC1, 128)	0.001	0.001	1.00
dot::MatType_Length::(8UC1, 256)	0.005	0.005	1.00
dot::MatType_Length::(8UC1, 512)	0.018	0.019	0.99
dot::MatType_Length::(8UC1, 1024)	0.072	0.073	0.97
dot::MatType_Length::(8SC1, 32)	0.000	0.000	1.10
dot::MatType_Length::(8SC1, 64)	0.000	0.000	1.10
dot::MatType_Length::(8SC1, 128)	0.001	0.001	1.12
dot::MatType_Length::(8SC1, 256)	0.006	0.005	1.11
dot::MatType_Length::(8SC1, 512)	0.023	0.021	1.09
dot::MatType_Length::(8SC1, 1024)	0.092	0.085	1.08
dot::MatType_Length::(16UC1, 32)	0.001	0.000	3.05
dot::MatType_Length::(16UC1, 64)	0.003	0.001	3.42
dot::MatType_Length::(16UC1, 128)	0.011	0.003	3.52
dot::MatType_Length::(16UC1, 256)	0.044	0.013	3.51
dot::MatType_Length::(16UC1, 512)	0.177	0.050	3.54
dot::MatType_Length::(16UC1, 1024)	0.707	0.202	3.51
dot::MatType_Length::(16SC1, 32)	0.001	0.000	5.16
dot::MatType_Length::(16SC1, 64)	0.003	0.000	6.50
dot::MatType_Length::(16SC1, 128)	0.011	0.002	6.93
dot::MatType_Length::(16SC1, 256)	0.044	0.007	6.74
dot::MatType_Length::(16SC1, 512)	0.177	0.026	6.75
dot::MatType_Length::(16SC1, 1024)	0.708	0.108	6.58
dot::MatType_Length::(32SC1, 32)	0.000	0.000	1.03
dot::MatType_Length::(32SC1, 64)	0.001	0.001	1.06
dot::MatType_Length::(32SC1, 128)	0.005	0.005	1.04
dot::MatType_Length::(32SC1, 256)	0.020	0.019	1.02
dot::MatType_Length::(32SC1, 512)	0.078	0.076	1.03
dot::MatType_Length::(32SC1, 1024)	0.367	0.364	1.01
dot::MatType_Length::(32FC1, 32)	0.000	0.000	1.00
dot::MatType_Length::(32FC1, 64)	0.000	0.000	0.97
dot::MatType_Length::(32FC1, 128)	0.002	0.002	1.00
dot::MatType_Length::(32FC1, 256)	0.010	0.010	1.00
dot::MatType_Length::(32FC1, 512)	0.042	0.042	0.99
dot::MatType_Length::(32FC1, 1024)	0.243	0.238	1.02

PPC64LE

CPU

Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  1
Core(s) per socket:  1
Socket(s):           8
NUMA node(s):        1
Model:               2.2 (pvr 004e 1202)
Model name:          POWER9 (architected), altivec supported
L1d cache:           32K
L1i cache:           32K
NUMA node0 CPU(s):   0-7

cat /proc/cpuinfo
processor	: 0
cpu		: POWER9 (architected), altivec supported
clock		: 2200.000000MHz
revision	: 2.2 (pvr 004e 1202)
timebase	: 512000000
platform	: pSeries
model		: IBM pSeries (emulated by qemu)
machine		: CHRP IBM pSeries (emulated by qemu)
MMU		: Radix

OS

Linux 8b2db3b0dfac 4.19.0-2-powerpc64le #1 SMP Debian 4.19.16-1 (2019-01-17) ppc64le ppc64le ppc64le GNU/Linux
gcc version 8.3.0 (Ubuntu 8.3.0-6ubuntu1~18.10.1)
Ubuntu 18.10 (cosmic)

Benchmark

BASELINE VSX Geometric mean (ms)

Name of Test	Before	After	After vs Before (x-factor)
dot::MatType_Length::(8UC1, 32)	0.000	0.000	1.30
dot::MatType_Length::(8UC1, 64)	0.001	0.001	1.63
dot::MatType_Length::(8UC1, 128)	0.003	0.002	1.76
dot::MatType_Length::(8UC1, 256)	0.012	0.006	1.84
dot::MatType_Length::(8UC1, 512)	0.046	0.025	1.84
dot::MatType_Length::(8UC1, 1024)	0.182	0.098	1.86
dot::MatType_Length::(8SC1, 32)	0.000	0.000	1.04
dot::MatType_Length::(8SC1, 64)	0.001	0.001	1.05
dot::MatType_Length::(8SC1, 128)	0.003	0.003	1.05
dot::MatType_Length::(8SC1, 256)	0.012	0.011	1.05
dot::MatType_Length::(8SC1, 512)	0.047	0.044	1.07
dot::MatType_Length::(8SC1, 1024)	0.193	0.174	1.11
dot::MatType_Length::(16UC1, 32)	0.001	0.001	1.84
dot::MatType_Length::(16UC1, 64)	0.003	0.002	2.00
dot::MatType_Length::(16UC1, 128)	0.012	0.006	2.03
dot::MatType_Length::(16UC1, 256)	0.048	0.023	2.05
dot::MatType_Length::(16UC1, 512)	0.193	0.093	2.07
dot::MatType_Length::(16UC1, 1024)	0.770	0.372	2.07
dot::MatType_Length::(16SC1, 32)	0.001	0.000	2.73
dot::MatType_Length::(16SC1, 64)	0.004	0.001	3.26
dot::MatType_Length::(16SC1, 128)	0.016	0.005	3.36
dot::MatType_Length::(16SC1, 256)	0.063	0.018	3.41
dot::MatType_Length::(16SC1, 512)	0.250	0.074	3.41
dot::MatType_Length::(16SC1, 1024)	1.001	0.293	3.41
dot::MatType_Length::(32SC1, 32)	0.001	0.001	1.07
dot::MatType_Length::(32SC1, 64)	0.003	0.002	1.12
dot::MatType_Length::(32SC1, 128)	0.010	0.009	1.13
dot::MatType_Length::(32SC1, 256)	0.040	0.035	1.13
dot::MatType_Length::(32SC1, 512)	0.159	0.140	1.14
dot::MatType_Length::(32SC1, 1024)	0.634	0.560	1.13
dot::MatType_Length::(32FC1, 32)	0.000	0.000	0.95
dot::MatType_Length::(32FC1, 64)	0.001	0.001	1.00
dot::MatType_Length::(32FC1, 128)	0.004	0.004	1.00
dot::MatType_Length::(32FC1, 256)	0.016	0.016	1.00
dot::MatType_Length::(32FC1, 512)	0.062	0.063	1.00
dot::MatType_Length::(32FC1, 1024)	0.249	0.254	0.98

ARM

CPU

Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  1
Core(s) per socket:  4
Socket(s):           2
Vendor ID:           ARM
Model:               4
Model name:          Cortex-A53
Stepping:            r0p4
CPU max MHz:         2314.0000
CPU min MHz:         403.0000
BogoMIPS:            52.00
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid

OS

Linux localhost 4.14.62-16641116 #1 SMP PREEMPT Fri Aug 23 14:14:12 KST 2019 aarch64 GNU/Linux
gcc version 8.3.0 (Debian 8.3.0-6)
Debian GNU/Linux 10 (buster)

Benchmark

BASELINE NEON Geometric mean (ms)

Name of Test	Before	After	After vs Before (x-factor)
dot::MatType_Length::(8UC1, 32)	0.004	0.003	1.38
dot::MatType_Length::(8UC1, 64)	0.007	0.005	1.46
dot::MatType_Length::(8UC1, 128)	0.024	0.013	1.77
dot::MatType_Length::(8UC1, 256)	0.095	0.044	2.17
dot::MatType_Length::(8UC1, 512)	0.304	0.154	1.97
dot::MatType_Length::(8UC1, 1024)	0.770	0.444	1.73
dot::MatType_Length::(8SC1, 32)	0.002	0.002	1.20
dot::MatType_Length::(8SC1, 64)	0.004	0.002	1.77
dot::MatType_Length::(8SC1, 128)	0.013	0.006	2.11
dot::MatType_Length::(8SC1, 256)	0.047	0.020	2.37
dot::MatType_Length::(8SC1, 512)	0.182	0.073	2.49
dot::MatType_Length::(8SC1, 1024)	0.737	0.306	2.41
dot::MatType_Length::(16UC1, 32)	0.003	0.002	1.46
dot::MatType_Length::(16UC1, 64)	0.008	0.004	1.86
dot::MatType_Length::(16UC1, 128)	0.029	0.013	2.16
dot::MatType_Length::(16UC1, 256)	0.111	0.048	2.32
dot::MatType_Length::(16UC1, 512)	0.444	0.200	2.22
dot::MatType_Length::(16UC1, 1024)	1.783	0.796	2.24
dot::MatType_Length::(16SC1, 32)	0.003	0.002	1.60
dot::MatType_Length::(16SC1, 64)	0.008	0.004	2.25
dot::MatType_Length::(16SC1, 128)	0.029	0.011	2.67
dot::MatType_Length::(16SC1, 256)	0.111	0.038	2.93
dot::MatType_Length::(16SC1, 512)	0.444	0.152	2.91
dot::MatType_Length::(16SC1, 1024)	1.783	0.615	2.90
dot::MatType_Length::(32SC1, 32)	0.003	0.002	1.31
dot::MatType_Length::(32SC1, 64)	0.008	0.005	1.53
dot::MatType_Length::(32SC1, 128)	0.028	0.017	1.65
dot::MatType_Length::(32SC1, 256)	0.107	0.063	1.71
dot::MatType_Length::(32SC1, 512)	0.426	0.273	1.56
dot::MatType_Length::(32SC1, 1024)	1.664	1.071	1.55
dot::MatType_Length::(32FC1, 32)	0.002	0.002	0.98
dot::MatType_Length::(32FC1, 64)	0.004	0.004	0.98
dot::MatType_Length::(32FC1, 128)	0.012	0.012	1.00
dot::MatType_Length::(32FC1, 256)	0.042	0.043	0.96
dot::MatType_Length::(32FC1, 512)	0.232	0.240	0.97
dot::MatType_Length::(32FC1, 1024)	0.935	0.938	1.00

terfendail · 2019-09-13T12:39:33Z

IMHO it make sense to extend v_dotprod to v_uint32 as well

modules/core/src/matmul.simd.hpp

seiko2plus · 2019-09-30T12:30:28Z

modules/core/include/opencv2/core/hal/intrin_vsx.hpp

+    CV_UNUSED(ignore_order);
+    // `vmsumshm` and `vmsumubm` showing a lake of performance when
+    // when the same register is used in `VRT` and `VRC`
+    return v_int32x4(vec_msum(a.val, b.val, vec_int4_z)) + c;


@pmur, @ChipKerchner , Have anyone faced this issue before? I couldn't find any notes ISA 2.7 or 3 and It seems for me a CPU issue. I tested on both Power8 & 9(little-endian) KVM/VM.

I'll ask around. Is using vec_int4_z a workaround to avoid generating those instructions?

@pmur, well let me explain it in a better way,
normally we should implement it as following which the third parm of dotprod go directly to vec_msum

return v_int32x4(vec_msum(a.val, b.val, c.val));

and by assuming that a=rgv1, b=rgv2, c=rgv3 and returning to r=rgv0 the compiler should produce the following instruction

vmsumshm v0,v1,v2,v3

now everything is fine and it shows a great performance but if c is overlapped to r the following is going to produce

vmsumshm v0,v1,v2,v0

and depend on many tests I found a missive loss of performance

as a quick fix, I passed vec_int4_z (splats(0)) to the third element and then add the product to c and that should produce three instructions instead of one as following and still shows a better performance

vspltisw v10, 0 vmsumshm v11, v1, v2, v10 vadduwm v0, v0, v11

You can try it by your self, just remove the workaround from all v_dotprod&_expand and compare the performance u will find more than 30% loss

the following benchmark between with and without the workaround shows how can two instructions vmsumshm and vadduwm since vspltisw 0 already loaded once, can be faster than one instruction vmsumshm when the same register is used in VRT and VRC

Name of Test Without With With vs Without (x-factor)

dot::MatType_Length::(8UC1, 32) 0.000 0.000 1.04

dot::MatType_Length::(8UC1, 64) 0.001 0.001 1.26

dot::MatType_Length::(8UC1, 128) 0.002 0.002 1.28

dot::MatType_Length::(8UC1, 256) 0.008 0.006 1.31

dot::MatType_Length::(8UC1, 512) 0.033 0.025 1.32

dot::MatType_Length::(8UC1, 1024) 0.129 0.098 1.31

dot::MatType_Length::(8SC1, 32) 0.000 0.000 1.21

dot::MatType_Length::(8SC1, 64) 0.001 0.001 1.33

dot::MatType_Length::(8SC1, 128) 0.004 0.003 1.37

dot::MatType_Length::(8SC1, 256) 0.015 0.011 1.39

dot::MatType_Length::(8SC1, 512) 0.061 0.044 1.39

dot::MatType_Length::(8SC1, 1024) 0.243 0.174 1.39

"without workaround" patch could be found here

I agree this workaround improves the dot product operation significantly. However, I don't think the issue is VRT==VRC. This seems to be a second-order effect due to the very small kernel getting generated.

well, I did another test for all the instruction set(vmsummbm, vmsumshm, vmsumshs, vmsumubm, vmsumuhm, vmsumuhs) and the situation now changed to worse.
Now we loss performance if the same register is that used in VRT also used in any of (VRA, VRB, VRC).
please, could you check the test source code and tell me if I'm missed something.

@seiko2plus thanks. I think I see what is going on. For a very small kernel (e.g the 8uC1 above), it's pretty much just a chain of dependent instructions.

What you're seeing is the extra latency of serializing more expensive instructions. Integer vector adds are lower latency, and thus you see a proportional speedup.

This is likely less desirable for a more complicated kernel, but for now I think you can leave the workaround as-is. I would recommend updated the comment however.

@PMR, if u checked the test file that I mentioned before you may realize that we really face strange behavior and I still have doubts that it may be a hardware issue and workaround should be from compiler level, however, I removed the comment because I 'm not 100% sure about the reason, also I made some change on code, so workaround only works when ignore_order is true because that may affect negatively on other cases. I will try to investigate this issue deeply when I got free time. thank u

I ran a slightly modified version of your example program through the P9 pipeline simulator to characterize the behavior. Moving the data dependency to vadd gives the core freedom to execute the vmsum* sooner, thus hiding it's higher latency in these examples.

Edit, this analysis should be equally applicable for most 8{S,U}C1 benchmarks above. The only difference is the unavoidable latency of the vector loads in either case.

If you haven't run across it yet: https://openpowerfoundation.org/?resource_lib=power9-processor-users-manual describes instruction behavior in great detail.

I ran a slightly modified version of your example program through the P9 pipeline simulator to characterize the behavior. Moving the data dependency to vadd gives the core freedom to execute the vmsum* sooner, thus hiding it's higher latency in these examples.

this could be reasonable but wait when I add vadd to vmsum in the case of VRT==VRC still losing latency, I did it in this test and here's the result.

Latency of v_msum(vmsummbm) : 186ms
Latency of v_msum_vadd(vmsummbm) : 238ms
Latency of v_msum_no_overlap(vmsummbm) : 53ms
Latency of v_msum_vadd_no_overlap(vmsummbm) : 61ms

Edit, "Moving the data dependency to vadd " I think maybe that's it, thank u

If you haven't run across it yet: https://openpowerfoundation.org/?resource_lib=power9-processor-users-manual describes instruction behavior in great detail.

sure I will look at it hopefully, I can find answer satisfying me.

ChipKerchner · 2019-09-30T13:59:40Z

modules/core/include/opencv2/core/vsx_utils.hpp

-    VSX_IMPL_2VRG(vec_udword2, vec_uint4, vmuleuw, vec_mule)
-    VSX_IMPL_2VRG(vec_dword2,  vec_int4,  vmulosw, vec_mulo)
-    VSX_IMPL_2VRG(vec_udword2, vec_uint4, vmulouw, vec_mulo)
+    VSX_IMPL_2VRG(vec_dword2,  vec_int4,  vmulosw, vec_mule)


Any reason why you changed the vec_mule (even) to use vmulosw (odd) and vice versa?

@ChipKerchner Some operations are indirectly endian sensitive on PPC. The vector intrinsics try to hide this as best as possible.

@seiko2plus thank you for fixing.

theses instructions sets are effected by endian mode and due to historical reasons the mnemonics are based on big-endian. in other words vmulosw in little-endian is returning the product of even elements.

- add new universal intrinsics v_dotprod[int32], v_dotprod_expand[u&int8, u&int16, int32], v_cvt_f64(int64) - add a boolean param for all v_dotprod&_expand intrinsics that change the behavior of addition order between pairs in some platforms in order to reach the maximum optimization when the sum among all lanes is what only matters - fix clang build on ppc64le - support wide universal intrinsics for dotProd_32s - remove raw SIMD and activate universal intrinsics for dotProd_8 - implement SIMD optimization for dotProd_s16&u16 - extend performance test data types of dotprod - fix GCC VSX workaround of vec_mule and vec_mulo (in little-endian it must be swapped) - optimize v_mul_expand(int32) on VSX

alalek

Well done! Thank you 👍

terfendail · 2019-10-04T13:45:11Z

modules/core/include/opencv2/core/hal/intrin_cpp.hpp

 Multiply values in two registers and sum adjacent result pairs.
+
+@cond Doxygen_Suppress
+@param ignore_order - if it's true the intrinsic may perform unorder sum between result pairs in some platforms,


IMO it would be better to have 2 separate functions for these cases. So I prefer to retain v_dotprod intrinsic as is and add separate intrinsic e.g. v_dotprod_fast for the case of unorder sum. For me it looks like cleaner and less error prone approach.

Well the call between your hand but in my opinion, I prefer having boolean parameter over introduce new intrins, it seems more robust to me and I don't see any errors could be caused but wait your choose will make it as a policy we gonna all follow it

The reason I prefer to avoid behavior affecting parameter is that it is possible to determine its value at run-time. While in most cases it could be possible for compiler to pre-evaluate the value and optimize-out the condition, there still could be a complex cases that could ruin the performance.
However since we are going to define a policy IMO it make sense to spend some time to figure out all possible advantages and drawbacks for both approaches(and maybe even discover another opportunity) to decide on the option suitable and convenient for as many cases as possible.
IMO it will be useful to start discussing intrinsics development guideline #15656
@alalek @seiko2plus What do you think?

…prod_fast&v_dotprod_expand_fast this changes made depend on "terfendail" review

tompollok · 2019-10-07T19:24:25Z

Great work 👍🏼

mshabunin · 2020-02-21T11:41:32Z

modules/core/include/opencv2/core/hal/intrin_avx512.hpp

+    return v_float64x8(_mm512_cvtepi64_pd(v.val));
+#else
+    // constants encoded as floating-point
+    __m512i magic_i_lo   = _mm512_set1_epi64x(0x4330000000000000); // 2^52


This block is broken, there are no intrinsics _mm512_set1_epi64x and _mm512_blend_epi32.

nice catch, I just forget to remove that part of code, it should be just _mm512_cvtepi64_pd() since the minimum support are SKX features which enables AVX512DQ

seiko2plus force-pushed the issue15506 branch 6 times, most recently from 7787008 to 6a6d2fe Compare September 13, 2019 22:49

This was referenced Sep 14, 2019

ppc64le: build failure on GCC <= 7 and clang <= 6 #15506

Closed

core: vectorize dotProd_32s for VSX #15339

Merged

seiko2plus force-pushed the issue15506 branch from 27a4a05 to 2a51672 Compare September 14, 2019 11:00

terfendail reviewed Sep 16, 2019

View reviewed changes

modules/core/src/matmul.simd.hpp Outdated Show resolved Hide resolved

seiko2plus force-pushed the issue15506 branch 5 times, most recently from 1bed950 to 68cbd8f Compare September 18, 2019 15:43

seiko2plus marked this pull request as ready for review September 18, 2019 15:47

seiko2plus force-pushed the issue15506 branch 4 times, most recently from 8aabf98 to 14e71f0 Compare September 25, 2019 14:49

seiko2plus changed the title ~~core: rework SIMD implementation of dotProd_32s~~ core: rework and optimize SIMD implementation of dotProd Sep 25, 2019

seiko2plus force-pushed the issue15506 branch 7 times, most recently from ecbe4d8 to f9df6e4 Compare September 29, 2019 15:56

seiko2plus requested a review from terfendail September 29, 2019 19:09

seiko2plus commented Sep 30, 2019

View reviewed changes

seiko2plus force-pushed the issue15506 branch 2 times, most recently from 3797471 to afe2e05 Compare September 30, 2019 12:44

ChipKerchner reviewed Sep 30, 2019

View reviewed changes

seiko2plus force-pushed the issue15506 branch 3 times, most recently from ead686b to 0522f87 Compare October 2, 2019 05:43

seiko2plus force-pushed the issue15506 branch from 0522f87 to 92d8ad7 Compare October 2, 2019 06:24

alalek approved these changes Oct 4, 2019

View reviewed changes

alalek assigned terfendail Oct 4, 2019

terfendail reviewed Oct 4, 2019

View reviewed changes

seiko2plus requested a review from alalek October 5, 2019 16:17

seiko2plus force-pushed the issue15506 branch from 97abf23 to 52bca44 Compare October 5, 2019 16:44

seiko2plus requested a review from terfendail October 5, 2019 16:47

seiko2plus force-pushed the issue15506 branch 2 times, most recently from 162c9c1 to c5521cd Compare October 5, 2019 17:01

core: remove boolean param from v_dotprod&_expand and implement v_dot…

99a703b

…prod_fast&v_dotprod_expand_fast this changes made depend on "terfendail" review

seiko2plus force-pushed the issue15506 branch from 22418f9 to 99a703b Compare October 5, 2019 18:24

alalek approved these changes Oct 7, 2019

View reviewed changes

alalek merged commit f2fe6f4 into opencv:3.4 Oct 7, 2019

This was referenced Oct 8, 2019

build: update AVX2 check #15661

Merged

Merge 3.4 #15678

Merged

mshabunin reviewed Feb 21, 2020

View reviewed changes

mshabunin mentioned this pull request Feb 21, 2020

intrin: fixed int64->double conversion for AVX-512 #16635

Merged

6 tasks

Uh oh!

Conversation

seiko2plus commented Sep 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

This pullrequest changes

TODO

Performance tests

Args used with opencv_perf_core

X86

Benchmark

PPC64LE

Benchmark

ARM

Benchmark

Uh oh!

terfendail commented Sep 13, 2019

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seiko2plus Sep 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

"without workaround" patch could be found here

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seiko2plus Oct 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pmur Oct 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seiko2plus Oct 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alalek left a comment

Choose a reason for hiding this comment

Uh oh!

terfendail Oct 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

terfendail Oct 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tompollok commented Oct 7, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

seiko2plus commented Sep 12, 2019 •

edited

Loading

seiko2plus Sep 30, 2019 •

edited

Loading

seiko2plus Oct 2, 2019 •

edited

Loading

pmur Oct 2, 2019 •

edited

Loading

seiko2plus Oct 2, 2019 •

edited

Loading

terfendail Oct 4, 2019 •

edited

Loading

terfendail Oct 7, 2019 •

edited

Loading