Skip to content

core: rework and optimize SIMD implementation of dotProd#15510

Merged
alalek merged 2 commits intoopencv:3.4from
seiko2plus:issue15506
Oct 7, 2019
Merged

core: rework and optimize SIMD implementation of dotProd#15510
alalek merged 2 commits intoopencv:3.4from
seiko2plus:issue15506

Conversation

@seiko2plus
Copy link
Copy Markdown
Contributor

@seiko2plus seiko2plus commented Sep 12, 2019

relates #15506
merge with opencv_extra opencv/opencv_extra#674

This pullrequest changes

  • add new universal intrinsics v_dotprod[int32], v_dotprod_expand[u&int8, u&int16, int32], v_cvt_f64(int64)
  • add a boolean param for all v_dotprod&_expand intrinsics that change the behavior of addition order between
    pairs in some platforms in order to reach the maximum optimization when the sum among all lanes is what only matters
  • fix clang build on ppc64le
  • support wide universal intrinsics for dotProd_32s
  • remove raw SIMD and activate universal intrinsics for dotProd_8
  • implement SIMD optimization for dotProd_s16&u16
  • extend performance test data types of dot prod
  • fix GCC VSX workaround of vec_mule and vec_mulo (in little-endian must be swapped)
  • optimize v_mul_expand(int32) on VSX

TODO

  • performance tests
  • ppc64le clang coverage
  • cover msa & wasm
  • update perf data
force_builders=Custom,armv7,armv8,linux
buildworker:Custom=linux-1
docker_image:Custom=mips64el
pw_compilers=gcc-4.9, power9_gcc-5, power9_gcc-6, power9_gcc-7, clang-4, clang-5, clang-6
pw_with_opencv_extra=branch:issue15506
# https://ocv-power.imavr.com/#/builders/3/builds/84
# imgproc test crash on clang 4, 5 . I disable it for now
pw_disable_tests=rgbd, shape, imgproc

Performance tests

Args used with opencv_perf_core

--gtest_filter=*dot* --perf_threads=1 --perf_min_samples=1000 --gtest_output=xml:[path]

X86

CPU
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 142
Model name:            Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
Stepping:              10
CPU MHz:               818.725
CPU max MHz:           4000.0000
CPU min MHz:           400.0000
BogoMIPS:              3984.00
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              8192K
NUMA node0 CPU(s):     0-7
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb invpcid_single ssbd ibrs ibpb stibp kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp flush_l1d
OS
Linux seiko-pc 4.19.0-5-amd64 #1 SMP Debian 4.19.37-5 (2019-06-19) x86_64 GNU/Linux
gcc version 8.3.0 (Debian 8.3.0-2)
Debian GNU/Linux 10 (buster)

Benchmark

BASELINE AVX2 Geometric mean (ms)
Name of Test Before After After vs Before (x-factor)
dot::MatType_Length::(8UC1, 32) 0.000 0.000 1.16
dot::MatType_Length::(8UC1, 64) 0.000 0.000 1.23
dot::MatType_Length::(8UC1, 128) 0.001 0.001 1.26
dot::MatType_Length::(8UC1, 256) 0.003 0.003 1.24
dot::MatType_Length::(8UC1, 512) 0.014 0.012 1.21
dot::MatType_Length::(8UC1, 1024) 0.054 0.046 1.19
dot::MatType_Length::(8SC1, 32) 0.000 0.000 1.15
dot::MatType_Length::(8SC1, 64) 0.000 0.000 1.28
dot::MatType_Length::(8SC1, 128) 0.001 0.001 1.34
dot::MatType_Length::(8SC1, 256) 0.003 0.003 1.33
dot::MatType_Length::(8SC1, 512) 0.014 0.011 1.18
dot::MatType_Length::(8SC1, 1024) 0.051 0.045 1.13
dot::MatType_Length::(16UC1, 32) 0.001 0.000 4.77
dot::MatType_Length::(16UC1, 64) 0.003 0.000 6.30
dot::MatType_Length::(16UC1, 128) 0.010 0.002 6.20
dot::MatType_Length::(16UC1, 256) 0.041 0.007 6.01
dot::MatType_Length::(16UC1, 512) 0.165 0.027 6.05
dot::MatType_Length::(16UC1, 1024) 0.665 0.112 5.94
dot::MatType_Length::(16SC1, 32) 0.001 0.000 6.08
dot::MatType_Length::(16SC1, 64) 0.003 0.000 8.86
dot::MatType_Length::(16SC1, 128) 0.010 0.001 9.09
dot::MatType_Length::(16SC1, 256) 0.041 0.005 8.56
dot::MatType_Length::(16SC1, 512) 0.166 0.021 8.02
dot::MatType_Length::(16SC1, 1024) 0.668 0.085 7.85
dot::MatType_Length::(32SC1, 32) 0.000 0.000 1.35
dot::MatType_Length::(32SC1, 64) 0.001 0.001 1.35
dot::MatType_Length::(32SC1, 128) 0.004 0.003 1.40
dot::MatType_Length::(32SC1, 256) 0.017 0.013 1.34
dot::MatType_Length::(32SC1, 512) 0.069 0.051 1.35
dot::MatType_Length::(32SC1, 1024) 0.335 0.264 1.27
dot::MatType_Length::(32FC1, 32) 0.000 0.000 0.98
dot::MatType_Length::(32FC1, 64) 0.000 0.000 0.97
dot::MatType_Length::(32FC1, 128) 0.002 0.002 1.01
dot::MatType_Length::(32FC1, 256) 0.010 0.010 1.02
dot::MatType_Length::(32FC1, 512) 0.039 0.038 1.00
dot::MatType_Length::(32FC1, 1024) 0.247 0.249 0.99
BASELINE SSE4_1 Geometric mean (ms)
Name of Test Before After After vs Before (x-factor)
dot::MatType_Length::(8UC1, 32) 0.000 0.000 0.99
dot::MatType_Length::(8UC1, 64) 0.000 0.000 0.98
dot::MatType_Length::(8UC1, 128) 0.001 0.001 0.97
dot::MatType_Length::(8UC1, 256) 0.005 0.005 0.97
dot::MatType_Length::(8UC1, 512) 0.020 0.020 1.00
dot::MatType_Length::(8UC1, 1024) 0.079 0.076 1.04
dot::MatType_Length::(8SC1, 32) 0.000 0.000 1.00
dot::MatType_Length::(8SC1, 64) 0.000 0.000 1.00
dot::MatType_Length::(8SC1, 128) 0.001 0.001 1.00
dot::MatType_Length::(8SC1, 256) 0.005 0.005 0.99
dot::MatType_Length::(8SC1, 512) 0.018 0.019 0.98
dot::MatType_Length::(8SC1, 1024) 0.073 0.073 1.00
dot::MatType_Length::(16UC1, 32) 0.001 0.000 3.05
dot::MatType_Length::(16UC1, 64) 0.003 0.001 3.43
dot::MatType_Length::(16UC1, 128) 0.011 0.003 3.54
dot::MatType_Length::(16UC1, 256) 0.044 0.012 3.55
dot::MatType_Length::(16UC1, 512) 0.177 0.050 3.53
dot::MatType_Length::(16UC1, 1024) 0.707 0.199 3.55
dot::MatType_Length::(16SC1, 32) 0.001 0.000 5.09
dot::MatType_Length::(16SC1, 64) 0.003 0.000 6.49
dot::MatType_Length::(16SC1, 128) 0.011 0.002 6.97
dot::MatType_Length::(16SC1, 256) 0.044 0.006 6.92
dot::MatType_Length::(16SC1, 512) 0.177 0.026 6.84
dot::MatType_Length::(16SC1, 1024) 0.708 0.103 6.85
dot::MatType_Length::(32SC1, 32) 0.000 0.000 1.02
dot::MatType_Length::(32SC1, 64) 0.001 0.001 1.06
dot::MatType_Length::(32SC1, 128) 0.005 0.005 1.01
dot::MatType_Length::(32SC1, 256) 0.019 0.018 1.05
dot::MatType_Length::(32SC1, 512) 0.078 0.074 1.06
dot::MatType_Length::(32SC1, 1024) 0.361 0.346 1.04
dot::MatType_Length::(32FC1, 32) 0.000 0.000 1.01
dot::MatType_Length::(32FC1, 64) 0.000 0.000 0.95
dot::MatType_Length::(32FC1, 128) 0.002 0.002 0.99
dot::MatType_Length::(32FC1, 256) 0.010 0.010 1.00
dot::MatType_Length::(32FC1, 512) 0.038 0.039 0.99
dot::MatType_Length::(32FC1, 1024) 0.261 0.236 1.11
BASELINE SSE3 Geometric mean (ms)
Name of Test Before After After vs Before (x-factor)
dot::MatType_Length::(8UC1, 32) 0.000 0.000 0.99
dot::MatType_Length::(8UC1, 64) 0.000 0.000 0.99
dot::MatType_Length::(8UC1, 128) 0.001 0.001 1.00
dot::MatType_Length::(8UC1, 256) 0.005 0.005 1.00
dot::MatType_Length::(8UC1, 512) 0.018 0.019 0.99
dot::MatType_Length::(8UC1, 1024) 0.072 0.073 0.97
dot::MatType_Length::(8SC1, 32) 0.000 0.000 1.10
dot::MatType_Length::(8SC1, 64) 0.000 0.000 1.10
dot::MatType_Length::(8SC1, 128) 0.001 0.001 1.12
dot::MatType_Length::(8SC1, 256) 0.006 0.005 1.11
dot::MatType_Length::(8SC1, 512) 0.023 0.021 1.09
dot::MatType_Length::(8SC1, 1024) 0.092 0.085 1.08
dot::MatType_Length::(16UC1, 32) 0.001 0.000 3.05
dot::MatType_Length::(16UC1, 64) 0.003 0.001 3.42
dot::MatType_Length::(16UC1, 128) 0.011 0.003 3.52
dot::MatType_Length::(16UC1, 256) 0.044 0.013 3.51
dot::MatType_Length::(16UC1, 512) 0.177 0.050 3.54
dot::MatType_Length::(16UC1, 1024) 0.707 0.202 3.51
dot::MatType_Length::(16SC1, 32) 0.001 0.000 5.16
dot::MatType_Length::(16SC1, 64) 0.003 0.000 6.50
dot::MatType_Length::(16SC1, 128) 0.011 0.002 6.93
dot::MatType_Length::(16SC1, 256) 0.044 0.007 6.74
dot::MatType_Length::(16SC1, 512) 0.177 0.026 6.75
dot::MatType_Length::(16SC1, 1024) 0.708 0.108 6.58
dot::MatType_Length::(32SC1, 32) 0.000 0.000 1.03
dot::MatType_Length::(32SC1, 64) 0.001 0.001 1.06
dot::MatType_Length::(32SC1, 128) 0.005 0.005 1.04
dot::MatType_Length::(32SC1, 256) 0.020 0.019 1.02
dot::MatType_Length::(32SC1, 512) 0.078 0.076 1.03
dot::MatType_Length::(32SC1, 1024) 0.367 0.364 1.01
dot::MatType_Length::(32FC1, 32) 0.000 0.000 1.00
dot::MatType_Length::(32FC1, 64) 0.000 0.000 0.97
dot::MatType_Length::(32FC1, 128) 0.002 0.002 1.00
dot::MatType_Length::(32FC1, 256) 0.010 0.010 1.00
dot::MatType_Length::(32FC1, 512) 0.042 0.042 0.99
dot::MatType_Length::(32FC1, 1024) 0.243 0.238 1.02

PPC64LE

CPU
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  1
Core(s) per socket:  1
Socket(s):           8
NUMA node(s):        1
Model:               2.2 (pvr 004e 1202)
Model name:          POWER9 (architected), altivec supported
L1d cache:           32K
L1i cache:           32K
NUMA node0 CPU(s):   0-7
cat /proc/cpuinfo
processor	: 0
cpu		: POWER9 (architected), altivec supported
clock		: 2200.000000MHz
revision	: 2.2 (pvr 004e 1202)
timebase	: 512000000
platform	: pSeries
model		: IBM pSeries (emulated by qemu)
machine		: CHRP IBM pSeries (emulated by qemu)
MMU		: Radix

OS
Linux 8b2db3b0dfac 4.19.0-2-powerpc64le #1 SMP Debian 4.19.16-1 (2019-01-17) ppc64le ppc64le ppc64le GNU/Linux
gcc version 8.3.0 (Ubuntu 8.3.0-6ubuntu1~18.10.1)
Ubuntu 18.10 (cosmic)

Benchmark

BASELINE VSX Geometric mean (ms)
Name of Test Before After After vs Before (x-factor)
dot::MatType_Length::(8UC1, 32) 0.000 0.000 1.30
dot::MatType_Length::(8UC1, 64) 0.001 0.001 1.63
dot::MatType_Length::(8UC1, 128) 0.003 0.002 1.76
dot::MatType_Length::(8UC1, 256) 0.012 0.006 1.84
dot::MatType_Length::(8UC1, 512) 0.046 0.025 1.84
dot::MatType_Length::(8UC1, 1024) 0.182 0.098 1.86
dot::MatType_Length::(8SC1, 32) 0.000 0.000 1.04
dot::MatType_Length::(8SC1, 64) 0.001 0.001 1.05
dot::MatType_Length::(8SC1, 128) 0.003 0.003 1.05
dot::MatType_Length::(8SC1, 256) 0.012 0.011 1.05
dot::MatType_Length::(8SC1, 512) 0.047 0.044 1.07
dot::MatType_Length::(8SC1, 1024) 0.193 0.174 1.11
dot::MatType_Length::(16UC1, 32) 0.001 0.001 1.84
dot::MatType_Length::(16UC1, 64) 0.003 0.002 2.00
dot::MatType_Length::(16UC1, 128) 0.012 0.006 2.03
dot::MatType_Length::(16UC1, 256) 0.048 0.023 2.05
dot::MatType_Length::(16UC1, 512) 0.193 0.093 2.07
dot::MatType_Length::(16UC1, 1024) 0.770 0.372 2.07
dot::MatType_Length::(16SC1, 32) 0.001 0.000 2.73
dot::MatType_Length::(16SC1, 64) 0.004 0.001 3.26
dot::MatType_Length::(16SC1, 128) 0.016 0.005 3.36
dot::MatType_Length::(16SC1, 256) 0.063 0.018 3.41
dot::MatType_Length::(16SC1, 512) 0.250 0.074 3.41
dot::MatType_Length::(16SC1, 1024) 1.001 0.293 3.41
dot::MatType_Length::(32SC1, 32) 0.001 0.001 1.07
dot::MatType_Length::(32SC1, 64) 0.003 0.002 1.12
dot::MatType_Length::(32SC1, 128) 0.010 0.009 1.13
dot::MatType_Length::(32SC1, 256) 0.040 0.035 1.13
dot::MatType_Length::(32SC1, 512) 0.159 0.140 1.14
dot::MatType_Length::(32SC1, 1024) 0.634 0.560 1.13
dot::MatType_Length::(32FC1, 32) 0.000 0.000 0.95
dot::MatType_Length::(32FC1, 64) 0.001 0.001 1.00
dot::MatType_Length::(32FC1, 128) 0.004 0.004 1.00
dot::MatType_Length::(32FC1, 256) 0.016 0.016 1.00
dot::MatType_Length::(32FC1, 512) 0.062 0.063 1.00
dot::MatType_Length::(32FC1, 1024) 0.249 0.254 0.98

ARM

CPU
Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  1
Core(s) per socket:  4
Socket(s):           2
Vendor ID:           ARM
Model:               4
Model name:          Cortex-A53
Stepping:            r0p4
CPU max MHz:         2314.0000
CPU min MHz:         403.0000
BogoMIPS:            52.00
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid

OS
Linux localhost 4.14.62-16641116 #1 SMP PREEMPT Fri Aug 23 14:14:12 KST 2019 aarch64 GNU/Linux
gcc version 8.3.0 (Debian 8.3.0-6)
Debian GNU/Linux 10 (buster)

Benchmark

BASELINE NEON Geometric mean (ms)
Name of Test Before After After vs Before (x-factor)
dot::MatType_Length::(8UC1, 32) 0.004 0.003 1.38
dot::MatType_Length::(8UC1, 64) 0.007 0.005 1.46
dot::MatType_Length::(8UC1, 128) 0.024 0.013 1.77
dot::MatType_Length::(8UC1, 256) 0.095 0.044 2.17
dot::MatType_Length::(8UC1, 512) 0.304 0.154 1.97
dot::MatType_Length::(8UC1, 1024) 0.770 0.444 1.73
dot::MatType_Length::(8SC1, 32) 0.002 0.002 1.20
dot::MatType_Length::(8SC1, 64) 0.004 0.002 1.77
dot::MatType_Length::(8SC1, 128) 0.013 0.006 2.11
dot::MatType_Length::(8SC1, 256) 0.047 0.020 2.37
dot::MatType_Length::(8SC1, 512) 0.182 0.073 2.49
dot::MatType_Length::(8SC1, 1024) 0.737 0.306 2.41
dot::MatType_Length::(16UC1, 32) 0.003 0.002 1.46
dot::MatType_Length::(16UC1, 64) 0.008 0.004 1.86
dot::MatType_Length::(16UC1, 128) 0.029 0.013 2.16
dot::MatType_Length::(16UC1, 256) 0.111 0.048 2.32
dot::MatType_Length::(16UC1, 512) 0.444 0.200 2.22
dot::MatType_Length::(16UC1, 1024) 1.783 0.796 2.24
dot::MatType_Length::(16SC1, 32) 0.003 0.002 1.60
dot::MatType_Length::(16SC1, 64) 0.008 0.004 2.25
dot::MatType_Length::(16SC1, 128) 0.029 0.011 2.67
dot::MatType_Length::(16SC1, 256) 0.111 0.038 2.93
dot::MatType_Length::(16SC1, 512) 0.444 0.152 2.91
dot::MatType_Length::(16SC1, 1024) 1.783 0.615 2.90
dot::MatType_Length::(32SC1, 32) 0.003 0.002 1.31
dot::MatType_Length::(32SC1, 64) 0.008 0.005 1.53
dot::MatType_Length::(32SC1, 128) 0.028 0.017 1.65
dot::MatType_Length::(32SC1, 256) 0.107 0.063 1.71
dot::MatType_Length::(32SC1, 512) 0.426 0.273 1.56
dot::MatType_Length::(32SC1, 1024) 1.664 1.071 1.55
dot::MatType_Length::(32FC1, 32) 0.002 0.002 0.98
dot::MatType_Length::(32FC1, 64) 0.004 0.004 0.98
dot::MatType_Length::(32FC1, 128) 0.012 0.012 1.00
dot::MatType_Length::(32FC1, 256) 0.042 0.043 0.96
dot::MatType_Length::(32FC1, 512) 0.232 0.240 0.97
dot::MatType_Length::(32FC1, 1024) 0.935 0.938 1.00

@terfendail
Copy link
Copy Markdown
Contributor

IMHO it make sense to extend v_dotprod to v_uint32 as well

@seiko2plus seiko2plus force-pushed the issue15506 branch 5 times, most recently from 1bed950 to 68cbd8f Compare September 18, 2019 15:43
@seiko2plus seiko2plus marked this pull request as ready for review September 18, 2019 15:47
@seiko2plus seiko2plus force-pushed the issue15506 branch 4 times, most recently from 8aabf98 to 14e71f0 Compare September 25, 2019 14:49
@seiko2plus seiko2plus changed the title core: rework SIMD implementation of dotProd_32s core: rework and optimize SIMD implementation of dotProd Sep 25, 2019
@seiko2plus seiko2plus force-pushed the issue15506 branch 7 times, most recently from ecbe4d8 to f9df6e4 Compare September 29, 2019 15:56
CV_UNUSED(ignore_order);
// `vmsumshm` and `vmsumubm` showing a lake of performance when
// when the same register is used in `VRT` and `VRC`
return v_int32x4(vec_msum(a.val, b.val, vec_int4_z)) + c;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmur, @ChipKerchner , Have anyone faced this issue before? I couldn't find any notes ISA 2.7 or 3 and It seems for me a CPU issue. I tested on both Power8 & 9(little-endian) KVM/VM.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll ask around. Is using vec_int4_z a workaround to avoid generating those instructions?

Copy link
Copy Markdown
Contributor Author

@seiko2plus seiko2plus Sep 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmur, well let me explain it in a better way,
normally we should implement it as following which the third parm of dotprod go directly to vec_msum

return v_int32x4(vec_msum(a.val, b.val, c.val));

and by assuming that a=rgv1, b=rgv2, c=rgv3 and returning to r=rgv0 the compiler should produce the following instruction

vmsumshm v0,v1,v2,v3

now everything is fine and it shows a great performance but if c is overlapped to r the following is going to produce

vmsumshm v0,v1,v2,v0

and depend on many tests I found a missive loss of performance

as a quick fix, I passed vec_int4_z (splats(0)) to the third element and then add the product to c and that should produce three instructions instead of one as following and still shows a better performance

vspltisw v10, 0
vmsumshm v11, v1, v2, v10
vadduwm v0, v0, v11

You can try it by your self, just remove the workaround from all v_dotprod&_expand and compare the performance u will find more than 30% loss

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the following benchmark between with and without the workaround shows how can two instructions vmsumshm and vadduwm since vspltisw 0 already loaded once, can be faster than one instruction vmsumshm when the same register is used in VRT and VRC

Name of Test Without With With vs Without (x-factor)
dot::MatType_Length::(8UC1, 32) 0.000 0.000 1.04
dot::MatType_Length::(8UC1, 64) 0.001 0.001 1.26
dot::MatType_Length::(8UC1, 128) 0.002 0.002 1.28
dot::MatType_Length::(8UC1, 256) 0.008 0.006 1.31
dot::MatType_Length::(8UC1, 512) 0.033 0.025 1.32
dot::MatType_Length::(8UC1, 1024) 0.129 0.098 1.31
dot::MatType_Length::(8SC1, 32) 0.000 0.000 1.21
dot::MatType_Length::(8SC1, 64) 0.001 0.001 1.33
dot::MatType_Length::(8SC1, 128) 0.004 0.003 1.37
dot::MatType_Length::(8SC1, 256) 0.015 0.011 1.39
dot::MatType_Length::(8SC1, 512) 0.061 0.044 1.39
dot::MatType_Length::(8SC1, 1024) 0.243 0.174 1.39
"without workaround" patch could be found here

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this workaround improves the dot product operation significantly. However, I don't think the issue is VRT==VRC. This seems to be a second-order effect due to the very small kernel getting generated.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, I did another test for all the instruction set(vmsummbm, vmsumshm, vmsumshs, vmsumubm, vmsumuhm, vmsumuhs) and the situation now changed to worse.
Now we loss performance if the same register is that used in VRT also used in any of (VRA, VRB, VRC).
please, could you check the test source code and tell me if I'm missed something.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seiko2plus thanks. I think I see what is going on. For a very small kernel (e.g the 8uC1 above), it's pretty much just a chain of dependent instructions.

What you're seeing is the extra latency of serializing more expensive instructions. Integer vector adds are lower latency, and thus you see a proportional speedup.

This is likely less desirable for a more complicated kernel, but for now I think you can leave the workaround as-is. I would recommend updated the comment however.

Copy link
Copy Markdown
Contributor Author

@seiko2plus seiko2plus Oct 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PMR, if u checked the test file that I mentioned before you may realize that we really face strange behavior and I still have doubts that it may be a hardware issue and workaround should be from compiler level, however, I removed the comment because I 'm not 100% sure about the reason, also I made some change on code, so workaround only works when ignore_order is true because that may affect negatively on other cases. I will try to investigate this issue deeply when I got free time. thank u

Copy link
Copy Markdown
Contributor

@pmur pmur Oct 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran a slightly modified version of your example program through the P9 pipeline simulator to characterize the behavior. Moving the data dependency to vadd gives the core freedom to execute the vmsum* sooner, thus hiding it's higher latency in these examples.

Edit, this analysis should be equally applicable for most 8{S,U}C1 benchmarks above. The only difference is the unavoidable latency of the vector loads in either case.

If you haven't run across it yet: https://openpowerfoundation.org/?resource_lib=power9-processor-users-manual describes instruction behavior in great detail.

Copy link
Copy Markdown
Contributor Author

@seiko2plus seiko2plus Oct 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran a slightly modified version of your example program through the P9 pipeline simulator to characterize the behavior. Moving the data dependency to vadd gives the core freedom to execute the vmsum* sooner, thus hiding it's higher latency in these examples.

this could be reasonable but wait when I add vadd to vmsum in the case of VRT==VRC still losing latency, I did it in this test and here's the result.

Latency of v_msum(vmsummbm) : 186ms
Latency of v_msum_vadd(vmsummbm) : 238ms
Latency of v_msum_no_overlap(vmsummbm) : 53ms
Latency of v_msum_vadd_no_overlap(vmsummbm) : 61ms

Edit, "Moving the data dependency to vadd " I think maybe that's it, thank u

If you haven't run across it yet: https://openpowerfoundation.org/?resource_lib=power9-processor-users-manual describes instruction behavior in great detail.

sure I will look at it hopefully, I can find answer satisfying me.

@seiko2plus seiko2plus force-pushed the issue15506 branch 2 times, most recently from 3797471 to afe2e05 Compare September 30, 2019 12:44
VSX_IMPL_2VRG(vec_udword2, vec_uint4, vmuleuw, vec_mule)
VSX_IMPL_2VRG(vec_dword2, vec_int4, vmulosw, vec_mulo)
VSX_IMPL_2VRG(vec_udword2, vec_uint4, vmulouw, vec_mulo)
VSX_IMPL_2VRG(vec_dword2, vec_int4, vmulosw, vec_mule)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why you changed the vec_mule (even) to use vmulosw (odd) and vice versa?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChipKerchner Some operations are indirectly endian sensitive on PPC. The vector intrinsics try to hide this as best as possible.

@seiko2plus thank you for fixing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

theses instructions sets are effected by endian mode and due to historical reasons the mnemonics are based on big-endian. in other words vmulosw in little-endian is returning the product of even elements.

@seiko2plus seiko2plus force-pushed the issue15506 branch 3 times, most recently from ead686b to 0522f87 Compare October 2, 2019 05:43
  - add new universal intrinsics v_dotprod[int32], v_dotprod_expand[u&int8, u&int16, int32], v_cvt_f64(int64)
  - add a boolean param for all v_dotprod&_expand intrinsics that change the behavior of addition order between
    pairs in some platforms in order to reach the maximum optimization when the sum among all lanes is what only matters
  - fix clang build on ppc64le
  - support wide universal intrinsics for dotProd_32s
  - remove raw SIMD and activate universal intrinsics for dotProd_8
  - implement SIMD optimization for dotProd_s16&u16
  - extend performance test data types of dotprod
  - fix GCC VSX workaround of vec_mule and vec_mulo (in little-endian it must be swapped)
  - optimize v_mul_expand(int32) on VSX
Copy link
Copy Markdown
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done! Thank you 👍

Multiply values in two registers and sum adjacent result pairs.

@cond Doxygen_Suppress
@param ignore_order - if it's true the intrinsic may perform unorder sum between result pairs in some platforms,
Copy link
Copy Markdown
Contributor

@terfendail terfendail Oct 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it would be better to have 2 separate functions for these cases. So I prefer to retain v_dotprod intrinsic as is and add separate intrinsic e.g. v_dotprod_fast for the case of unorder sum. For me it looks like cleaner and less error prone approach.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the call between your hand but in my opinion, I prefer having boolean parameter over introduce new intrins, it seems more robust to me and I don't see any errors could be caused but wait your choose will make it as a policy we gonna all follow it

Copy link
Copy Markdown
Contributor

@terfendail terfendail Oct 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I prefer to avoid behavior affecting parameter is that it is possible to determine its value at run-time. While in most cases it could be possible for compiler to pre-evaluate the value and optimize-out the condition, there still could be a complex cases that could ruin the performance.
However since we are going to define a policy IMO it make sense to spend some time to figure out all possible advantages and drawbacks for both approaches(and maybe even discover another opportunity) to decide on the option suitable and convenient for as many cases as possible.
IMO it will be useful to start discussing intrinsics development guideline #15656
@alalek @seiko2plus What do you think?

@seiko2plus seiko2plus requested a review from alalek October 5, 2019 16:17
@seiko2plus seiko2plus requested a review from terfendail October 5, 2019 16:47
@seiko2plus seiko2plus force-pushed the issue15506 branch 2 times, most recently from 162c9c1 to c5521cd Compare October 5, 2019 17:01
…prod_fast&v_dotprod_expand_fast

  this changes made depend on "terfendail" review
@alalek alalek merged commit f2fe6f4 into opencv:3.4 Oct 7, 2019
@tompollok
Copy link
Copy Markdown
Contributor

Great work 👍🏼

This was referenced Oct 8, 2019
return v_float64x8(_mm512_cvtepi64_pd(v.val));
#else
// constants encoded as floating-point
__m512i magic_i_lo = _mm512_set1_epi64x(0x4330000000000000); // 2^52
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block is broken, there are no intrinsics _mm512_set1_epi64x and _mm512_blend_epi32.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch, I just forget to remove that part of code, it should be just _mm512_cvtepi64_pd() since the minimum support are SKX features which enables AVX512DQ

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants