core: rework and optimize SIMD implementation of dotProd#15510
core: rework and optimize SIMD implementation of dotProd#15510alalek merged 2 commits intoopencv:3.4from
Conversation
|
IMHO it make sense to extend v_dotprod to v_uint32 as well |
7787008 to
6a6d2fe
Compare
27a4a05 to
2a51672
Compare
1bed950 to
68cbd8f
Compare
8aabf98 to
14e71f0
Compare
ecbe4d8 to
f9df6e4
Compare
| CV_UNUSED(ignore_order); | ||
| // `vmsumshm` and `vmsumubm` showing a lake of performance when | ||
| // when the same register is used in `VRT` and `VRC` | ||
| return v_int32x4(vec_msum(a.val, b.val, vec_int4_z)) + c; |
There was a problem hiding this comment.
@pmur, @ChipKerchner , Have anyone faced this issue before? I couldn't find any notes ISA 2.7 or 3 and It seems for me a CPU issue. I tested on both Power8 & 9(little-endian) KVM/VM.
There was a problem hiding this comment.
I'll ask around. Is using vec_int4_z a workaround to avoid generating those instructions?
There was a problem hiding this comment.
@pmur, well let me explain it in a better way,
normally we should implement it as following which the third parm of dotprod go directly to vec_msum
return v_int32x4(vec_msum(a.val, b.val, c.val));and by assuming that a=rgv1, b=rgv2, c=rgv3 and returning to r=rgv0 the compiler should produce the following instruction
vmsumshm v0,v1,v2,v3now everything is fine and it shows a great performance but if c is overlapped to r the following is going to produce
vmsumshm v0,v1,v2,v0and depend on many tests I found a missive loss of performance
as a quick fix, I passed vec_int4_z (splats(0)) to the third element and then add the product to c and that should produce three instructions instead of one as following and still shows a better performance
vspltisw v10, 0
vmsumshm v11, v1, v2, v10
vadduwm v0, v0, v11You can try it by your self, just remove the workaround from all v_dotprod&_expand and compare the performance u will find more than 30% loss
There was a problem hiding this comment.
the following benchmark between with and without the workaround shows how can two instructions vmsumshm and vadduwm since vspltisw 0 already loaded once, can be faster than one instruction vmsumshm when the same register is used in VRT and VRC
| Name of Test | Without | With | With vs Without (x-factor) |
|---|---|---|---|
| dot::MatType_Length::(8UC1, 32) | 0.000 | 0.000 | 1.04 |
| dot::MatType_Length::(8UC1, 64) | 0.001 | 0.001 | 1.26 |
| dot::MatType_Length::(8UC1, 128) | 0.002 | 0.002 | 1.28 |
| dot::MatType_Length::(8UC1, 256) | 0.008 | 0.006 | 1.31 |
| dot::MatType_Length::(8UC1, 512) | 0.033 | 0.025 | 1.32 |
| dot::MatType_Length::(8UC1, 1024) | 0.129 | 0.098 | 1.31 |
| dot::MatType_Length::(8SC1, 32) | 0.000 | 0.000 | 1.21 |
| dot::MatType_Length::(8SC1, 64) | 0.001 | 0.001 | 1.33 |
| dot::MatType_Length::(8SC1, 128) | 0.004 | 0.003 | 1.37 |
| dot::MatType_Length::(8SC1, 256) | 0.015 | 0.011 | 1.39 |
| dot::MatType_Length::(8SC1, 512) | 0.061 | 0.044 | 1.39 |
| dot::MatType_Length::(8SC1, 1024) | 0.243 | 0.174 | 1.39 |
"without workaround" patch could be found here
There was a problem hiding this comment.
I agree this workaround improves the dot product operation significantly. However, I don't think the issue is VRT==VRC. This seems to be a second-order effect due to the very small kernel getting generated.
There was a problem hiding this comment.
well, I did another test for all the instruction set(vmsummbm, vmsumshm, vmsumshs, vmsumubm, vmsumuhm, vmsumuhs) and the situation now changed to worse.
Now we loss performance if the same register is that used in VRT also used in any of (VRA, VRB, VRC).
please, could you check the test source code and tell me if I'm missed something.
There was a problem hiding this comment.
@seiko2plus thanks. I think I see what is going on. For a very small kernel (e.g the 8uC1 above), it's pretty much just a chain of dependent instructions.
What you're seeing is the extra latency of serializing more expensive instructions. Integer vector adds are lower latency, and thus you see a proportional speedup.
This is likely less desirable for a more complicated kernel, but for now I think you can leave the workaround as-is. I would recommend updated the comment however.
There was a problem hiding this comment.
@PMR, if u checked the test file that I mentioned before you may realize that we really face strange behavior and I still have doubts that it may be a hardware issue and workaround should be from compiler level, however, I removed the comment because I 'm not 100% sure about the reason, also I made some change on code, so workaround only works when ignore_order is true because that may affect negatively on other cases. I will try to investigate this issue deeply when I got free time. thank u
There was a problem hiding this comment.
I ran a slightly modified version of your example program through the P9 pipeline simulator to characterize the behavior. Moving the data dependency to vadd gives the core freedom to execute the vmsum* sooner, thus hiding it's higher latency in these examples.
Edit, this analysis should be equally applicable for most 8{S,U}C1 benchmarks above. The only difference is the unavoidable latency of the vector loads in either case.
If you haven't run across it yet: https://openpowerfoundation.org/?resource_lib=power9-processor-users-manual describes instruction behavior in great detail.
There was a problem hiding this comment.
I ran a slightly modified version of your example program through the P9 pipeline simulator to characterize the behavior. Moving the data dependency to vadd gives the core freedom to execute the vmsum* sooner, thus hiding it's higher latency in these examples.
this could be reasonable but wait when I add vadd to vmsum in the case of VRT==VRC still losing latency, I did it in this test and here's the result.
Latency of v_msum(vmsummbm) : 186ms
Latency of v_msum_vadd(vmsummbm) : 238ms
Latency of v_msum_no_overlap(vmsummbm) : 53ms
Latency of v_msum_vadd_no_overlap(vmsummbm) : 61ms
Edit, "Moving the data dependency to vadd " I think maybe that's it, thank u
If you haven't run across it yet: https://openpowerfoundation.org/?resource_lib=power9-processor-users-manual describes instruction behavior in great detail.
sure I will look at it hopefully, I can find answer satisfying me.
3797471 to
afe2e05
Compare
| VSX_IMPL_2VRG(vec_udword2, vec_uint4, vmuleuw, vec_mule) | ||
| VSX_IMPL_2VRG(vec_dword2, vec_int4, vmulosw, vec_mulo) | ||
| VSX_IMPL_2VRG(vec_udword2, vec_uint4, vmulouw, vec_mulo) | ||
| VSX_IMPL_2VRG(vec_dword2, vec_int4, vmulosw, vec_mule) |
There was a problem hiding this comment.
Any reason why you changed the vec_mule (even) to use vmulosw (odd) and vice versa?
There was a problem hiding this comment.
@ChipKerchner Some operations are indirectly endian sensitive on PPC. The vector intrinsics try to hide this as best as possible.
@seiko2plus thank you for fixing.
There was a problem hiding this comment.
theses instructions sets are effected by endian mode and due to historical reasons the mnemonics are based on big-endian. in other words vmulosw in little-endian is returning the product of even elements.
ead686b to
0522f87
Compare
- add new universal intrinsics v_dotprod[int32], v_dotprod_expand[u&int8, u&int16, int32], v_cvt_f64(int64)
- add a boolean param for all v_dotprod&_expand intrinsics that change the behavior of addition order between
pairs in some platforms in order to reach the maximum optimization when the sum among all lanes is what only matters
- fix clang build on ppc64le
- support wide universal intrinsics for dotProd_32s
- remove raw SIMD and activate universal intrinsics for dotProd_8
- implement SIMD optimization for dotProd_s16&u16
- extend performance test data types of dotprod
- fix GCC VSX workaround of vec_mule and vec_mulo (in little-endian it must be swapped)
- optimize v_mul_expand(int32) on VSX
| Multiply values in two registers and sum adjacent result pairs. | ||
|
|
||
| @cond Doxygen_Suppress | ||
| @param ignore_order - if it's true the intrinsic may perform unorder sum between result pairs in some platforms, |
There was a problem hiding this comment.
IMO it would be better to have 2 separate functions for these cases. So I prefer to retain v_dotprod intrinsic as is and add separate intrinsic e.g. v_dotprod_fast for the case of unorder sum. For me it looks like cleaner and less error prone approach.
There was a problem hiding this comment.
Well the call between your hand but in my opinion, I prefer having boolean parameter over introduce new intrins, it seems more robust to me and I don't see any errors could be caused but wait your choose will make it as a policy we gonna all follow it
There was a problem hiding this comment.
The reason I prefer to avoid behavior affecting parameter is that it is possible to determine its value at run-time. While in most cases it could be possible for compiler to pre-evaluate the value and optimize-out the condition, there still could be a complex cases that could ruin the performance.
However since we are going to define a policy IMO it make sense to spend some time to figure out all possible advantages and drawbacks for both approaches(and maybe even discover another opportunity) to decide on the option suitable and convenient for as many cases as possible.
IMO it will be useful to start discussing intrinsics development guideline #15656
@alalek @seiko2plus What do you think?
162c9c1 to
c5521cd
Compare
…prod_fast&v_dotprod_expand_fast this changes made depend on "terfendail" review
|
Great work 👍🏼 |
| return v_float64x8(_mm512_cvtepi64_pd(v.val)); | ||
| #else | ||
| // constants encoded as floating-point | ||
| __m512i magic_i_lo = _mm512_set1_epi64x(0x4330000000000000); // 2^52 |
There was a problem hiding this comment.
This block is broken, there are no intrinsics _mm512_set1_epi64x and _mm512_blend_epi32.
There was a problem hiding this comment.
nice catch, I just forget to remove that part of code, it should be just _mm512_cvtepi64_pd() since the minimum support are SKX features which enables AVX512DQ
relates #15506
merge with opencv_extra opencv/opencv_extra#674
This pullrequest changes
pairs in some platforms in order to reach the maximum optimization when the sum among all lanes is what only matters
TODO
Performance tests
Args used with opencv_perf_core
X86
CPU
OS
Benchmark
BASELINE AVX2 Geometric mean (ms)
BASELINE SSE4_1 Geometric mean (ms)
BASELINE SSE3 Geometric mean (ms)
PPC64LE
CPU
OS
Benchmark
BASELINE VSX Geometric mean (ms)
ARM
CPU
OS
Benchmark
BASELINE NEON Geometric mean (ms)