Convert lkpyramid from SSE SIMD to HAL - 90% faster on Power (VSX).#15274
Convert lkpyramid from SSE SIMD to HAL - 90% faster on Power (VSX).#15274alalek merged 4 commits intoopencv:3.4from ChipKerchner:lkpyramidToHal
Conversation
modules/video/src/lkpyramid.cpp
Outdated
| v_store_aligned(A11buf, qA11); | ||
| v_store_aligned(A12buf, qA12); | ||
| v_store_aligned(A22buf, qA22); | ||
| iA11 += A11buf[0] + A11buf[1] + A11buf[2] + A11buf[3]; |
There was a problem hiding this comment.
Probably we should use v_reduce_sum() here instead of temporary buffer.
modules/video/src/lkpyramid.cpp
Outdated
| _mm_store_ps(bbuf, _mm_add_ps(qb0, qb1)); | ||
| v_store_aligned(bbuf, qb0 + qb1); | ||
| ib1 += bbuf[0] + bbuf[2]; | ||
| ib2 += bbuf[1] + bbuf[3]; |
There was a problem hiding this comment.
@terfendail Could you suggest something here to avoid temporary buffer?
There was a problem hiding this comment.
It could be done with
v_float32x4 qf0, qf1;
v_recombine(v_interleave_pairs(qb0 + qb1), v_setzero_f32(), qf0, qf1);
ib1 += v_reduce_sum(qf0);
ib2 += v_reduce_sum(qf1);
modules/video/src/lkpyramid.cpp
Outdated
| v_store(dIptr, v00); | ||
|
|
||
| t0 = v_reinterpret_as_s32(v00) >> 16; // Iy0 Iy1 Iy2 Iy3 | ||
| t1 = v_reinterpret_as_s32(v_reinterpret_as_u32(v00) << 16) >> 16; // Ix0 Ix1 Ix2 Ix3 |
There was a problem hiding this comment.
It is also deinterleaving the Ix and Iy components.
modules/video/src/lkpyramid.cpp
Outdated
| __m128 qA11 = _mm_setzero_ps(), qA12 = _mm_setzero_ps(), qA22 = _mm_setzero_ps(); | ||
| #if CV_SSE2 || CV_VSX | ||
| v_int32x4 qw0 = v_setall_s32(iw00 + (iw01 << 16)); | ||
| v_int32x4 qw1 = v_setall_s32(iw10 + (iw11 << 16)); |
There was a problem hiding this comment.
Looks like this part of code assumes Little-endian platforms only.
And there are a lot of v_reinterpret_as_s16(qw0) statements below.
There was a problem hiding this comment.
Changing to v_int16x8 with interleaved data.
modules/video/src/lkpyramid.cpp
Outdated
| __m128i qdelta_d = _mm_set1_epi32(1 << (W_BITS1-1)); | ||
| __m128i qdelta = _mm_set1_epi32(1 << (W_BITS1-5-1)); | ||
| __m128 qA11 = _mm_setzero_ps(), qA12 = _mm_setzero_ps(), qA22 = _mm_setzero_ps(); | ||
| #if CV_SSE2 || CV_VSX |
There was a problem hiding this comment.
Could you please use #if CV_SIMD128 && !CV_NEON here and below
| v_int32x4 t0, t1; | ||
| v_int16x8 v00, v01, v10, v11, t00, t01, t10, t11; | ||
|
|
||
| v00 = v_reinterpret_as_s16(v_load_expand(src + x)); |
There was a problem hiding this comment.
There could be out of bound reading on the last iteration of the cycle. Probably it make sense to unroll the cycle to simultaneous processing of 8 elements.
There was a problem hiding this comment.
Unrolled to 8 elements.
modules/video/src/lkpyramid.cpp
Outdated
| v_zip(v10, v11, t10, t11); | ||
|
|
||
| t0 = v_dotprod(t00, qw0, qdelta_d) + v_dotprod(t10, qw1); | ||
| t1 = v_dotprod(t01, v_reinterpret_as_s16(qw0), qdelta_d) + |
There was a problem hiding this comment.
v_reinterpret_as_s16 is unnecessary here
…ince we've already loaded the data.
| ib2 += bbuf[1] + bbuf[3]; | ||
| #if CV_SIMD128 && !CV_NEON | ||
| v_float32x4 qf0, qf1; | ||
| v_recombine(v_interleave_pairs(qb0 + qb1), v_setzero_f32(), qf0, qf1); |
There was a problem hiding this comment.
IMO it would be better to use v_recombine(v_interleave_pairs(qb0), v_interleave_pairs(qb1), qf0, qf1);. However I don't think it will essentially affect the performance.
There was a problem hiding this comment.
I don't think that is exactly the same. It's missing the 'add'.
There was a problem hiding this comment.
Addition will be performed as a part of following v_reduce_sum
There was a problem hiding this comment.
I didn't factor in the lines below. You are correct.
Convert lkpyramid from SSE SIMD to HAL - 90% faster on Power (VSX). Didn't convert NEON since it looks different from the SSE2.