Convert lkpyramid from SSE SIMD to HAL - 90% faster on Power (VSX). by ChipKerchner · Pull Request #15274 · opencv/opencv

ChipKerchner · 2019-08-09T12:21:59Z

Convert lkpyramid from SSE SIMD to HAL - 90% faster on Power (VSX). Didn't convert NEON since it looks different from the SSE2.

alalek · 2019-08-15T10:34:38Z

modules/video/src/lkpyramid.cpp

+        v_store_aligned(A11buf, qA11);
+        v_store_aligned(A12buf, qA12);
+        v_store_aligned(A22buf, qA22);
        iA11 += A11buf[0] + A11buf[1] + A11buf[2] + A11buf[3];


Probably we should use v_reduce_sum() here instead of temporary buffer.

alalek · 2019-08-15T10:45:36Z

modules/video/src/lkpyramid.cpp

-            _mm_store_ps(bbuf, _mm_add_ps(qb0, qb1));
+            v_store_aligned(bbuf, qb0 + qb1);
            ib1 += bbuf[0] + bbuf[2];
            ib2 += bbuf[1] + bbuf[3];


@terfendail Could you suggest something here to avoid temporary buffer?

It could be done with

v_float32x4 qf0, qf1; v_recombine(v_interleave_pairs(qb0 + qb1), v_setzero_f32(), qf0, qf1); ib1 += v_reduce_sum(qf0); ib2 += v_reduce_sum(qf1);

alalek · 2019-08-15T10:50:41Z

modules/video/src/lkpyramid.cpp

+                v_store(dIptr, v00);
+
+                t0 = v_reinterpret_as_s32(v00) >> 16; // Iy0 Iy1 Iy2 Iy3
+                t1 = v_reinterpret_as_s32(v_reinterpret_as_u32(v00) << 16) >> 16; // Ix0 Ix1 Ix2 Ix3


v_expand()?

It is also deinterleaving the Ix and Iy components.

alalek · 2019-08-15T10:50:53Z

modules/video/src/lkpyramid.cpp

-        __m128 qA11 = _mm_setzero_ps(), qA12 = _mm_setzero_ps(), qA22 = _mm_setzero_ps();
+#if CV_SSE2 || CV_VSX
+        v_int32x4 qw0 = v_setall_s32(iw00 + (iw01 << 16));
+        v_int32x4 qw1 = v_setall_s32(iw10 + (iw11 << 16));


Looks like this part of code assumes Little-endian platforms only.

And there are a lot of v_reinterpret_as_s16(qw0) statements below.

Changing to v_int16x8 with interleaved data.

terfendail · 2019-08-20T14:41:38Z

modules/video/src/lkpyramid.cpp

-        __m128i qdelta_d = _mm_set1_epi32(1 << (W_BITS1-1));
-        __m128i qdelta = _mm_set1_epi32(1 << (W_BITS1-5-1));
-        __m128 qA11 = _mm_setzero_ps(), qA12 = _mm_setzero_ps(), qA22 = _mm_setzero_ps();
+#if CV_SSE2 || CV_VSX


Could you please use #if CV_SIMD128 && !CV_NEON here and below

terfendail · 2019-08-20T14:44:28Z

modules/video/src/lkpyramid.cpp

+                v_int32x4 t0, t1;
+                v_int16x8 v00, v01, v10, v11, t00, t01, t10, t11;
+
+                v00 = v_reinterpret_as_s16(v_load_expand(src + x));


There could be out of bound reading on the last iteration of the cycle. Probably it make sense to unroll the cycle to simultaneous processing of 8 elements.

Unrolled to 8 elements.

terfendail · 2019-08-20T14:49:06Z

modules/video/src/lkpyramid.cpp

+                v_zip(v10, v11, t10, t11);
+
+                t0 = v_dotprod(t00, qw0, qdelta_d) + v_dotprod(t10, qw1);
+                t1 = v_dotprod(t01, v_reinterpret_as_s16(qw0), qdelta_d) +


v_reinterpret_as_s16 is unnecessary here

…ince we've already loaded the data.

terfendail · 2019-08-28T10:39:26Z

modules/video/src/lkpyramid.cpp

-            ib2 += bbuf[1] + bbuf[3];
+#if CV_SIMD128 && !CV_NEON
+            v_float32x4 qf0, qf1;
+            v_recombine(v_interleave_pairs(qb0 + qb1), v_setzero_f32(), qf0, qf1);


IMO it would be better to use v_recombine(v_interleave_pairs(qb0), v_interleave_pairs(qb1), qf0, qf1);. However I don't think it will essentially affect the performance.

I don't think that is exactly the same. It's missing the 'add'.

Addition will be performed as a part of following v_reduce_sum

I didn't factor in the lines below. You are correct.

Convert lkpyramid from SSE SIMD to HAL - 90% faster on Power (VSX).

b74c9f4

alalek reviewed Aug 15, 2019

View reviewed changes

ChipKerchner added 2 commits August 15, 2019 10:46

Replace stores with reduce_sum. Rework to handle endianess correctly.

405b6d7

Fix compiler warnings by casting values explicitly to shorts

a5e3cef

terfendail reviewed Aug 20, 2019

View reviewed changes

Switch to CV_SIMD128 compiler definition. Unroll loop to 8 elements s…

91f1896

…ince we've already loaded the data.

terfendail reviewed Aug 28, 2019

View reviewed changes

terfendail approved these changes Aug 28, 2019

View reviewed changes

alalek assigned terfendail Aug 28, 2019

alalek merged commit 30a60d3 into opencv:3.4 Aug 28, 2019

alalek mentioned this pull request Aug 30, 2019

Merge 3.4 #15423

Merged

ChipKerchner deleted the lkpyramidToHal branch September 3, 2019 13:47

Uh oh!

Conversation

ChipKerchner commented Aug 9, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants