Skip to content

Convert lkpyramid from SSE SIMD to HAL - 90% faster on Power (VSX).#15274

Merged
alalek merged 4 commits intoopencv:3.4from
ChipKerchner:lkpyramidToHal
Aug 28, 2019
Merged

Convert lkpyramid from SSE SIMD to HAL - 90% faster on Power (VSX).#15274
alalek merged 4 commits intoopencv:3.4from
ChipKerchner:lkpyramidToHal

Conversation

@ChipKerchner
Copy link
Copy Markdown
Contributor

Convert lkpyramid from SSE SIMD to HAL - 90% faster on Power (VSX). Didn't convert NEON since it looks different from the SSE2.

v_store_aligned(A11buf, qA11);
v_store_aligned(A12buf, qA12);
v_store_aligned(A22buf, qA22);
iA11 += A11buf[0] + A11buf[1] + A11buf[2] + A11buf[3];
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we should use v_reduce_sum() here instead of temporary buffer.

_mm_store_ps(bbuf, _mm_add_ps(qb0, qb1));
v_store_aligned(bbuf, qb0 + qb1);
ib1 += bbuf[0] + bbuf[2];
ib2 += bbuf[1] + bbuf[3];
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@terfendail Could you suggest something here to avoid temporary buffer?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be done with

v_float32x4 qf0, qf1;
v_recombine(v_interleave_pairs(qb0 + qb1), v_setzero_f32(), qf0, qf1);
ib1 += v_reduce_sum(qf0);
ib2 += v_reduce_sum(qf1);

v_store(dIptr, v00);

t0 = v_reinterpret_as_s32(v00) >> 16; // Iy0 Iy1 Iy2 Iy3
t1 = v_reinterpret_as_s32(v_reinterpret_as_u32(v00) << 16) >> 16; // Ix0 Ix1 Ix2 Ix3
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is also deinterleaving the Ix and Iy components.

__m128 qA11 = _mm_setzero_ps(), qA12 = _mm_setzero_ps(), qA22 = _mm_setzero_ps();
#if CV_SSE2 || CV_VSX
v_int32x4 qw0 = v_setall_s32(iw00 + (iw01 << 16));
v_int32x4 qw1 = v_setall_s32(iw10 + (iw11 << 16));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this part of code assumes Little-endian platforms only.

And there are a lot of v_reinterpret_as_s16(qw0) statements below.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing to v_int16x8 with interleaved data.

__m128i qdelta_d = _mm_set1_epi32(1 << (W_BITS1-1));
__m128i qdelta = _mm_set1_epi32(1 << (W_BITS1-5-1));
__m128 qA11 = _mm_setzero_ps(), qA12 = _mm_setzero_ps(), qA22 = _mm_setzero_ps();
#if CV_SSE2 || CV_VSX
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please use #if CV_SIMD128 && !CV_NEON here and below

v_int32x4 t0, t1;
v_int16x8 v00, v01, v10, v11, t00, t01, t10, t11;

v00 = v_reinterpret_as_s16(v_load_expand(src + x));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There could be out of bound reading on the last iteration of the cycle. Probably it make sense to unroll the cycle to simultaneous processing of 8 elements.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrolled to 8 elements.

v_zip(v10, v11, t10, t11);

t0 = v_dotprod(t00, qw0, qdelta_d) + v_dotprod(t10, qw1);
t1 = v_dotprod(t01, v_reinterpret_as_s16(qw0), qdelta_d) +
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v_reinterpret_as_s16 is unnecessary here

ib2 += bbuf[1] + bbuf[3];
#if CV_SIMD128 && !CV_NEON
v_float32x4 qf0, qf1;
v_recombine(v_interleave_pairs(qb0 + qb1), v_setzero_f32(), qf0, qf1);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it would be better to use v_recombine(v_interleave_pairs(qb0), v_interleave_pairs(qb1), qf0, qf1);. However I don't think it will essentially affect the performance.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that is exactly the same. It's missing the 'add'.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addition will be performed as a part of following v_reduce_sum

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't factor in the lines below. You are correct.

@alalek alalek merged commit 30a60d3 into opencv:3.4 Aug 28, 2019
@alalek alalek mentioned this pull request Aug 30, 2019
@ChipKerchner ChipKerchner deleted the lkpyramidToHal branch September 3, 2019 13:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants