Skip to content

Improving VSX performance of integral function#15494

Merged
alalek merged 7 commits intoopencv:3.4from
everton1984:hal_vector_get_n
Nov 20, 2019
Merged

Improving VSX performance of integral function#15494
alalek merged 7 commits intoopencv:3.4from
everton1984:hal_vector_get_n

Conversation

@everton1984
Copy link
Copy Markdown
Contributor

@everton1984 everton1984 commented Sep 10, 2019

This pullrequest is adding support for the get function described in intrin_cpp.hpp line 322 for VSX and adapting the integral HAL implementations to use it. The way the integral functions are accessing a single element from the vector register require 3 instructions on VSX 2 of which are inside the loop. This new implementation requires only a single instruction.

force_builders=Linux AVX2,armv7,armv8,Custom,Win32,Linux32
buildworker:Custom=linux-1,linux-2,linux-4
docker_image:Custom=powerpc64le

build_image:Custom Win=msvs2017

#buildworker:Custom=linux-3
#build_image:Custom=ubuntu:18.04
#CPU_BASELINE:Custom=AVX512_SKX
#disable_ipp=ON

#buildworker:Custom=linux-1
#build_image:Custom=mips64el
#build_image:Custom=javascript-simd

@mshabunin
Copy link
Copy Markdown
Contributor

I think it is better to implement get() method and corresponding tests for all intrinsics variants (x86, NEON, ...) and remove CV_VSX macros in the code.

@everton1984
Copy link
Copy Markdown
Contributor Author

I don't know if that's really a good idea, the implementation enhances performance for Power9 but I'm not really confident it would do the same for other platforms. Although I could try to simply implement 'get' with the shift approach summ_pixels was using. What do you think?

@mshabunin
Copy link
Copy Markdown
Contributor

Yes, I meant this. If it there is native intrinsic for get - we should use it, otherwise rotate and get0. One thing is, this method should be template, because v_rotate is template.

@everton1984
Copy link
Copy Markdown
Contributor Author

I tried to follow intrin_cpp's implementation approach

    _Tp get(const int i) const { return s[i]; }

just to keep consistency. Don't you think this is a better approach? I kinda fear implementing a template version would break consistency a bit.

@mshabunin
Copy link
Copy Markdown
Contributor

We can try this, but I doubt it will work with v_rotate, because underlying x86 and NEON intrinsics accept immediate value which must be known at compile time.

I've enabled ARM builders for this PR, so you can try to add a commit and see if it works with CI builds (https://pullrequest.opencv.org/#/summary/).

@alalek
Copy link
Copy Markdown
Member

alalek commented Oct 14, 2019

BTW, Merge conflicts should be fixed through rebasing commits onto fresh upstream HEAD commit. It is nice to squash commits with rebasing.

@everton1984
Copy link
Copy Markdown
Contributor Author

I'm planning to squash as soon as the tests are ok.

@everton1984 everton1984 force-pushed the hal_vector_get_n branch 2 times, most recently from 82e6227 to eee079c Compare October 17, 2019 14:02
el4h += el4l;
prev = vx_setall_s32(v_rotate_right<v_int32::nlanes - 1>(el4h).get0());

prev = vx_setall_s32(v_extract_n<v_int32::nlanes - 1>(el4h));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably more specific intrinsic fits better in this case:

  • something like v_broadcast_element<v_int32::nlanes - 1>(el4h)
  • can be implemented through permutation/shuffle

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean v_broadcast_element would grab the nth-element and return a vector full with copies of it?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Plain C++ code:

/** Broadcast i-th element of vector
Scheme:
@code
{ v[0] v[1] v[2] ... v[SZ] } => { v[i], v[i], v[i] ... v[i] }
@endcode
Restriction: 0 <= i < nlanes (supported: TBD)
Supported types: TBD (lets start from `int32_t` - make sense after revieving of real intrinsic support)
 */
template<int i, typename _Tp, int n>
inline v_reg<_Tp, n> v_broadcast_element(const v_reg<_Tp, n>& a)
{
    return v_reg<_Tp, n>::all(a.s[i]);
}

Do you see any objections?

Everton Constantino added 3 commits November 14, 2019 11:41
integral function gains a bit of performance.
instruction v_extract_n to get the n-th element of a vector register.
@everton1984 everton1984 force-pushed the hal_vector_get_n branch 3 times, most recently from 57b535a to e1aa8df Compare November 14, 2019 15:10
- updated docs
- commented out code to repair compilation
- added WASM and MSA default implementations
@alalek
Copy link
Copy Markdown
Member

alalek commented Nov 17, 2019

Thank you for update!

I pushed commit with adding corresponding tests for new intrinsics.

Some code compilation is failed on Linux x64 (fixed / commented out), Windows (_mm256_extract_epi64 issue), WASM/MSA backends (added stubs).


  • _mm256_extract_epi64 issue (Windows) is MSVS 2015 specific. MSVS 2019 build is fine.
  • _mm_extract_epi64 (Linux32): available on 64-bit mode only (GCC bug)

- x86: avoid _mm256_extract_epi64/32/16/8 with MSVS 2015
- x86: _mm_extract_epi64 is 64-bit only
template<int i>
inline v_uint32x8 v_broadcast_element(v_uint32x8 v)
{
return v_uint32x8(_mm256_shuffle_epi32(v.val, _MM_SHUFFLE(i,i,i,i)));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove commented out code in a separate commit tomorrow. Code is still available in the history of this PR.
@everton1984 Do you have any objections? (or time to update these extended cases too)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alalek Hi! Sorry.it took me so long to answer, I was on holiday. By extended cases you mean WASM/MSA?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean v_broadcast_element for types other than int32/uint32/float32 (see commented out code).

It is not about WASM/MSA.

Copy link
Copy Markdown
Contributor Author

@everton1984 everton1984 Nov 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alalek ok let me try to tackle it a bit, I'm not an Intel expert as you might have noticed :) but since there are tests written I might be able to explore. I'll post something as soon as I have it working.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These "other" cases are not tested or used, so we can remove these implementations if you don't have objections.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH I don't see a problem, although I can see the need for at least a 16-bit version if someone wants to tackle vectorising the integral function for other datatypes in the future.

Copy link
Copy Markdown
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done! Thank you 👍

@alalek alalek merged commit 75315fb into opencv:3.4 Nov 20, 2019
@alalek alalek mentioned this pull request Nov 22, 2019
OPENCV_HAL_IMPL_VSX_TRANSPOSE4x4(v_float32x4, vec_float4)

template<int i>
inline v_int8x16 v_broadcast_element(v_int8x16 v)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know its too late, but GCC will not able to optimize such permute like this one into vsplt[b,h,w] instructions, intrinsic vec_splat() must be used instead. in case of use vec_splat() with double word vector, the compiler will use the instruction xxpermdi A.K.A vec_permi() which is far way better than vperm.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to propose patch for this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alalek I think I need to create another PR for this right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh never mind, @seiko2plus already fixed it. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants