Improving VSX performance of integral function by everton1984 · Pull Request #15494 · opencv/opencv

everton1984 · 2019-09-10T11:49:59Z

This pullrequest is adding support for the get function described in intrin_cpp.hpp line 322 for VSX and adapting the integral HAL implementations to use it. The way the integral functions are accessing a single element from the vector register require 3 instructions on VSX 2 of which are inside the loop. This new implementation requires only a single instruction.

force_builders=Linux AVX2,armv7,armv8,Custom,Win32,Linux32
buildworker:Custom=linux-1,linux-2,linux-4
docker_image:Custom=powerpc64le

build_image:Custom Win=msvs2017

#buildworker:Custom=linux-3
#build_image:Custom=ubuntu:18.04
#CPU_BASELINE:Custom=AVX512_SKX
#disable_ipp=ON

#buildworker:Custom=linux-1
#build_image:Custom=mips64el
#build_image:Custom=javascript-simd

mshabunin · 2019-09-16T14:35:39Z

I think it is better to implement get() method and corresponding tests for all intrinsics variants (x86, NEON, ...) and remove CV_VSX macros in the code.

everton1984 · 2019-09-16T18:38:58Z

I don't know if that's really a good idea, the implementation enhances performance for Power9 but I'm not really confident it would do the same for other platforms. Although I could try to simply implement 'get' with the shift approach summ_pixels was using. What do you think?

mshabunin · 2019-09-16T19:53:25Z

Yes, I meant this. If it there is native intrinsic for get - we should use it, otherwise rotate and get0. One thing is, this method should be template, because v_rotate is template.

everton1984 · 2019-09-16T20:04:31Z

I tried to follow intrin_cpp's implementation approach

    _Tp get(const int i) const { return s[i]; }

just to keep consistency. Don't you think this is a better approach? I kinda fear implementing a template version would break consistency a bit.

mshabunin · 2019-09-16T20:57:24Z

We can try this, but I doubt it will work with v_rotate, because underlying x86 and NEON intrinsics accept immediate value which must be known at compile time.

I've enabled ARM builders for this PR, so you can try to add a commit and see if it works with CI builds (https://pullrequest.opencv.org/#/summary/).

modules/core/include/opencv2/core/hal/intrin_vsx.hpp

alalek · 2019-10-14T15:32:26Z

BTW, Merge conflicts should be fixed through rebasing commits onto fresh upstream HEAD commit. It is nice to squash commits with rebasing.

everton1984 · 2019-10-14T15:49:18Z

I'm planning to squash as soon as the tests are ok.

alalek · 2019-10-18T21:15:31Z

modules/imgproc/src/sumpixels.cpp

                el4h += el4l;
-                prev = vx_setall_s32(v_rotate_right<v_int32::nlanes - 1>(el4h).get0());
+
+                prev = vx_setall_s32(v_extract_n<v_int32::nlanes - 1>(el4h));


Probably more specific intrinsic fits better in this case:

something like v_broadcast_element<v_int32::nlanes - 1>(el4h)

can be implemented through permutation/shuffle

You mean v_broadcast_element would grab the nth-element and return a vector full with copies of it?

Yes.

Plain C++ code:

/** Broadcast i-th element of vector Scheme: @code { v[0] v[1] v[2] ... v[SZ] } => { v[i], v[i], v[i] ... v[i] } @endcode Restriction: 0 <= i < nlanes (supported: TBD) Supported types: TBD (lets start from `int32_t` - make sense after revieving of real intrinsic support) */ template<int i, typename _Tp, int n> inline v_reg<_Tp, n> v_broadcast_element(const v_reg<_Tp, n>& a) { return v_reg<_Tp, n>::all(a.s[i]); }

Do you see any objections?

integral function gains a bit of performance.

instruction v_extract_n to get the n-th element of a vector register.

- updated docs - commented out code to repair compilation - added WASM and MSA default implementations

alalek · 2019-11-17T00:07:47Z

Thank you for update!

I pushed commit with adding corresponding tests for new intrinsics.

Some code compilation is failed on Linux x64 (fixed / commented out), Windows (_mm256_extract_epi64 issue), WASM/MSA backends (added stubs).

_mm256_extract_epi64 issue (Windows) is MSVS 2015 specific. MSVS 2019 build is fine.
_mm_extract_epi64 (Linux32): available on 64-bit mode only (GCC bug)

- x86: avoid _mm256_extract_epi64/32/16/8 with MSVS 2015 - x86: _mm_extract_epi64 is 64-bit only

alalek · 2019-11-18T14:31:01Z

modules/core/include/opencv2/core/hal/intrin_avx.hpp

+template<int i>
+inline v_uint32x8 v_broadcast_element(v_uint32x8 v)
+{
+    return v_uint32x8(_mm256_shuffle_epi32(v.val, _MM_SHUFFLE(i,i,i,i)));


I will remove commented out code in a separate commit tomorrow. Code is still available in the history of this PR.
@everton1984 Do you have any objections? (or time to update these extended cases too)

@alalek Hi! Sorry.it took me so long to answer, I was on holiday. By extended cases you mean WASM/MSA?

I mean v_broadcast_element for types other than int32/uint32/float32 (see commented out code).

It is not about WASM/MSA.

@alalek ok let me try to tackle it a bit, I'm not an Intel expert as you might have noticed :) but since there are tests written I might be able to explore. I'll post something as soon as I have it working.

These "other" cases are not tested or used, so we can remove these implementations if you don't have objections.

TBH I don't see a problem, although I can see the need for at least a 16-bit version if someone wants to tackle vectorising the integral function for other datatypes in the future.

alalek

Well done! Thank you 👍

seiko2plus · 2020-03-14T10:17:38Z

modules/core/include/opencv2/core/hal/intrin_vsx.hpp

 OPENCV_HAL_IMPL_VSX_TRANSPOSE4x4(v_float32x4, vec_float4)

+template<int i>
+inline v_int8x16 v_broadcast_element(v_int8x16 v)


I know its too late, but GCC will not able to optimize such permute like this one into vsplt[b,h,w] instructions, intrinsic vec_splat() must be used instead. in case of use vec_splat() with double word vector, the compiler will use the instruction xxpermdi A.K.A vec_permi() which is far way better than vperm.

Feel free to propose patch for this.

@alalek I think I need to create another PR for this right?

oh never mind, @seiko2plus already fixed it. Thanks.

alalek reviewed Sep 18, 2019

View reviewed changes

modules/core/include/opencv2/core/hal/intrin_vsx.hpp Outdated Show resolved Hide resolved

everton1984 force-pushed the hal_vector_get_n branch 2 times, most recently from 82e6227 to eee079c Compare October 17, 2019 14:02

alalek reviewed Oct 18, 2019

View reviewed changes

everton1984 force-pushed the hal_vector_get_n branch from eee079c to a3249cf Compare October 29, 2019 15:32

Everton Constantino added 3 commits November 14, 2019 11:41

Adding support for vector get function on VSX datatypes so the

1f1a54b

integral function gains a bit of performance.

Removing get as a datatype member function and implementing a new HAL

d523383

instruction v_extract_n to get the n-th element of a vector register.

Adding SSE/NEON/AVX intrinsics.

e597402

everton1984 force-pushed the hal_vector_get_n branch 3 times, most recently from 57b535a to e1aa8df Compare November 14, 2019 15:10

Implement new HAL instruction v_broadcast_element on VSX/AVX/NEON/SSE.

242688d

everton1984 force-pushed the hal_vector_get_n branch from e1aa8df to 242688d Compare November 14, 2019 15:18

core(simd): add tests for v_extract_n/v_broadcast_element

3bd70c0

- updated docs - commented out code to repair compilation - added WASM and MSA default implementations

core(simd): fix compilation

0479b41

- x86: avoid _mm256_extract_epi64/32/16/8 with MSVS 2015 - x86: _mm_extract_epi64 is 64-bit only

alalek reviewed Nov 18, 2019

View reviewed changes

cleanup

51be521

alalek approved these changes Nov 20, 2019

View reviewed changes

alalek merged commit 75315fb into opencv:3.4 Nov 20, 2019

alalek mentioned this pull request Nov 22, 2019

Merge 3.4 #15974

Merged

seiko2plus reviewed Mar 14, 2020

View reviewed changes

seiko2plus mentioned this pull request Mar 14, 2020

core:vsx reimplement v_broadcast_element() #16812

Merged

6 tasks

fengyuentau mentioned this pull request Apr 23, 2024

core: add universal intrinsics for fp16 #25196

Merged

6 tasks

Uh oh!

Conversation

everton1984 commented Sep 10, 2019 • edited by alalek Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mshabunin commented Sep 16, 2019

Uh oh!

everton1984 commented Sep 16, 2019

Uh oh!

mshabunin commented Sep 16, 2019

Uh oh!

everton1984 commented Sep 16, 2019

Uh oh!

mshabunin commented Sep 16, 2019

Uh oh!

Uh oh!

alalek commented Oct 14, 2019

Uh oh!

everton1984 commented Oct 14, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alalek commented Nov 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

everton1984 Nov 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alalek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

everton1984 commented Sep 10, 2019 •

edited by alalek

Loading

alalek commented Nov 17, 2019 •

edited

Loading

everton1984 Nov 19, 2019 •

edited

Loading