Improving VSX performance of integral function#15494
Conversation
|
I think it is better to implement |
|
I don't know if that's really a good idea, the implementation enhances performance for Power9 but I'm not really confident it would do the same for other platforms. Although I could try to simply implement 'get' with the shift approach summ_pixels was using. What do you think? |
|
Yes, I meant this. If it there is native intrinsic for get - we should use it, otherwise rotate and get0. One thing is, this method should be template, because v_rotate is template. |
|
I tried to follow intrin_cpp's implementation approach just to keep consistency. Don't you think this is a better approach? I kinda fear implementing a template version would break consistency a bit. |
|
We can try this, but I doubt it will work with v_rotate, because underlying x86 and NEON intrinsics accept immediate value which must be known at compile time. I've enabled ARM builders for this PR, so you can try to add a commit and see if it works with CI builds (https://pullrequest.opencv.org/#/summary/). |
|
BTW, Merge conflicts should be fixed through rebasing commits onto fresh upstream HEAD commit. It is nice to squash commits with rebasing. |
|
I'm planning to squash as soon as the tests are ok. |
82e6227 to
eee079c
Compare
modules/imgproc/src/sumpixels.cpp
Outdated
| el4h += el4l; | ||
| prev = vx_setall_s32(v_rotate_right<v_int32::nlanes - 1>(el4h).get0()); | ||
|
|
||
| prev = vx_setall_s32(v_extract_n<v_int32::nlanes - 1>(el4h)); |
There was a problem hiding this comment.
Probably more specific intrinsic fits better in this case:
- something like
v_broadcast_element<v_int32::nlanes - 1>(el4h) - can be implemented through permutation/shuffle
There was a problem hiding this comment.
You mean v_broadcast_element would grab the nth-element and return a vector full with copies of it?
There was a problem hiding this comment.
Yes.
Plain C++ code:
/** Broadcast i-th element of vector
Scheme:
@code
{ v[0] v[1] v[2] ... v[SZ] } => { v[i], v[i], v[i] ... v[i] }
@endcode
Restriction: 0 <= i < nlanes (supported: TBD)
Supported types: TBD (lets start from `int32_t` - make sense after revieving of real intrinsic support)
*/
template<int i, typename _Tp, int n>
inline v_reg<_Tp, n> v_broadcast_element(const v_reg<_Tp, n>& a)
{
return v_reg<_Tp, n>::all(a.s[i]);
}
Do you see any objections?
eee079c to
a3249cf
Compare
integral function gains a bit of performance.
instruction v_extract_n to get the n-th element of a vector register.
57b535a to
e1aa8df
Compare
e1aa8df to
242688d
Compare
- updated docs - commented out code to repair compilation - added WASM and MSA default implementations
|
Thank you for update! I pushed commit with adding corresponding tests for new intrinsics. Some code compilation is failed on Linux x64 (fixed / commented out), Windows (
|
- x86: avoid _mm256_extract_epi64/32/16/8 with MSVS 2015 - x86: _mm_extract_epi64 is 64-bit only
| template<int i> | ||
| inline v_uint32x8 v_broadcast_element(v_uint32x8 v) | ||
| { | ||
| return v_uint32x8(_mm256_shuffle_epi32(v.val, _MM_SHUFFLE(i,i,i,i))); |
There was a problem hiding this comment.
I will remove commented out code in a separate commit tomorrow. Code is still available in the history of this PR.
@everton1984 Do you have any objections? (or time to update these extended cases too)
There was a problem hiding this comment.
@alalek Hi! Sorry.it took me so long to answer, I was on holiday. By extended cases you mean WASM/MSA?
There was a problem hiding this comment.
I mean v_broadcast_element for types other than int32/uint32/float32 (see commented out code).
It is not about WASM/MSA.
There was a problem hiding this comment.
@alalek ok let me try to tackle it a bit, I'm not an Intel expert as you might have noticed :) but since there are tests written I might be able to explore. I'll post something as soon as I have it working.
There was a problem hiding this comment.
These "other" cases are not tested or used, so we can remove these implementations if you don't have objections.
There was a problem hiding this comment.
TBH I don't see a problem, although I can see the need for at least a 16-bit version if someone wants to tackle vectorising the integral function for other datatypes in the future.
| OPENCV_HAL_IMPL_VSX_TRANSPOSE4x4(v_float32x4, vec_float4) | ||
|
|
||
| template<int i> | ||
| inline v_int8x16 v_broadcast_element(v_int8x16 v) |
There was a problem hiding this comment.
I know its too late, but GCC will not able to optimize such permute like this one into vsplt[b,h,w] instructions, intrinsic vec_splat() must be used instead. in case of use vec_splat() with double word vector, the compiler will use the instruction xxpermdi A.K.A vec_permi() which is far way better than vperm.
There was a problem hiding this comment.
Feel free to propose patch for this.
There was a problem hiding this comment.
@alalek I think I need to create another PR for this right?
There was a problem hiding this comment.
oh never mind, @seiko2plus already fixed it. Thanks.
This pullrequest is adding support for the get function described in intrin_cpp.hpp line 322 for VSX and adapting the integral HAL implementations to use it. The way the integral functions are accessing a single element from the vector register require 3 instructions on VSX 2 of which are inside the loop. This new implementation requires only a single instruction.