Conversation
5abee9b to
86f21d4
Compare
86f21d4 to
d26e6da
Compare
sivanov-work
left a comment
There was a problem hiding this comment.
LGTM: added minor comments
| float* sc = scratch.OutLine<float>(); | ||
|
|
||
| for (int i = 0; i < scratch.length(); ++i) | ||
| sc[i] = static_cast<float>(_scalar[i % chan]); |
There was a problem hiding this comment.
Naive: till scratch size is known at compile phase - is it possible to make unrolling loop for that determined size?
In case of compiler didn't make unroll it automatically...
There was a problem hiding this comment.
It is not entirely clear yet. Could you explain in more detail?
There was a problem hiding this comment.
Sure:
Scratch buffer size depends from MACRO definition here https://github.com/opencv/opencv/pull/21158/files#diff-fe0ac6b5a07c1bec59c3756100eb186c9117a82dd1d951ef6997b8423a89943dR1505
we have constant buffer size here for CV_SIMD turn on/off case... So it is possible to introduce MACRO branch here by CV_SIMD turn on/off and make manual loop unrolling in a way
#ifdef CV_SIMD
cs[0] = _scalar[0];
cs[1] = _scalar[1];
...<16 + 2 times>
#else
But from the other hand, C++ compiler can do this work of manual loop unrolling in more optimized way... But I do not known much details how is it applicable here: would C++ compiler make unrolling or wouldn't in this case - if scratch.length() has runtime (but not compile time) value.
So my point here: We can try to simply replace scratch.length() from dynamic parameter to constant parameter from MACRO where scratch buffer is determined and believe that C++ compiler would make loop unrolling or straight way vectorization here
P.S. this comment may be (or may not be ) applicable if this code is placed on critical performance path ONLY
There was a problem hiding this comment.
It seems to me there are no sense to do this for current case, because this loop work just once when function is called for the first line of the handling picture. And the compiler usually does it well itself.
|
|
||
| #undef MUL_SIMD | ||
|
|
||
| struct add_op {}; |
There was a problem hiding this comment.
simple find on a page doesn't return to me anything about using this structure: where is it used?
| { | ||
| #if CV_SIMD | ||
| // 512 bits / 32 bits = 16 elements of float32 can contain a AVX 512 SIMD vector. | ||
| constexpr int maxNlanes = 16; |
There was a problem hiding this comment.
is it the same initialization which used for AddC? if yes - then maybe we need single common function to reuse in both?
| typename std::enable_if<(std::is_same<DST, ushort>::value || | ||
| std::is_same<DST, short>::value), void>::type | ||
| addc_simd_common_impl(const SRC* inx, DST* outx, const v_float32& sc, const int nlanes) | ||
| arithmOpScalar_simd_common_impl(op_tag t, const SRC* inx, DST* outx, const v_float32& sc, const int nlanes) |
There was a problem hiding this comment.
👍🏻
Now code for sum + sub is unified?
There was a problem hiding this comment.
what do you think to rename op for something more less domain specific -> operation or op/operation_dispatcher ? too much short-named variables here and functions name stays hidden beside them
|
@alalek CI checks passed, PR was approved. Could you please merge it? |
* GAPI Fluid: SIMD for SubC kernel. * Applied comments
SIMD for GAPI Fluid SubC kernel.
Performance report:
FluidSubCSIMD.xlsx