GAPI: SIMD optimization for AddWeighted kernel. by anna-khakimova · Pull Request #18466 · opencv/opencv

anna-khakimova · 2020-09-30T08:57:13Z

SIMD optimization for AddWeighted kernel via universal intrinsics.

force_builders=Linux AVX2,Custom
disable_ipp:Custom=ON

buildworker:Custom=linux-3
build_image:Custom=ubuntu:18.04
CPU_BASELINE:Custom=AVX512_SKX

Xbuildworker:Custom=linux-1,linux-2,linux-4
Xbuild_image:Custom=powerpc64le

Perf report for AddWeighted:

AddWeighted_avx512_avx2_sse42_perf_report.xlsx

anna-khakimova · 2020-09-30T21:46:40Z

@terfendail , please take a look on SIMD optimization of And, Or, Xor kernels.

modules/core/include/opencv2/core/hal/intrin_neon.hpp

modules/gapi/src/backends/fluid/gfluidcore.cpp

modules/core/include/opencv2/core/hal/intrin_avx.hpp

modules/core/include/opencv2/core/hal/intrin_sse.hpp

modules/gapi/src/backends/fluid/gfluidcore.cpp

asmorkalov · 2020-11-12T11:02:05Z

@anna-khakimova @anton-potapov Friendly reminder.

asmorkalov · 2020-11-23T09:27:05Z

@anna-khakimova could you rebase the PR and fix conflicts?

anna-khakimova · 2020-11-30T16:28:03Z

Hello @asmorkalov
This task has been temporarily suspended.

OrestChura

I also can suggest the same changes for the code of addw_simd function; but it is more complicated question, let's discuss it then

modules/gapi/src/backends/fluid/gfluidcore.cpp

anna-khakimova · 2021-01-28T14:14:47Z

I also can suggest the same changes for the code of addw_simd function; but it is more complicated question, let's discuss it then

Reworked.

anna-khakimova · 2021-01-28T14:19:10Z

@alalek , @terfendail , @rgarnov , @OrestChura please check.

alalek · 2021-01-28T14:45:29Z

modules/gapi/src/backends/fluid/gfluidcore.cpp

+    v_float32 a = vx_setall_f32(_alpha);
+    v_float32 b = vx_setall_f32(_beta);
+    v_float32 g = vx_setall_f32(_gamma);
+
+    x = addw_simd(in1, in2, out, a, b, g, length);


Why?

I don't know any case where passing SIMD vectors through parameters (into non-inline function) can improve performance.

I don't tried to improve performance with commit "Refactoring.Step2". I've just tried remove excess intermediate function which has the same name addw_simd() but look like this:

template<typename SRC, typename DST> CV_ALWAYS_INLINE int addw_simd(const SRC in1[], const SRC in2[], DST out[], float _alpha, float _beta, float _gamma, int length) { //Cases when dst type is float are successfully vectorized with compiler. if (std::is_same<DST, float>::value) return 0; v_float32 alpha = vx_setall_f32(_alpha); v_float32 beta = vx_setall_f32(_beta); v_float32 gamma = vx_setall_f32(_gamma); if (std::is_same<SRC, ushort>::value && std::is_same<DST, ushort>::value) { return addw_short2short(reinterpret_cast<const ushort*>(in1), reinterpret_cast<const ushort*>(in2), reinterpret_cast<ushort*>(out), alpha, beta, gamma, length); } else if (std::is_same<SRC, short>::value && std::is_same<DST, short>::value) { return addw_short2short(reinterpret_cast<const short*>(in1), reinterpret_cast<const short*>(in2), reinterpret_cast<short*>(out), alpha, beta, gamma, length); } else if (std::is_same<SRC, short>::value && std::is_same<DST, uchar>::value) { return addw_short2uchar(reinterpret_cast<const short*>(in1), reinterpret_cast<const short*>(in2), reinterpret_cast<uchar*>(out), alpha, beta, gamma, length); } else if (std::is_same<SRC, ushort>::value && std::is_same<DST, uchar>::value) { return addw_short2uchar(reinterpret_cast<const ushort*>(in1), reinterpret_cast<const ushort*>(in2), reinterpret_cast<uchar*>(out), alpha, beta, gamma, length); } else if (std::is_same<SRC, uchar>::value && std::is_same<DST, uchar>::value) { constexpr int nlanes = v_uint8::nlanes; if (length < nlanes) return 0; int x = 0; for (;;) { for (; x <= length - nlanes; x += nlanes) { v_float32 a1 = v_load_f32(reinterpret_cast<const uchar*>(&in1[x])); v_float32 a2 = v_load_f32(reinterpret_cast<const uchar*>(&in1[x + nlanes / 4])); v_float32 a3 = v_load_f32(reinterpret_cast<const uchar*>(&in1[x + nlanes / 2])); v_float32 a4 = v_load_f32(reinterpret_cast<const uchar*>(&in1[x + 3 * nlanes / 4])); v_float32 b1 = v_load_f32(reinterpret_cast<const uchar*>(&in2[x])); v_float32 b2 = v_load_f32(reinterpret_cast<const uchar*>(&in2[x + nlanes / 4])); v_float32 b3 = v_load_f32(reinterpret_cast<const uchar*>(&in2[x + nlanes / 2])); v_float32 b4 = v_load_f32(reinterpret_cast<const uchar*>(&in2[x + 3 * nlanes / 4])); v_int32 sum1 = v_round(v_fma(a1, alpha, v_fma(b1, beta, gamma))), sum2 = v_round(v_fma(a2, alpha, v_fma(b2, beta, gamma))), sum3 = v_round(v_fma(a3, alpha, v_fma(b3, beta, gamma))), sum4 = v_round(v_fma(a4, alpha, v_fma(b4, beta, gamma))); vx_store(reinterpret_cast<uchar*>(&out[x]), v_pack_u(v_pack(sum1, sum2), v_pack(sum3, sum4))); } if (x < length) { x = length - nlanes; continue; // process one more time (unaligned tail) } break; } return x; } return 0; }` ```

Functions addw_short2uchar(), addw_short2short() were renamed to addw_simd() and became templated with 2 template specializations. Please don't rely on github's diff. It often shows a difference incorrectly.
Since showed above intermediate function was removed, initialization of vectors alpha, beta, gamma had to be moved to function run_addweighted() on one call stack's level higher.

So now addw_simd() look like this:

template<typename SRC, typename DST> CV_ALWAYS_INLINE int addw_simd(const SRC in1[], const SRC in2[], DST out[], const v_float32& alpha, const v_float32& beta, const v_float32& gamma, int length) { GAPI_Assert((std::is_same<DST, ushort>::value) || (std::is_same<DST, short>::value)); constexpr int nlanes = (std::is_same<DST, ushort>::value) ? static_cast<int>(v_uint16::nlanes) : static_cast<int>(v_int16::nlanes); if (length < nlanes) return 0; int x = 0; for (;;) { for (; x <= length - nlanes; x += nlanes) { v_float32 a1 = v_load_f32(&in1[x]); v_float32 a2 = v_load_f32(&in1[x + nlanes / 2]); v_float32 b1 = v_load_f32(&in2[x]); v_float32 b2 = v_load_f32(&in2[x + nlanes / 2]); addw_short_store(&out[x], v_round(v_fma(a1, alpha, v_fma(b1, beta, gamma))), v_round(v_fma(a2, alpha, v_fma(b2, beta, gamma)))); } if (x < length) { x = length - nlanes; continue; // process one more time (unaligned tail) } break; } return x; } template<typename SRC> CV_ALWAYS_INLINE int addw_simd(const SRC in1[], const SRC in2[], uchar out[], const v_float32& alpha, const v_float32& beta, const v_float32& gamma, int length) { constexpr int nlanes = v_uint8::nlanes; if (length < nlanes) return 0; int x = 0; for (;;) { for (; x <= length - nlanes; x += nlanes) { v_float32 a1 = v_load_f32(&in1[x]); v_float32 a2 = v_load_f32(&in1[x + nlanes / 4]); v_float32 a3 = v_load_f32(&in1[x + nlanes / 2]); v_float32 a4 = v_load_f32(&in1[x + 3 * nlanes / 4]); v_float32 b1 = v_load_f32(&in2[x]); v_float32 b2 = v_load_f32(&in2[x + nlanes / 4]); v_float32 b3 = v_load_f32(&in2[x + nlanes / 2]); v_float32 b4 = v_load_f32(&in2[x + 3 * nlanes / 4]); v_int32 sum1 = v_round(v_fma(a1, alpha, v_fma(b1, beta, gamma))), sum2 = v_round(v_fma(a2, alpha, v_fma(b2, beta, gamma))), sum3 = v_round(v_fma(a3, alpha, v_fma(b3, beta, gamma))), sum4 = v_round(v_fma(a4, alpha, v_fma(b4, beta, gamma))); vx_store(&out[x], v_pack_u(v_pack(sum1, sum2), v_pack(sum3, sum4))); } if (x < length) { x = length - nlanes; continue; // process one more time (unaligned tail) } break; } return x; } template<typename SRC> CV_ALWAYS_INLINE int addw_simd(const SRC*, const SRC*, float*, const v_float32&, const v_float32&, const v_float32&, int length) { //Cases when dst type is float are successfully vectorized with compiler. return 0; }

But if you mind I can move

v_float32 a = vx_setall_f32(_alpha); v_float32 b = vx_setall_f32(_beta); v_float32 g = vx_setall_f32(_gamma);

to each of the addw_simd() template specialization. But it cause code duplication.

I think it would be better and easier to add static assert.
@alalek could you please comment?

OK, this may work here.

if you mind I can move

This should be moved into SIMD functions.

Think about runtime dispatching (not in a scope of this PR) where we don't even know SIMD size and registers types of underlying functions.
Which SIMD value do you want to send there?

Consider following these rules:

no SIMD variables in dispatching code (generic C++ part)

no SIMD arguments in SIMD functions which are called from generic C++ code.

anna-khakimova · 2021-02-04T08:26:37Z

@alalek I've already applied all comments. It seems to me that Vitaly is on vacation. Could you please check, remove change request from Vitaly and merge?

alalek

@rgarnov @dmatveev In this patch SIMD code targeted for NEON is explicitly disabled. It is very unusual case, especially for such simple loops.

Need to properly re-measure performance with enabled NEON intrinsics on the target platform.

On public TX-1 (no access to target platform) I see 1.5-3.5x speedup.

Anna provides report with 12% degradation.

We need third independent measurement through OpenCV perf tests scripts to confirm performance observations.

alalek · 2021-02-04T08:52:46Z

modules/gapi/src/backends/fluid/gfluidcore.cpp

 // Fluid kernels: addWeighted
 //
 //---------------------------
+#if CV_SSE2 | CV_AVX2 | CV_AVX512_SKX


#if CV_SSE2 | CV_AVX2 | CV_AVX512_SKX

Why is just #if CV_SSE2 not enough?

alalek · 2021-02-04T09:00:37Z

modules/gapi/src/backends/fluid/gfluidcore.cpp

+    v_float32 a = vx_setall_f32(_alpha);
+    v_float32 b = vx_setall_f32(_beta);
+    v_float32 g = vx_setall_f32(_gamma);
+
+    x = addw_simd(in1, in2, out, a, b, g, length);


if you mind I can move

This should be moved into SIMD functions.

Think about runtime dispatching (not in a scope of this PR) where we don't even know SIMD size and registers types of underlying functions.
Which SIMD value do you want to send there?

Consider following these rules:

no SIMD variables in dispatching code (generic C++ part)

no SIMD arguments in SIMD functions which are called from generic C++ code.

anna-khakimova · 2021-02-04T11:34:55Z

@dbudniko please take a look.

anna-khakimova · 2021-02-05T08:57:19Z

@rgarnov @dmatveev In this patch SIMD code targeted for NEON is explicitly disabled. It is very unusual case, especially for such simple loops.

Need to properly re-measure performance with enabled NEON intrinsics on the target platform.

On public TX-1 (no access to target platform) I see 1.5-3.5x speedup.

Anna provides report with 12% degradation.

We need third independent measurement through OpenCV perf tests scripts to confirm performance observations.

@alalek
@dbudniko tested this patch on target ARM platform yesterday and got following results:

perf_report_Dmitry.xlsx

As you can see Dmitry's result is the same as our with Orest. Speedup is not observed.

anna-khakimova · 2021-02-05T10:02:44Z

@alalek please take a look.

terfendail

I haven't yet collected performance statistics for this update, however the change looks fine for me

GAPI: SIMD optimization for AddWeighted kernel. * Add, sub, absdiff kernels optimization * AddW kernel * And, or kernels * AddWeighted refactoring and SIMD opt for AbsDiffC kernel * Remove simd opt of AbsDiffC kernel * Refactoring * Applied comments * Refactoring.Step2 * Applied comments.Step2

anna-khakimova force-pushed the ak/simd_addw_bitwise branch from d1bdfcd to c14bca7 Compare September 30, 2020 09:17

anna-khakimova changed the title ~~AddW and bitwise kernels~~ UI SIMD for AddW and bitwise kernels Sep 30, 2020

anna-khakimova force-pushed the ak/simd_addw_bitwise branch 8 times, most recently from 4187207 to a1eec7b Compare September 30, 2020 20:16

terfendail reviewed Oct 1, 2020

View reviewed changes

modules/core/include/opencv2/core/hal/intrin_neon.hpp Outdated Show resolved Hide resolved

terfendail reviewed Oct 1, 2020

View reviewed changes

modules/gapi/src/backends/fluid/gfluidcore.cpp Outdated Show resolved Hide resolved

terfendail reviewed Oct 1, 2020

View reviewed changes

modules/gapi/src/backends/fluid/gfluidcore.cpp Outdated Show resolved Hide resolved

anton-potapov reviewed Oct 2, 2020

View reviewed changes

modules/core/include/opencv2/core/hal/intrin_avx.hpp Outdated Show resolved Hide resolved

anton-potapov reviewed Oct 2, 2020

View reviewed changes

modules/core/include/opencv2/core/hal/intrin_sse.hpp Outdated Show resolved Hide resolved

anton-potapov reviewed Oct 2, 2020

View reviewed changes

modules/gapi/src/backends/fluid/gfluidcore.cpp Outdated Show resolved Hide resolved

anton-potapov reviewed Oct 2, 2020

View reviewed changes

modules/gapi/src/backends/fluid/gfluidcore.cpp Outdated Show resolved Hide resolved

anton-potapov reviewed Oct 2, 2020

View reviewed changes

modules/gapi/src/backends/fluid/gfluidcore.cpp Show resolved Hide resolved

anna-khakimova changed the title ~~UI SIMD for AddW and bitwise kernels~~ Univ Intrinsics SIMD for AddW and bitwise kernels Oct 5, 2020

anna-khakimova changed the title ~~Univ Intrinsics SIMD for AddW and bitwise kernels~~ Uni Intrinsics SIMD for AddW and bitwise kernels Oct 5, 2020

asmorkalov requested review from OrestChura, anton-potapov and terfendail November 20, 2020 11:50

anna-khakimova force-pushed the ak/simd_addw_bitwise branch 3 times, most recently from 26b9ac6 to 1dc3ab2 Compare December 25, 2020 17:14

OrestChura requested changes Jan 27, 2021

View reviewed changes

modules/gapi/src/backends/fluid/gfluidcore.cpp Outdated Show resolved Hide resolved

anna-khakimova requested review from OrestChura, alalek and terfendail January 28, 2021 14:18

alalek reviewed Jan 28, 2021

View reviewed changes

anna-khakimova force-pushed the ak/simd_addw_bitwise branch 2 times, most recently from 6fad040 to cca8f7e Compare January 28, 2021 15:28

Refactoring.Step2

1b4529b

anna-khakimova force-pushed the ak/simd_addw_bitwise branch from cca8f7e to 1b4529b Compare January 28, 2021 17:19

anna-khakimova changed the title ~~GAPI: Uni Intrinsics SIMD for AddWeighted kernel~~ GAPI: SIMD optimization for AddWeighted kernel. Jan 28, 2021

OrestChura approved these changes Jan 28, 2021

View reviewed changes

OrestChura mentioned this pull request Jan 29, 2021

GAPI: SIMD optimization for AbsDiffC kernel #19233

Merged

anna-khakimova force-pushed the ak/simd_addw_bitwise branch 2 times, most recently from e2cebd1 to bb30bac Compare February 3, 2021 15:19

anna-khakimova requested a review from alalek February 4, 2021 08:27

alalek reviewed Feb 4, 2021

View reviewed changes

Applied comments.Step2

b819847

anna-khakimova force-pushed the ak/simd_addw_bitwise branch from bb30bac to b819847 Compare February 5, 2021 09:39

anna-khakimova requested a review from alalek February 5, 2021 09:55

alalek assigned OrestChura Feb 5, 2021

terfendail approved these changes Feb 5, 2021

View reviewed changes

alalek merged commit fb3b297 into opencv:master Feb 5, 2021

alalek mentioned this pull request Apr 9, 2021

(5.x) Merge 4.x #19885

Merged

Uh oh!

Conversation

anna-khakimova commented Sep 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anna-khakimova commented Sep 30, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

asmorkalov commented Nov 12, 2020

Uh oh!

asmorkalov commented Nov 23, 2020

Uh oh!

anna-khakimova commented Nov 30, 2020

Uh oh!

OrestChura left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

anna-khakimova commented Jan 28, 2021

Uh oh!

anna-khakimova commented Jan 28, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anna-khakimova Jan 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anna-khakimova commented Feb 4, 2021

Uh oh!

alalek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anna-khakimova commented Feb 4, 2021

Uh oh!

anna-khakimova commented Feb 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anna-khakimova commented Feb 5, 2021

Uh oh!

terfendail left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

anna-khakimova commented Sep 30, 2020 •

edited

Loading

anna-khakimova Jan 28, 2021 •

edited

Loading

anna-khakimova commented Feb 5, 2021 •

edited

Loading