Skip to content

GAPI: SIMD optimization for AbsDiffC kernel#19233

Merged
alalek merged 6 commits intoopencv:masterfrom
anna-khakimova:ak/simd_absdiffc
Feb 8, 2021
Merged

GAPI: SIMD optimization for AbsDiffC kernel#19233
alalek merged 6 commits intoopencv:masterfrom
anna-khakimova:ak/simd_absdiffc

Conversation

@anna-khakimova
Copy link
Copy Markdown
Member

@anna-khakimova anna-khakimova commented Dec 29, 2020

SIMD optimization for AbsDiffC kernel via univ intrinsics.

@rgarnov, @OrestChura please take a look.

Full performance report from latest revision:
AbsDiffC_full_perf_report.xlsx

build_gapi_standalone:Linux x64=ade-0.1.1f

force_builders=Linux AVX2,Custom,ARMv7

disable_ipp:Custom=ON
buildworker:Custom=linux-3
build_image:Custom=ubuntu:18.04
CPU_BASELINE:Custom=AVX512_SKX

Xbuildworker:Custom=linux-1,linux-2,linux-4
Xbuild_image:Custom=powerpc64le

@anna-khakimova anna-khakimova changed the title SIMD optimization for AbsDiffC kernel GAPI: SIMD optimization for AbsDiffC kernel Dec 29, 2020
@anna-khakimova anna-khakimova force-pushed the ak/simd_absdiffc branch 7 times, most recently from 1e9ddaa to 261f45f Compare January 19, 2021 08:38
@anna-khakimova anna-khakimova force-pushed the ak/simd_absdiffc branch 6 times, most recently from 0df5429 to f72ef73 Compare January 20, 2021 15:03
@anna-khakimova
Copy link
Copy Markdown
Member Author

@alalek please review.

Copy link
Copy Markdown
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to reduce usage of native intrinsics.
Amount of code should be reduced too, no need to start with optimizations of one-time initialization part (we just can't measure these benefits through perf tests).


What is about code dispatching between SSE4.2 / AVX2 / AVX512 in a single binary? // cc @dmatveev

return v_float32x16(_mm512_setr_ps(*scalar, *(scalar + 1), *scalar, *(scalar + 1),
*scalar, *(scalar + 1), *scalar, *(scalar + 1),
*scalar, *(scalar + 1), *scalar, *(scalar + 1),
*scalar, *(scalar + 1), *scalar, *(scalar + 1)));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v_float32x16 ctor must be used instead of native intrinsics.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outdated

CV_ALWAYS_INLINE int absdiffc_simd_c1c2c4(const T in[], T out[],
const v_float32& s, const int length)
{
constexpr int nlanes = static_cast<int>(v_uint16::nlanes);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typename T
v_uint16::nlanes

Code should be consistent.
Don't use assumptions in generic implementation (especially silently).

Copy link
Copy Markdown
Member Author

@anna-khakimova anna-khakimova Jan 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is that this function handles cases when data is of unsigned short type and when data is of signed short type. In both cases nlanes is one and the same. nlanes = ength vector in bits / number bits in types. For this case 128(SSE42)/16 = 8. So for both types U16 and S16 nlanes = 8 for SSE42. So there is no particular need to separate two these cases.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we are expecting T=ushort or T=short here; in this case, maybe it would be better to explicitly check that by asserts, smth like:

bool isShort = std::is_same<T, ushort>::value || std::is_same<T, short>::value;
GAPI_Assert(isShort == true);

This also should be applied to absdiffc_simd_c3_impl, I think. There is the same issue

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use assumptions in generic implementation (especially silently).

Changed.

Comment on lines +1021 to +1024
v_float32 a1 = v_cvt_f32(vx_load_expand_q(in + x)),
a2 = v_cvt_f32(vx_load_expand_q(in + x + nlanes / 4)),
a3 = v_cvt_f32(vx_load_expand_q(in + x + nlanes / 2)),
a4 = v_cvt_f32(vx_load_expand_q(in + x + 3 * nlanes / 4));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid declarations of multiple vars at once:

  • debugger is not able to show the right statement if this code goes out of buffer range

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +1021 to +1024
v_float32 a1 = v_cvt_f32(vx_load_expand_q(in + x)),
a2 = v_cvt_f32(vx_load_expand_q(in + x + nlanes / 4)),
a3 = v_cvt_f32(vx_load_expand_q(in + x + nlanes / 2)),
a4 = v_cvt_f32(vx_load_expand_q(in + x + 3 * nlanes / 4));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vx_load_expand_q
vx_load_expand_q
vx_load_expand_q
vx_load_expand_q

Reduce pressure on CPU's LOAD units. Fetched memory is equal to vx_load(in + x).
Load v_uint8 first and then repack in registers.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't quite understand your proposal. could you please clarify your idea?

Copy link
Copy Markdown
Member

@alalek alalek Jan 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace 4 load instructions to one.

Copy link
Copy Markdown
Member Author

@anna-khakimova anna-khakimova Jan 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I need initialize 4 vectors for further work with them. How can I load four vectors with one vx_load call?

Copy link
Copy Markdown
Member Author

@anna-khakimova anna-khakimova Jan 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@terfendail Could you please comment or clarify Alexander's proposal? How will Alexander's approach affect performance?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Alexander means something like

v_uint16 ld0, ld1;
v_expand(vx_load(in+x), ld0, ld1);
v_float32 a1 = v_cvt_f32(v_expand_low(ld0));
v_float32 a2 = v_cvt_f32(v_expand_high(ld0));
v_float32 a3 = v_cvt_f32(v_expand_low(ld1));
v_float32 a4 = v_cvt_f32(v_expand_high(ld1));

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@terfendail Ok. Thank you so much for clarification!

Copy link
Copy Markdown
Member Author

@anna-khakimova anna-khakimova Jan 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Alexander means something like

v_uint16 ld0, ld1;
v_expand(vx_load(in+x), ld0, ld1);
v_float32 a1 = v_cvt_f32(v_expand_low(ld0));
v_float32 a2 = v_cvt_f32(v_expand_high(ld0));
v_float32 a3 = v_cvt_f32(v_expand_low(ld1));
v_float32 a4 = v_cvt_f32(v_expand_high(ld1));

@alalek I applied your proposal for 8U and gather performance report for AVX512 vectors. I observed average performance degradation equals to 12.6%. For 8UC3 test cases performance degradation is up to 33.3%. So I wouldn't like to apply this proposal to my snippet. Please take a look at the comparative performance report: //cc @dmatveev

vx_load_expand_q_vs_v_expand_high_low.xlsx

Copy link
Copy Markdown
Member Author

@anna-khakimova anna-khakimova Jan 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see applied proposal in the "Performance experiment" commit.

@anna-khakimova
Copy link
Copy Markdown
Member Author

anna-khakimova commented Jan 21, 2021

Need to reduce usage of native intrinsics.
Amount of code should be reduced too, no need to start with optimizations of one-time initialization part (we just can't measure these benefits through perf tests).

What is about code dispatching between SSE4.2 / AVX2 / AVX512 in a single binary? // cc @dmatveev

@alalek If I understand correctly dispatching between SSE4.2 / AVX2 / AVX512 in a single binary will be possible only if I move my new universal intrinsics v_cvt_f32 () and v_set_scalar () to intrin_sse.hpp, intrin_avx.hpp, intrin_avx512.hpp files. Which is highly undesirable for you like for reviewer.
Or is there some other way to organize dynamic dispatching without adding new intrinsics to files mentioned above?
Could you express your opinion on this matter please? What approach do you propose?

@anna-khakimova
Copy link
Copy Markdown
Member Author

Need to reduce usage of native intrinsics.
Amount of code should be reduced too, no need to start with optimizations of one-time initialization part (we just can't measure these benefits through perf tests).
What is about code dispatching between SSE4.2 / AVX2 / AVX512 in a single binary? // cc @dmatveev

@alalek If I understand correctly dispatching between SSE4.2 / AVX2 / AVX512 in a single binary will be possible only if I move my new universal intrinsics v_cvt_f32 () and v_set_scalar () to intrin_sse.hpp, intrin_avx.hpp, intrin_avx512.hpp files. Which is highly undesirable for you like for reviewer.
Or is there some other way to organize dynamic dispatching without adding new intrinsics to files mentioned above?
Could you express your opinion on this matter please? What approach do you propose?

@terfendail Could you please comment our proposals?

float init[6] = { *scalar, *(scalar + 1), *(scalar + 2), *scalar,
*(scalar + 1), *(scalar + 2) };

v_float32 s1 = v_set_scalar<3>(scalar);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to extend init array to v_float::nlanes +2 and than just load
s1 =vx_load(init+0)
For 2 and 4 channels you could use the same approach or try vx_lut_pairs/vx_lut_quads(scalar, vx_setzero_s32()) whatever show better performance

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed about simplifying/minimization of initialization code (no real performance impact)

Copy link
Copy Markdown
Member Author

@anna-khakimova anna-khakimova Jan 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to extend init array to v_float::nlanes +2 and than just load
s1 =vx_load(init+0)

Thanks for advice. I'll try.

For 2 and 4 channels you could use the same approach or try vx_lut_pairs/vx_lut_quads(scalar, vx_setzero_s32()) whatever show better performance

try vx_lut_pairs/vx_lut_quads(scalar, vx_setzero_s32())

It is not so good idea. vx_lut_pairs() calls (pefix)_i32gather_epi64() intrinsic that has latency equals to about 25. For comparison, vx_load() has the latency equals to 7.

Copy link
Copy Markdown
Member Author

@anna-khakimova anna-khakimova Jan 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And vx_lut_quads() has summary latency about equals to 33 when the latency of the vx_load() equals to 7 .

{
for (; x <= length - nlanes; x += nlanes)
{
v_float32 a1 = v_cvt_f32(vx_load_expand(in + x)),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use v_cvt_f32(v_reinterpret_as_s32(vx_load_expand(in + x))) and avoid defining v_cvt_f32 for uint32

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@anna-khakimova anna-khakimova force-pushed the ak/simd_absdiffc branch 2 times, most recently from b9f5681 to b308343 Compare January 26, 2021 21:23
@anna-khakimova
Copy link
Copy Markdown
Member Author

@alalek please review.

float init[size];
for (int i = 0; i < size; ++i)
{
init[i] = *(scalar + i % chan);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to obfuscate code:

-*(scalar + i % chan)
+scalar[i % chan]

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outdated.

T out[], int width)
{
constexpr int chan = 4;
constexpr int size = static_cast<int>(v_float32::nlanes) + 2;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+ 2

Why?

Copy link
Copy Markdown
Member Author

@anna-khakimova anna-khakimova Jan 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I've already written in the post above, that loading to each next coefficient vector occurs with an offset:

v_float32 s1 = vx_load(init);

#if CV_SIMD_WIDTH == 32
    v_float32 s2 = vx_load(init + 2);
    v_float32 s3 = vx_load(init + 1);
#else
    v_float32 s2 = vx_load(init + 1);
    v_float32 s3 = vx_load(init + 2); 
#endif

Maximal offset is 2.
Also @terfendail has already write about it here

Size of vector equals to nlanes. If loading start at second element of init array, then it'll finish at nlanes+2 element of init.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no such code in this function.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. It's a typo.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outdated.

float init[size];
for (int i = 0; i < size; ++i)
{
init[i] = *(scalar + i % chan);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmatveev AFAIK, Fluid backend performs per-row processing.
So it make sense to implement support for initializer code of such constants.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Scratch buffer was applied.

const v_float32& s1, const v_float32& s2,
const v_float32& s3, const int length)
{
CV_StaticAssert((std::is_same<T, ushort>::value) ||
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there CV_StaticAssert() support in standalone mode? IE?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to static_assert()

@anna-khakimova anna-khakimova force-pushed the ak/simd_absdiffc branch 2 times, most recently from fcb13db to 8ffecb7 Compare February 4, 2021 10:01
@anna-khakimova anna-khakimova requested a review from alalek February 4, 2021 10:23
{
for (int i = 0; i < num_vectors; ++i)
{
vectors[i] = v_load_f32(in + x + i * nlanes / 4);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to perform hand-made registers spilling. Compilers are smart enough and can do that for you if necessary (moreover AVX512 has up to 32 vector registers)

This data is:

  • loaded once
  • used once

Move data loading to corresponding places.

Copy link
Copy Markdown
Member Author

@anna-khakimova anna-khakimova Feb 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alalek Could you please clarify what you mean under hand-made registers spilling? If you mean v_load_f32(), it isn't hand-made registers spilling. It is just an overloaded function for ease of writing templates.
If you mean for loop, for initialization 12 vectors- are you sure that you want to see 12 load lines instead of one?
I don't quite understand the essence of your request. Please clarify.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Data is loaded from the memory.
  2. On the same line data is stored back to the memory.
  3. Data re-loaded later once again for processing.

Do you see here redundant steps?

P.S. No need to load all 12 SIMD vectors at once. Load data on demand.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P.S. No need to load all 12 SIMD vectors at once. Load data on demand.

It's necessary because of specificities of the algorithm.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to perform hand-made registers spilling. Compilers are smart enough and can do that for you if necessary (moreover AVX512 has up to 32 vector registers)

This data is:

  • loaded once
  • used once

Move data loading to corresponding places.

Reworked.

Comment on lines +1251 to +1371
static void initScratch(const GMatDesc& in, const cv::Scalar& _scalar, Buffer& scratch)
{
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great 👍

@anna-khakimova anna-khakimova force-pushed the ak/simd_absdiffc branch 2 times, most recently from 0fd3dea to 1ebbd0c Compare February 5, 2021 16:02
@anna-khakimova anna-khakimova requested a review from alalek February 5, 2021 16:06
@anna-khakimova
Copy link
Copy Markdown
Member Author

@alalek All comments were applied. Please check.

@anna-khakimova anna-khakimova force-pushed the ak/simd_absdiffc branch 2 times, most recently from d9a52e1 to fb7f668 Compare February 8, 2021 09:13
@anna-khakimova
Copy link
Copy Markdown
Member Author

@alalek CI builds finished successfully. There are no unapplied comments.
Please check.

@alalek alalek merged commit 7ab3a80 into opencv:master Feb 8, 2021
@alalek alalek mentioned this pull request Apr 9, 2021
a-sajjad72 pushed a commit to a-sajjad72/opencv that referenced this pull request Mar 30, 2023
GAPI: SIMD optimization for AbsDiffC kernel

* SIMD optimization for AbsDiffC kernel

* Applied comments

* Applying comments and refactoring: Remove new univ intrinsics.

* Performance experiment

* Applied comments.Step2

* Applied comments. Step3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants