Skip to content

Convert HOG from SSE SIMD to HAL - 35-45% faster on Power (VSX)#15199

Merged
alalek merged 4 commits intoopencv:3.4from
ChipKerchner:hogToHal
Aug 8, 2019
Merged

Convert HOG from SSE SIMD to HAL - 35-45% faster on Power (VSX)#15199
alalek merged 4 commits intoopencv:3.4from
ChipKerchner:hogToHal

Conversation

@ChipKerchner
Copy link
Copy Markdown
Contributor

@ChipKerchner ChipKerchner commented Jul 31, 2019

Convert HOG from SSE SIMD to HAL - 35-45% faster on Power (VSX).

force_builders=ARMv8,Custom
buildworker:Custom=linux-1,linux-2,linux-4
docker_image:Custom=powerpc64le

@mshabunin
Copy link
Copy Markdown
Contributor

Shouldn't it be transformed differently?

Before:

#if CV_SSE2
...
#elif CV_NEON
...
#else
...
#endif

After:

#if CV_SIMD128 // or CV_SIMD for wide universal intrinsics
...
#else
...
#endif

@ChipKerchner
Copy link
Copy Markdown
Contributor Author

I could remove the NEON specific code and use the CV_SIMD128 for 3 platforms (SSE2, NEON, VSX) instead.

I just did NOT have a way of testing NEON.

@alalek
Copy link
Copy Markdown
Member

alalek commented Aug 1, 2019

Yes, this is right way.
We would run tests on NEON-capable hardware.


v_int32x4 sign = (ione & v_reinterpret_as_s32(_angle < fzero));
v_int32x4 _hidx = v_trunc(_angle);
_hidx -= sign;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose v_floor intrinsic could be used here to compute _hidx instead of lines 502-504

v_int32x4 mask0 = _hidx >> 31;
v_int32x4 it0 = mask0 & _nbins;
mask0 = (_hidx < _nbins);
v_int32x4 it1 = ~mask0 & _nbins;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code will seem simpler if >= is used instead of < invertion

Copy link
Copy Markdown
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done!

int32x4_t ifour = vdupq_n_s32(4);
#if CV_SIMD128
const float a[] = { 0.0, 1.0, 2.0, 3.0 };
v_float32x4 idx = v_load((float *)a);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v_float32x4 idx(0.0f, 1.0f, 2.0f, 3.0f); here and above (line 251).

Copy link
Copy Markdown
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done! Thank you 👍

@alalek alalek merged commit d513fb4 into opencv:3.4 Aug 8, 2019
@ChipKerchner ChipKerchner deleted the hogToHal branch August 8, 2019 18:59
@alalek alalek mentioned this pull request Aug 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants