Skip to content

Vectorize flipHoriz and flipVert functions.#15555

Merged
alalek merged 6 commits intoopencv:3.4from
ChipKerchner:flipVectorize
Nov 1, 2019
Merged

Vectorize flipHoriz and flipVert functions.#15555
alalek merged 6 commits intoopencv:3.4from
ChipKerchner:flipVectorize

Conversation

@ChipKerchner
Copy link
Copy Markdown
Contributor

@ChipKerchner ChipKerchner commented Sep 20, 2019

Vectorize flipHoriz and flipVert functions.

flipVert - 2x faster on VSX and 1.5x on x86
flipHoriz - up to 40x+faster on VSX and x86.

Also improved for NEON.

force_builders=Linux AVX2,Custom
buildworker:Custom=linux-3
build_image:Custom=ubuntu:18.04
CPU_BASELINE:Custom=AVX512_SKX
disable_ipp=ON

@ChipKerchner
Copy link
Copy Markdown
Contributor Author

Need someone to review the NEON code.

#elif CV_VSX
static const vec_uchar16 perm = {15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0};
vec_uchar16 vec = vsx_ld(0, ptr);
return v_uint8x16(vec_perm(vec, vec, perm));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vec_revb(__int128) should do this in one instruction on P9. It may help reduce register pressure in tight kernels.

@alalek
Copy link
Copy Markdown
Member

alalek commented Sep 24, 2019

Please also check this CI results: https://ocv-power.imavr.com/#/opencv_pullrequests (PowerPC builds are failed)

@tomoaki0705
Copy link
Copy Markdown
Contributor

I don't get why these v_load_mirror series goes in to universal intrinsic.
That's the place to write those wrapper functions.

@tomoaki0705
Copy link
Copy Markdown
Contributor

I've confirmed that current HEAD ( 6fedc7d ) passes on various Arm boards.
NEON code works fine.

Though, I still recommend to put v_load_mirror series in to universal intrinsic.

@ChipKerchner
Copy link
Copy Markdown
Contributor Author

ChipKerchner commented Oct 8, 2019

I've prepared a change for an universal intrinsics v_reverse that addresses the above requests. #15662

@ChipKerchner
Copy link
Copy Markdown
Contributor Author

Could this PR be looked at and approved now that v_reverse works for all vector platforms?

#elif CV_VSX
static const vec_uchar16 perm = {8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 4, 5, 6, 7};
vec_uchar16 vec = vsx_ld(0, ptr);
return v_uint8x16(vec_perm(vec, vec, perm));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the same as xxswapd (e.g vec_permxxdi(vec,vec,2))?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outdated code.

Copy link
Copy Markdown
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

T t0, t1;

t0 = *((T*)((uchar*)src + i));
t1 = *((T*)((uchar*)src + j - sizeof(T)));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be careful.
At least on ARM these T* pointers are required to be aligned - see #14710

if( ((size_t)src0|(size_t)dst0|(size_t)src1|(size_t)dst1) % sizeof(int) == 0 )
{
#if CV_SIMD
for( ; i <= size.width - (v_int32::nlanes * 4); i += v_int32::nlanes * 4 )
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SIMD code doesn't require alignment check above, so it can be placed before if.

@alalek alalek merged commit ed7e427 into opencv:3.4 Nov 1, 2019
@alalek alalek mentioned this pull request Nov 4, 2019
@ChipKerchner ChipKerchner deleted the flipVectorize branch November 5, 2019 17:54
@mshabunin
Copy link
Copy Markdown
Contributor

mshabunin commented Dec 11, 2019

Looks like there is a problem with unaligned v_load on ARM platform:

Program received signal SIGBUS, Bus error.
cv::hal_baseline::v_load (ptr=0x7effe83c) at ../opencv/modules/core/include/opencv2/core/hal/intrin_neon.hpp:1212
1212    ../opencv/modules/core/include/opencv2/core/hal/intrin_neon.hpp: No such file or directory.
(gdb) bt
#0  cv::hal_baseline::v_load (ptr=0x7effe83c) at ../opencv/modules/core/include/opencv2/core/hal/intrin_neon.hpp:1212
#1  0x76829e7e in cv::flipHoriz_single<cv::hal_baseline::v_uint64x2> (esz=8, size=..., dstep=32, dst=0x7effe83c "", 
    sstep=32, src=0x7effe83c "") at ../opencv/modules/core/src/copy.cpp:581
#2  cv::flipHoriz (src=0x7effe83c "", sstep=32, dst=0x7effe83c "", dstep=32, size=..., esz=8)
    at ../opencv/modules/core/src/copy.cpp:698
#3  0x7682b2f0 in cv::flip (_src=..., _dst=..., flip_mode=1) at ../opencv/modules/core/src/copy.cpp:977
#4  0x76bc2e72 in cv::intersectConvexConvex (_p1=..., _p2=..., _p12=..., handleNested=true)
    at ../opencv/modules/imgproc/src/geometry.cpp:530
#5  0x004ead06 in opencv_test::(anonymous namespace)::Imgproc_IntersectConvexConvex_intersection_1_Test::Body (
    this=0x8bb830) at ../opencv/modules/imgproc/test/test_intersectconvexconvex.cpp:161
#6  0x004eab5c in opencv_test::(anonymous namespace)::Imgproc_IntersectConvexConvex_intersection_1_Test::TestBody (
    this=0x8bb830) at ../opencv/modules/imgproc/test/test_intersectconvexconvex.cpp:146
<cut>

Can be reproduced by running this test case: ./bin/opencv_test_imgproc --gtest_filter=Imgproc_IntersectConvexConvex.intersection_1

@tomoaki0705
Copy link
Copy Markdown
Contributor

tomoaki0705 commented Dec 12, 2019

I did some quick experiments. https://github.com/tomoaki0705/unalignedLoad
@ChipKerchner is right. There is no "unaligned load instruction" in NEON.
Still, I figured out that there is a corner case in the combination of Armv7 + 64bit loading.

As @mshabunin showed, the error comes when the load happens with v_uint64x2

#1  0x76829e7e in cv::flipHoriz_single<cv::hal_baseline::v_uint64x2>

Also, it happens on Armv7 (32bit) but not on Arm v8 (64bit)

Here's a summarized table result

Platform 32/64bit GCC loading structure memory address result
Jetson TX1 64 5.4.0 uint64x2 unaligned success
Jetson TX1 64 5.4.0 uint64x2 aligned success
ODROID-XU4 32 5.4.0 uint64x2 unaligned failure (SIGBUS)
ODROID-XU4 32 5.4.0 uint64x2 aligned success
ODROID-XU4 32 5.4.0 uint32x4 unaligned success

So, when loading from unaligned address using v_uint64x2 on Armv7, SIGBUS raises.
I can't believe this, but as far as I tried, the result shows above result.

How to fix, it's a difficult question.
May be replace v_uint64x2 with v_uint32x4 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants