Improve performance on Arm64 by Developer-Ecosystem-Engineering · Pull Request #20011 · opencv/opencv

Developer-Ecosystem-Engineering · 2021-04-29T19:23:39Z

This patch will

Improve HAL primitives
- reduction (sum, min, max, sad)
- signmask
- mul_expand
- check_any / check_all

Results on a M1 Macbook Pro

% ./build-run
cc -march=armv8.4-a+dotprod -O3 -o neon-perf-test-O3 neon-perf-test.c
./neon-perf-test-O3 1000000
CONFIG:
 BUFFER SIZE : 16000
 VECTORS     : 1000
 ITERATIONS  : 1000000
 TOTAL SIZE  : 16000000000
                 Warmup : 75.268 ms

reduce_sum uint8x16:
                 OpenCV : 396.690 ms
             vaddlvq_u8 : 375.593 ms 1.056x

reduce_sum int8x16:
                 OpenCV : 395.530 ms
             vaddlvq_s8 : 375.733 ms 1.053x

reduce_sum uint16x8:
                 OpenCV : 321.458 ms
            vaddlvq_u16 : 320.453 ms 1.003x

reduce_sum int16x8:
                 OpenCV : 321.622 ms
            vaddlvq_s16 : 320.413 ms 1.004x

reduce_sum float32x4:
                 OpenCV : 396.202 ms
             vaddvq_f32 : 321.988 ms 1.230x

reduce_sum4 float32x4:
                 OpenCV : 218.655 ms
             vpaddq_f32 : 121.952 ms 1.793x

signmask uint8x16:
                 OpenCV : 550.870 ms
                  vtbl1 : 395.717 ms 1.392x

signmask uint16x8:
                 OpenCV : 507.869 ms
            vaddlvq_u16 : 320.548 ms 1.584x

signmask uint64x2:
                 OpenCV : 592.592 ms
             vaddvq_u64 : 320.520 ms 1.849x

check_all uint8x16:
                 OpenCV : 443.745 ms
              vminvq_u8 : 375.694 ms 1.181x

check_all uint16x8:
                 OpenCV : 443.904 ms
             vminvq_u16 : 375.623 ms 1.182x

check_all uint32x4:
                 OpenCV : 444.041 ms
             vminvq_u32 : 375.776 ms 1.182x

check_any uint8x16:
                 OpenCV : 403.869 ms
              vmaxvq_u8 : 375.780 ms 1.075x

check_any uint16x8:
                 OpenCV : 403.862 ms
             vmaxvq_u16 : 375.621 ms 1.075x

check_any uint32x4:
                 OpenCV : 403.871 ms
             vmaxvq_u32 : 375.621 ms 1.075x

vmul_expand uint8x16:
                 OpenCV : 202.145 ms
vmull_u8, vmull_high_u8 : 202.269 ms 0.999x  (better codegen on gcc, although run here is done with clang)

vmul_expand int8x16:
                 OpenCV : 202.147 ms
vmull_s8, vmull_high_s8 : 202.143 ms 1.000x  (better codegen on gcc, although run here is done with clang)

v_dotprod_expand uint8x16:
                 OpenCV : 512.093 ms
              vdotq_u32 : 202.430 ms 2.530x

v_dotprod_expand_fast uint8x16:
                 OpenCV : 278.198 ms

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

I agree to contribute to the project under Apache 2 License.
To the best of my knowledge, the proposed patch is not based on a code under GPL or other license that is incompatible with OpenCV
The PR is proposed to proper branch
There is reference to original bug report and related work
There is accuracy test, performance test and test data in opencv_extra repository, if applicable
Patch to opencv_extra has the same branch name.
The feature is well documented and sample code can be built with the project CMake

force_builders=linux,docs,Android armeabi-v7a,ARMv7,ARMv8

This patch will - Enable dot product intrinsics for macOS arm64 builds - Enable for macOS arm64 builds - Improve HAL primitives - reduction (sum, min, max, sad) - signmask - mul_expand - check_any / check_all Results on a M1 Macbook Pro % ./build-run cc -march=armv8.4-a+dotprod -O3 -o neon-perf-test-O3 neon-perf-test.c ./neon-perf-test-O3 1000000 CONFIG: BUFFER SIZE : 16000 VECTORS : 1000 ITERATIONS : 1000000 TOTAL SIZE : 16000000000 Warmup : 75.268 ms reduce_sum uint8x16: OpenCV : 396.690 ms vaddlvq_u8 : 375.593 ms 1.056x reduce_sum int8x16: OpenCV : 395.530 ms vaddlvq_s8 : 375.733 ms 1.053x reduce_sum uint16x8: OpenCV : 321.458 ms vaddlvq_u16 : 320.453 ms 1.003x reduce_sum int16x8: OpenCV : 321.622 ms vaddlvq_s16 : 320.413 ms 1.004x reduce_sum float32x4: OpenCV : 396.202 ms vaddvq_f32 : 321.988 ms 1.230x reduce_sum4 float32x4: OpenCV : 218.655 ms vpaddq_f32 : 121.952 ms 1.793x signmask uint8x16: OpenCV : 550.870 ms vtbl1 : 395.717 ms 1.392x signmask uint16x8: OpenCV : 507.869 ms vaddlvq_u16 : 320.548 ms 1.584x signmask uint64x2: OpenCV : 592.592 ms vaddvq_u64 : 320.520 ms 1.849x check_all uint8x16: OpenCV : 443.745 ms vminvq_u8 : 375.694 ms 1.181x check_all uint16x8: OpenCV : 443.904 ms vminvq_u16 : 375.623 ms 1.182x check_all uint32x4: OpenCV : 444.041 ms vminvq_u32 : 375.776 ms 1.182x check_any uint8x16: OpenCV : 403.869 ms vmaxvq_u8 : 375.780 ms 1.075x check_any uint16x8: OpenCV : 403.862 ms vmaxvq_u16 : 375.621 ms 1.075x check_any uint32x4: OpenCV : 403.871 ms vmaxvq_u32 : 375.621 ms 1.075x vmul_expand uint8x16: OpenCV : 202.145 ms vmull_u8, vmull_high_u8 : 202.269 ms 0.999x (better codegen on gcc, although run here is done with clang) vmul_expand int8x16: OpenCV : 202.147 ms vmull_s8, vmull_high_s8 : 202.143 ms 1.000x (better codegen on gcc, although run here is done with clang) v_dotprod_expand uint8x16: OpenCV : 512.093 ms vdotq_u32 : 202.430 ms 2.530x v_dotprod_expand_fast uint8x16: OpenCV : 278.198 ms

alalek · 2021-05-01T05:44:46Z

/cc @fpetrogalli @jondea @tomoaki0705

tomoaki0705

Great contribution.

Still, if this PR is supposed to be M1 specific modification, I think it should be separated into 2 different PR (please see my comment)
It's doing more than it says.

Lastly, not only for this PR, but for anyone who is intended to write SIMD optimization, please leverage using existing performance test code, rather than using your own well-written-comparison-test-code like neon-perf-test.c
There is already a test code that covers the performance measurement, and please STOP using your own code.
I really appreciate it if you could consider leveraging existing test code since that will make it way more easier for other contributors to test locally.

cmake/platforms/OpenCV-Darwin.cmake

modules/core/include/opencv2/core/hal/intrin_neon.hpp

Developer-Ecosystem-Engineering · 2021-05-04T18:51:22Z

Thank you for the feedback we will review and come back

- Removes Apple Silicon specific workarounds - Makes #ifdef sections smaller for v_mul_expand cases - Moves dot product optimization to compiler optimization check - Adds 4x4 matrix transpose optimization

tomoaki0705

Great works!
Mainly two points to fix

Critical but not crucial, compiler error about vtrn1q_u64
Critical and crucial, you've misunderstood some parts of dispatch feature. I left some pointers, and I hope it'll help you.

tomoaki0705 · 2021-05-12T02:38:41Z

modules/core/include/opencv2/core/hal/intrin_neon.hpp

+                         v_##_Tpvec& b2, v_##_Tpvec& b3) \
+{ \
+    /* -- Pass 1: 64b transpose */ \
+    _Tpvec##_t t0 = vtrn1q_##suffix##64(a0.val, a2.val); \


vtrn1q_u64 takes uint64x2_t as an input, not uint32x4.
You need to recast using vreinterpretq_u64_u32
That's the cause of the error message

/core/hal/intrin_neon.hpp:2015:44: error: cannot convert 'const uint32x4_t' {aka 'const __vector(4) unsigned int'} to 'uint64x2_t' {aka '__vector(2) long unsigned int'}

Fixed in the latest

cmake/OpenCVCompilerOptimizations.cmake

modules/core/src/system.cpp

cmake/OpenCVCompilerOptimizations.cmake

Based on the latest, we've removed dotprod entirely and will revisit in a future PR. Added explicit cats with v_transpose4x4() This should resolve all opens with this PR

Developer-Ecosystem-Engineering · 2021-05-24T17:18:06Z

We've updated based on the feedback thanks @tomoaki0705. We will review DOTPROD in a new PR so we can disentangle it from the rest of these changes which we believe are ready at this point.

tomoaki0705 · 2021-05-24T21:16:28Z

Great work!
Just to make sure before merge, it seems that these comment lines could be removed.

opencv/modules/core/test/test_intrin_utils.hpp

Line 585 in 1c3f2a6

// Rx2 c, d;

opencv/modules/core/test/test_intrin_utils.hpp

Line 595 in 1c3f2a6

// EXPECT_EQ((typename R::lane_type)dataA[i + n] * dataB[i + n], resD[i]);

Remove two extraneous comments

Developer-Ecosystem-Engineering · 2021-05-24T21:29:15Z

Great work!
Just to make sure before merge, it seems that these comment lines could be removed.

opencv/modules/core/test/test_intrin_utils.hpp

Line 585 in 1c3f2a6

// Rx2 c, d;

Removed the commented out lines

tomoaki0705 · 2021-05-24T21:30:39Z

Looks good to me

tomoaki0705 reviewed May 1, 2021

View reviewed changes

Updates to opencv#20011 based on feedback

3ff1e5e

- Removes Apple Silicon specific workarounds - Makes #ifdef sections smaller for v_mul_expand cases - Moves dot product optimization to compiler optimization check - Adds 4x4 matrix transpose optimization

tomoaki0705 suggested changes May 12, 2021

View reviewed changes

Remove dotprod and fix v_transpose

1c3f2a6

Based on the latest, we've removed dotprod entirely and will revisit in a future PR. Added explicit cats with v_transpose4x4() This should resolve all opens with this PR

Developer-Ecosystem-Engineering changed the title ~~Improve performance on Apple silicon~~ Improve performance on Arm64 May 24, 2021

Remove commented out lines

5fc4017

Remove two extraneous comments

alalek approved these changes May 25, 2021

View reviewed changes

alalek merged commit 814550d into opencv:3.4 Jun 1, 2021

Developer-Ecosystem-Engineering deleted the 3.4 branch June 2, 2021 18:56

This was referenced Jun 3, 2021

(4.x) Merge 3.4 #20207

Merged

(5.x) Merge 4.x #20216

Merged

Uh oh!

Conversation

Developer-Ecosystem-Engineering commented Apr 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Readiness Checklist

Uh oh!

alalek commented May 1, 2021

Uh oh!

tomoaki0705 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Developer-Ecosystem-Engineering commented May 4, 2021

Uh oh!

tomoaki0705 left a comment

Choose a reason for hiding this comment

Uh oh!

tomoaki0705 May 12, 2021

Choose a reason for hiding this comment

Uh oh!

Developer-Ecosystem-Engineering May 24, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Developer-Ecosystem-Engineering commented May 24, 2021

Uh oh!

tomoaki0705 commented May 24, 2021

Uh oh!

Developer-Ecosystem-Engineering commented May 24, 2021

Uh oh!

tomoaki0705 commented May 24, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Developer-Ecosystem-Engineering commented Apr 29, 2021 •

edited

Loading