Skip to content

Improve performance on Arm64#20011

Merged
alalek merged 4 commits intoopencv:3.4from
Developer-Ecosystem-Engineering:3.4
Jun 1, 2021
Merged

Improve performance on Arm64#20011
alalek merged 4 commits intoopencv:3.4from
Developer-Ecosystem-Engineering:3.4

Conversation

@Developer-Ecosystem-Engineering
Copy link
Copy Markdown
Contributor

@Developer-Ecosystem-Engineering Developer-Ecosystem-Engineering commented Apr 29, 2021

This patch will

  • Improve HAL primitives
    • reduction (sum, min, max, sad)
    • signmask
    • mul_expand
    • check_any / check_all

Results on a M1 Macbook Pro

% ./build-run
cc -march=armv8.4-a+dotprod -O3 -o neon-perf-test-O3 neon-perf-test.c
./neon-perf-test-O3 1000000
CONFIG:
 BUFFER SIZE : 16000
 VECTORS     : 1000
 ITERATIONS  : 1000000
 TOTAL SIZE  : 16000000000
                 Warmup : 75.268 ms

reduce_sum uint8x16:
                 OpenCV : 396.690 ms
             vaddlvq_u8 : 375.593 ms 1.056x

reduce_sum int8x16:
                 OpenCV : 395.530 ms
             vaddlvq_s8 : 375.733 ms 1.053x

reduce_sum uint16x8:
                 OpenCV : 321.458 ms
            vaddlvq_u16 : 320.453 ms 1.003x

reduce_sum int16x8:
                 OpenCV : 321.622 ms
            vaddlvq_s16 : 320.413 ms 1.004x

reduce_sum float32x4:
                 OpenCV : 396.202 ms
             vaddvq_f32 : 321.988 ms 1.230x

reduce_sum4 float32x4:
                 OpenCV : 218.655 ms
             vpaddq_f32 : 121.952 ms 1.793x

signmask uint8x16:
                 OpenCV : 550.870 ms
                  vtbl1 : 395.717 ms 1.392x

signmask uint16x8:
                 OpenCV : 507.869 ms
            vaddlvq_u16 : 320.548 ms 1.584x

signmask uint64x2:
                 OpenCV : 592.592 ms
             vaddvq_u64 : 320.520 ms 1.849x

check_all uint8x16:
                 OpenCV : 443.745 ms
              vminvq_u8 : 375.694 ms 1.181x

check_all uint16x8:
                 OpenCV : 443.904 ms
             vminvq_u16 : 375.623 ms 1.182x

check_all uint32x4:
                 OpenCV : 444.041 ms
             vminvq_u32 : 375.776 ms 1.182x

check_any uint8x16:
                 OpenCV : 403.869 ms
              vmaxvq_u8 : 375.780 ms 1.075x

check_any uint16x8:
                 OpenCV : 403.862 ms
             vmaxvq_u16 : 375.621 ms 1.075x

check_any uint32x4:
                 OpenCV : 403.871 ms
             vmaxvq_u32 : 375.621 ms 1.075x

vmul_expand uint8x16:
                 OpenCV : 202.145 ms
vmull_u8, vmull_high_u8 : 202.269 ms 0.999x  (better codegen on gcc, although run here is done with clang)

vmul_expand int8x16:
                 OpenCV : 202.147 ms
vmull_s8, vmull_high_s8 : 202.143 ms 1.000x  (better codegen on gcc, although run here is done with clang)

v_dotprod_expand uint8x16:
                 OpenCV : 512.093 ms
              vdotq_u32 : 202.430 ms 2.530x

v_dotprod_expand_fast uint8x16:
                 OpenCV : 278.198 ms

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or other license that is incompatible with OpenCV
  • The PR is proposed to proper branch
  • There is reference to original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake
force_builders=linux,docs,Android armeabi-v7a,ARMv7,ARMv8

This patch will
- Enable dot product intrinsics for macOS arm64 builds
- Enable for macOS arm64 builds
- Improve HAL primitives
  - reduction (sum, min, max, sad)
  - signmask
  - mul_expand
  - check_any / check_all

Results on a M1 Macbook Pro

 % ./build-run
cc -march=armv8.4-a+dotprod -O3 -o neon-perf-test-O3 neon-perf-test.c
./neon-perf-test-O3 1000000
CONFIG:
  BUFFER SIZE : 16000
  VECTORS     : 1000
  ITERATIONS  : 1000000
  TOTAL SIZE  : 16000000000
                  Warmup : 75.268 ms

reduce_sum uint8x16:
                  OpenCV : 396.690 ms
              vaddlvq_u8 : 375.593 ms 1.056x

reduce_sum int8x16:
                  OpenCV : 395.530 ms
              vaddlvq_s8 : 375.733 ms 1.053x

reduce_sum uint16x8:
                  OpenCV : 321.458 ms
             vaddlvq_u16 : 320.453 ms 1.003x

reduce_sum int16x8:
                  OpenCV : 321.622 ms
             vaddlvq_s16 : 320.413 ms 1.004x

reduce_sum float32x4:
                  OpenCV : 396.202 ms
              vaddvq_f32 : 321.988 ms 1.230x

reduce_sum4 float32x4:
                  OpenCV : 218.655 ms
              vpaddq_f32 : 121.952 ms 1.793x

signmask uint8x16:
                  OpenCV : 550.870 ms
                   vtbl1 : 395.717 ms 1.392x

signmask uint16x8:
                  OpenCV : 507.869 ms
             vaddlvq_u16 : 320.548 ms 1.584x

signmask uint64x2:
                  OpenCV : 592.592 ms
              vaddvq_u64 : 320.520 ms 1.849x

check_all uint8x16:
                  OpenCV : 443.745 ms
               vminvq_u8 : 375.694 ms 1.181x

check_all uint16x8:
                  OpenCV : 443.904 ms
              vminvq_u16 : 375.623 ms 1.182x

check_all uint32x4:
                  OpenCV : 444.041 ms
              vminvq_u32 : 375.776 ms 1.182x

check_any uint8x16:
                  OpenCV : 403.869 ms
               vmaxvq_u8 : 375.780 ms 1.075x

check_any uint16x8:
                  OpenCV : 403.862 ms
              vmaxvq_u16 : 375.621 ms 1.075x

check_any uint32x4:
                  OpenCV : 403.871 ms
              vmaxvq_u32 : 375.621 ms 1.075x

vmul_expand uint8x16:
                  OpenCV : 202.145 ms
 vmull_u8, vmull_high_u8 : 202.269 ms 0.999x  (better codegen on gcc, although run here is done with clang)

vmul_expand int8x16:
                  OpenCV : 202.147 ms
 vmull_s8, vmull_high_s8 : 202.143 ms 1.000x  (better codegen on gcc, although run here is done with clang)

v_dotprod_expand uint8x16:
                  OpenCV : 512.093 ms
               vdotq_u32 : 202.430 ms 2.530x

v_dotprod_expand_fast uint8x16:
                  OpenCV : 278.198 ms
@alalek
Copy link
Copy Markdown
Member

alalek commented May 1, 2021

/cc @fpetrogalli @jondea @tomoaki0705

Copy link
Copy Markdown
Contributor

@tomoaki0705 tomoaki0705 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great contribution.

Still, if this PR is supposed to be M1 specific modification, I think it should be separated into 2 different PR (please see my comment)
It's doing more than it says.

Lastly, not only for this PR, but for anyone who is intended to write SIMD optimization, please leverage using existing performance test code, rather than using your own well-written-comparison-test-code like neon-perf-test.c
There is already a test code that covers the performance measurement, and please STOP using your own code.
I really appreciate it if you could consider leveraging existing test code since that will make it way more easier for other contributors to test locally.

@Developer-Ecosystem-Engineering
Copy link
Copy Markdown
Contributor Author

Thank you for the feedback we will review and come back

  - Removes Apple Silicon specific workarounds
  - Makes #ifdef sections smaller for v_mul_expand cases
  - Moves dot product optimization to compiler optimization check
  - Adds 4x4 matrix transpose optimization
Copy link
Copy Markdown
Contributor

@tomoaki0705 tomoaki0705 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great works!
Mainly two points to fix

  • Critical but not crucial, compiler error about vtrn1q_u64
  • Critical and crucial, you've misunderstood some parts of dispatch feature. I left some pointers, and I hope it'll help you.

v_##_Tpvec& b2, v_##_Tpvec& b3) \
{ \
/* -- Pass 1: 64b transpose */ \
_Tpvec##_t t0 = vtrn1q_##suffix##64(a0.val, a2.val); \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vtrn1q_u64 takes uint64x2_t as an input, not uint32x4.
You need to recast using vreinterpretq_u64_u32
That's the cause of the error message

/core/hal/intrin_neon.hpp:2015:44: error: cannot convert 'const uint32x4_t' {aka 'const __vector(4) unsigned int'} to 'uint64x2_t' {aka '__vector(2) long unsigned int'}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the latest

Based on the latest, we've removed dotprod entirely and will revisit in a future PR.

Added explicit cats with v_transpose4x4()

This should resolve all opens with this PR
@Developer-Ecosystem-Engineering Developer-Ecosystem-Engineering changed the title Improve performance on Apple silicon Improve performance on Arm64 May 24, 2021
@Developer-Ecosystem-Engineering
Copy link
Copy Markdown
Contributor Author

We've updated based on the feedback thanks @tomoaki0705. We will review DOTPROD in a new PR so we can disentangle it from the rest of these changes which we believe are ready at this point.

@tomoaki0705
Copy link
Copy Markdown
Contributor

Great work!
Just to make sure before merge, it seems that these comment lines could be removed.


// EXPECT_EQ((typename R::lane_type)dataA[i + n] * dataB[i + n], resD[i]);

Remove two extraneous comments
@Developer-Ecosystem-Engineering
Copy link
Copy Markdown
Contributor Author

Great work!
Just to make sure before merge, it seems that these comment lines could be removed.

Removed the commented out lines

@tomoaki0705
Copy link
Copy Markdown
Contributor

Looks good to me

@alalek alalek merged commit 814550d into opencv:3.4 Jun 1, 2021
This was referenced Jun 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

optimization platform: arm ARM boards related issues: RPi, NVIDIA TK/TX, etc

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants