[hal][neon] Optimize the v_dotprod_fast intrinsics for aarch64.#19486
[hal][neon] Optimize the v_dotprod_fast intrinsics for aarch64.#19486alalek merged 5 commits intoopencv:3.4from
Conversation
On Armv8 in AArch64 execution mode, we can skip the sequence v<op>_<ty>(vget_high_<ty>(x), vget_high_<ty>(y)) in favour of v<op>_high_<ty>(x, y) This has better changes for recent compilers to use less data movement operations and better register allocation. See for example: https://godbolt.org/z/bPq7vd
|
Here is the speedup (as in new version / old version) I measured on graviton2 on AWS (neoverse N1). I used The total cycle count of the function |
|
@fpetrogalli Please take a look on CI builds for iOS. The patch breaks the build. |
|
The same for Android on arm-v7a. |
|
@asmorkalov - yes! Thank you, I have noticed. I am trying to reproduce the error. I am quite surprised because the macro |
|
/cc @tomoaki0705 |
tomoaki0705
left a comment
There was a problem hiding this comment.
Generally looks good.
Please see small notes from me.
The fix is needed to prevent warnings when building for Armv7.
alalek
left a comment
There was a problem hiding this comment.
@fpetrogalli Thank you for contribution!
@tomoaki0705 Thank you for review!
On Armv8 in AArch64 execution mode, we can skip the sequence
in favour of
This has better changes for recent compilers to use less data movement
operations and better register allocation. See for example:
https://godbolt.org/z/bPq7vd
Pull Request Readiness Checklist
See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request
Patch to opencv_extra has the same branch name.