Conversation
This patch will
- Enable dot product intrinsics for macOS arm64 builds
- Enable for macOS arm64 builds
- Improve HAL primitives
- reduction (sum, min, max, sad)
- signmask
- mul_expand
- check_any / check_all
Results on a M1 Macbook Pro
% ./build-run
cc -march=armv8.4-a+dotprod -O3 -o neon-perf-test-O3 neon-perf-test.c
./neon-perf-test-O3 1000000
CONFIG:
BUFFER SIZE : 16000
VECTORS : 1000
ITERATIONS : 1000000
TOTAL SIZE : 16000000000
Warmup : 75.268 ms
reduce_sum uint8x16:
OpenCV : 396.690 ms
vaddlvq_u8 : 375.593 ms 1.056x
reduce_sum int8x16:
OpenCV : 395.530 ms
vaddlvq_s8 : 375.733 ms 1.053x
reduce_sum uint16x8:
OpenCV : 321.458 ms
vaddlvq_u16 : 320.453 ms 1.003x
reduce_sum int16x8:
OpenCV : 321.622 ms
vaddlvq_s16 : 320.413 ms 1.004x
reduce_sum float32x4:
OpenCV : 396.202 ms
vaddvq_f32 : 321.988 ms 1.230x
reduce_sum4 float32x4:
OpenCV : 218.655 ms
vpaddq_f32 : 121.952 ms 1.793x
signmask uint8x16:
OpenCV : 550.870 ms
vtbl1 : 395.717 ms 1.392x
signmask uint16x8:
OpenCV : 507.869 ms
vaddlvq_u16 : 320.548 ms 1.584x
signmask uint64x2:
OpenCV : 592.592 ms
vaddvq_u64 : 320.520 ms 1.849x
check_all uint8x16:
OpenCV : 443.745 ms
vminvq_u8 : 375.694 ms 1.181x
check_all uint16x8:
OpenCV : 443.904 ms
vminvq_u16 : 375.623 ms 1.182x
check_all uint32x4:
OpenCV : 444.041 ms
vminvq_u32 : 375.776 ms 1.182x
check_any uint8x16:
OpenCV : 403.869 ms
vmaxvq_u8 : 375.780 ms 1.075x
check_any uint16x8:
OpenCV : 403.862 ms
vmaxvq_u16 : 375.621 ms 1.075x
check_any uint32x4:
OpenCV : 403.871 ms
vmaxvq_u32 : 375.621 ms 1.075x
vmul_expand uint8x16:
OpenCV : 202.145 ms
vmull_u8, vmull_high_u8 : 202.269 ms 0.999x (better codegen on gcc, although run here is done with clang)
vmul_expand int8x16:
OpenCV : 202.147 ms
vmull_s8, vmull_high_s8 : 202.143 ms 1.000x (better codegen on gcc, although run here is done with clang)
v_dotprod_expand uint8x16:
OpenCV : 512.093 ms
vdotq_u32 : 202.430 ms 2.530x
v_dotprod_expand_fast uint8x16:
OpenCV : 278.198 ms
tomoaki0705
left a comment
There was a problem hiding this comment.
Great contribution.
Still, if this PR is supposed to be M1 specific modification, I think it should be separated into 2 different PR (please see my comment)
It's doing more than it says.
Lastly, not only for this PR, but for anyone who is intended to write SIMD optimization, please leverage using existing performance test code, rather than using your own well-written-comparison-test-code like neon-perf-test.c
There is already a test code that covers the performance measurement, and please STOP using your own code.
I really appreciate it if you could consider leveraging existing test code since that will make it way more easier for other contributors to test locally.
|
Thank you for the feedback we will review and come back |
- Removes Apple Silicon specific workarounds - Makes #ifdef sections smaller for v_mul_expand cases - Moves dot product optimization to compiler optimization check - Adds 4x4 matrix transpose optimization
tomoaki0705
left a comment
There was a problem hiding this comment.
Great works!
Mainly two points to fix
- Critical but not crucial, compiler error about
vtrn1q_u64 - Critical and crucial, you've misunderstood some parts of dispatch feature. I left some pointers, and I hope it'll help you.
| v_##_Tpvec& b2, v_##_Tpvec& b3) \ | ||
| { \ | ||
| /* -- Pass 1: 64b transpose */ \ | ||
| _Tpvec##_t t0 = vtrn1q_##suffix##64(a0.val, a2.val); \ |
There was a problem hiding this comment.
vtrn1q_u64 takes uint64x2_t as an input, not uint32x4.
You need to recast using vreinterpretq_u64_u32
That's the cause of the error message
/core/hal/intrin_neon.hpp:2015:44: error: cannot convert 'const uint32x4_t' {aka 'const __vector(4) unsigned int'} to 'uint64x2_t' {aka '__vector(2) long unsigned int'}
There was a problem hiding this comment.
Fixed in the latest
Based on the latest, we've removed dotprod entirely and will revisit in a future PR. Added explicit cats with v_transpose4x4() This should resolve all opens with this PR
|
We've updated based on the feedback thanks @tomoaki0705. We will review DOTPROD in a new PR so we can disentangle it from the rest of these changes which we believe are ready at this point. |
|
Great work! |
Remove two extraneous comments
Removed the commented out lines |
|
Looks good to me |
This patch will
Results on a M1 Macbook Pro
Pull Request Readiness Checklist
See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request
Patch to opencv_extra has the same branch name.