use universal SIMD intrinsics for SIFT#17707
Conversation
| int i, j, k, len = (radius*2+1)*(radius*2+1); | ||
|
|
||
| float expf_scale = -1.f/(2.f * sigma * sigma); | ||
| #if CV_SIMD |
There was a problem hiding this comment.
I think it make sense to use aligned version regardless of CV_SIMD. Even without explicit vectorization It could help compiler's autovectorizer
There was a problem hiding this comment.
Also there is utils::BufferArea class in core/utils/buffer_area.private.hpp that could be used to allocate memory for internal buffer
| v_float32 ori = vx_load_aligned( Ori + k ); | ||
| v_int32 bin = v_round( nd360 * ori ); | ||
|
|
||
| bin = v_select(bin >= __n, bin - __n, bin); |
There was a problem hiding this comment.
Could you please check performance for bin = bin - __n & (bin >= n) version. It could be a bit faster according to instructions throughput
There was a problem hiding this comment.
tested on AVX2 baseline
| Performance test | v_select | bin - (__n & (bin >= __n)) |
|---|---|---|
| detect/57 | 105.21 | 106.29 |
| detect/58 | 98.78 | 98.99 |
| detect/59 | 206.89 | 211.78 |
| extract/18 | 53.14 | 53.80 |
| extract/19 | 37.49 | 38.14 |
| extract/20 | 153.39 | 153.34 |
| detectAndExtract/18 | 138.48 | 140.28 |
| detectAndExtract/19 | 154.51 | 154.39 |
| detectAndExtract/20 | 360.15 | 358.27 |
I'm not sure which is better.
disasm of calcOrientationHist
bin - __n & (bin >= __n)
b0150: c5 d5 66 c8 vpcmpgtd ymm1,ymm5,ymm0 ; ymm1 = __n > ymm0 ; ---
b0154: c5 f5 df cc vpandn ymm1,ymm1,ymm4 ; ymm1 = (!ymm1 & __n) ; ymm1 = (__n <= ymm0) & ymm0
b0158: c5 fd fa c1 vpsubd ymm0,ymm0,ymm1 ; ymm0 -= ymm1
b015c: c5 f5 72 e0 1f vpsrad ymm1,ymm0,0x1f ; ymm1 = ymm0 >> 0x1f (extract sign bit)
b0161: c5 dd db c9 vpand ymm1,ymm4,ymm1 ; ymm1 &= __n
b0165: c5 fd fe c1 vpaddd ymm0,ymm0,ymm1 ; ymm0 += ymm1
v_select
b0113: c5 dd 76 e4 vpcmpeqd ymm4,ymm4,ymm4
...
b0150: c5 e5 66 c8 vpcmpgtd ymm1,ymm3,ymm0 ; ymm1 = __n > ymm0
b0154: c5 fd fe d5 vpaddd ymm2,ymm0,ymm5 ; ymm2 = ymm0 + (- __n)
b0158: c5 dd ef c9 vpxor ymm1,ymm4,ymm1 ; ymm1 ^= 1
b015c: c4 e3 7d 4c c2 10 vpblendvb ymm0,ymm0,ymm2,ymm1 ; ymm0 = __n <= ymm0 ? ymm2 : ymm0
b0162: c5 e5 fe c8 vpaddd ymm1,ymm3,ymm0 ; ymm1 = __n + ymm0
b0166: c5 ed 72 e0 1f vpsrad ymm2,ymm0,0x1f ; ymm2 = (sign bit of ymm0)
b016b: c4 e3 7d 4c c1 20 vpblendvb ymm0,ymm0,ymm1,ymm2 ; ymm0 = ymm2 < 0 ? ymm1 : ymm0
There was a problem hiding this comment.
Looks like there is no difference in performance. Lets retain v_select version as more clear
Test cases
Performance for SSE2 baseline
Performance for SSE3 baseline
Performance for SSE4.2 baseline
Performance for AVX2 baseline
Pull Request Readiness Checklist
See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request
Patch to opencv_extra has the same branch name.