use universal SIMD intrinsics for SIFT by Yosshi999 · Pull Request #17707 · opencv/opencv

Yosshi999 · 2020-06-30T15:35:19Z

Test cases

feature2d_detect.
  detect/57  # GetParam() = (SIFT_DEFAULT, "cv/detectors_descriptors_evaluation/images_datasets/leuven/img1.png")
  detect/58  # GetParam() = (SIFT_DEFAULT, "stitching/a3.png")
  detect/59  # GetParam() = (SIFT_DEFAULT, "stitching/s2.jpg")
feature2d_extract.
  extract/18  # GetParam() = (SIFT_DEFAULT, "cv/detectors_descriptors_evaluation/images_datasets/leuven/img1.png")
  extract/19  # GetParam() = (SIFT_DEFAULT, "stitching/a3.png")
  extract/20  # GetParam() = (SIFT_DEFAULT, "stitching/s2.jpg")
feature2d_detectAndExtract.
  detectAndExtract/18  # GetParam() = (SIFT_DEFAULT, "cv/detectors_descriptors_evaluation/images_datasets/leuven/img1.png")
  detectAndExtract/19  # GetParam() = (SIFT_DEFAULT, "stitching/a3.png")
  detectAndExtract/20  # GetParam() = (SIFT_DEFAULT, "stitching/s2.jpg")

Performance for SSE2 baseline

Performace test	Reference time	PR time	Speedup
detect/57	110.77	110.82	0.999549
detect/58	103.91	102.08	1.017927
detect/59	221.66	218.72	1.013442
extract/18	68.4	61.09	1.11966
extract/19	46.52	42.5	1.094588
extract/20	206.91	180.9	1.143781
detectAndExtract/18	160.03	149.95	1.067222
detectAndExtract/19	184.96	170.37	1.085637
detectAndExtract/20	435.27	397.29	1.095598

Performance for SSE3 baseline

Performance test	Reference time	PR time	Speedup
detect/57	109.4	108.8	1.005515
detect/58	103.53	101.53	1.019699
detect/59	221.11	214.64	1.030143
extract/18	69.48	61.42	1.131228
extract/19	47.06	42.83	1.098763
extract/20	211.52	181.71	1.164053
detectAndExtract/18	159.5	149.9	1.064043
detectAndExtract/19	184.8	170.27	1.085335
detectAndExtract/20	438.35	398.1	1.101105

Performance for SSE4.2 baseline

Performance test	Reference time	PR time	Speedup
detect/57	111.3	108.26	1.028081
detect/58	104.05	101.76	1.022504
detect/59	220.64	230.93	0.955441
extract/18	69.47	59.24	1.172687
extract/19	47.29	41.35	1.143652
extract/20	211.78	174.29	1.215101
detectAndExtract/18	160.09	147.61	1.084547
detectAndExtract/19	186.66	167.1	1.117056
detectAndExtract/20	439.57	405.87	1.083032

Performance for AVX2 baseline

Performance test	Reference time	PR time	Speedup
detect/57	105.27	105.53	0.997536
detect/58	97.97	98.5	0.994619
detect/59	208.17	211.63	0.983651
extract/18	52.31	53.26	0.982163
extract/19	37.25	37.68	0.988588
extract/20	150.7	153.38	0.982527
detectAndExtract/18	138.54	140.22	0.988019
detectAndExtract/19	152.58	155.6	0.980591
detectAndExtract/20	358.02	359.65	0.995468

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

I agree to contribute to the project under OpenCV (BSD) License.
To the best of my knowledge, the proposed patch is not based on a code under GPL or other license that is incompatible with OpenCV
The PR is proposed to proper branch
There is reference to original bug report and related work
There is accuracy test, performance test and test data in opencv_extra repository, if applicable
Patch to opencv_extra has the same branch name.
The feature is well documented and sample code can be built with the project CMake

alalek

Looks good to me! Thank you 👍

terfendail · 2020-07-09T13:28:31Z

modules/features2d/src/sift.simd.hpp

    int i, j, k, len = (radius*2+1)*(radius*2+1);

    float expf_scale = -1.f/(2.f * sigma * sigma);
+#if CV_SIMD


I think it make sense to use aligned version regardless of CV_SIMD. Even without explicit vectorization It could help compiler's autovectorizer

Also there is utils::BufferArea class in core/utils/buffer_area.private.hpp that could be used to allocate memory for internal buffer

terfendail · 2020-07-09T13:51:11Z

modules/features2d/src/sift.simd.hpp

+        v_float32 ori = vx_load_aligned( Ori + k );
+        v_int32 bin = v_round( nd360 * ori );
+
+        bin = v_select(bin >= __n, bin - __n, bin);


Could you please check performance for bin = bin - __n & (bin >= n) version. It could be a bit faster according to instructions throughput

tested on AVX2 baseline

Performance test v_select bin - (__n & (bin >= __n))

detect/57 105.21 106.29

detect/58 98.78 98.99

detect/59 206.89 211.78

extract/18 53.14 53.80

extract/19 37.49 38.14

extract/20 153.39 153.34

detectAndExtract/18 138.48 140.28

detectAndExtract/19 154.51 154.39

detectAndExtract/20 360.15 358.27

I'm not sure which is better.

disasm of calcOrientationHist

bin - __n & (bin >= __n)

b0150: c5 d5 66 c8 vpcmpgtd ymm1,ymm5,ymm0 ; ymm1 = __n > ymm0 ; --- b0154: c5 f5 df cc vpandn ymm1,ymm1,ymm4 ; ymm1 = (!ymm1 & __n) ; ymm1 = (__n <= ymm0) & ymm0 b0158: c5 fd fa c1 vpsubd ymm0,ymm0,ymm1 ; ymm0 -= ymm1 b015c: c5 f5 72 e0 1f vpsrad ymm1,ymm0,0x1f ; ymm1 = ymm0 >> 0x1f (extract sign bit) b0161: c5 dd db c9 vpand ymm1,ymm4,ymm1 ; ymm1 &= __n b0165: c5 fd fe c1 vpaddd ymm0,ymm0,ymm1 ; ymm0 += ymm1

v_select

b0113: c5 dd 76 e4 vpcmpeqd ymm4,ymm4,ymm4 ... b0150: c5 e5 66 c8 vpcmpgtd ymm1,ymm3,ymm0 ; ymm1 = __n > ymm0 b0154: c5 fd fe d5 vpaddd ymm2,ymm0,ymm5 ; ymm2 = ymm0 + (- __n) b0158: c5 dd ef c9 vpxor ymm1,ymm4,ymm1 ; ymm1 ^= 1 b015c: c4 e3 7d 4c c2 10 vpblendvb ymm0,ymm0,ymm2,ymm1 ; ymm0 = __n <= ymm0 ? ymm2 : ymm0 b0162: c5 e5 fe c8 vpaddd ymm1,ymm3,ymm0 ; ymm1 = __n + ymm0 b0166: c5 ed 72 e0 1f vpsrad ymm2,ymm0,0x1f ; ymm2 = (sign bit of ymm0) b016b: c4 e3 7d 4c c1 20 vpblendvb ymm0,ymm0,ymm1,ymm2 ; ymm0 = ymm2 < 0 ? ymm1 : ymm0

Looks like there is no difference in performance. Lets retain v_select version as more clear

use universal SIMD intrinsics for SIFT

920c180

alalek approved these changes Jul 8, 2020

View reviewed changes

opencv-pushbot merged commit 8931c68 into opencv:3.4 Jul 8, 2020

alalek mentioned this pull request Jul 8, 2020

Merge 3.4 #17785

Merged

terfendail reviewed Jul 9, 2020

View reviewed changes

alalek mentioned this pull request Jul 11, 2020

3.4: broken PowerPC build #17815

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

use universal SIMD intrinsics for SIFT#17707

use universal SIMD intrinsics for SIFT#17707
opencv-pushbot merged 1 commit intoopencv:3.4from
Yosshi999:gsoc_sift-universal-intrinsic

Yosshi999 commented Jun 30, 2020

Uh oh!

alalek left a comment

Uh oh!

terfendail Jul 9, 2020

Uh oh!

terfendail Jul 9, 2020

Uh oh!

terfendail Jul 9, 2020

Uh oh!

Yosshi999 Jul 9, 2020 •

edited

Loading

Uh oh!

terfendail Jul 10, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Performance test	v_select	`bin - (__n & (bin >= __n))`
detect/57	105.21	106.29
detect/58	98.78	98.99
detect/59	206.89	211.78
extract/18	53.14	53.80
extract/19	37.49	38.14
extract/20	153.39	153.34
detectAndExtract/18	138.48	140.28
detectAndExtract/19	154.51	154.39
detectAndExtract/20	360.15	358.27

Uh oh!

Conversation

Yosshi999 commented Jun 30, 2020

Pull Request Readiness Checklist

Uh oh!

alalek left a comment

Choose a reason for hiding this comment

Uh oh!

terfendail Jul 9, 2020

Choose a reason for hiding this comment

Uh oh!

terfendail Jul 9, 2020

Choose a reason for hiding this comment

Uh oh!

terfendail Jul 9, 2020

Choose a reason for hiding this comment

Uh oh!

Yosshi999 Jul 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

disasm of calcOrientationHist

bin - __n & (bin >= __n)

v_select

Uh oh!

terfendail Jul 10, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Yosshi999 Jul 9, 2020 •

edited

Loading

`bin - n & (bin >= n)`