Skip to content

use universal SIMD intrinsics for SIFT#17707

Merged
opencv-pushbot merged 1 commit intoopencv:3.4from
Yosshi999:gsoc_sift-universal-intrinsic
Jul 8, 2020
Merged

use universal SIMD intrinsics for SIFT#17707
opencv-pushbot merged 1 commit intoopencv:3.4from
Yosshi999:gsoc_sift-universal-intrinsic

Conversation

@Yosshi999
Copy link
Copy Markdown
Contributor

Test cases
feature2d_detect.
  detect/57  # GetParam() = (SIFT_DEFAULT, "cv/detectors_descriptors_evaluation/images_datasets/leuven/img1.png")
  detect/58  # GetParam() = (SIFT_DEFAULT, "stitching/a3.png")
  detect/59  # GetParam() = (SIFT_DEFAULT, "stitching/s2.jpg")
feature2d_extract.
  extract/18  # GetParam() = (SIFT_DEFAULT, "cv/detectors_descriptors_evaluation/images_datasets/leuven/img1.png")
  extract/19  # GetParam() = (SIFT_DEFAULT, "stitching/a3.png")
  extract/20  # GetParam() = (SIFT_DEFAULT, "stitching/s2.jpg")
feature2d_detectAndExtract.
  detectAndExtract/18  # GetParam() = (SIFT_DEFAULT, "cv/detectors_descriptors_evaluation/images_datasets/leuven/img1.png")
  detectAndExtract/19  # GetParam() = (SIFT_DEFAULT, "stitching/a3.png")
  detectAndExtract/20  # GetParam() = (SIFT_DEFAULT, "stitching/s2.jpg")
Performance for SSE2 baseline
Performace test Reference time PR time Speedup
detect/57 110.77 110.82 0.999549
detect/58 103.91 102.08 1.017927
detect/59 221.66 218.72 1.013442
extract/18 68.4 61.09 1.11966
extract/19 46.52 42.5 1.094588
extract/20 206.91 180.9 1.143781
detectAndExtract/18 160.03 149.95 1.067222
detectAndExtract/19 184.96 170.37 1.085637
detectAndExtract/20 435.27 397.29 1.095598
Performance for SSE3 baseline
Performance test Reference time PR time Speedup
detect/57 109.4 108.8 1.005515
detect/58 103.53 101.53 1.019699
detect/59 221.11 214.64 1.030143
extract/18 69.48 61.42 1.131228
extract/19 47.06 42.83 1.098763
extract/20 211.52 181.71 1.164053
detectAndExtract/18 159.5 149.9 1.064043
detectAndExtract/19 184.8 170.27 1.085335
detectAndExtract/20 438.35 398.1 1.101105
Performance for SSE4.2 baseline
Performance test Reference time PR time Speedup
detect/57 111.3 108.26 1.028081
detect/58 104.05 101.76 1.022504
detect/59 220.64 230.93 0.955441
extract/18 69.47 59.24 1.172687
extract/19 47.29 41.35 1.143652
extract/20 211.78 174.29 1.215101
detectAndExtract/18 160.09 147.61 1.084547
detectAndExtract/19 186.66 167.1 1.117056
detectAndExtract/20 439.57 405.87 1.083032
Performance for AVX2 baseline
Performance test Reference time PR time Speedup
detect/57 105.27 105.53 0.997536
detect/58 97.97 98.5 0.994619
detect/59 208.17 211.63 0.983651
extract/18 52.31 53.26 0.982163
extract/19 37.25 37.68 0.988588
extract/20 150.7 153.38 0.982527
detectAndExtract/18 138.54 140.22 0.988019
detectAndExtract/19 152.58 155.6 0.980591
detectAndExtract/20 358.02 359.65 0.995468

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under OpenCV (BSD) License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or other license that is incompatible with OpenCV
  • The PR is proposed to proper branch
  • There is reference to original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake

Copy link
Copy Markdown
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thank you 👍

@opencv-pushbot opencv-pushbot merged commit 8931c68 into opencv:3.4 Jul 8, 2020
@alalek alalek mentioned this pull request Jul 8, 2020
int i, j, k, len = (radius*2+1)*(radius*2+1);

float expf_scale = -1.f/(2.f * sigma * sigma);
#if CV_SIMD
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it make sense to use aligned version regardless of CV_SIMD. Even without explicit vectorization It could help compiler's autovectorizer

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also there is utils::BufferArea class in core/utils/buffer_area.private.hpp that could be used to allocate memory for internal buffer

v_float32 ori = vx_load_aligned( Ori + k );
v_int32 bin = v_round( nd360 * ori );

bin = v_select(bin >= __n, bin - __n, bin);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please check performance for bin = bin - __n & (bin >= n) version. It could be a bit faster according to instructions throughput

Copy link
Copy Markdown
Contributor Author

@Yosshi999 Yosshi999 Jul 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested on AVX2 baseline

Performance test v_select bin - (__n & (bin >= __n))
detect/57 105.21 106.29
detect/58 98.78 98.99
detect/59 206.89 211.78
extract/18 53.14 53.80
extract/19 37.49 38.14
extract/20 153.39 153.34
detectAndExtract/18 138.48 140.28
detectAndExtract/19 154.51 154.39
detectAndExtract/20 360.15 358.27

I'm not sure which is better.

disasm of calcOrientationHist

bin - __n & (bin >= __n)

   b0150:	c5 d5 66 c8          	vpcmpgtd ymm1,ymm5,ymm0  ; ymm1 = __n > ymm0    ; ---
   b0154:	c5 f5 df cc          	vpandn ymm1,ymm1,ymm4    ; ymm1 = (!ymm1 & __n)  ; ymm1 = (__n <= ymm0) & ymm0
   b0158:	c5 fd fa c1          	vpsubd ymm0,ymm0,ymm1    ; ymm0 -= ymm1
   b015c:	c5 f5 72 e0 1f       	vpsrad ymm1,ymm0,0x1f    ; ymm1 = ymm0 >> 0x1f (extract sign bit)
   b0161:	c5 dd db c9          	vpand  ymm1,ymm4,ymm1    ; ymm1 &= __n
   b0165:	c5 fd fe c1          	vpaddd ymm0,ymm0,ymm1    ; ymm0 += ymm1

v_select

   b0113:	c5 dd 76 e4          	vpcmpeqd ymm4,ymm4,ymm4
...
   b0150:	c5 e5 66 c8          	vpcmpgtd ymm1,ymm3,ymm0  ; ymm1 = __n > ymm0
   b0154:	c5 fd fe d5          	vpaddd ymm2,ymm0,ymm5      ; ymm2 = ymm0 + (- __n)
   b0158:	c5 dd ef c9          	vpxor  ymm1,ymm4,ymm1        ; ymm1 ^= 1
   b015c:	c4 e3 7d 4c c2 10    	vpblendvb ymm0,ymm0,ymm2,ymm1   ; ymm0 = __n <= ymm0 ? ymm2 : ymm0
   b0162:	c5 e5 fe c8          	vpaddd ymm1,ymm3,ymm0     ; ymm1 = __n + ymm0
   b0166:	c5 ed 72 e0 1f       	vpsrad ymm2,ymm0,0x1f          ; ymm2 = (sign bit of ymm0)
   b016b:	c4 e3 7d 4c c1 20    	vpblendvb ymm0,ymm0,ymm1,ymm2  ; ymm0 = ymm2 < 0 ? ymm1 : ymm0

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there is no difference in performance. Lets retain v_select version as more clear

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants