Use intrinsics for `cvRound` on x86_64 `__GNUC__` (clang/gcc linux) too. by legrosbuffle · Pull Request #24001 · opencv/opencv

legrosbuffle · 2023-07-17T14:36:40Z

There is no reason to enable this only on windows. We've measured a 7x improvement in speed for cvRound using the intrinsic.

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

[ X] I agree to contribute to the project under Apache 2 License.
[ X] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
[ X] The PR is proposed to the proper branch
There is a reference to the original bug report and related work
[ X] There is accuracy test, performance test and test data in opencv_extra repository, if applicable
Patch to opencv_extra has the same branch name.
[ X] The feature is well documented and sample code can be built with the project CMake

asmorkalov · 2023-07-18T12:45:20Z

@legrosbuffle Thanks a lot for the contribution. I made quick benchmark of imgproc module with and without the patch and outcome is the following:

opencv uses __builtin_lrintf as default solution. The assembly code has function call, but not compiler intrinsic. See https://godbolt.org/z/hvb46xx3h
I tried old Core i5-2500K and modern AMD Ryzen 7 2700X. Overall result is very close for two hosts. There is speedup in several cases for cvtColor conversions (very lightweight) and Hough Transform. Other cases are flat. The best example (AMD):

Geometric mean (ms)

                                                                      Name of Test                                                                        imgproc  imgproc  imgproc  
                                                                                                                                                            4.x     patch    patch   
                                                                                                                                                             1      round    round   
                                                                                                                                                                      1        1     
                                                                                                                                                                               vs    
                                                                                                                                                                            imgproc  
                                                                                                                                                                              4.x    
                                                                                                                                                                               1     
                                                                                                                                                                           (x-factor)
HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 1, 0.01, 0.5)                                                                      15.444   4.715     3.28   
HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 1, 0.1, 0.5)                                                                       1.775    0.463     3.83   
HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 1, 0.01, 1.1)                                                                      15.557   4.697     3.31   
HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 1, 0.1, 1.1)                                                                       1.771    0.458     3.87   
HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 10, 0.01, 0.5)                                                                     16.753   4.433     3.78   
HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 10, 0.1, 0.5)                                                                      1.746    0.420     4.16   
HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 10, 0.01, 1.1)                                                                     16.694   4.394     3.80   
HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 10, 0.1, 1.1)                                                                      1.740    0.428     4.06   
HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 1, 0.01, 0.5)                                                                       144.136  132.952    1.08   
HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 1, 0.1, 0.5)                                                                         14.148   3.270     4.33   
HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 1, 0.01, 1.1)                                                                       141.822  32.338     4.39   
HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 1, 0.1, 1.1)                                                                         14.132   3.250     4.35   
HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 10, 0.01, 0.5)                                                                      136.346  30.763     4.43   
HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 10, 0.1, 0.5)                                                                        14.062   3.157     4.45   
HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 10, 0.01, 1.1)                                                                      136.306  30.152     4.52   
HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 10, 0.1, 1.1)                                                                        14.072   3.159     4.45   
HoughLines::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 1, 0.01, 0.5)                                                                        17.115  16.516     1.04   
HoughLines::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 1, 0.1, 0.5)                                                                         1.780    0.456     3.91   
HoughLines::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 1, 0.01, 1.1)                                                                        16.586  16.874     0.98   
HoughLines::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 1, 0.1, 1.1)                                                                         1.782    0.454     3.93   
HoughLines::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 10, 0.01, 0.5)                                                                       16.688   4.429     3.77   
HoughLines::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 10, 0.1, 0.5)                                                                        1.748    0.432     4.05   
HoughLines::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 10, 0.01, 1.1)                                                                       16.667   3.610     4.62   
HoughLines::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 10, 0.1, 1.1)                                                                        1.739    0.424     4.10   
HoughLines::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 1, 0.01, 0.5)                                                                         144.735  130.719    1.11   
HoughLines::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 1, 0.1, 0.5)                                                                           14.305   3.742     3.82   
HoughLines::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 1, 0.01, 1.1)                                                                         139.298  32.419     4.30   
HoughLines::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 1, 0.1, 1.1)                                                                           14.284   3.697     3.86   
HoughLines::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 10, 0.01, 0.5)                                                                        140.510  121.088    1.16   
HoughLines::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 10, 0.1, 0.5)                                                                          14.101   3.142     4.49   
HoughLines::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 10, 0.01, 1.1)                                                                        139.432  30.010     4.65   
HoughLines::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 10, 0.1, 1.1)                                                                          14.097   3.164     4.46

There are no speedup/regression in features2d and objdetect.

asmorkalov · 2023-07-18T12:46:29Z

@legrosbuffle could you describe your use case?

legrosbuffle · 2023-07-18T14:45:36Z

We're mostly interested in Resize and HoughLines. Here's a benchmark that somehow represents what we're measuring in profiles for Resize (using the public google/benchmark library ):

#include "testing/base/public/benchmark.h"
#include "third_party/OpenCV/core/fast_math.hpp"
#include "third_party/OpenCV/core/mat.hpp"
#include "third_party/OpenCV/imgproc.hpp"

namespace {

template <typename FloatT>
void BM_CvRound(benchmark::State& state) {
  FloatT input = 1.23456789;
  for (const auto s : state) {
    testing::DoNotOptimize(input);
    int rounded = cvRound(input);
    testing::DoNotOptimize(rounded);
  }
}
BENCHMARK(BM_CvRound<float>);
BENCHMARK(BM_CvRound<double>);

// `cv::resize` makes intensive use of `cvRound`.
void BM_CvResize(benchmark::State& state) {
  constexpr int kSrcWidth = 2048;
  constexpr int kSrcHeight = 1536;
  constexpr int kDstWidth = 512;
  constexpr int kDstHeight = 384;
  const int interpolation = state.range(0);
  cv::Mat src(kSrcWidth, kSrcHeight, CV_8UC1, cv::Scalar(0));
  cv::Mat dst(kDstWidth, kDstHeight, CV_8UC1, cv::Scalar(0));
  for (const auto s : state) {
    testing::DoNotOptimize(src);
    cv::resize(src, dst, dst.size(), 0, 0, interpolation);
    testing::DoNotOptimize(dst);
  }
}
BENCHMARK(BM_CvResize)->Arg(cv::INTER_LINEAR)->Arg(cv::INTER_AREA);

}  // namespace

Diff between base and patch on machines we care about:

BM_CvRound<float>     -85.15%         (p=0.000 n=8+10)
BM_CvRound<double>    -86.63%         (p=0.000 n=8+10)

BM_CvResize/1      -17.29%         (p=0.000 n=9+10)
BM_CvResize/3      -28.96%        (p=0.000 n=10+10)

vpisarev · 2023-07-19T07:04:57Z

@legrosbuffle, thanks for PR.

We just want to give compiler more chances to vectorize code. I believe, with higher-level intrinsics like __builtin_lrintf() the chances are higher than with manually embedded scalar SSE2 intrinsics. Another reason is that the code may be compiled with AVX2 or AVX512 and then those SSE2 intrinsics would look weird.

May we get more detailed information about your test environment:

operating system, including version.
GCC version
hardware used
version of OpenCV used
are there any extra compiler flags used to build OpenCV?

legrosbuffle · 2023-07-19T09:36:01Z

We just want to give compiler more chances to vectorize code. I believe, with higher-level intrinsics like __builtin_lrintf() the chances are higher than with manually embedded scalar SSE2 intrinsics.

Do you have an example where the buitin version vectorizes ? Neither GCC nor clang seem to vectorize any version of cvRound without fast-math: https://godbolt.org/z/fjoqv5jxo

Another reason is that the code may be compiled with AVX2 or AVX512 and then those SSE2 intrinsics would look weird.

When building with AVX2, the compiler will select the proper encoding (see godbolt link).

May we get more detailed information about your test environment:

operating system, including version.

Google's linux disrib. Results are similar on gLinux and prod.

GCC version

We're using clang, but results looks similar on GCC and clang (see my godbolt link above)

hardware used

Skylake

version of OpenCV used

HEAD

are there any extra compiler flags used to build OpenCV?

The only relevant one is -DCV_ENABLE_INTRINSICS.

…linux) too. We've measured a 7x improvement in speed for `cvRound` using the intrinsic.

asmorkalov · 2023-07-21T08:38:11Z

@legrosbuffle Thanks a lot for the patch. We discussed the solution on OpenCV core team meeting and decide to merge it. I removed x86_64 check to cover x86 32-bit configuration too. I'll merge it as soon as CI passed.

asmorkalov

👍

asmorkalov added optimization category: core labels Jul 18, 2023

asmorkalov requested review from opencv-alalek and vpisarev July 18, 2023 12:47

Use intrinsics for cvRound on x86 and x86_64 __GNUC__ (clang/gcc …

3cce299

…linux) too. We've measured a 7x improvement in speed for `cvRound` using the intrinsic.

asmorkalov force-pushed the legrosbuffle-cvround-intrinsic branch from 23118c9 to 3cce299 Compare July 21, 2023 08:35

asmorkalov approved these changes Jul 21, 2023

View reviewed changes

opencv-alalek approved these changes Jul 21, 2023

View reviewed changes

asmorkalov merged commit 12acf56 into opencv:4.x Jul 23, 2023

asmorkalov mentioned this pull request Jul 27, 2023

(5.x) Merge 4.x #24070

Merged

asmorkalov mentioned this pull request Aug 7, 2023

Rewrite Universal Intrinsic code by using new API: Core module. #23980

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use intrinsics for `cvRound` on x86_64 `GNUC` (clang/gcc linux) too.#24001

Use intrinsics for `cvRound` on x86_64 `GNUC` (clang/gcc linux) too.#24001
asmorkalov merged 1 commit intoopencv:4.xfrom
legrosbuffle:legrosbuffle-cvround-intrinsic

legrosbuffle commented Jul 17, 2023

Uh oh!

asmorkalov commented Jul 18, 2023

Uh oh!

asmorkalov commented Jul 18, 2023

Uh oh!

legrosbuffle commented Jul 18, 2023

Uh oh!

vpisarev commented Jul 19, 2023

Uh oh!

legrosbuffle commented Jul 19, 2023

Uh oh!

asmorkalov commented Jul 21, 2023

Uh oh!

asmorkalov left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

legrosbuffle commented Jul 17, 2023

Pull Request Readiness Checklist

Uh oh!

asmorkalov commented Jul 18, 2023

Uh oh!

asmorkalov commented Jul 18, 2023

Uh oh!

legrosbuffle commented Jul 18, 2023

Uh oh!

vpisarev commented Jul 19, 2023

Uh oh!

legrosbuffle commented Jul 19, 2023

Uh oh!

asmorkalov commented Jul 21, 2023

Uh oh!

asmorkalov left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants