Skip to content

Use intrinsics for cvRound on x86_64 __GNUC__ (clang/gcc linux) too.#24001

Merged
asmorkalov merged 1 commit intoopencv:4.xfrom
legrosbuffle:legrosbuffle-cvround-intrinsic
Jul 23, 2023
Merged

Use intrinsics for cvRound on x86_64 __GNUC__ (clang/gcc linux) too.#24001
asmorkalov merged 1 commit intoopencv:4.xfrom
legrosbuffle:legrosbuffle-cvround-intrinsic

Conversation

@legrosbuffle
Copy link
Copy Markdown
Contributor

There is no reason to enable this only on windows. We've measured a 7x improvement in speed for cvRound using the intrinsic.

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • [ X] I agree to contribute to the project under Apache 2 License.
  • [ X] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
  • [ X] The PR is proposed to the proper branch
  • There is a reference to the original bug report and related work
  • [ X] There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • [ X] The feature is well documented and sample code can be built with the project CMake

@asmorkalov
Copy link
Copy Markdown
Contributor

@legrosbuffle Thanks a lot for the contribution. I made quick benchmark of imgproc module with and without the patch and outcome is the following:

  • opencv uses __builtin_lrintf as default solution. The assembly code has function call, but not compiler intrinsic. See https://godbolt.org/z/hvb46xx3h
  • I tried old Core i5-2500K and modern AMD Ryzen 7 2700X. Overall result is very close for two hosts. There is speedup in several cases for cvtColor conversions (very lightweight) and Hough Transform. Other cases are flat. The best example (AMD):
Geometric mean (ms)

                                                                      Name of Test                                                                        imgproc  imgproc  imgproc  
                                                                                                                                                            4.x     patch    patch   
                                                                                                                                                             1      round    round   
                                                                                                                                                                      1        1     
                                                                                                                                                                               vs    
                                                                                                                                                                            imgproc  
                                                                                                                                                                              4.x    
                                                                                                                                                                               1     
                                                                                                                                                                           (x-factor)
HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 1, 0.01, 0.5)                                                                      15.444   4.715     3.28   
HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 1, 0.1, 0.5)                                                                       1.775    0.463     3.83   
HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 1, 0.01, 1.1)                                                                      15.557   4.697     3.31   
HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 1, 0.1, 1.1)                                                                       1.771    0.458     3.87   
HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 10, 0.01, 0.5)                                                                     16.753   4.433     3.78   
HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 10, 0.1, 0.5)                                                                      1.746    0.420     4.16   
HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 10, 0.01, 1.1)                                                                     16.694   4.394     3.80   
HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 10, 0.1, 1.1)                                                                      1.740    0.428     4.06   
HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 1, 0.01, 0.5)                                                                       144.136  132.952    1.08   
HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 1, 0.1, 0.5)                                                                         14.148   3.270     4.33   
HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 1, 0.01, 1.1)                                                                       141.822  32.338     4.39   
HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 1, 0.1, 1.1)                                                                         14.132   3.250     4.35   
HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 10, 0.01, 0.5)                                                                      136.346  30.763     4.43   
HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 10, 0.1, 0.5)                                                                        14.062   3.157     4.45   
HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 10, 0.01, 1.1)                                                                      136.306  30.152     4.52   
HoughLines3f::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 10, 0.1, 1.1)                                                                        14.072   3.159     4.45   
HoughLines::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 1, 0.01, 0.5)                                                                        17.115  16.516     1.04   
HoughLines::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 1, 0.1, 0.5)                                                                         1.780    0.456     3.91   
HoughLines::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 1, 0.01, 1.1)                                                                        16.586  16.874     0.98   
HoughLines::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 1, 0.1, 1.1)                                                                         1.782    0.454     3.93   
HoughLines::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 10, 0.01, 0.5)                                                                       16.688   4.429     3.77   
HoughLines::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 10, 0.1, 0.5)                                                                        1.748    0.432     4.05   
HoughLines::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 10, 0.01, 1.1)                                                                       16.667   3.610     4.62   
HoughLines::Image_RhoStep_ThetaStep_Threshold::("cv/shared/pic5.png", 10, 0.1, 1.1)                                                                        1.739    0.424     4.10   
HoughLines::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 1, 0.01, 0.5)                                                                         144.735  130.719    1.11   
HoughLines::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 1, 0.1, 0.5)                                                                           14.305   3.742     3.82   
HoughLines::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 1, 0.01, 1.1)                                                                         139.298  32.419     4.30   
HoughLines::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 1, 0.1, 1.1)                                                                           14.284   3.697     3.86   
HoughLines::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 10, 0.01, 0.5)                                                                        140.510  121.088    1.16   
HoughLines::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 10, 0.1, 0.5)                                                                          14.101   3.142     4.49   
HoughLines::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 10, 0.01, 1.1)                                                                        139.432  30.010     4.65   
HoughLines::Image_RhoStep_ThetaStep_Threshold::("stitching/a1.png", 10, 0.1, 1.1)                                                                          14.097   3.164     4.46 
  • There are no speedup/regression in features2d and objdetect.

@asmorkalov
Copy link
Copy Markdown
Contributor

@legrosbuffle could you describe your use case?

@legrosbuffle
Copy link
Copy Markdown
Contributor Author

We're mostly interested in Resize and HoughLines. Here's a benchmark that somehow represents what we're measuring in profiles for Resize (using the public google/benchmark library ):

#include "testing/base/public/benchmark.h"
#include "third_party/OpenCV/core/fast_math.hpp"
#include "third_party/OpenCV/core/mat.hpp"
#include "third_party/OpenCV/imgproc.hpp"

namespace {

template <typename FloatT>
void BM_CvRound(benchmark::State& state) {
  FloatT input = 1.23456789;
  for (const auto s : state) {
    testing::DoNotOptimize(input);
    int rounded = cvRound(input);
    testing::DoNotOptimize(rounded);
  }
}
BENCHMARK(BM_CvRound<float>);
BENCHMARK(BM_CvRound<double>);

// `cv::resize` makes intensive use of `cvRound`.
void BM_CvResize(benchmark::State& state) {
  constexpr int kSrcWidth = 2048;
  constexpr int kSrcHeight = 1536;
  constexpr int kDstWidth = 512;
  constexpr int kDstHeight = 384;
  const int interpolation = state.range(0);
  cv::Mat src(kSrcWidth, kSrcHeight, CV_8UC1, cv::Scalar(0));
  cv::Mat dst(kDstWidth, kDstHeight, CV_8UC1, cv::Scalar(0));
  for (const auto s : state) {
    testing::DoNotOptimize(src);
    cv::resize(src, dst, dst.size(), 0, 0, interpolation);
    testing::DoNotOptimize(dst);
  }
}
BENCHMARK(BM_CvResize)->Arg(cv::INTER_LINEAR)->Arg(cv::INTER_AREA);

}  // namespace

Diff between base and patch on machines we care about:

BM_CvRound<float>     -85.15%         (p=0.000 n=8+10)
BM_CvRound<double>    -86.63%         (p=0.000 n=8+10)
BM_CvResize/1      -17.29%         (p=0.000 n=9+10)
BM_CvResize/3      -28.96%        (p=0.000 n=10+10)

@vpisarev
Copy link
Copy Markdown
Contributor

@legrosbuffle, thanks for PR.

We just want to give compiler more chances to vectorize code. I believe, with higher-level intrinsics like __builtin_lrintf() the chances are higher than with manually embedded scalar SSE2 intrinsics. Another reason is that the code may be compiled with AVX2 or AVX512 and then those SSE2 intrinsics would look weird.

May we get more detailed information about your test environment:

  1. operating system, including version.
  2. GCC version
  3. hardware used
  4. version of OpenCV used
  5. are there any extra compiler flags used to build OpenCV?

@legrosbuffle
Copy link
Copy Markdown
Contributor Author

We just want to give compiler more chances to vectorize code. I believe, with higher-level intrinsics like __builtin_lrintf() the chances are higher than with manually embedded scalar SSE2 intrinsics.

Do you have an example where the buitin version vectorizes ? Neither GCC nor clang seem to vectorize any version of cvRound without fast-math: https://godbolt.org/z/fjoqv5jxo

Another reason is that the code may be compiled with AVX2 or AVX512 and then those SSE2 intrinsics would look weird.

When building with AVX2, the compiler will select the proper encoding (see godbolt link).

May we get more detailed information about your test environment:

  1. operating system, including version.

Google's linux disrib. Results are similar on gLinux and prod.

  1. GCC version

We're using clang, but results looks similar on GCC and clang (see my godbolt link above)

  1. hardware used

Skylake

  1. version of OpenCV used

HEAD

  1. are there any extra compiler flags used to build OpenCV?

The only relevant one is -DCV_ENABLE_INTRINSICS.

…linux) too.

We've measured a 7x improvement in speed for `cvRound` using the intrinsic.
@asmorkalov asmorkalov force-pushed the legrosbuffle-cvround-intrinsic branch from 23118c9 to 3cce299 Compare July 21, 2023 08:35
@asmorkalov
Copy link
Copy Markdown
Contributor

@legrosbuffle Thanks a lot for the patch. We discussed the solution on OpenCV core team meeting and decide to merge it. I removed x86_64 check to cover x86 32-bit configuration too. I'll merge it as soon as CI passed.

Copy link
Copy Markdown
Contributor

@asmorkalov asmorkalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants