Skip to content

fast_math.hpp performance improvements#15122

Merged
opencv-pushbot merged 3 commits intoopencv:3.4from
pmur:fast-math-improvements
Aug 14, 2019
Merged

fast_math.hpp performance improvements#15122
opencv-pushbot merged 3 commits intoopencv:3.4from
pmur:fast-math-improvements

Conversation

@pmur
Copy link
Copy Markdown
Contributor

@pmur pmur commented Jul 22, 2019

Optimize fast_math.hpp primitives for P8+ and GCC based systems.

Leverage compiler builtins to implement most rounding primitives. This should allow the compiler to choose more efficient instructions for the target architecture. PPC has dedicated rounding instructions. Hopefully this also carries over to other architectures too.

Notably, __builtin_lrint{,f} functions just call out to libm. Instead, include an inline solution similar to ARM.

This reduces the testing time by 200-300 seconds on POWER9.

allow_multiple_commits=1

@alalek
Copy link
Copy Markdown
Member

alalek commented Jul 22, 2019

  1. Optimization patches should go into 3.4 branch first. We will merge changes from 3.4 into master regularly (weekly/bi-weekly).
  2. Clang mimics for GCC.
    Please try to use this condition (with extra macro to bypass this code path):
#if defined(__GNUC__) && !defined(__clang__) \
    && !defined(OPENCV_SKIP_FASTMATH_GCC_BUILTINS)
  1. It would be nice to have simple perf tests for these functions.
  2. You can check PowerPC build results here: https://ocv-power.imavr.com/#/opencv_pullrequests

@pmur pmur force-pushed the fast-math-improvements branch from d5417f8 to d6f8a47 Compare July 25, 2019 19:44
@pmur pmur changed the base branch from master to 3.4 July 25, 2019 19:45
@pmur pmur force-pushed the fast-math-improvements branch 2 times, most recently from cf75cf7 to d1769c4 Compare July 29, 2019 20:30
without the -fno-math-errno option. */
#ifdef OPENCV_USE_FASTMATH_GCC_BUILTINS
# define _OPENCV_FASTMATH_ENABLE_GCC_MATH_BUILTINS ((defined __GNUC__ && !defined __clang__) \
&& defined OPENCV_USE_FASTMATH_GCC_BUILTINS)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it really works as expected?

#if defined _OPENCV_FASTMATH_ENABLE_GCC_MATH_BUILTINS

// 3. version for float
#define ARM_ROUND_FLT(value) ARM_ROUND(value, "vcvtr.s32.f32 %[temp], %[value]\n vmov %[res], %[temp]")
#define CV_INLINE_ROUND_FLT(value) ARM_ROUND(value, "vcvtr.s32.f32 %[temp], %[value]\n vmov %[res], %[temp]")
#elif defined __PPC64__ && defined __GNUC__ && defined _ARCH_PWR8
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmur pmur force-pushed the fast-math-improvements branch from d1769c4 to 17584ef Compare August 1, 2019 19:43

template <typename T>
static void CvRoundMat(const cv::Mat & src, cv::Mat & dst)
static void CvRoundMat(const cv::Mat & src, cv::Mat & dst, int (*round)(T))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance tests are degraded by itself due using of function pointer.

pmur added 3 commits August 7, 2019 14:59
Add a basic sanity test to verify the rounding functions
work as expected.

Likewise, extend the rounding performance test to cover the
additional float -> int fast math functions.
Add a new macro definition OPENCV_USE_FASTMATH_GCC_BUILTINS to enable
usage of GCC inline math functions, if available and requested by the
user.

Likewise, enable it for POWER. This is nearly always a substantial
improvement over using integer manipulation as most operations can
be done in several instructions with no branching. The result is a
1.5-1.8x speedup in the ceil/floor operations.

1. As tested with AT 12.0-1 (GCC 8.3.1) compiler on P9 LE.
Implement cvRound using inline asm. No compiler support
exists today to properly optimize this. This results in
about a 4x speedup over the default rounding. Likewise,
simplify the growing number of rounding function overloads.

For P9 enabled targets, utilize the classification
testing instruction to test for Inf/Nan values. Operation
speedup is about 1.2x for FP32, and 1.5x for FP64 operands.

For P8 targets, fallback to the GCC nan inline. It provides
a 1.1/1.4x improvement for FP32/FP64 arguments.
@pmur pmur force-pushed the fast-math-improvements branch from 17584ef to f38a61c Compare August 7, 2019 20:03
Copy link
Copy Markdown
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thank you 👍

@opencv-pushbot opencv-pushbot merged commit f38a61c into opencv:3.4 Aug 14, 2019
@alalek alalek mentioned this pull request Aug 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants