fast_math.hpp performance improvements#15122
Merged
opencv-pushbot merged 3 commits intoopencv:3.4from Aug 14, 2019
Merged
Conversation
Member
|
d5417f8 to
d6f8a47
Compare
cf75cf7 to
d1769c4
Compare
alalek
reviewed
Aug 1, 2019
| without the -fno-math-errno option. */ | ||
| #ifdef OPENCV_USE_FASTMATH_GCC_BUILTINS | ||
| # define _OPENCV_FASTMATH_ENABLE_GCC_MATH_BUILTINS ((defined __GNUC__ && !defined __clang__) \ | ||
| && defined OPENCV_USE_FASTMATH_GCC_BUILTINS) |
Member
There was a problem hiding this comment.
Does it really works as expected?
#if defined _OPENCV_FASTMATH_ENABLE_GCC_MATH_BUILTINS
| // 3. version for float | ||
| #define ARM_ROUND_FLT(value) ARM_ROUND(value, "vcvtr.s32.f32 %[temp], %[value]\n vmov %[res], %[temp]") | ||
| #define CV_INLINE_ROUND_FLT(value) ARM_ROUND(value, "vcvtr.s32.f32 %[temp], %[value]\n vmov %[res], %[temp]") | ||
| #elif defined __PPC64__ && defined __GNUC__ && defined _ARCH_PWR8 |
Member
There was a problem hiding this comment.
Probably we need to apply __CUDACC__ guard here too:
https://www.ibm.com/developerworks/community/blogs/fe313521-2e95-46f2-817d-44a4f27eba32/entry/how_to_develop_nvidia_cuda_applications_on_ibm_power8?lang=en
d1769c4 to
17584ef
Compare
alalek
reviewed
Aug 7, 2019
modules/core/perf/perf_cvround.cpp
Outdated
|
|
||
| template <typename T> | ||
| static void CvRoundMat(const cv::Mat & src, cv::Mat & dst) | ||
| static void CvRoundMat(const cv::Mat & src, cv::Mat & dst, int (*round)(T)) |
Member
There was a problem hiding this comment.
Performance tests are degraded by itself due using of function pointer.
Add a basic sanity test to verify the rounding functions work as expected. Likewise, extend the rounding performance test to cover the additional float -> int fast math functions.
Add a new macro definition OPENCV_USE_FASTMATH_GCC_BUILTINS to enable usage of GCC inline math functions, if available and requested by the user. Likewise, enable it for POWER. This is nearly always a substantial improvement over using integer manipulation as most operations can be done in several instructions with no branching. The result is a 1.5-1.8x speedup in the ceil/floor operations. 1. As tested with AT 12.0-1 (GCC 8.3.1) compiler on P9 LE.
Implement cvRound using inline asm. No compiler support exists today to properly optimize this. This results in about a 4x speedup over the default rounding. Likewise, simplify the growing number of rounding function overloads. For P9 enabled targets, utilize the classification testing instruction to test for Inf/Nan values. Operation speedup is about 1.2x for FP32, and 1.5x for FP64 operands. For P8 targets, fallback to the GCC nan inline. It provides a 1.1/1.4x improvement for FP32/FP64 arguments.
17584ef to
f38a61c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Optimize fast_math.hpp primitives for P8+ and GCC based systems.
Leverage compiler builtins to implement most rounding primitives. This should allow the compiler to choose more efficient instructions for the target architecture. PPC has dedicated rounding instructions. Hopefully this also carries over to other architectures too.
Notably, __builtin_lrint{,f} functions just call out to libm. Instead, include an inline solution similar to ARM.
This reduces the testing time by 200-300 seconds on POWER9.