core: update convertFp16 using CV_CPU_CALL_FP16#8838
core: update convertFp16 using CV_CPU_CALL_FP16#8838opencv-pushbot merged 1 commit intoopencv:masterfrom
Conversation
alalek
left a comment
There was a problem hiding this comment.
Looks great! Thank you!
Please see my comments below.
| @@ -0,0 +1,178 @@ | |||
| /*M/////////////////////////////////////////////////////////////////////////////////////// | |||
There was a problem hiding this comment.
Please rename this file to "convert.fp16.cpp"
You can check effect via CMake output (FP16 (0 files)):
-- Dispatched code generation: SSE4_1 FP16 AVX AVX2
-- requested: SSE4_1 AVX FP16 AVX2
-- SSE4_1 (0 files): + SSE4_1
-- FP16 (0 files): + SSE4_1 POPCNT SSE4_2 FP16 AVX
-- AVX (1 files): + SSE4_1 POPCNT SSE4_2 AVX
-- AVX2 (1 files): + SSE4_1 POPCNT SSE4_2 FP16 FMA3 AVX AVX2
There was a problem hiding this comment.
Right!
I'll do this
| void cvtScaleHalf_SIMD32f16f( const float* src, size_t sstep, short* dst, size_t dstep, cv::Size size ) | ||
| { | ||
| #if !defined CV_CPU_BASELINE_COMPILE_AVX | ||
| _mm256_zeroupper(); |
There was a problem hiding this comment.
We should call this at the end of AVX function. Calling this in the beginning of function is optional (and has less effect).
https://stackoverflow.com/a/28356924
BTW, I wrapped this into CV_INSTRUMENT_REGION() (via this macro)
There was a problem hiding this comment.
That's a great macro.
I'll follow this comment
|
|
||
| for ( ; x < size.width; x++ ) | ||
| { | ||
| dst[x] = convertFp16SW(src[x]); |
There was a problem hiding this comment.
Calling of "non optimized code" is not recommended.
It is better to put implementations of these functions into header file under "anonymous namespace" (to avoid calling of SSE instructions without vzeroupper).
| void cvtScaleHalf_SIMD32f16f( const float* src, size_t sstep, short* dst, size_t dstep, cv::Size size ) | ||
| { | ||
| return; | ||
| } |
There was a problem hiding this comment.
Probably we don't need these stubs.
There was a problem hiding this comment.
Right. I'll remove these stubs.
|
@alalek , thank you very much for your very kind and nice comment. About 3, I'm still lost of how to go. First of all, I implemented the optimized code using or Since both define have On the other hand, this is vectorized optimization, which means that we always have to consider that the line is not a multiple of the vector width. In another word, non-optimized code for few elements. From these two points, I came to a conclusion that in this case, I've no choice but to implement as following.
So, from my point of view, I'm lost about where to put the SW non-optimized version. I basically agree about your suggestion, but I'm not sure how to follow it. |
| namespace cv | ||
| { | ||
| float convertFp16SW(short fp16); | ||
| short convertFp16SW(float fp32); |
There was a problem hiding this comment.
I mean, try to move implementations of these functions into header file under anonymous namespace.
So compilation unit with SSE* generated code will have one version of these functions, and AVX*(FP16) compilation unit will have own version too.
Anonymous namespace is needed to eliminate linker error about duplicate identifiers or unwanted code "replacement".
namespace {
#if !CV_FP16_TYPE
// const numbers for floating points format
const unsigned int kShiftSignificand = 13;
const unsigned int kMaskFp16Significand = 0x3ff;
const unsigned int kBiasFp16Exponent = 15;
const unsigned int kBiasFp32Exponent = 127;
#endif
#if CV_FP16_TYPE
float convertFp16SW(short fp16)
{
// Fp16 -> Fp32
Cv16suf a;
a.i = fp16;
return (float)a.h;
}
#else
float convertFp16SW(short fp16)
{
...
}
#endif
#if CV_FP16_TYPE
short convertFp16SW(float fp32)
{
// Fp32 -> Fp16
Cv16suf a;
a.h = (__fp16)fp32;
return a.i;
}
#else
short convertFp16SW(float fp32)
{
...
}
#endif
} // namespace
Loops for tails should be left "as is" in optimized version too.
|
@alalek , I appreciate your comment. Meanwhile, I'll highly appreciate if you could give an feedback on this implementation. |
|
@tomoaki0705 Looks good to me! Thank you! Looks like build failures are related to changes from BTW, |
|
@alalek, thank you very much for your feedback. Thank you very much, again. |
modules/core/src/convert.fp16.cpp
Outdated
| { | ||
| namespace opt_FP16 | ||
| { | ||
| #if defined(__x86_64__) || defined(_M_X64) |
There was a problem hiding this comment.
This "#if" makes file this file empty for x86 32-bit builds.
Please use this pattern:
#if !defined(CV_NEON) || !CV_NEON
... x86 code ...
#elif CV_NEON
... ARM code ...
#else
#error "Unsupported build configuration"
#endif
There was a problem hiding this comment.
Thank you! sorry for such a stupid mistake.
* avoid link error (move the implementation of software version to header) * make getConvertFuncFp16 local (move from precomp.hpp to convert.hpp) * fix error on 32bit x86
|
Well done! Thank you for contribution! |
This pullrequest changes
FP16