Skip to content

core: update convertFp16 using CV_CPU_CALL_FP16#8838

Merged
opencv-pushbot merged 1 commit intoopencv:masterfrom
tomoaki0705:dispatchFp16
Jun 6, 2017
Merged

core: update convertFp16 using CV_CPU_CALL_FP16#8838
opencv-pushbot merged 1 commit intoopencv:masterfrom
tomoaki0705:dispatchFp16

Conversation

@tomoaki0705
Copy link
Copy Markdown
Contributor

This pullrequest changes

  • follow the new CPU optimization mode for FP16

Copy link
Copy Markdown
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thank you!
Please see my comments below.

@@ -0,0 +1,178 @@
/*M///////////////////////////////////////////////////////////////////////////////////////
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename this file to "convert.fp16.cpp"

You can check effect via CMake output (FP16 (0 files)):

--     Dispatched code generation:  SSE4_1 FP16 AVX AVX2
--       requested:                 SSE4_1 AVX FP16 AVX2
--       SSE4_1 (0 files):          + SSE4_1
--       FP16 (0 files):            + SSE4_1 POPCNT SSE4_2 FP16 AVX
--       AVX (1 files):             + SSE4_1 POPCNT SSE4_2 AVX
--       AVX2 (1 files):            + SSE4_1 POPCNT SSE4_2 FP16 FMA3 AVX AVX2

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right!
I'll do this

void cvtScaleHalf_SIMD32f16f( const float* src, size_t sstep, short* dst, size_t dstep, cv::Size size )
{
#if !defined CV_CPU_BASELINE_COMPILE_AVX
_mm256_zeroupper();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should call this at the end of AVX function. Calling this in the beginning of function is optional (and has less effect).
https://stackoverflow.com/a/28356924

BTW, I wrapped this into CV_INSTRUMENT_REGION() (via this macro)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great macro.
I'll follow this comment


for ( ; x < size.width; x++ )
{
dst[x] = convertFp16SW(src[x]);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling of "non optimized code" is not recommended.
It is better to put implementations of these functions into header file under "anonymous namespace" (to avoid calling of SSE instructions without vzeroupper).

void cvtScaleHalf_SIMD32f16f( const float* src, size_t sstep, short* dst, size_t dstep, cv::Size size )
{
return;
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we don't need these stubs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I'll remove these stubs.

@tomoaki0705
Copy link
Copy Markdown
Contributor Author

@alalek , thank you very much for your very kind and nice comment.
I pushed a commit which applies your suggestion 1, 2 and 4.

About 3, I'm still lost of how to go.

First of all, I implemented the optimized code using CV_CPU_CALL_FP16 which is either

#  define CV_CPU_CALL_FP16(fn, args) return (opt_FP16::fn args)

or

#  define CV_CPU_CALL_FP16(fn, args) if (CV_CPU_HAS_SUPPORT_FP16) return (opt_FP16::fn args)

Since both define have return inside, if the optimized function is called, the code following the function call will NOT be executed.

On the other hand, this is vectorized optimization, which means that we always have to consider that the line is not a multiple of the vector width. In another word, non-optimized code for few elements.

From these two points, I came to a conclusion that in this case, I've no choice but to implement as following.

  • in convert.cpp
cvtScaleHalf_<float, short>( const float* src, size_t sstep, short* dst, size_t dstep, Size size )
{
   // optimized version
    CV_CPU_CALL_FP16(cvtScaleHalf_SIMD32f16f, (src, sstep, dst, dstep, size));
    // call to the non-optimized version (software version)
    dst[x] = convertFp16SW(src[x]);
}
  • in convert.hpp
// declaration of SW version
short convertFp16SW(float f);
float convertFp16SW(short s);

// optimized version
void cvtScaleHalf_SIMD32f16f(src, sstep, dst, dstep, size);
void cvtScaleHalf_SIMD16f32f(src, sstep, dst, dstep, size);
  • in convert.fp16.cpp
        for ( ; x <= size.width - cVectorWidth ; x += cVectorWidth )
        {
            __m128i v_src = _mm_loadu_si128((__m128i*)(src + x));

            __m256 v_dst = _mm256_cvtph_ps(v_src);

            _mm256_storeu_ps(dst + x, v_dst);
        }

        for ( ; x < size.width; x++ )
        {
            dst[x] = convertFp16SW(src[x]);
        }

So, from my point of view, I'm lost about where to put the SW non-optimized version.
Could you give me bit more suggestion about what to write in which file, please ?

I basically agree about your suggestion, but I'm not sure how to follow it.
Best,

namespace cv
{
float convertFp16SW(short fp16);
short convertFp16SW(float fp32);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, try to move implementations of these functions into header file under anonymous namespace.
So compilation unit with SSE* generated code will have one version of these functions, and AVX*(FP16) compilation unit will have own version too.
Anonymous namespace is needed to eliminate linker error about duplicate identifiers or unwanted code "replacement".

namespace {
#if !CV_FP16_TYPE
// const numbers for floating points format
const unsigned int kShiftSignificand    = 13;
const unsigned int kMaskFp16Significand = 0x3ff;
const unsigned int kBiasFp16Exponent    = 15;
const unsigned int kBiasFp32Exponent    = 127;
#endif
  
#if CV_FP16_TYPE
float convertFp16SW(short fp16)
{
    // Fp16 -> Fp32
    Cv16suf a;
    a.i = fp16;
    return (float)a.h;
}
#else
float convertFp16SW(short fp16)
{
    ...
}
#endif

#if CV_FP16_TYPE
short convertFp16SW(float fp32)
{
    // Fp32 -> Fp16
    Cv16suf a;
    a.h = (__fp16)fp32;
    return a.i;
}
#else
short convertFp16SW(float fp32)
{
    ...
}
#endif

} // namespace

Loops for tails should be left "as is" in optimized version too.

@tomoaki0705
Copy link
Copy Markdown
Contributor Author

@alalek , I appreciate your comment.
Based on your feedback, I moved the implementations to the header file.
I'm getting many errors on the build test, so I'm checking why it's happening.

Meanwhile, I'll highly appreciate if you could give an feedback on this implementation.
Are we on the same page ?

@alalek
Copy link
Copy Markdown
Member

alalek commented Jun 5, 2017

@tomoaki0705 Looks good to me! Thank you!

Looks like build failures are related to changes from out.cpp file.

BTW, getConvertFuncFp16() can be local "static" (can be removed from precomp.hpp).

@tomoaki0705
Copy link
Copy Markdown
Contributor Author

@alalek, thank you very much for your feedback.
Indeed, I forgot to revert the local modification about out.cpp
I also changed getConvertFuncFp16() as local static.

Thank you very much, again.

{
namespace opt_FP16
{
#if defined(__x86_64__) || defined(_M_X64)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "#if" makes file this file empty for x86 32-bit builds.
Please use this pattern:

#if !defined(CV_NEON) || !CV_NEON
... x86 code ...
#elif CV_NEON
... ARM code ...
#else
#error "Unsupported build configuration"
#endif

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! sorry for such a stupid mistake.

 * avoid link error (move the implementation of software version to header)
 * make getConvertFuncFp16 local (move from precomp.hpp to convert.hpp)
 * fix error on 32bit x86
@alalek
Copy link
Copy Markdown
Member

alalek commented Jun 6, 2017

Well done! Thank you for contribution!
👍

@opencv-pushbot opencv-pushbot merged commit e269ef9 into opencv:master Jun 6, 2017
@tomoaki0705 tomoaki0705 deleted the dispatchFp16 branch June 6, 2017 22:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants