Added MSA implementations for mips platforms.#15422
Conversation
|
I have added MSA implementations for MIPS platforms. The changes include:
Performance and functionality testsuites have been run locally for verification. According to the opencv_perf tests results, the MSA optimized version is 39% faster than the No-MSA version. |
alalek
left a comment
There was a problem hiding this comment.
Great job 👍
General notes:
- please split PR: SIMD backend + build scripts should be reviewed and merged first. Other parts please move into separate PRs. Please check related comment below about RAW MSA code.
- as an "optimization" activity please re-target this PR onto 3.4 branch
3rdparty/libpng/CMakeLists.txt
Outdated
| set(PNG_MIPS_MSA "on" CACHE STRING "Enable MIPS_MSA optimizations: | ||
| off: disable the optimizations") | ||
| set_property(CACHE PNG_MIPS_MSA PROPERTY STRINGS | ||
| ${PNG_MIPS_MSA_POSSIBLE_VALUES}) |
There was a problem hiding this comment.
-...
+set(PNG_MIPS_MSA ON CACHE BOOL ...)
-if(${PNG_MIPS_MSA} STREQUAL "on")
+if(PNG_MIPS_MSA)instead of strings manipulation.
Also it is better to add this pre-check:
if(";${CPU_BASELINE_FINAL};" MATCHES ";MSA;")
| ocv_optimization_process_obsolete_option(ENABLE_VFPV3 VFPV3 OFF) | ||
| ocv_optimization_process_obsolete_option(ENABLE_NEON NEON OFF) | ||
|
|
||
| ocv_optimization_process_obsolete_option(ENABLE_MSA MSA OFF) |
There was a problem hiding this comment.
Please don't introduce obsolete options.
CPU_BASELINE=MSA should be used instead (it is defaulted to this values below on the line 350)
| #define CV_CPU_AVX512_CEL 261 | ||
| #define CV_CPU_AVX512_ICL 262 | ||
|
|
||
| #define CV_CPU_MSA 300 |
| // of this distribution and at http://opencv.org/license.html. | ||
|
|
||
| #ifndef OPENCV_CORE_HAL_MSA_MACROS_H | ||
| #define OPENCV_CORE_HAL_MSA_MACROS_H |
There was a problem hiding this comment.
Please move this file into core/hal directory.
platforms/linux/mips.toolchain.cmake
Outdated
|
|
||
| if(USE_MSA) | ||
| message(WARNING "You use obsolete variable USE_MSA to enable MSA instruction set. Use -DENABLE_MSA=ON instead." ) | ||
| set(ENABLE_MSA TRUE) |
| //! @name Check SIMD support | ||
| //! @{ | ||
| //! @brief Check CPU capability of SIMD operation | ||
| static inline bool hasSIMD128() |
There was a problem hiding this comment.
Please remove this call. It is gone (handled by compile time checks).
There was a problem hiding this comment.
GitHub code reference is buggy again - correct line is 1748 about hasSIMD128().
There was a problem hiding this comment.
Please remove static inline bool hasSIMD128() - it is not used anymore in OpenCV
(wrong code is linked due GitHub bug, but file is correct)
| } | ||
| #endif | ||
|
|
||
| ////// FP16 suport /////// |
| #define SLDI_B2_0(RTYPE, in0, in1, out0, out1, slide_val) \ | ||
| { \ | ||
| out0 = (RTYPE) __msa_sldi_b((v16i8) __msa_fill_b(0), (v16i8) in0, slide_val); \ | ||
| out1 = (RTYPE) __msa_sldi_b((v16i8) __msa_fill_b(0), (v16i8) in1, slide_val); \ |
There was a problem hiding this comment.
Please use source from libpng 1.6.37
Add separate patch file if required (to avoid fixes lost after 3rdparty upgrade in the future)
| vs12 = v_fma(w1, r2, vs12); | ||
| vs13 = v_fma(w1, r3, vs13); | ||
| #else | ||
| vs00 += w0*r0; |
There was a problem hiding this comment.
Perhaps this can be replaced to v_fma() unconditionally for all platforms.
modules/imgproc/src/demosaicing.cpp
Outdated
| { | ||
| v8u16 r0 = msa_ld1q_u16((const ushort*)bayer); | ||
| v8u16 r1 = msa_ld1q_u16((const ushort*)(bayer + bayer_step)); | ||
| v8u16 r2 = msa_ld1q_u16((const ushort*)(bayer + bayer_step * 2)); |
There was a problem hiding this comment.
Main principle of "universal intrinsics" (SIMD) approach is "code once, run anywhere".
You need to write code once (on some platform), debug it with some backend (there is even C++ backend for SIMD for reference) and then it should run on other backends with reasonable speed.
So we are strictly against huge code blocks using native intrinsic if this code can be replaced by universal SIMD code without significant lost of speed.
Universal SIMD code is regularly tested via x86/ARM backends, native RAW MIPS has very low testing coverage.
See discussion here: #9763 (comment)
Could you please move these changes into a separate PR?
Lets merge first addition of SIMD backend and related build scripts (so we can add cross-platforms builds for that into our infrastructure (but no tests)).
|
Hi Alexander
Thanks a lot for your quick response and detailed comments.
I will adjust and optimize the source code as you have commented, then submit separated PRs.
Best Regards,
Fei Wu
|
|
@alalek Since I need to separate the PR into two, should I close this pull request first?
|
|
Lets keep basic part of MSA integration in this PR (at least to track conversation above). You can "force push" commits in your branch to update PR ("mipsopen-fwu:msa-dev"). |
|
@alalek |
Yes. You can resolve conflicts during rebase. Before that you may enable diff3 conflict style for more information about changes: https://stackoverflow.com/questions/27417656/should-diff3-be-default-conflictstyle-on-git To abort rebase process use |
|
I rebased PR onto 3.4 branch - fetch changes from your fork. Old saved commits are here: https://github.com/alalek/opencv/commits/pr15422_0903 |
|
@alalek Thanks for rebasing! So what I need to do for the next step is to separate SIMD backend & build scripts from my previous commits, then commit and 'force push' to my branch (msa-dev) to update PR, right? |
…build scripts for MIPS platforms are added. Signed-off-by: Fei Wu <fwu@wavecomp.com>
31292d0 to
7cd5726
Compare
|
@alalek I have separated MSA backend and build scripts from the old commit in this PR which has been rebased to 3.4 branch. |
|
I have added builder with cross-compilation from Ubuntu 18:04 x64:
Update:
|
|
@alalek Can you build the project for mips platform successfully now? |
Signed-off-by: Fei Wu <fwu@wavecomp.com>
|
@alalek
In file included from /tmp/iJM73d75Px/dump1.h:55:0:
--These warnings are introduced by the original filter_msa_intrinsics.c file from libpng-1.6.37. We didn't make any changes on it. If needed, I can make a fix. cc1plus: warning: 'void* __builtin_memcpy(void*, const void*, long unsigned int)' forming offset [25, 32] is out of the bounds [0, 24] of object 'scharr' with type 'const int [6]' [-Warray-bounds] --This warning is not introduced by my commit. |
|
Thank you for recommended build instructions! Currently on CI we prefer using cross-compilation mode through builtin Ubuntu compiler kits. We believe it is enough for basic validation builds (no tests). It is easy enough to integrate and upgrade dependencies from our side.
|
…g warnings for libpng. Signed-off-by: Fei Wu <fwu@wavecomp.com>
|
@alalek I checked the failed build log for 'Android pack' : https://pullrequest.opencv.org/buildbot/builders/precommit_pack_android/builds/11090/steps/build%20sdk/logs/stdio This error Error ‘ opcode not supported on this processor: mips32 (mips32) `pause' ’ is not introduced by my commit. I think maybe this error can be fixed by adding __mips_isa_rev check in modules/core/src/parallel_impl.cpp: |
… is less than 2. Signed-off-by: Fei Wu <fwu@wavecomp.com>
|
@alalek I have committed the fix for compiling error in 'Android pack' build |
|
Hi, Alexander
The PR #15422 is kept in ‘waiting for builds’ state currently. How can I make it go ahead?
I’ll submit another PR for the other parts of MSA optimizations after 15422 is merged.
Thanks.
Best Regards,
Fei Wu
发件人: Alexander Alekhin <notifications@github.com>
发送时间: 2019年9月6日 12:59
收件人: opencv/opencv <opencv@noreply.github.com>
抄送: Fei Wu <fwu@wavecomp.com>; Author <author@noreply.github.com>
主题: [EXTERNAL]Re: [opencv/opencv] Added MSA implementations for mips platforms. (#15422)
Thank you for recommended build instructions!
I believe it make sense to add them into PR description and/or add comment in r6 toolchain file with SDK link + this PR link.
Currently on CI we prefer using cross-compilation mode through builtin Ubuntu compiler kits. We believe it is enough for basic validation builds (no tests). It is easy enough to integrate and upgrade dependencies from our side.
1. It is Linux x86-64 build. No -mmsa option is expected here.
Each header is included directly into ABI checker. Need to add guard in hal/msa_macros.h:
#ifdef __mips_msa
...
#endif
1. In 3rdparty/libpng/CMakeLists.txt near add_definitions(-DPNG_MIPS_MSA_OPT=2) add this line:
ocv_warnings_disable(CMAKE_C_FLAGS -Wshadow)
1. OK, I will try to check this with GCC-8 on other archs and find out how to workaround that.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#15422?email_source=notifications&email_token=AM4VXTJFXXMCNA6HTJRRTHDQIHPPRA5CNFSM4ISK27G2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6BXEDA#issuecomment-528708108>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AM4VXTJEXMOKMS37GT66CGTQIHPPRANCNFSM4ISK27GQ>.
|
alalek
left a comment
There was a problem hiding this comment.
I added 'patch' files for '3rdparty' code.
Please take a look on comments below.
| ocv_update(CPU_MSA_TEST_FILE "${OpenCV_SOURCE_DIR}/cmake/checks/cpu_msa.cpp") | ||
| ocv_update(CPU_KNOWN_OPTIMIZATIONS "MSA") | ||
| ocv_update(CPU_MSA_FLAGS_ON "") | ||
| ocv_update(CPU_FP16_IMPLIES "MSA") |
There was a problem hiding this comment.
FP16 is not enabled in current builds.
What did you mean here?
| CPU_AVX512_CEL = 261, //!< Cascade Lake with AVX-512F/CD/BW/DQ/VL/IFMA/VBMI/VNNI | ||
| CPU_AVX512_ICL = 262, //!< Ice Lake with AVX-512F/CD/BW/DQ/VL/IFMA/VBMI/VNNI/VBMI2/BITALG/VPOPCNTDQ | ||
|
|
||
| CPU_MSA = 300, |
| //! @name Check SIMD support | ||
| //! @{ | ||
| //! @brief Check CPU capability of SIMD operation | ||
| static inline bool hasSIMD128() |
There was a problem hiding this comment.
Please remove static inline bool hasSIMD128() - it is not used anymore in OpenCV
(wrong code is linked due GitHub bug, but file is correct)
…ptimizations.cmake. 2. Use CV_CPU_COMPILE_MSA instead of __mips_msa for MSA feature check in cv_cpu_dispatch.h. 3. Removed hasSIMD128() in intrin_msa.hpp. 4. Define CPU_MSA as 150. Signed-off-by: Fei Wu <fwu@wavecomp.com>
bb24134 to
f75fcb6
Compare
|
| OPENCV_HAL_IMPL_MSA_EXPAND(v_uint32x4, v_uint64x2, uint, u32, s32, v4u32, v4i32) | ||
| OPENCV_HAL_IMPL_MSA_EXPAND(v_int32x4, v_int64x2, int, s32, s32, v4i32, v4i32) | ||
|
|
||
| #if 0 |
There was a problem hiding this comment.
Why this version of implementation is disabled? Is it slow, inaccurate or something else?
There was a problem hiding this comment.
The '#if 0' version is slower according to the test results. The reason is that MSA doesn't actually provide data type expanding instructions for vectors. The 'movl' related conversion are done by C.
So it would take more steps to use 'movl' interface than doing converion directly.
modules/core/src/arithm.simd.hpp
Outdated
| typedef bin_loader<OP, T1, Tvec> ldr; | ||
| enum {wide_step = Tvec::nlanes}; | ||
| #if !CV_NEON && CV_SIMD_WIDTH == 16 | ||
| #if !CV_NEON && !CV_MSA && CV_SIMD_WIDTH == 16 |
There was a problem hiding this comment.
Is it necessary? Do manual loop unrolling affect performance?
There was a problem hiding this comment.
In case data width is less than wide_step * 2, the performance would be better with this change. Otherwise, the performance shouldn't be better. Considering 'width' should be larger in most cases, this change would not be necessary. I will remove it.
modules/core/src/matmul.simd.hpp
Outdated
| i += blockSize; | ||
| } | ||
| } | ||
| #elif CV_MSA |
There was a problem hiding this comment.
Do performance of this implementation differs essentially from generic universal intrinsics version? I suppose generic UI version should work quite fast due to availability of native intrinsic for v_dotrpod. In case generic version provide similar performance I prefer to avoid platform specific code wherever possible.
BTW This code is anyway disabled since CV_SIMD is available and is checked prior to this.
There was a problem hiding this comment.
I will remove CV_MSA related code block dotProd_8u(), it's not necessary.
|
@terfendail
|
2. Removed unnecessary CV_MSA related code block in dotProd_8u(). Signed-off-by: Fei Wu <fwu@wavecomp.com>
| i += blockSize; | ||
| } | ||
| } | ||
| #elif CV_MSA |
There was a problem hiding this comment.
#if CV_SIMD blocks this code execution here too (and compilation). The same note is about #elif CV_NEON code block (dead code too)
There was a problem hiding this comment.
@mipsopen-fwu Have you compared performance of this MSA specific implementation against generic UI implementation?
There was a problem hiding this comment.
@terfendail
This CV_MSA related code block was added by mistake when I merged msa implementations from 4.0.1 branch to 4.1.1 and 3.4 branch. On our pervious 4.0.1 development branch, this MSA specific implementation doesn't exist. It's correct to remove them from the pull request.
Even though, in order to make clear the performance difference, I did MatType_Length_dot test on mips32r5 HW by setting testparam=(8UC1, 1024) and samples=15. From the result, we can see that the performance of generic UI implementation(using CV_SIMD) is better against CV_MSA specific implementation which was added by mistake in this code block.
The following is the test result:
- The total running time for CV_MSA implementation is:
[----------] 18 tests from MatType_Length_dot (2177 ms total) - The total running time for CV_SIMD implementations is:
[----------] 18 tests from MatType_Length_dot (618 ms total)
| elseif(MIPS) | ||
| ocv_update(CPU_MSA_TEST_FILE "${OpenCV_SOURCE_DIR}/cmake/checks/cpu_msa.cpp") | ||
| ocv_update(CPU_KNOWN_OPTIMIZATIONS "MSA") | ||
| ocv_update(CPU_MSA_FLAGS_ON "") |
There was a problem hiding this comment.
Why is it empty? It should contain "-mmsa" or something similar.
There was a problem hiding this comment.
This option seems not be used elsewhere, but it should be better to be defined as '-mmsa' for future use. I will modify it.
There was a problem hiding this comment.
This option is used, but it not obvious:
- there is a list of
CPU_KNOWN_OPTIMIZATIONSwhich containMSA - cmake first tries to compile file
CPU_${OPT}_TEST_FILEwith optionCPU_${OPT}_FLAGS_ON, whereOPT=MSA - if compilation succeeded and
MSAis requested by user (viaCPU_BASELINE), then these options will be added to compiler
Is the -mmsa option the only one required by implemented intrinsics? What about -fhard-float or others? We should move these options from toolchain file to CompilerOptimization.cmake.
There was a problem hiding this comment.
Yes. -mmsa option is enough. When '-mmsa' is included, '-mfp64' and '-mhard-float' will be enabled by default. '-mmsa' has the same effect as '-mmsa -mfp64 -mhard-float'.
| inline bool v_check_any(const v_float32x4& a) | ||
| { return v_check_any(v_reinterpret_as_u32(a)); } | ||
|
|
||
| #if CV_SIMD128_64F |
There was a problem hiding this comment.
v_check_all(const v_int64x2&) shouldn't be guarded by CV_SIMD128_64F as well.
There was a problem hiding this comment.
I will remove CV_SIMD128_64F guarding form intrin_msa.hpp, only the define is kept.
2. Removed CV_SIMD128_64F guardings in intrin_msa.hpp. Signed-off-by: Fei Wu <fwu@wavecomp.com>
Signed-off-by: Fei Wu <fwu@wavecomp.com>
Signed-off-by: Fei Wu fwu@wavecomp.com
Materials: