Skip to content

Add Neon optimised RGB2Lab conversion#19883

Merged
alalek merged 7 commits intoopencv:3.4from
jondea:arm-neon-optimised-color-lab-3.4
May 28, 2021
Merged

Add Neon optimised RGB2Lab conversion#19883
alalek merged 7 commits intoopencv:3.4from
jondea:arm-neon-optimised-color-lab-3.4

Conversation

@jondea
Copy link
Copy Markdown
Contributor

@jondea jondea commented Apr 9, 2021

A Neon specific implementation of RGB2Lab increases single threaded performance by ~25%, here's the numbers run on aws c6gd.4xlarge with gcc 9.3 (numbers are similar using gcc 10)

Test set Test number After/before ratio Speedup with 1/million bounds [%]
cvtColor8u 8 0.76835 23.2 ± 0.2
cvtColor8u 34 0.76204 23.8 ± 0.2
cvtColor8u 67 0.76667 23.3 ± 0.2
cvtColor8u 69 0.76773 23.2 ± 0.2
cvtColor8u 71 0.76231 23.8 ± 0.2
cvtColor8u 73 0.76184 23.8 ± 0.2
cvtColor8u 90 0.76851 23.1 ± 0.2
cvtColor8u 103 0.76143 23.9 ± 0.2
cvtColor8u 128 0.73870 26.1 ± 0.1
cvtColor8u 154 0.73760 26.2 ± 0.2
cvtColor8u 187 0.73891 26.1 ± 0.1
cvtColor8u 189 0.73889 26.1 ± 0.1
cvtColor8u 191 0.73802 26.2 ± 0.2
cvtColor8u 193 0.73817 26.2 ± 0.2
cvtColor8u 210 0.73879 26.1 ± 0.1
cvtColor8u 223 0.73745 26.3 ± 0.2
cvtColor8u 248 0.73756 26.2 ± 0.1
cvtColor8u 274 0.73613 26.4 ± 0.1
cvtColor8u 307 0.73768 26.2 ± 0.1
cvtColor8u 309 0.73767 26.2 ± 0.1
cvtColor8u 311 0.73676 26.3 ± 0.2
cvtColor8u 313 0.73672 26.3 ± 0.2
cvtColor8u 330 0.73748 26.3 ± 0.1
cvtColor8u 343 0.73591 26.4 ± 0.1

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or other license that is incompatible with OpenCV
  • The PR is proposed to proper branch
  • There is reference to original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake
force_builders=linux,docs,ARMv8,ARMv7

@alalek
Copy link
Copy Markdown
Member

alalek commented Apr 9, 2021

Marking this RFC, because it doesn't follow OpenCV guidelines to avoid using of raw native intrinsics in OpenCV modules.

@jondea
Copy link
Copy Markdown
Contributor Author

jondea commented Apr 9, 2021

Hi @alalek, thank you for looking into this. We (me and @fpetrogalli) submitted it like this because we weren't sure what correct approach was. Also, sorry if this is an silly question, but what does it mean to mark it as RFC?

One possible solution to the raw intrinsics is to keep the #if CV_NEON block in color_lab.cpp but rewrite it using the HAL intrinsics. Another solution would be to split it into a neon specific file, like in the case of resize.cpp, resize.avx2.cpp and resize.sse4_1.cpp for example. Are either of these acceptable or preferable? Or is there a another way which would achieve the same goal?

@vpisarev
Copy link
Copy Markdown
Contributor

@jondea, thank you for the contribution!
as @alalek said, for the tiny OpenCV core team it's simply unfeasible to maintain separate code branches for the growing amount of code and the growing number of platforms that we support. With time we hope to port most of the remaining native branches to HAL/universal intrinsics. There will be some exceptions, like deep learning, where the amount of critical kernels is not that big and where we can afford separate branches, but overall universal intrinsics is the preferable (by far) option.

I'd start with the first option that you suggested - keep the separate branch under CV_NEON, but rewrite it using HAL intrinsics. I briefly looked at the current implementation and I found it too bulky for the equivalent C code that it accelerates. So, I'm 60-80% sure that the HAL code that you write will be faster than the existing implementation not just on ARM, but on the other platforms as well. And then we will just replace that code with yours, i.e. remove #if CV_NEON ... #endif around your code and remove the other branch.

@jondea
Copy link
Copy Markdown
Contributor Author

jondea commented Apr 27, 2021

The changes have been rewritten to use just HAL intrinsics, any feedback would be appreciated.

Copy link
Copy Markdown
Contributor

@fpetrogalli fpetrogalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jondea,

just a couple of minor observations.

Thank you for your work.

Francesco

Copy link
Copy Markdown
Contributor

@fpetrogalli fpetrogalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with a nit.

Final word on the maintainers, of course!

Thank you, Francesco

Co-authored-by: Francesco Petrogalli <25690309+fpetrogalli@users.noreply.github.com>
@jondea
Copy link
Copy Markdown
Contributor Author

jondea commented May 10, 2021

@vpisarev @alalek is there anything else which needs to be done before this can be merged?

@jondea
Copy link
Copy Markdown
Contributor Author

jondea commented May 19, 2021

Hi @vpisarev @alalek would you be able to take another look at this please and let me know if it can be merged?

@vpisarev vpisarev self-assigned this May 24, 2021
@vpisarev
Copy link
Copy Markdown
Contributor

@jondea, thank you very much! I tested the code both on Mac-Intel and Mac-ARM (M1), it works well, the claimed acceleration is achieved. On Intel it's no slower than the previous version, but, unfortunately, it's 128-bit only.

In any case, it can be merged as-is, and later we can modify this code to use some new variations of v_lut() intrinsic.

👍

@vpisarev vpisarev self-requested a review May 25, 2021 05:45
@fpetrogalli
Copy link
Copy Markdown
Contributor

@jondea, thank you very much! I tested the code both on Mac-Intel and Mac-ARM (M1), it works well, the claimed acceleration is achieved. On Intel it's no slower than the previous version, but, unfortunately, it's 128-bit only.

@vpisarev , @jondea is working on an equivalent version that uses SVE2 intrinsics. He is using the intrinsic svld1uh_gather_s32index_s32 for the variation of v_lut() that does a gather from the indexes. It is a Vector Length Agnositc (VLA) version, so it could be ported easily to HAL once we make the nlanes field to be a runtime value and we have the correspondent indexed lut intrinsic.

@vpisarev vpisarev requested a review from asmorkalov May 28, 2021 04:28
@asmorkalov asmorkalov requested review from asmorkalov and removed request for asmorkalov May 28, 2021 07:24
@alalek alalek merged commit 8ecfbdb into opencv:3.4 May 28, 2021
This was referenced May 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

optimization platform: arm ARM boards related issues: RPi, NVIDIA TK/TX, etc RFC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants