Skip to content

Add the HAL implementation for the merge function on RISC-V Vector.#26216

Merged
asmorkalov merged 1 commit intoopencv:4.xfrom
hanliutong:rvv-hal-merge
Oct 9, 2024
Merged

Add the HAL implementation for the merge function on RISC-V Vector.#26216
asmorkalov merged 1 commit intoopencv:4.xfrom
hanliutong:rvv-hal-merge

Conversation

@hanliutong
Copy link
Copy Markdown
Contributor

We have seen some performance gains after refactoring the SIMD loop code in OpenCV to be compatible with the RISC-V Vector backend. However, we still see a lot of gaps where we can use the HAL to implement RVV-specific optimizations to further improve performance/eliminate performance losses.

In the Core module, we find that the performance of the merge on 32F/64F when channel is 3 or 4 is reduced in compared to the scalar version, especially on the BPI-F3 board (~0.5x). This patch add the HAL implementation for the merge function on RISC-V Vector.

  • Performance result: compiled by GCC 14.2 and Clang 19.1.0, tested on k230 and BPI-F3

1727617232236

Compared to using UI, this patch (named native in the figure) eliminates the above performance loss, and the acceleration ratio is greater than the UI version

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
  • The PR is proposed to the proper branch
  • There is a reference to the original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake

@asmorkalov
Copy link
Copy Markdown
Contributor

@mshabunin could you take a look?

__riscv_vsse8_v_u8m1(dst + i*cn, sizeof(uchar)*2, a, vl);
}
#if defined(__clang__)
#pragma clang loop vectorize(disable)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do yoiu disable vectorization? Was it causing slowdown?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, since it's actually quite short, vectorizing this loop will cause performance degradation.
This is actually a ”tail“, but the static analysis of compiler does not know that it is the "tail".

if( k == 1 )
{
const uchar* src0 = src[0];
for( ; i <= len - vl; i += vl)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try scalable vector length in the loop to avoid separate tail processing? Did it perform worse?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the scalable vector length loop is worse and I guess the reason is due to the setvl instruction inside the loop body.

@asmorkalov asmorkalov merged commit 69803e7 into opencv:4.x Oct 9, 2024
@asmorkalov asmorkalov mentioned this pull request Oct 23, 2024
asmorkalov pushed a commit that referenced this pull request Oct 24, 2024
Use LMUL=2 in the RISC-V Vector (RVV) backend of Universal Intrinsic. #26318

The modification of this patch involves the RVV backend of Universal Intrinsic, replacing `LMUL=1` with `LMUL=2`.

Now each Universal Intrinsic type actually corresponds to two RVV vector registers, and each Intrinsic function also operates two vector registers. Considering that algorithms written using Universal Intrinsic usually do not use the maximum number of registers, this can help the RVV backend utilize more register resources without modifying the algorithm implementation

This patch is generally beneficial in performance.

We compiled OpenCV with `Clang-19.1.1` and `GCC-14.2.0` , ran it on `CanMV-k230` and `Banana-Pi F3`. Then we have four scenarios on combinations of compilers and devices. In `opencv_perf_core`, there are 3363 cases, of which:
- 901 (26.8%) cases achieved more than `5%` performance improvement in all four scenarios, and the average speedup of these test cases (compared to scalar) increased from `3.35x` to `4.35x`
- 75 (2.2%) cases had more than `5%` performance loss in all four scenarios, indicating that these cases are better with `LMUL=1` instead of `LMUL=2`. This involves `Mat_Transform`, `hasNonZero`, `KMeans`, `meanStdDev`, `merge` and `norm2`. Among them, `Mat_Transform` only has performance degradation in a few cases (`8UC3`), and the actual execution time of `hasNonZero` is so short that it can be ignored. For `KMeans`, `meanStdDev`, `merge` and `norm2`, we should be able to use the HAL to optimize/restore their performance. (In fact, we have already done this for `merge`  #26216 )

### Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

- [x] I agree to contribute to the project under Apache 2 License.
- [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
- [ ] The PR is proposed to the proper branch
- [ ] There is a reference to the original bug report and related work
- [ ] There is accuracy test, performance test and test data in opencv_extra repository, if applicable
      Patch to opencv_extra has the same branch name.
- [ ] The feature is well documented and sample code can be built with the project CMake
@@ -0,0 +1,363 @@
#ifndef OPENCV_HAL_RVV_MERGE_HPP_INCLUDED
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add OpenCV license header

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the shorter version below, or another longer version?

// This file is part of OpenCV project.
// It is subject to the license terms in the LICENSE file found in the top-level directory
// of this distribution and at http://opencv.org/license.html.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ting-xin pushed a commit to ting-xin/opencv that referenced this pull request Nov 17, 2024
Use LMUL=2 in the RISC-V Vector (RVV) backend of Universal Intrinsic. opencv#26318

The modification of this patch involves the RVV backend of Universal Intrinsic, replacing `LMUL=1` with `LMUL=2`.

Now each Universal Intrinsic type actually corresponds to two RVV vector registers, and each Intrinsic function also operates two vector registers. Considering that algorithms written using Universal Intrinsic usually do not use the maximum number of registers, this can help the RVV backend utilize more register resources without modifying the algorithm implementation

This patch is generally beneficial in performance.

We compiled OpenCV with `Clang-19.1.1` and `GCC-14.2.0` , ran it on `CanMV-k230` and `Banana-Pi F3`. Then we have four scenarios on combinations of compilers and devices. In `opencv_perf_core`, there are 3363 cases, of which:
- 901 (26.8%) cases achieved more than `5%` performance improvement in all four scenarios, and the average speedup of these test cases (compared to scalar) increased from `3.35x` to `4.35x`
- 75 (2.2%) cases had more than `5%` performance loss in all four scenarios, indicating that these cases are better with `LMUL=1` instead of `LMUL=2`. This involves `Mat_Transform`, `hasNonZero`, `KMeans`, `meanStdDev`, `merge` and `norm2`. Among them, `Mat_Transform` only has performance degradation in a few cases (`8UC3`), and the actual execution time of `hasNonZero` is so short that it can be ignored. For `KMeans`, `meanStdDev`, `merge` and `norm2`, we should be able to use the HAL to optimize/restore their performance. (In fact, we have already done this for `merge`  opencv#26216 )

### Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

- [x] I agree to contribute to the project under Apache 2 License.
- [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
- [ ] The PR is proposed to the proper branch
- [ ] There is a reference to the original bug report and related work
- [ ] There is accuracy test, performance test and test data in opencv_extra repository, if applicable
      Patch to opencv_extra has the same branch name.
- [ ] The feature is well documented and sample code can be built with the project CMake
thewoz pushed a commit to CobbsLab/OPENCV that referenced this pull request Feb 13, 2025
Use LMUL=2 in the RISC-V Vector (RVV) backend of Universal Intrinsic. opencv#26318

The modification of this patch involves the RVV backend of Universal Intrinsic, replacing `LMUL=1` with `LMUL=2`.

Now each Universal Intrinsic type actually corresponds to two RVV vector registers, and each Intrinsic function also operates two vector registers. Considering that algorithms written using Universal Intrinsic usually do not use the maximum number of registers, this can help the RVV backend utilize more register resources without modifying the algorithm implementation

This patch is generally beneficial in performance.

We compiled OpenCV with `Clang-19.1.1` and `GCC-14.2.0` , ran it on `CanMV-k230` and `Banana-Pi F3`. Then we have four scenarios on combinations of compilers and devices. In `opencv_perf_core`, there are 3363 cases, of which:
- 901 (26.8%) cases achieved more than `5%` performance improvement in all four scenarios, and the average speedup of these test cases (compared to scalar) increased from `3.35x` to `4.35x`
- 75 (2.2%) cases had more than `5%` performance loss in all four scenarios, indicating that these cases are better with `LMUL=1` instead of `LMUL=2`. This involves `Mat_Transform`, `hasNonZero`, `KMeans`, `meanStdDev`, `merge` and `norm2`. Among them, `Mat_Transform` only has performance degradation in a few cases (`8UC3`), and the actual execution time of `hasNonZero` is so short that it can be ignored. For `KMeans`, `meanStdDev`, `merge` and `norm2`, we should be able to use the HAL to optimize/restore their performance. (In fact, we have already done this for `merge`  opencv#26216 )

### Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

- [x] I agree to contribute to the project under Apache 2 License.
- [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
- [ ] The PR is proposed to the proper branch
- [ ] There is a reference to the original bug report and related work
- [ ] There is accuracy test, performance test and test data in opencv_extra repository, if applicable
      Patch to opencv_extra has the same branch name.
- [ ] The feature is well documented and sample code can be built with the project CMake
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants