[HAL RVV] unify and impl polar_to_cart | add perf test#26999
[HAL RVV] unify and impl polar_to_cart | add perf test#26999asmorkalov merged 4 commits intoopencv:4.xfrom
Conversation
6d3d8d0 to
12b22a6
Compare
|
My performance numbers for Muse Pi v 30 (gcc 14.2): |
|
@GenshinImpactStarts Thanks a lot for the contribution! I want to ask you to try UI branch for the function too. See cartToPolar32f_: |
|
@asmorkalov Do you want me to turn on CV_SIMD_SCALABLE for cartToPolar32f_, or should I work on the UI for polarToCart? |
|
Try to turn on for CV_SIMD_SCALABLE for cartToPolar32f_ and fix the implementation, if it's possible. |
|
OK. I will work on this in another PR #27000 . |
12b22a6 to
6f61e59
Compare
|
My performance results (K1 vs RK3568): |
6f61e59 to
87059be
Compare
|
UI implementation of sincos has been replaced with v_sincos from #25892. HAL implementation also improved. Comment and perf test updated. Updated parts in comment:
|
|
According to your results, UI works better than the AVX2 branch. It makes sense to drop it and keep the code clean. |
c081d0c to
850ec68
Compare
|
Updated performance results (K1 vs RK3568): Impressive 👍 |
Co-authored-by: Liutong HAN <liutong2020@iscas.ac.cn>
Co-authored-by: Liutong HAN <liutong2020@iscas.ac.cn>
Co-authored-by: Liutong HAN <liutong2020@iscas.ac.cn>
Co-authored-by: Liutong HAN <liutong2020@iscas.ac.cn>
c7bbbd6 to
992714d
Compare
|
There is speedup on aarch64 (jetson orin): |
Summary
cv_hal_polarToCart32fandcv_hal_polarToCart64finterfaces.polarToCartperformance testscv::polarToCartuse CALL_HAL in the same way ascv::cartToPolarTested through:
HAL performance test
UPDATE: Current implementation is no more depending on vlen.
NOTE: Due to the 4th point in the summary above, the
scalaranduitest is based on the modified code of this PR. The impact of this patch onscalaranduiis evaluated in the next section,Effect of Point 4.Vlen 256 (Muse Pi):
Effect of Point 4
To make
cv::polarToCartbehave the same ascv::cartToPolar, the implementation detail of the former has been moved to the latter's location (frommathfuncs.cpptomathfuncs_core.simd.hpp).Reason for Changes:
This function works as follows:
$y = \text{mag} \times \sin(\text{angle})$ and $x = \text{mag} \times \cos(\text{angle})$ . The original implementation first calculates the values of $\sin$ and $\cos$ , storing the results in the output buffers $x$ and $y$ , and then multiplies the result by $\text{mag}$ .
However, when the function is used as an in-place operation (one of the output buffers is also an input buffer), the original implementation allocates an extra buffer to store the$\sin$ and $\cos$ values in case the $\text{mag}$ value gets overwritten. This extra buffer allocation prevents
cv::polarToCartfrom functioning in the same way ascv::cartToPolar.Therefore, the multiplication is now performed immediately without storing intermediate values. Since the original implementation also had AVX2 optimizations, I have applied the same optimizations to the AVX2 version of this implementation.
UPDATE: UI use v_sincos from #25892 now. The original implementation has AVX2 optimizations but is slower much than current UI so it's removed, and AVX2 perf test is below. Scalar implementation isn't changed because it's faster than using UI's method.
Test Result
scalaranduitest is done on Muse PI, and AVX2 test is done on Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz.scalartest:uitest:AVX2 test:
A slight performance loss occurs because the check for whether$mag$ is nullptr is performed with every calculation, instead of being done once per batch. This is to reuse current
SinCos_32ffunction.Pull Request Readiness Checklist
See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request
Patch to opencv_extra has the same branch name.