Skip to content

invSqrt SIMD_SCALABLE implementation & HAL tests refactoring#26887

Merged
asmorkalov merged 7 commits intoopencv:4.xfrom
kyler1cartesis:4.x
Feb 19, 2025
Merged

invSqrt SIMD_SCALABLE implementation & HAL tests refactoring#26887
asmorkalov merged 7 commits intoopencv:4.xfrom
kyler1cartesis:4.x

Conversation

@kyler1cartesis
Copy link
Copy Markdown
Contributor

@kyler1cartesis kyler1cartesis commented Feb 7, 2025

Enable CV_SIMD_SCALABLE for invSqrt.

  • Banana Pi BF3 (SpacemiT K1) RISC-V
  • Compiler: Syntacore Clang 18.1.4 (build 2024.12)
Geometric mean (ms)

                Name of Test                  baseline   simd      simd   
                                                       scalable  scalable
                                                                    vs
                                                                 baseline
                                                                (x-factor)
InvSqrtf::InvSqrtfFixture::(127x61, 32FC1)     0.163    0.051      3.23   
InvSqrtf::InvSqrtfFixture::(127x61, 64FC1)     0.241    0.103      2.35   
InvSqrtf::InvSqrtfFixture::(640x480, 32FC1)    6.460    1.893      3.41   
InvSqrtf::InvSqrtfFixture::(640x480, 64FC1)    9.687    3.999      2.42   
InvSqrtf::InvSqrtfFixture::(1280x720, 32FC1)   19.292   5.701      3.38   
InvSqrtf::InvSqrtfFixture::(1280x720, 64FC1)   29.452   11.963     2.46   
InvSqrtf::InvSqrtfFixture::(1920x1080, 32FC1)  43.326   12.805     3.38   
InvSqrtf::InvSqrtfFixture::(1920x1080, 64FC1)  65.566   26.881     2.44

@dkurt dkurt marked this pull request as draft February 7, 2025 15:03
@dkurt dkurt self-assigned this Feb 7, 2025
min_hal_t = std::min(min_hal_t, t);

t = (double)getTickCount();
bool solveStatus = solve(a0, b, x0, (nfunc == HAL_LU ? DECOMP_LU : DECOMP_CHOLESKY));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is weak because it tests explicit HAL call and the same HAL call at

CALL_HAL_RET(Cholesky64f, cv_hal_Cholesky64f, output, A, astep, m, b, bstep, n)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose to invert the ground truth calculation: generate A and x with randu, calculate b with multiplication, use A and b for test and x as ground truth.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've inverted the test but keep comparison between A * x and A * x0 because test x vs x0 gives bigger error on size=15:

[ RUN      ] Core_HAL/mat_decomp.accuracy/7, where GetParam() = (5, Cholesky, 15)
/home/d.kurtaev/opencv/modules/core/test/test_hal_core.cpp:189: Failure
Expected: (cvtest::norm(x, x0, NORM_INF | NORM_RELATIVE)) <= (eps), actual: 0.03707 vs 1e-05

@dkurt dkurt marked this pull request as ready for review February 13, 2025 18:13
@dkurt dkurt changed the title Added RVV HAL invSqrt32f() implementation & tests Added RVV HAL invSqrt implementation & tests Feb 13, 2025
@asmorkalov
Copy link
Copy Markdown
Contributor

What if just add CV_SIMD_SCALABLE here:

and corresponding 64-bit guard for corresponding function? IMHO, it's more efficient to tune code with universal intrinsics rather then duplicate the code in HAL.

@dkurt
Copy link
Copy Markdown
Member

dkurt commented Feb 15, 2025

@asmorkalov, yes, that worked also fine. So Can we enable in this PR only invSqrt but I will open a new PR with changes to other methods?

Geometric mean (ms)

                Name of Test                   hal     simd      simd
                                                     scalable  scalable
                                                                  vs
                                                                 hal
                                                              (x-factor)
InvSqrtf::InvSqrtfFixture::(127x61, 32FC1)    0.051   0.051      1.01   
InvSqrtf::InvSqrtfFixture::(127x61, 64FC1)    0.102   0.103      1.00
InvSqrtf::InvSqrtfFixture::(640x480, 32FC1)   1.903   1.893      1.00
InvSqrtf::InvSqrtfFixture::(640x480, 64FC1)   4.008   3.999      1.00
InvSqrtf::InvSqrtfFixture::(1280x720, 32FC1)  5.713   5.701      1.00
InvSqrtf::InvSqrtfFixture::(1280x720, 64FC1)  12.000  11.963     1.00
InvSqrtf::InvSqrtfFixture::(1920x1080, 32FC1) 12.832  12.805     1.00
InvSqrtf::InvSqrtfFixture::(1920x1080, 64FC1) 26.911  26.881     1.00

@dkurt dkurt changed the title Added RVV HAL invSqrt implementation & tests invSqrt SIMD_SCALABLE implementation & tests Feb 15, 2025
@kyler1cartesis
Copy link
Copy Markdown
Contributor Author

kyler1cartesis commented Feb 17, 2025

Newton approximation method could be used in the future, maybe there even will be more accurate instruction (now only 7-bit). But somehow my algorithm didn't work.
What accuracy of a result is needed?

@dkurt dkurt added this to the 4.12.0 milestone Feb 17, 2025
@dkurt
Copy link
Copy Markdown
Member

dkurt commented Feb 17, 2025

@kyler1cartesis, I think it depends on the input values range and the algorithm. As we discussed offline, some of the OpenCV algorithms does not require perfect accuracy and probably you may do the following research:

  1. Find OpenCV modules in main repo and https://github.com/opencv/opencv_contrib/ which are heavily depend on inv_sqrt
  2. Check if they already have universal intrinsics optimizations
  3. Try apply fast inv sqrt versions for them. Despite local operation accuracy degradation overall algorithm may keep the same quality.

If there is a performance benefit on using fast inverse sqrt on some of the methods, it worth to add a new universal intrinsic (with __riscv_vfrsqrt7_v_* in RVV branch as you tried previously)

min_hal_t = std::min(min_hal_t, t);

t = (double)getTickCount();
bool solveStatus = solve(a0, b, x0, (nfunc == HAL_LU ? DECOMP_LU : DECOMP_CHOLESKY));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose to invert the ground truth calculation: generate A and x with randu, calculate b with multiplication, use A and b for test and x as ground truth.

@dkurt dkurt changed the title invSqrt SIMD_SCALABLE implementation & tests invSqrt SIMD_SCALABLE implementation & HAL tests refactoring Feb 18, 2025
@asmorkalov
Copy link
Copy Markdown
Contributor

Results for Spacemit muse pi v30 (gcc 14.2.1):

Geometric mean (ms)

               Name of Test                 4.x-1  patch-1  patch-1  
                                                               vs    
                                                             4.x-1   
                                                           (x-factor)
InvSqrt::InvSqrtFixture::(127x61, 32FC1)    0.261   0.050     5.20   
InvSqrt::InvSqrtFixture::(127x61, 64FC1)    0.339   0.102     3.32   
InvSqrt::InvSqrtFixture::(640x480, 32FC1)   10.456  1.895     5.52   
InvSqrt::InvSqrtFixture::(640x480, 64FC1)   13.814  4.003     3.45   
InvSqrt::InvSqrtFixture::(1280x720, 32FC1)  31.338  5.701     5.50   
InvSqrt::InvSqrtFixture::(1280x720, 64FC1)  41.446 11.933     3.47   
InvSqrt::InvSqrtFixture::(1920x1080, 32FC1) 70.409 12.767     5.51   
InvSqrt::InvSqrtFixture::(1920x1080, 64FC1) 92.907 26.813     3.46   

@asmorkalov asmorkalov merged commit d32d4da into opencv:4.x Feb 19, 2025
27 of 29 checks passed
NanQin555 pushed a commit to NanQin555/opencv that referenced this pull request Feb 24, 2025
invSqrt SIMD_SCALABLE implementation & HAL tests refactoring opencv#26887

Enable CV_SIMD_SCALABLE for invSqrt.

* Banana Pi BF3 (SpacemiT K1) RISC-V
* Compiler: Syntacore Clang 18.1.4 (build 2024.12)

```
Geometric mean (ms)

                Name of Test                  baseline   simd      simd   
                                                       scalable  scalable
                                                                    vs
                                                                 baseline
                                                                (x-factor)
InvSqrtf::InvSqrtfFixture::(127x61, 32FC1)     0.163    0.051      3.23   
InvSqrtf::InvSqrtfFixture::(127x61, 64FC1)     0.241    0.103      2.35   
InvSqrtf::InvSqrtfFixture::(640x480, 32FC1)    6.460    1.893      3.41   
InvSqrtf::InvSqrtfFixture::(640x480, 64FC1)    9.687    3.999      2.42   
InvSqrtf::InvSqrtfFixture::(1280x720, 32FC1)   19.292   5.701      3.38   
InvSqrtf::InvSqrtfFixture::(1280x720, 64FC1)   29.452   11.963     2.46   
InvSqrtf::InvSqrtfFixture::(1920x1080, 32FC1)  43.326   12.805     3.38   
InvSqrtf::InvSqrtfFixture::(1920x1080, 64FC1)  65.566   26.881     2.44
```
@asmorkalov asmorkalov mentioned this pull request Mar 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants