Skip to content

[HAL RVV] impl sqrt and invSqrt#27015

Merged
asmorkalov merged 6 commits intoopencv:4.xfrom
GenshinImpactStarts:sqrt
Mar 12, 2025
Merged

[HAL RVV] impl sqrt and invSqrt#27015
asmorkalov merged 6 commits intoopencv:4.xfrom
GenshinImpactStarts:sqrt

Conversation

@GenshinImpactStarts
Copy link
Copy Markdown
Contributor

@GenshinImpactStarts GenshinImpactStarts commented Mar 5, 2025

Implement through the existing interfaces cv_hal_sqrt32f, cv_hal_sqrt64f, cv_hal_invSqrt32f, cv_hal_invSqrt64f.

Perf test done on MUSE-PI and CanMV K230. Because the performance of scalar is much worse than universal intrinsic, only ui and hal rvv is compared.

In RVV's UI, invSqrt is computed using 1 / sqrt(). This patch first uses frsqrt and then applies the Newton-Raphson method to achieve higher precision. For the initial value, I tried using the famous fast inverse square root algorithm, which involves one bit shift and one subtraction. However, on both MUSE-PI and CanMV K230, the performance was slightly lower (about 3%), so I chose to use frsqrt for the initial value instead.

BTW, I think this patch can directly replace RVV's UI.

UPDATE: Due to strange vector registers allocation strategy in clang, for invSqrt, clang use LMUL m4 while gcc use LMUL m8, which leads to some performance loss in clang. So the test for clang is appended.

$ opencv_test_core --gtest_filter="Core_HAL/mathfuncs.*"
$ opencv_perf_core --gtest_filter="SqrtFixture.*" --perf_min_samples=300 --perf_force_samples=300

CanMV K230:

              Name of Test                 ui    rvv      rvv    
                                                           vs    
                                                           ui    
                                                       (x-factor)
Sqrt::SqrtFixture::(127x61, 5, false)    0.052  0.027     1.96   
Sqrt::SqrtFixture::(127x61, 5, true)     0.101  0.026     3.80   
Sqrt::SqrtFixture::(127x61, 6, false)    0.106  0.059     1.79   
Sqrt::SqrtFixture::(127x61, 6, true)     0.207  0.058     3.55   
Sqrt::SqrtFixture::(640x480, 5, false)   1.988  0.956     2.08   
Sqrt::SqrtFixture::(640x480, 5, true)    3.920  0.948     4.13   
Sqrt::SqrtFixture::(640x480, 6, false)   4.179  2.342     1.78   
Sqrt::SqrtFixture::(640x480, 6, true)    8.220  2.290     3.59   
Sqrt::SqrtFixture::(1280x720, 5, false)  5.969  2.881     2.07   
Sqrt::SqrtFixture::(1280x720, 5, true)   11.731 2.857     4.11   
Sqrt::SqrtFixture::(1280x720, 6, false)  12.533 7.031     1.78   
Sqrt::SqrtFixture::(1280x720, 6, true)   24.643 6.917     3.56   
Sqrt::SqrtFixture::(1920x1080, 5, false) 13.423 6.483     2.07   
Sqrt::SqrtFixture::(1920x1080, 5, true)  26.379 6.436     4.10   
Sqrt::SqrtFixture::(1920x1080, 6, false) 28.200 15.833    1.78   
Sqrt::SqrtFixture::(1920x1080, 6, true)  55.434 15.565    3.56   

MUSE-PI:

                                                 GCC              |        clang            
              Name of Test                 ui    rvv      rvv     |   ui    rvv      rvv    
                                                           vs     |                   vs    
                                                           ui     |                   ui    
                                                       (x-factor) |               (x-factor)
Sqrt::SqrtFixture::(127x61, 5, false)    0.027  0.018     1.46    | 0.027  0.016     1.65   
Sqrt::SqrtFixture::(127x61, 5, true)     0.050  0.017     2.98    | 0.050  0.017     2.99   
Sqrt::SqrtFixture::(127x61, 6, false)    0.053  0.031     1.72    | 0.052  0.032     1.64   
Sqrt::SqrtFixture::(127x61, 6, true)     0.100  0.030     3.31    | 0.101  0.035     2.86   
Sqrt::SqrtFixture::(640x480, 5, false)   0.955  0.483     1.98    | 0.959  0.499     1.92   
Sqrt::SqrtFixture::(640x480, 5, true)    1.873  0.489     3.83    | 1.873  0.520     3.60   
Sqrt::SqrtFixture::(640x480, 6, false)   2.027  1.163     1.74    | 2.037  1.218     1.67   
Sqrt::SqrtFixture::(640x480, 6, true)    3.961  1.153     3.44    | 3.961  1.341     2.95   
Sqrt::SqrtFixture::(1280x720, 5, false)  2.916  1.538     1.90    | 2.912  1.598     1.82   
Sqrt::SqrtFixture::(1280x720, 5, true)   5.735  1.534     3.74    | 5.726  1.661     3.45   
Sqrt::SqrtFixture::(1280x720, 6, false)  6.121  3.585     1.71    | 6.109  3.725     1.64   
Sqrt::SqrtFixture::(1280x720, 6, true)   12.059 3.501     3.44    | 12.053 4.080     2.95   
Sqrt::SqrtFixture::(1920x1080, 5, false) 6.540  3.535     1.85    | 6.540  3.643     1.80   
Sqrt::SqrtFixture::(1920x1080, 5, true)  12.943 3.445     3.76    | 12.908 3.706     3.48   
Sqrt::SqrtFixture::(1920x1080, 6, false) 13.714 8.062     1.70    | 13.711 8.376     1.64   
Sqrt::SqrtFixture::(1920x1080, 6, true)  27.011 7.989     3.38    | 27.115 9.245     2.93   

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
  • The PR is proposed to the proper branch
  • There is a reference to the original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake

@GenshinImpactStarts
Copy link
Copy Markdown
Contributor Author

perf result updated

@asmorkalov
Copy link
Copy Markdown
Contributor

Hm, UI implementation uses the same __riscv_vfsqrt for sqrt at the end and does nothing else, but it ~2 times slower. Looks very unexpected.

@GenshinImpactStarts
Copy link
Copy Markdown
Contributor Author

GenshinImpactStarts commented Mar 6, 2025

Perhaps, just like multiplication and division are both single instructions but division is much slower, fsqrt is also quite slow. Additionally, the riscv-v-spec-1.0 mentions examples of approximate computation for both division and square root:

A.7. Division approximation example

# v1 = v1 / v2 to almost 23 bits of precision.

vfrec7.v v3, v2         # Estimate 1/v2
    li t0, 0x40000000
vmv.v.x v4, t0          # Splat 2.0
vfnmsac.vv v4, v2, v3   # 2.0 - v2 * est(1/v2)
vfmul.vv v3, v3, v4     # Better estimate of 1/v2
vmv.v.x v4, t0          # Splat 2.0
vfnmsac.vv v4, v2, v3   # 2.0 - v2 * est(1/v2)
vfmul.vv v3, v3, v4     # Better estimate of 1/v2
vfmul.vv v1, v1, v3     # Estimate of v1/v2

A.8. Square root approximation example

# v1 = sqrt(v1) to almost 23 bits of precision.

fmv.w.x ft0, x0             # Mask off zero inputs
vmfne.vf v0, v1, ft0        # to avoid div by zero
vfrsqrt7.v v2, v1, v0.t     # Estimate 1/sqrt(x)
vmfne.vf v0, v2, ft0, v0.t  # Additionally mask off +inf inputs
    li t0, 0xbf000000
fmv.w.x ft0, t0             # -0.5
vfmul.vf v3, v1, ft0, v0.t  # -0.5 * x
vfmul.vv v4, v2, v2, v0.t   # est * est
    li t0, 0x3fc00000
vmv.v.x v5, t0, v0.t        # Splat 1.5
vfmadd.vv v4, v3, v5, v0.t  # 1.5 - 0.5 * x * est * est
vfmul.vv v1, v1, v4, v0.t   # estimate to 14 bits
vfmul.vv v4, v1, v1, v0.t   # est * est
vfmadd.vv v4, v3, v5, v0.t  # 1.5 - 0.5 * x * est * est
vfmul.vv v1, v1, v4, v0.t   # estimate to 23 bits

BTW, when testing on MUSE-PI and CanMV K230, I found that frsqrt is surprisingly fast—so much so that even a bit-shift operation followed by a subtraction is not faster than frsqrt.

@GenshinImpactStarts
Copy link
Copy Markdown
Contributor Author

perf result updated.

Strangely, simply adding two comparison operations, replacing a multiplication with a masked multiplication, and changing mask agnostic to mask undisturbed caused ~2× performance regression.

@GenshinImpactStarts
Copy link
Copy Markdown
Contributor Author

perf result updated again. Note that clang has strange behavior when allocating vector registers, test for clang appended.

UPDATE: Due to strange vector registers allocation strategy in clang, for invSqrt, clang use LMUL m4 while gcc use LMUL m8, which leads to some performance loss in clang. So the test for clang is appended.

Very good mask agnostic/undisturbed, driving me crazy.

@fengyuentau fengyuentau self-requested a review March 7, 2025 07:15
@fengyuentau

This comment was marked as outdated.

Comment on lines +37 to +49
// just to prevent the compiler from calculating mask before the invSqrt, which will run out
// of registers and cause memory access.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this happen?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the for loop (calculating invSqrt), at most four variables, x, x2, y, t, need to be stored simultaneously, so it occupies at least 4 LMUL registers. However, after exiting the for loop, only x and y need to be retained. Therefore, when using mask registers, the number of registers used should not exceed the previous 4 LMUL.

However, without a memory barrier, both GCC and Clang from the riscv-collab/riscv-gnu-toolchain release Nightly: December 16, 2024 tend to do the mask calculation first, causing the for loop to require 4 LMUL registers plus one additional mask register at the same time.

@fengyuentau

This comment was marked as off-topic.

@GenshinImpactStarts
Copy link
Copy Markdown
Contributor Author

@fengyuentau So, should I move these reusable parts to a common file or leave them here? I also plan to replace these helper classes with the unified one that was just merged. I’d like to complete all of this in the last commit to avoid redundant CI checks.

@GenshinImpactStarts

This comment was marked as resolved.

@fengyuentau
Copy link
Copy Markdown
Member

@GenshinImpactStarts Thank you for reminder. Here is the results (K1 vs. RK3568):

              Name of Test                 rk   patch-gcc patch-clang patch-gcc  patch-clang
                                                                          vs         vs
                                                                          rk         rk
                                                                      (x-factor) (x-factor)
Sqrt::SqrtFixture::(127x61, 5, false)    0.016    0.017      0.018       0.99       0.94
Sqrt::SqrtFixture::(127x61, 5, true)     0.026    0.016      0.018       1.60       1.44
Sqrt::SqrtFixture::(127x61, 6, false)    0.052    0.031      0.032       1.66       1.62
Sqrt::SqrtFixture::(127x61, 6, true)     0.090    0.031      0.035       2.96       2.55
Sqrt::SqrtFixture::(640x480, 5, false)   0.618    0.476      0.494       1.30       1.25
Sqrt::SqrtFixture::(640x480, 5, true)    1.027    0.472      0.510       2.18       2.01
Sqrt::SqrtFixture::(640x480, 6, false)   2.082    1.154      1.199       1.80       1.74
Sqrt::SqrtFixture::(640x480, 6, true)    3.626    1.140      1.329       3.18       2.73
Sqrt::SqrtFixture::(1280x720, 5, false)  1.918    1.479      1.525       1.30       1.26
Sqrt::SqrtFixture::(1280x720, 5, true)   3.146    1.488      1.579       2.11       1.99
Sqrt::SqrtFixture::(1280x720, 6, false)  6.209    3.470      3.613       1.79       1.72
Sqrt::SqrtFixture::(1280x720, 6, true)   10.839   3.454      3.999       3.14       2.71
Sqrt::SqrtFixture::(1920x1080, 5, false) 4.348    3.337      3.436       1.30       1.27
Sqrt::SqrtFixture::(1920x1080, 5, true)  7.042    3.347      3.594       2.10       1.96
Sqrt::SqrtFixture::(1920x1080, 6, false) 14.020   7.781      8.106       1.80       1.73
Sqrt::SqrtFixture::(1920x1080, 6, true)  24.665   7.746      8.970       3.18       2.75

@fengyuentau
Copy link
Copy Markdown
Member

@fengyuentau So, should I move these reusable parts to a common file or leave them here? I also plan to replace these helper classes with the unified one that was just merged. I’d like to complete all of this in the last commit to avoid redundant CI checks.

I guess it is fine to leave as-is since we can also include this header in where it is needed.

GenshinImpactStarts and others added 6 commits March 11, 2025 07:02
Co-authored-by: Liutong HAN <liutong2020@iscas.ac.cn>
Co-authored-by: Liutong HAN <liutong2020@iscas.ac.cn>
Co-authored-by: Liutong HAN <liutong2020@iscas.ac.cn>
Co-authored-by: Liutong HAN <liutong2020@iscas.ac.cn>
Co-authored-by: Liutong HAN <liutong2020@iscas.ac.cn>
Co-authored-by: Liutong HAN <liutong2020@iscas.ac.cn>
@asmorkalov asmorkalov merged commit 60de3ff into opencv:4.x Mar 12, 2025
26 of 28 checks passed
@GenshinImpactStarts GenshinImpactStarts deleted the sqrt branch March 12, 2025 07:32
@asmorkalov asmorkalov mentioned this pull request Apr 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants