[HAL RVV] impl sqrt and invSqrt by GenshinImpactStarts · Pull Request #27015 · opencv/opencv

GenshinImpactStarts · 2025-03-05T10:16:47Z

Implement through the existing interfaces cv_hal_sqrt32f, cv_hal_sqrt64f, cv_hal_invSqrt32f, cv_hal_invSqrt64f.

Perf test done on MUSE-PI and CanMV K230. Because the performance of scalar is much worse than universal intrinsic, only ui and hal rvv is compared.

In RVV's UI, invSqrt is computed using 1 / sqrt(). This patch first uses frsqrt and then applies the Newton-Raphson method to achieve higher precision. For the initial value, I tried using the famous fast inverse square root algorithm, which involves one bit shift and one subtraction. However, on both MUSE-PI and CanMV K230, the performance was slightly lower (about 3%), so I chose to use frsqrt for the initial value instead.

BTW, I think this patch can directly replace RVV's UI.

UPDATE: Due to strange vector registers allocation strategy in clang, for invSqrt, clang use LMUL m4 while gcc use LMUL m8, which leads to some performance loss in clang. So the test for clang is appended.

$ opencv_test_core --gtest_filter="Core_HAL/mathfuncs.*"
$ opencv_perf_core --gtest_filter="SqrtFixture.*" --perf_min_samples=300 --perf_force_samples=300

CanMV K230:

              Name of Test                 ui    rvv      rvv    
                                                           vs    
                                                           ui    
                                                       (x-factor)
Sqrt::SqrtFixture::(127x61, 5, false)    0.052  0.027     1.96   
Sqrt::SqrtFixture::(127x61, 5, true)     0.101  0.026     3.80   
Sqrt::SqrtFixture::(127x61, 6, false)    0.106  0.059     1.79   
Sqrt::SqrtFixture::(127x61, 6, true)     0.207  0.058     3.55   
Sqrt::SqrtFixture::(640x480, 5, false)   1.988  0.956     2.08   
Sqrt::SqrtFixture::(640x480, 5, true)    3.920  0.948     4.13   
Sqrt::SqrtFixture::(640x480, 6, false)   4.179  2.342     1.78   
Sqrt::SqrtFixture::(640x480, 6, true)    8.220  2.290     3.59   
Sqrt::SqrtFixture::(1280x720, 5, false)  5.969  2.881     2.07   
Sqrt::SqrtFixture::(1280x720, 5, true)   11.731 2.857     4.11   
Sqrt::SqrtFixture::(1280x720, 6, false)  12.533 7.031     1.78   
Sqrt::SqrtFixture::(1280x720, 6, true)   24.643 6.917     3.56   
Sqrt::SqrtFixture::(1920x1080, 5, false) 13.423 6.483     2.07   
Sqrt::SqrtFixture::(1920x1080, 5, true)  26.379 6.436     4.10   
Sqrt::SqrtFixture::(1920x1080, 6, false) 28.200 15.833    1.78   
Sqrt::SqrtFixture::(1920x1080, 6, true)  55.434 15.565    3.56

MUSE-PI:

                                                 GCC              |        clang            
              Name of Test                 ui    rvv      rvv     |   ui    rvv      rvv    
                                                           vs     |                   vs    
                                                           ui     |                   ui    
                                                       (x-factor) |               (x-factor)
Sqrt::SqrtFixture::(127x61, 5, false)    0.027  0.018     1.46    | 0.027  0.016     1.65   
Sqrt::SqrtFixture::(127x61, 5, true)     0.050  0.017     2.98    | 0.050  0.017     2.99   
Sqrt::SqrtFixture::(127x61, 6, false)    0.053  0.031     1.72    | 0.052  0.032     1.64   
Sqrt::SqrtFixture::(127x61, 6, true)     0.100  0.030     3.31    | 0.101  0.035     2.86   
Sqrt::SqrtFixture::(640x480, 5, false)   0.955  0.483     1.98    | 0.959  0.499     1.92   
Sqrt::SqrtFixture::(640x480, 5, true)    1.873  0.489     3.83    | 1.873  0.520     3.60   
Sqrt::SqrtFixture::(640x480, 6, false)   2.027  1.163     1.74    | 2.037  1.218     1.67   
Sqrt::SqrtFixture::(640x480, 6, true)    3.961  1.153     3.44    | 3.961  1.341     2.95   
Sqrt::SqrtFixture::(1280x720, 5, false)  2.916  1.538     1.90    | 2.912  1.598     1.82   
Sqrt::SqrtFixture::(1280x720, 5, true)   5.735  1.534     3.74    | 5.726  1.661     3.45   
Sqrt::SqrtFixture::(1280x720, 6, false)  6.121  3.585     1.71    | 6.109  3.725     1.64   
Sqrt::SqrtFixture::(1280x720, 6, true)   12.059 3.501     3.44    | 12.053 4.080     2.95   
Sqrt::SqrtFixture::(1920x1080, 5, false) 6.540  3.535     1.85    | 6.540  3.643     1.80   
Sqrt::SqrtFixture::(1920x1080, 5, true)  12.943 3.445     3.76    | 12.908 3.706     3.48   
Sqrt::SqrtFixture::(1920x1080, 6, false) 13.714 8.062     1.70    | 13.711 8.376     1.64   
Sqrt::SqrtFixture::(1920x1080, 6, true)  27.011 7.989     3.38    | 27.115 9.245     2.93

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

I agree to contribute to the project under Apache 2 License.
To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
The PR is proposed to the proper branch
There is a reference to the original bug report and related work
There is accuracy test, performance test and test data in opencv_extra repository, if applicable
Patch to opencv_extra has the same branch name.
The feature is well documented and sample code can be built with the project CMake

GenshinImpactStarts · 2025-03-05T18:28:37Z

perf result updated

asmorkalov · 2025-03-06T06:25:04Z

Hm, UI implementation uses the same __riscv_vfsqrt for sqrt at the end and does nothing else, but it ~2 times slower. Looks very unexpected.

GenshinImpactStarts · 2025-03-06T06:38:37Z

Perhaps, just like multiplication and division are both single instructions but division is much slower, fsqrt is also quite slow. Additionally, the riscv-v-spec-1.0 mentions examples of approximate computation for both division and square root:

A.7. Division approximation example

# v1 = v1 / v2 to almost 23 bits of precision.

vfrec7.v v3, v2         # Estimate 1/v2
    li t0, 0x40000000
vmv.v.x v4, t0          # Splat 2.0
vfnmsac.vv v4, v2, v3   # 2.0 - v2 * est(1/v2)
vfmul.vv v3, v3, v4     # Better estimate of 1/v2
vmv.v.x v4, t0          # Splat 2.0
vfnmsac.vv v4, v2, v3   # 2.0 - v2 * est(1/v2)
vfmul.vv v3, v3, v4     # Better estimate of 1/v2
vfmul.vv v1, v1, v3     # Estimate of v1/v2

A.8. Square root approximation example

# v1 = sqrt(v1) to almost 23 bits of precision.

fmv.w.x ft0, x0             # Mask off zero inputs
vmfne.vf v0, v1, ft0        # to avoid div by zero
vfrsqrt7.v v2, v1, v0.t     # Estimate 1/sqrt(x)
vmfne.vf v0, v2, ft0, v0.t  # Additionally mask off +inf inputs
    li t0, 0xbf000000
fmv.w.x ft0, t0             # -0.5
vfmul.vf v3, v1, ft0, v0.t  # -0.5 * x
vfmul.vv v4, v2, v2, v0.t   # est * est
    li t0, 0x3fc00000
vmv.v.x v5, t0, v0.t        # Splat 1.5
vfmadd.vv v4, v3, v5, v0.t  # 1.5 - 0.5 * x * est * est
vfmul.vv v1, v1, v4, v0.t   # estimate to 14 bits
vfmul.vv v4, v1, v1, v0.t   # est * est
vfmadd.vv v4, v3, v5, v0.t  # 1.5 - 0.5 * x * est * est
vfmul.vv v1, v1, v4, v0.t   # estimate to 23 bits

BTW, when testing on MUSE-PI and CanMV K230, I found that frsqrt is surprisingly fast—so much so that even a bit-shift operation followed by a subtraction is not faster than frsqrt.

GenshinImpactStarts · 2025-03-06T09:22:46Z

perf result updated.

Strangely, simply adding two comparison operations, replacing a multiplication with a masked multiplication, and changing mask agnostic to mask undisturbed caused ~2× performance regression.

GenshinImpactStarts · 2025-03-06T18:13:00Z

perf result updated again. Note that clang has strange behavior when allocating vector registers, test for clang appended.

UPDATE: Due to strange vector registers allocation strategy in clang, for invSqrt, clang use LMUL m4 while gcc use LMUL m8, which leads to some performance loss in clang. So the test for clang is appended.

Very good mask agnostic/undisturbed, driving me crazy.

3rdparty/hal_rvv/hal_rvv_1p0/sqrt.hpp

fengyuentau · 2025-03-10T06:00:42Z

3rdparty/hal_rvv/hal_rvv_1p0/sqrt.hpp

+    // just to prevent the compiler from calculating mask before the invSqrt, which will run out
+    // of registers and cause memory access.


How does this happen?

In the for loop (calculating invSqrt), at most four variables, x, x2, y, t, need to be stored simultaneously, so it occupies at least 4 LMUL registers. However, after exiting the for loop, only x and y need to be retained. Therefore, when using mask registers, the number of registers used should not exceed the previous 4 LMUL.

However, without a memory barrier, both GCC and Clang from the riscv-collab/riscv-gnu-toolchain release Nightly: December 16, 2024 tend to do the mask calculation first, causing the for loop to require 4 LMUL registers plus one additional mask register at the same time.

GenshinImpactStarts · 2025-03-10T09:14:31Z

@fengyuentau So, should I move these reusable parts to a common file or leave them here? I also plan to replace these helper classes with the unified one that was just merged. I’d like to complete all of this in the last commit to avoid redundant CI checks.

fengyuentau · 2025-03-11T06:23:42Z

@GenshinImpactStarts Thank you for reminder. Here is the results (K1 vs. RK3568):

              Name of Test                 rk   patch-gcc patch-clang patch-gcc  patch-clang
                                                                          vs         vs
                                                                          rk         rk
                                                                      (x-factor) (x-factor)
Sqrt::SqrtFixture::(127x61, 5, false)    0.016    0.017      0.018       0.99       0.94
Sqrt::SqrtFixture::(127x61, 5, true)     0.026    0.016      0.018       1.60       1.44
Sqrt::SqrtFixture::(127x61, 6, false)    0.052    0.031      0.032       1.66       1.62
Sqrt::SqrtFixture::(127x61, 6, true)     0.090    0.031      0.035       2.96       2.55
Sqrt::SqrtFixture::(640x480, 5, false)   0.618    0.476      0.494       1.30       1.25
Sqrt::SqrtFixture::(640x480, 5, true)    1.027    0.472      0.510       2.18       2.01
Sqrt::SqrtFixture::(640x480, 6, false)   2.082    1.154      1.199       1.80       1.74
Sqrt::SqrtFixture::(640x480, 6, true)    3.626    1.140      1.329       3.18       2.73
Sqrt::SqrtFixture::(1280x720, 5, false)  1.918    1.479      1.525       1.30       1.26
Sqrt::SqrtFixture::(1280x720, 5, true)   3.146    1.488      1.579       2.11       1.99
Sqrt::SqrtFixture::(1280x720, 6, false)  6.209    3.470      3.613       1.79       1.72
Sqrt::SqrtFixture::(1280x720, 6, true)   10.839   3.454      3.999       3.14       2.71
Sqrt::SqrtFixture::(1920x1080, 5, false) 4.348    3.337      3.436       1.30       1.27
Sqrt::SqrtFixture::(1920x1080, 5, true)  7.042    3.347      3.594       2.10       1.96
Sqrt::SqrtFixture::(1920x1080, 6, false) 14.020   7.781      8.106       1.80       1.73
Sqrt::SqrtFixture::(1920x1080, 6, true)  24.665   7.746      8.970       3.18       2.75

fengyuentau · 2025-03-11T06:33:59Z

@fengyuentau So, should I move these reusable parts to a common file or leave them here? I also plan to replace these helper classes with the unified one that was just merged. I’d like to complete all of this in the last commit to avoid redundant CI checks.

I guess it is fine to leave as-is since we can also include this header in where it is needed.

Co-authored-by: Liutong HAN <liutong2020@iscas.ac.cn>

asmorkalov added optimization platform: riscv labels Mar 5, 2025

asmorkalov added this to the 4.12.0 milestone Mar 5, 2025

GenshinImpactStarts force-pushed the sqrt branch from 478cd2e to 75da2a4 Compare March 6, 2025 09:18

This was referenced Mar 6, 2025

[HAL RVV] impl magnitude | add perf test #27002

Merged

[HAL RVV] reuse atan | impl cart_to_polar | add perf test #27000

Merged

fengyuentau self-requested a review March 7, 2025 07:15

This comment was marked as outdated.

Sign in to view

GenshinImpactStarts force-pushed the sqrt branch from 0407913 to 287cb77 Compare March 7, 2025 14:52

fengyuentau reviewed Mar 10, 2025

View reviewed changes

This comment was marked as off-topic.

Sign in to view

This comment was marked as resolved.

Sign in to view

GenshinImpactStarts and others added 6 commits March 11, 2025 07:02

impl sqrt and invSqrt

4035a8c

Co-authored-by: Liutong HAN <liutong2020@iscas.ac.cn>

fix sqrt when input is 0 or inf

e87ec89

Co-authored-by: Liutong HAN <liutong2020@iscas.ac.cn>

fix invSqrt when input is 0 or inf

2f6f825

Co-authored-by: Liutong HAN <liutong2020@iscas.ac.cn>

fix input 0, inf | more reusable

c3f5fbc

Co-authored-by: Liutong HAN <liutong2020@iscas.ac.cn>

replace pragma once with define

4083e00

Co-authored-by: Liutong HAN <liutong2020@iscas.ac.cn>

replace helper class

890ea91

Co-authored-by: Liutong HAN <liutong2020@iscas.ac.cn>

GenshinImpactStarts force-pushed the sqrt branch from 3f87050 to 890ea91 Compare March 11, 2025 08:10

fengyuentau approved these changes Mar 12, 2025

View reviewed changes

asmorkalov assigned fengyuentau Mar 12, 2025

asmorkalov merged commit 60de3ff into opencv:4.x Mar 12, 2025
26 of 28 checks passed

GenshinImpactStarts deleted the sqrt branch March 12, 2025 07:32

asmorkalov mentioned this pull request Apr 29, 2025

5.x merge 4.x #27265

Merged

		// just to prevent the compiler from calculating mask before the invSqrt, which will run out
		// of registers and cause memory access.

Uh oh!

Conversation

GenshinImpactStarts commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Readiness Checklist

Uh oh!

GenshinImpactStarts commented Mar 5, 2025

Uh oh!

asmorkalov commented Mar 6, 2025

Uh oh!

GenshinImpactStarts commented Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GenshinImpactStarts commented Mar 6, 2025

Uh oh!

GenshinImpactStarts commented Mar 6, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

fengyuentau Mar 10, 2025

Choose a reason for hiding this comment

Uh oh!

GenshinImpactStarts Mar 10, 2025

Choose a reason for hiding this comment

Uh oh!

This comment was marked as off-topic.

GenshinImpactStarts commented Mar 10, 2025

Uh oh!

This comment was marked as resolved.

fengyuentau commented Mar 11, 2025

Uh oh!

fengyuentau commented Mar 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

GenshinImpactStarts commented Mar 5, 2025 •

edited

Loading

GenshinImpactStarts commented Mar 6, 2025 •

edited

Loading