[HAL RVV] impl sqrt and invSqrt#27015
Conversation
|
perf result updated |
|
Hm, UI implementation uses the same __riscv_vfsqrt for sqrt at the end and does nothing else, but it ~2 times slower. Looks very unexpected. |
|
Perhaps, just like multiplication and division are both single instructions but division is much slower, BTW, when testing on MUSE-PI and CanMV K230, I found that |
478cd2e to
75da2a4
Compare
|
perf result updated. Strangely, simply adding two comparison operations, replacing a multiplication with a masked multiplication, and changing mask agnostic to mask undisturbed caused ~2× performance regression. |
|
perf result updated again. Note that clang has strange behavior when allocating vector registers, test for clang appended.
Very good mask agnostic/undisturbed, driving me crazy. |
This comment was marked as outdated.
This comment was marked as outdated.
0407913 to
287cb77
Compare
| // just to prevent the compiler from calculating mask before the invSqrt, which will run out | ||
| // of registers and cause memory access. |
There was a problem hiding this comment.
In the for loop (calculating invSqrt), at most four variables, x, x2, y, t, need to be stored simultaneously, so it occupies at least 4 LMUL registers. However, after exiting the for loop, only x and y need to be retained. Therefore, when using mask registers, the number of registers used should not exceed the previous 4 LMUL.
However, without a memory barrier, both GCC and Clang from the riscv-collab/riscv-gnu-toolchain release Nightly: December 16, 2024 tend to do the mask calculation first, causing the for loop to require 4 LMUL registers plus one additional mask register at the same time.
This comment was marked as off-topic.
This comment was marked as off-topic.
|
@fengyuentau So, should I move these reusable parts to a common file or leave them here? I also plan to replace these helper classes with the unified one that was just merged. I’d like to complete all of this in the last commit to avoid redundant CI checks. |
This comment was marked as resolved.
This comment was marked as resolved.
|
@GenshinImpactStarts Thank you for reminder. Here is the results (K1 vs. RK3568): |
I guess it is fine to leave as-is since we can also include this header in where it is needed. |
Co-authored-by: Liutong HAN <liutong2020@iscas.ac.cn>
Co-authored-by: Liutong HAN <liutong2020@iscas.ac.cn>
Co-authored-by: Liutong HAN <liutong2020@iscas.ac.cn>
Co-authored-by: Liutong HAN <liutong2020@iscas.ac.cn>
Co-authored-by: Liutong HAN <liutong2020@iscas.ac.cn>
Co-authored-by: Liutong HAN <liutong2020@iscas.ac.cn>
3f87050 to
890ea91
Compare
Implement through the existing interfaces
cv_hal_sqrt32f,cv_hal_sqrt64f,cv_hal_invSqrt32f,cv_hal_invSqrt64f.Perf test done on MUSE-PI and CanMV K230. Because the performance of scalar is much worse than universal intrinsic, only ui and hal rvv is compared.
In RVV's UI,
invSqrtis computed using1 / sqrt(). This patch first usesfrsqrtand then applies the Newton-Raphson method to achieve higher precision. For the initial value, I tried using the famous fast inverse square root algorithm, which involves one bit shift and one subtraction. However, on both MUSE-PI and CanMV K230, the performance was slightly lower (about 3%), so I chose to usefrsqrtfor the initial value instead.BTW, I think this patch can directly replace RVV's UI.
UPDATE: Due to strange vector registers allocation strategy in clang, for
invSqrt, clang use LMUL m4 while gcc use LMUL m8, which leads to some performance loss in clang. So the test for clang is appended.CanMV K230:
MUSE-PI:
Pull Request Readiness Checklist
See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request
Patch to opencv_extra has the same branch name.