Update universal intrinsics of RVV back-end. by hanliutong · Pull Request #21351 · opencv/opencv

hanliutong · 2021-12-27T13:07:30Z

The goal of this patch is to optimize the existing Universal Intrinsics implementation of RVV.
The initial version of RVV UI is #18228.

The main difference between this patch and the initial version is the design of the data structure: Using a pointer to the RVV Intrinsic Type instead of an array to hold the value. In this way, no more Intrinsic functions such as load & store needed in the struct, which provides opportunities for compilation optimization.

This is a comparison on assembly level, taking v_fma as an example, the implementation of this patch reduces the number of instructions by about 50%.

This patch is tested on QEMU with the core testcases and the dnn testcases, and the results are the same as master.

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

I agree to contribute to the project under Apache 2 License.
To the best of my knowledge, the proposed patch is not based on a code under GPL or other license that is incompatible with OpenCV
The PR is proposed to proper branch
There is reference to original bug report and related work
There is accuracy test, performance test and test data in opencv_extra repository, if applicable
Patch to opencv_extra has the same branch name.
The feature is well documented and sample code can be built with the project CMake

modules/core/include/opencv2/core/hal/intrin_rvv.hpp

asmorkalov · 2021-12-29T11:44:15Z

modules/core/include/opencv2/core/hal/intrin_rvv.hpp

+    inline v_uint8x16(const v_uint8x16& vec) {
+        *pval = *(vec.pval);
+    }
+    vuint8m1_t* pval = (vuint8m1_t*)malloc(nlanes*8);


It's bad idea. The vector registers are usually local variables in function or several functions in call chain. Each vector variable creation cases expensive call and potentially heap fragmentation. I propose to use vuint8m1_t[nlanes*8] for the field as soon as it's always allocated/

asmorkalov · 2021-12-29T11:44:53Z

modules/core/include/opencv2/core/hal/intrin_rvv.hpp

+    inline v_int8x16(const v_int8x16& vec) {
+        *pval = *(vec.pval);
+    }
+    vint8m1_t* pval = (vint8m1_t*)malloc(nlanes*8);


The same here.

hanliutong · 2021-12-30T10:44:33Z

Hi @asmorkalov, thank you for your review and I am going to explain the design of this patch in detail about your review comment.

I agree that calling malloc for each vector variable is expensive and may fragment the heap.

The reason I did this is to get a vector variable *pval in struct in order to remove unnecessary intrinsic functions for loading array and storing vector.

Normally, for a vector variable, we would try to declare it with intrinsic type directly. But we can not declare a RVV intrinsic vector type like vuint8m1_t val in the struct because it is sizeless and the compiler doesn't know how many bits did it needed on the stack. However, in OpenCV UI, we know exactly how much bit is needed, 128.

So I tried an scalar array and a RVV type pointer like below (but not a RVV type array like vuint8m1_t[nlanes*8], we may NOT need an array containing vector) , in this way, *pval is the vector variable that we want:

float val[nlanes];
vfloat32m1_t* pval = (vfloat32m1_t*)val;

it's work but generate unecessay store instructions. And I find that use the malloc way like below can eliminate unecessay instructions, same as the array way, *pval is the vector variable that we want:

vfloat32m1_t* pval = (vfloat32m1_t*)malloc(nlanes*32);
~v_float32x4() {free(pval);}

it also works as well as more concise assembly. And most importantly, although we call malloc in the source code, no malloc is actually called in assembly with the help of compiler optimization because (I guess) the parameters of the malloc function is a constant (128). But in more complex cases, the compiler may not be able to optimize.

https://godbolt.org/z/8sa6W8vf3 shows the difference between using a scalar array and using malloc.

Therefore, there is a tradeoff between more unnecessary store instructions (by using array) and possible malloc calls, what's your opinion?

And I would like to try to explain why I think this does not cause a memory leak:

Firstly, if there is no malloc called in assembly, then no memory in heap is created, also no way to leak.
Secondly, assuming there is a malloc called. In this case, we always operate as *pval and do not operate pval directly, which means we may change the value pointed by the pointer, but does not change the pointer itself. In other words, the pointer pval always points to the memory from malloc, and will be free in the destructor.

So there does not seem to be a memory leak, please correct me if I am wrong.

asmorkalov · 2022-01-10T06:07:19Z

Even if compiler is smart enough and does not generate real malloc calls in some cases, it's bad idea to drop pointer values and replace them with the new one. It bad logic and it stops working. Also, it's unsafe to call free for the cases like this:

        short v[] = {v0, v1, v2, v3, v4, v5, v6, v7};
        *pval = vle16_v_i16m1(v, nlanes);

The pval points to stack frame, but not to buffer on heap. And it's good luck again, that compiler helps you with UB.

asmorkalov

👍

asmorkalov · 2022-03-30T13:04:27Z

Tested with qemu64 and clang version 14.0.0 (https://github.com/llvm/llvm-project.git f5f0bd8e3d972d040827c88adf1a9880ebcb821e). Also I have to tune toolchain file and replace hardware configuration with -march=rv64gcv1p0.

* Update universal intrinsics of RVV back-end. * Use array instead of malloc.

Update universal intrinsics of RVV back-end.

dc5578a

asmorkalov requested changes Dec 29, 2021

View reviewed changes

Use array instead of malloc.

20011bd

asmorkalov approved these changes Mar 30, 2022

View reviewed changes

alalek assigned asmorkalov Mar 30, 2022

alalek merged commit 3e4a566 into opencv:4.x Mar 30, 2022

opencv-pushbot mentioned this pull request Apr 23, 2022

(5.x) Merge 4.x #21903

Merged

hanliutong deleted the rvv-clang branch May 12, 2022 03:57

a-sajjad72 pushed a commit to a-sajjad72/opencv that referenced this pull request Mar 30, 2023

Merge pull request opencv#21351 from hanliutong:rvv-clang

8fa16cb

* Update universal intrinsics of RVV back-end. * Use array instead of malloc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update universal intrinsics of RVV back-end.#21351

Update universal intrinsics of RVV back-end.#21351
alalek merged 2 commits intoopencv:4.xfrom
hanliutong:rvv-clang

hanliutong commented Dec 27, 2021

Uh oh!

Uh oh!

asmorkalov Dec 29, 2021

Uh oh!

asmorkalov Dec 29, 2021

Uh oh!

hanliutong commented Dec 30, 2021 •

edited

Loading

Uh oh!

asmorkalov commented Jan 10, 2022

Uh oh!

asmorkalov left a comment

Uh oh!

asmorkalov commented Mar 30, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

hanliutong commented Dec 27, 2021

Pull Request Readiness Checklist

Uh oh!

Uh oh!

asmorkalov Dec 29, 2021

Choose a reason for hiding this comment

Uh oh!

asmorkalov Dec 29, 2021

Choose a reason for hiding this comment

Uh oh!

hanliutong commented Dec 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asmorkalov commented Jan 10, 2022

Uh oh!

asmorkalov left a comment

Choose a reason for hiding this comment

Uh oh!

asmorkalov commented Mar 30, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hanliutong commented Dec 30, 2021 •

edited

Loading