Skip to content

Update universal intrinsics of RVV back-end.#21351

Merged
alalek merged 2 commits intoopencv:4.xfrom
hanliutong:rvv-clang
Mar 30, 2022
Merged

Update universal intrinsics of RVV back-end.#21351
alalek merged 2 commits intoopencv:4.xfrom
hanliutong:rvv-clang

Conversation

@hanliutong
Copy link
Copy Markdown
Contributor

The goal of this patch is to optimize the existing Universal Intrinsics implementation of RVV.
The initial version of RVV UI is #18228.

The main difference between this patch and the initial version is the design of the data structure: Using a pointer to the RVV Intrinsic Type instead of an array to hold the value. In this way, no more Intrinsic functions such as load & store needed in the struct, which provides opportunities for compilation optimization.

This is a comparison on assembly level, taking v_fma as an example, the implementation of this patch reduces the number of instructions by about 50%.

This patch is tested on QEMU with the core testcases and the dnn testcases, and the results are the same as master.

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or other license that is incompatible with OpenCV
  • The PR is proposed to proper branch
  • There is reference to original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake

inline v_uint8x16(const v_uint8x16& vec) {
*pval = *(vec.pval);
}
vuint8m1_t* pval = (vuint8m1_t*)malloc(nlanes*8);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's bad idea. The vector registers are usually local variables in function or several functions in call chain. Each vector variable creation cases expensive call and potentially heap fragmentation. I propose to use vuint8m1_t[nlanes*8] for the field as soon as it's always allocated/

inline v_int8x16(const v_int8x16& vec) {
*pval = *(vec.pval);
}
vint8m1_t* pval = (vint8m1_t*)malloc(nlanes*8);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same here.

@hanliutong
Copy link
Copy Markdown
Contributor Author

hanliutong commented Dec 30, 2021

Hi @asmorkalov, thank you for your review and I am going to explain the design of this patch in detail about your review comment.

I agree that calling malloc for each vector variable is expensive and may fragment the heap.

The reason I did this is to get a vector variable *pval in struct in order to remove unnecessary intrinsic functions for loading array and storing vector.

Normally, for a vector variable, we would try to declare it with intrinsic type directly. But we can not declare a RVV intrinsic vector type like vuint8m1_t val in the struct because it is sizeless and the compiler doesn't know how many bits did it needed on the stack. However, in OpenCV UI, we know exactly how much bit is needed, 128.

So I tried an scalar array and a RVV type pointer like below (but not a RVV type array like vuint8m1_t[nlanes*8], we may NOT need an array containing vector) , in this way, *pval is the vector variable that we want:

float val[nlanes];
vfloat32m1_t* pval = (vfloat32m1_t*)val;

it's work but generate unecessay store instructions. And I find that use the malloc way like below can eliminate unecessay instructions, same as the array way, *pval is the vector variable that we want:

vfloat32m1_t* pval = (vfloat32m1_t*)malloc(nlanes*32);
~v_float32x4() {free(pval);}

it also works as well as more concise assembly. And most importantly, although we call malloc in the source code, no malloc is actually called in assembly with the help of compiler optimization because (I guess) the parameters of the malloc function is a constant (128). But in more complex cases, the compiler may not be able to optimize.

https://godbolt.org/z/8sa6W8vf3 shows the difference between using a scalar array and using malloc.

Therefore, there is a tradeoff between more unnecessary store instructions (by using array) and possible malloc calls, what's your opinion?

And I would like to try to explain why I think this does not cause a memory leak:

Firstly, if there is no malloc called in assembly, then no memory in heap is created, also no way to leak.
Secondly, assuming there is a malloc called. In this case, we always operate as *pval and do not operate pval directly, which means we may change the value pointed by the pointer, but does not change the pointer itself. In other words, the pointer pval always points to the memory from malloc, and will be free in the destructor.

So there does not seem to be a memory leak, please correct me if I am wrong.

@asmorkalov
Copy link
Copy Markdown
Contributor

Even if compiler is smart enough and does not generate real malloc calls in some cases, it's bad idea to drop pointer values and replace them with the new one. It bad logic and it stops working. Also, it's unsafe to call free for the cases like this:

        short v[] = {v0, v1, v2, v3, v4, v5, v6, v7};
        *pval = vle16_v_i16m1(v, nlanes);

The pval points to stack frame, but not to buffer on heap. And it's good luck again, that compiler helps you with UB.

Copy link
Copy Markdown
Contributor

@asmorkalov asmorkalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@asmorkalov
Copy link
Copy Markdown
Contributor

Tested with qemu64 and clang version 14.0.0 (https://github.com/llvm/llvm-project.git f5f0bd8e3d972d040827c88adf1a9880ebcb821e). Also I have to tune toolchain file and replace hardware configuration with -march=rv64gcv1p0.

@alalek alalek merged commit 3e4a566 into opencv:4.x Mar 30, 2022
@opencv-pushbot opencv-pushbot mentioned this pull request Apr 23, 2022
@hanliutong hanliutong deleted the rvv-clang branch May 12, 2022 03:57
a-sajjad72 pushed a commit to a-sajjad72/opencv that referenced this pull request Mar 30, 2023
* Update universal intrinsics of RVV back-end.

* Use array instead of malloc.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants