Update universal intrinsics of RVV back-end.#21351
Conversation
| inline v_uint8x16(const v_uint8x16& vec) { | ||
| *pval = *(vec.pval); | ||
| } | ||
| vuint8m1_t* pval = (vuint8m1_t*)malloc(nlanes*8); |
There was a problem hiding this comment.
It's bad idea. The vector registers are usually local variables in function or several functions in call chain. Each vector variable creation cases expensive call and potentially heap fragmentation. I propose to use vuint8m1_t[nlanes*8] for the field as soon as it's always allocated/
| inline v_int8x16(const v_int8x16& vec) { | ||
| *pval = *(vec.pval); | ||
| } | ||
| vint8m1_t* pval = (vint8m1_t*)malloc(nlanes*8); |
|
Hi @asmorkalov, thank you for your review and I am going to explain the design of this patch in detail about your review comment. I agree that calling malloc for each vector variable is expensive and may fragment the heap. The reason I did this is to get a vector variable Normally, for a vector variable, we would try to declare it with intrinsic type directly. But we can not declare a RVV intrinsic vector type like So I tried an scalar array and a RVV type pointer like below (but not a RVV type array like float val[nlanes];
vfloat32m1_t* pval = (vfloat32m1_t*)val;it's work but generate unecessay store instructions. And I find that use the malloc way like below can eliminate unecessay instructions, same as the array way, vfloat32m1_t* pval = (vfloat32m1_t*)malloc(nlanes*32);
~v_float32x4() {free(pval);}it also works as well as more concise assembly. And most importantly, although we call malloc in the source code, no malloc is actually called in assembly with the help of compiler optimization because (I guess) the parameters of the malloc function is a constant (128). But in more complex cases, the compiler may not be able to optimize. https://godbolt.org/z/8sa6W8vf3 shows the difference between using a scalar array and using malloc. Therefore, there is a tradeoff between more unnecessary store instructions (by using array) and possible malloc calls, what's your opinion? And I would like to try to explain why I think this does not cause a memory leak: Firstly, if there is no malloc called in assembly, then no memory in heap is created, also no way to leak. So there does not seem to be a memory leak, please correct me if I am wrong. |
|
Even if compiler is smart enough and does not generate real malloc calls in some cases, it's bad idea to drop pointer values and replace them with the new one. It bad logic and it stops working. Also, it's unsafe to call free for the cases like this: The |
|
Tested with qemu64 and clang version 14.0.0 (https://github.com/llvm/llvm-project.git f5f0bd8e3d972d040827c88adf1a9880ebcb821e). Also I have to tune toolchain file and replace hardware configuration with |
* Update universal intrinsics of RVV back-end. * Use array instead of malloc.
The goal of this patch is to optimize the existing Universal Intrinsics implementation of RVV.
The initial version of RVV UI is #18228.
The main difference between this patch and the initial version is the design of the data structure: Using a pointer to the RVV Intrinsic Type instead of an array to hold the value. In this way, no more Intrinsic functions such as load & store needed in the struct, which provides opportunities for compilation optimization.
This is a comparison on assembly level, taking
v_fmaas an example, the implementation of this patch reduces the number of instructions by about 50%.This patch is tested on QEMU with the core testcases and the dnn testcases, and the results are the same as master.
Pull Request Readiness Checklist
See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request
Patch to opencv_extra has the same branch name.