Skip to content

Optimization of DNN using native RISC-V vector intrinsics.#20287

Merged
alalek merged 10 commits intoopencv:masterfrom
hanliutong:dev-rvv-0.10
Aug 10, 2021
Merged

Optimization of DNN using native RISC-V vector intrinsics.#20287
alalek merged 10 commits intoopencv:masterfrom
hanliutong:dev-rvv-0.10

Conversation

@hanliutong
Copy link
Copy Markdown
Contributor

@hanliutong hanliutong commented Jun 21, 2021

PR for GSoC'21 project on Optimize OpenCV DNN for RISC-V.

This PR is going to add the functions implemented by RVV intrinsics (0.10) in layers_common.simd.hpp, which has 4 functions as below.

In this PR, we assume that vlen=128, which means if RVV Vector Registers are 256-bit or more longer, then the current implementation will use only a part of them. We will make the implementation adjustable to different vector sizes by other PR(s).

Functions Implement && Build Vectorize tails (No Scala) Max used of vReg (v0-v31)
fastGEMM ≈ 14*2
fastGEMM1T ≈ 16*2
fastConv ≈ 4+14*2
fastDepthwiseConv ≈ 23*2

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or other license that is incompatible with OpenCV
  • The PR is proposed to proper branch
  • There is reference to original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake

Copy link
Copy Markdown
Contributor

@asmorkalov asmorkalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The solution looks like translated version of AVX code. I prpose:

  1. Do not compute tails with scalar code as it's done for Intel and ARM, but set vector size with last loop iteration with vl = vsetvl_e32m2(tail_size); It makes code simplier and it should be faster.
  2. There is no alternative for vectorization for now unlike SSE/AVX/AVX2/AVX512. I do not think that we need checkHardwareSupport for vectorized code.

@asmorkalov
Copy link
Copy Markdown
Contributor

@vpisarev Please take a look and provide your feedback.

@vpisarev
Copy link
Copy Markdown
Contributor

vpisarev commented Jul 7, 2021

@hanliutong, thanks for the patch. Please, try it on DNN tests under QEMU. Also, I do not see the port of the most important, convolution kernel. It looks like we need to accelerate a bit, so far the proposed patch is a little small for 1+ month of work. It's a lot of work to do. As @asmorkalov said, the code looks like as almost direct translation of SSE2 code. What I'd like to see:

  1. using a bigger blocks. Right now it's 4x2 for GEMM. It takes ~12-16 registers. RVV offers 32 registers. We can use larger blocks.
  2. more intelligent processing of tails, I believe, RVV provides a method to mask out some of the lanes of a vector register, so that a scalar part in C is not needed.
  3. But please, before 1 and 2 add optimization of convolution kernels, as soon as possible.

bool tail = false;
if (j + FASCONV_BASE_VECSZ > blockSize)
{
if (j == 0) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this branch? It case if blockSize is too small, tail should be called without loop iteration.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to explain why there is a branch about j==0.

When j + 4 > blockSize, it usually means we meet a tail then we use a tail mask and let j = blockSize - 4 (compute the last 4 elements and store with mask).

Example:

  • Assume blockSize=5 and output is 5*1, ignore the output channel
  • The mask will be [0, 0, 0, 1]
  • The first loop iteration will compute the first 4 elements in the output matrix, then the output should be [√, √, √, √, TBD]. Now, j = 4.
  • The second also the last loop iteration will compute the last element with the help of the mask. Actually, we compute the last 4 elements but only store the last one because of the mask. In detail, at the beginning of this iteration, j=4 and we find that j + 4 > blockSize, so we let j = blockSize - 4, now, j=1 then we compute the last 4 elements and only store the last one.

However, there is another situation here when the blockSize is smaller than 4. In this case, we should not use mask but use vl. I will explain with an example.

Example:

  • Assume blockSize=1 and output is 1*1, also ignore the output channel.
  • The mask will be also [0, 0, 0, 1], which can NOT use.
  • Instead, we let vl=1 and load-compute-store directly with the help of vl.

@hanliutong hanliutong marked this pull request as ready for review August 4, 2021 07:36
@hanliutong hanliutong changed the title WIP: Optimization of DNN using native RISC-V vector intrinsics. Optimization of DNN using native RISC-V vector intrinsics. Aug 9, 2021
@asmorkalov asmorkalov self-requested a review August 9, 2021 07:35
Copy link
Copy Markdown
Contributor

@asmorkalov asmorkalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job! 👍

@alalek alalek merged commit aaca498 into opencv:master Aug 10, 2021
@alalek alalek mentioned this pull request Oct 15, 2021
@hanliutong hanliutong deleted the dev-rvv-0.10 branch November 19, 2021 08:18
a-sajjad72 pushed a commit to a-sajjad72/opencv that referenced this pull request Mar 30, 2023
Optimization of DNN using native RISC-V vector intrinsics.

* Use RVV to optimize fastGEMM (FP32) in DNN.

* Use RVV to optimize fastGEMM1T in DNN.

* Use RVV to optimize fastConv in DNN.

* Use RVV to optimize fastDepthwiseConv in DNN.

* Vectorize tails using vl.

* Use "vl" instead of scalar to handle small block in fastConv.

* Fix memory access out of bound in "fastGEMM1T".

* Remove setvl.

* Remove useless initialization.

* Use loop unrolling to handle tail part instead of switch.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants