Optimize int8 layers in DNN modules by using RISC-V Vector intrinsic. by hanliutong · Pull Request #25230 · opencv/opencv

hanliutong · 2024-03-19T04:17:38Z

This patch optimize 3 functions in the int8 layer by using RVV Native Intrinsic.

This patch was tested on QEMU using VLEN=128 and VLEN=256 on ./bin/opencv_test_dnn --gtest_filter="*Int8*";
On the real device (k230, VLEN=128), EfficientDet_int8 in opencv_perf_dnn showed a performance improvement of 1.46x.

Name of Test	Original	optimized	Speed-up
EfficientDet_int8::DNNTestNetwork::OCV/CPU	2843.467	1947.013	1.46

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

I agree to contribute to the project under Apache 2 License.
To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
The PR is proposed to the proper branch
There is a reference to the original bug report and related work
There is accuracy test, performance test and test data in opencv_extra repository, if applicable
Patch to opencv_extra has the same branch name.
The feature is well documented and sample code can be built with the project CMake

mshabunin · 2024-03-19T20:04:39Z

Which compiler did you use? Currently OpenCV uses v0.10 of RVV intrinsics + compatibility layers to v0.11 and v0.12. It seems you've used the latest intrinsics version. Perhaps these code blocks should be guarded: #if defined(__riscv_v_intrinsic) && __riscv_v_intrinsic>=12000. Alternatively we can consider upgrading all our intrinsics to the latest version and drop support of older compilers (intrinsics version check would be added to the current compile check).

hanliutong · 2024-03-20T03:00:10Z

I'm using clang 16.0.6. These code can work when RVV intrinsic version is 0.11 or 0.12, so both clang 16.x and 17.x (and trunk) should work. But with clang 15.x, which only supports version 0.10, this code will not compile.

Considering that v1.0-rc1 of RVV intrinsic has been released and looks like it is very close to stable, I think it is reasonable to upgrade all our intrinsics to the v1.0 (when it is officially released).

And for now, we may use #if defined(__riscv_v_intrinsic) && __riscv_v_intrinsic>=11000 to temporary guard there code.

hanliutong · 2024-03-20T03:26:59Z

Compiler compatibility: https://godbolt.org/z/ns9afhTae

asmorkalov · 2024-03-28T08:08:26Z

@mshabunin Could you take a look? Let's merge it.

mshabunin · 2024-03-28T10:00:59Z

@mshabunin Could you take a look? Let's merge it.

I'll take a closer look today.

mshabunin · 2024-03-28T11:09:05Z

modules/dnn/src/int8layers/layers_common.simd.hpp

+            vint32m1_t zero = __riscv_vmv_v_x_i32m1(0, e8m1);
+            int sum0[FASCONV_BASE_VECSZ], sum1[FASCONV_BASE_VECSZ], sum2[FASCONV_BASE_VECSZ];
+            int vs[16] = {0};
+            __riscv_vse32(vs, vs00, e8m1);


I have a problem here when compiling with GCC 13.2+ (13.x release branch somewhere after 13.2.0):

/work/opencv/modules/dnn/src/int8layers/layers_common.simd.hpp:1370:26: error: no matching function for call to '__riscv_vse32(int [16], vint32m2_t&, const size_t&)' 1370 | __riscv_vse32(vs, vs00, e8m1); | ~~~~~~~~~~~~~^~~~~~~~~~~~~~~~

GCC version:

./riscv64-unknown-linux-gnu-g++ --version riscv64-unknown-linux-gnu-g++ (g128d9cc0599) 13.2.1 20240220

I'm using this GCC because 13.2.0 can not build OpenCV due to an error which has been fixed after this release on branch releases/gcc-13 and on trunk branch trunk

This overloaded intrinsic can be replaced with __riscv_vse32_v_i32m2.

BTW, is int vs[16] array actually used anywhere?

Sorry, it looks like the code I used for debugging was not removed. I have deleted it now.

mshabunin · 2024-03-28T11:27:11Z

For some reason some tests fail when built with GCC 13.2+ (qemu 8.2.1):

qemu-riscv64 \
	-L ${sysroot} \
	-cpu rv64,v=true,vext_spec=v1.0 \
./bin/opencv_test_$t --gtest_filter="*Int8_layers*"
...
[  FAILED  ] 11 tests, listed below:
[  FAILED  ] Test_Int8_layers.Convolution2D/0, where GetParam() = OCV/CPU
[  FAILED  ] Test_Int8_layers.Padding/0, where GetParam() = OCV/CPU
[  FAILED  ] Test_Int8_layers.AvePooling/0, where GetParam() = OCV/CPU
[  FAILED  ] Test_Int8_layers.MaxPooling/0, where GetParam() = OCV/CPU
[  FAILED  ] Test_Int8_layers.Softmax_slim_TF/0, where GetParam() = OCV/CPU
[  FAILED  ] Test_Int8_layers.Concat/0, where GetParam() = OCV/CPU
[  FAILED  ] Test_Int8_layers.InnerProduct/0, where GetParam() = OCV/CPU
[  FAILED  ] Test_Int8_layers.Reshape/0, where GetParam() = OCV/CPU
[  FAILED  ] Test_Int8_layers.Slice_4d_tf/0, where GetParam() = OCV/CPU
[  FAILED  ] Test_Int8_layers.Slice_strided_tf/0, where GetParam() = OCV/CPU
[  FAILED  ] Test_Int8_layers.Eltwise/0, where GetParam() = OCV/CPU

Examples of failures:

[ RUN      ] Test_Int8_layers.Convolution2D/0, where GetParam() = OCV/CPU
/work/opencv/modules/dnn/test/test_common.impl.hpp:76: Failure
Expected: (normL1) <= (l1), actual: 0.49239 vs 0.00413
single_conv  |ref| = 4.2324190139770508
/work/opencv/modules/dnn/test/test_common.impl.hpp:79: Failure
Expected: (normInf) <= (lInf), actual: 3.21996 vs 0.02201
single_conv  |ref| = 4.2324190139770508
/work/opencv/modules/dnn/test/test_common.impl.hpp:76: Failure
Expected: (normL1) <= (l1), actual: 1.6915 vs 0.0193
atrous_conv2d_valid  |ref| = 9.5424537658691406

With clang 17, these tests pass. I'm not sure whether the problem is with GCC or OpenCV. Perhaps we can merge it as-is and try to find the problem later.

hanliutong · 2024-03-28T12:05:18Z

I can also reproduce those tests fail when I use gcc 14.0.1(trunk). Working on it.

hanliutong · 2024-03-29T07:45:25Z

Fixed. Test are passed both with clang and gcc now. Thanks for your tests on gcc @mshabunin !

GCC is right, test passed on clang is lucky. When accumulating vectors, we should use "tail undisturbed" (tu) to make sure that the elements after vl not change (keeping the last accumulation result). I incorrectly used "tail agnostic" (ta) earlier.

mshabunin

Awesome! 👍

Optimize int8 layers in DNN modules by using RISC-V Vector intrinsic. opencv#25230 This patch optimize 3 functions in the int8 layer by using RVV Native Intrinsic. This patch was tested on QEMU using VLEN=128 and VLEN=256 on `./bin/opencv_test_dnn --gtest_filter="*Int8*"`; On the real device (k230, VLEN=128), `EfficientDet_int8` in `opencv_perf_dnn` showed a performance improvement of 1.46x. | Name of Test | Original | optimized | Speed-up | | ------------------------------------------ | -------- | ---------- | -------- | | EfficientDet_int8::DNNTestNetwork::OCV/CPU | 2843.467 | 1947.013 | 1.46 | ### Pull Request Readiness Checklist See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request - [ ] I agree to contribute to the project under Apache 2 License. - [ ] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV - [ ] The PR is proposed to the proper branch - [ ] There is a reference to the original bug report and related work - [ ] There is accuracy test, performance test and test data in opencv_extra repository, if applicable Patch to opencv_extra has the same branch name. - [ ] The feature is well documented and sample code can be built with the project CMake

Optimize int8 layers in DNN modules by using RVV intrinsics.

d3b738e

opencv-alalek added optimization category: dnn platform: riscv labels Mar 19, 2024

opencv-alalek added this to the 4.10.0 milestone Mar 19, 2024

Eliminate register folded and reload.

8061bef

mshabunin self-assigned this Mar 19, 2024

Add macro guard.

978f50d

asmorkalov requested a review from mshabunin March 23, 2024 11:00

mshabunin reviewed Mar 28, 2024

View reviewed changes

Remove redundant code.

af35024

Use tail undisturbed (tu) instead.

65287a5

mshabunin approved these changes Mar 30, 2024

View reviewed changes

asmorkalov merged commit eba158f into opencv:4.x Mar 31, 2024

asmorkalov mentioned this pull request Apr 1, 2024

5.x merge 4.x #25305

Merged

hanliutong deleted the rvv-conv branch April 7, 2024 03:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize int8 layers in DNN modules by using RISC-V Vector intrinsic.#25230

Optimize int8 layers in DNN modules by using RISC-V Vector intrinsic.#25230
asmorkalov merged 5 commits intoopencv:4.xfrom
hanliutong:rvv-conv

hanliutong commented Mar 19, 2024

Uh oh!

mshabunin commented Mar 19, 2024

Uh oh!

hanliutong commented Mar 20, 2024

Uh oh!

hanliutong commented Mar 20, 2024

Uh oh!

asmorkalov commented Mar 28, 2024

Uh oh!

mshabunin commented Mar 28, 2024

Uh oh!

mshabunin Mar 28, 2024

Uh oh!

mshabunin Mar 28, 2024

Uh oh!

hanliutong Mar 28, 2024

Uh oh!

mshabunin commented Mar 28, 2024

Uh oh!

hanliutong commented Mar 28, 2024

Uh oh!

hanliutong commented Mar 29, 2024

Uh oh!

mshabunin left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

hanliutong commented Mar 19, 2024

Pull Request Readiness Checklist

Uh oh!

mshabunin commented Mar 19, 2024

Uh oh!

hanliutong commented Mar 20, 2024

Uh oh!

hanliutong commented Mar 20, 2024

Uh oh!

asmorkalov commented Mar 28, 2024

Uh oh!

mshabunin commented Mar 28, 2024

Uh oh!

mshabunin Mar 28, 2024

Choose a reason for hiding this comment

Uh oh!

mshabunin Mar 28, 2024

Choose a reason for hiding this comment

Uh oh!

hanliutong Mar 28, 2024

Choose a reason for hiding this comment

Uh oh!

mshabunin commented Mar 28, 2024

Uh oh!

hanliutong commented Mar 28, 2024

Uh oh!

hanliutong commented Mar 29, 2024

Uh oh!

mshabunin left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants