DNN: Accelerating convolution by zihaomu · Pull Request #21910 · opencv/opencv

zihaomu · 2022-04-25T14:14:03Z

The goal of this proposal is to speed up the convolution layer, thereby speeding up the running speed of the dnn module.

Speed Up Branch	Status	Remarks
Convolution 2D	✔️	Done
DepthWise 2D	✔️	AVX & universal intrinsics
Winograd_Conv2D with 3x3 stride 1	✔️	NEON supported only
Cross Platform support	✔️	universal intrinsics, NEON, AVX2

Performance Test on ARM (Appel M1 Chip, 8 threads)

On ARM platform, it can achieve 2.5X speed up on ResNet50, 1.7X speed up on MobileNetV2.

Model Name	Oringial	With Fast Conv	NCNN FP32	NCNN FP16
ReseNet 50	65 ms	26.8 ms	21.51 ms	14.29 ms
MobileNetv2	9.2 ms	5.43 ms	3.01 ms	1.75 ms

Performance Test on X86 (AMD 5600X, 12 threads)

It can achieve 15% faster on X86 platform.

Model Name	Oringial	With Fast Conv	NCNN's benchmark (FP32)
ReseNet 50	22.33 ms	18.5 ms	22 ms
MobileNetv2	3.9532 ms	3.12 ms	3.0 ms

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

I agree to contribute to the project under Apache 2 License.
To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
The PR is proposed to the proper branch
There is a reference to the original bug report and related work
There is accuracy test, performance test and test data in opencv_extra repository, if applicable
Patch to opencv_extra has the same branch name.
The feature is well documented and sample code can be built with the project CMake

WIP

force_builders=linux,docs,Win32,Linux Debug,Linux AVX2
test_modules=dnn

cmake/OpenCVFindFrameworks.cmake

modules/dnn/include/opencv2/dnn/dnn.hpp

modules/dnn/src/layers/convolution_arm/conv_2d_3x3s1_winograd.hpp

modules/dnn/src/layers/convolution_arm/convolution_arm.hpp

modules/dnn/src/layers/convolution_arm/fast_conv_2d.hpp

modules/dnn/src/layers/convolution_layer.cpp

modules/dnn/src/net_impl.cpp

modules/dnn/CMakeLists.txt

modules/dnn/src/layers/fast_convolution/fast_convolution.hpp

vpisarev · 2022-06-21T21:57:03Z

modules/dnn/src/layers/layers_common.simd.hpp


 #include "opencv2/core/hal/intrin.hpp"

+#ifndef FAST_CONV_PRAM


since you've added separate fast_convolution.simd.hpp, why not move all the new convolution kernels there? Please, move conv_block and depthWiseBlock there. At once, please, normalize the naming. If you use lowercase names with underscores for conv_block, use the same style for depthwise_block. Or make everything mixed case: convBlock, depthWiseBlock

vpisarev · 2022-06-21T22:00:09Z

modules/dnn/src/layers/fast_convolution/fast_convolution.cpp

+    }
+}
+
+void runFastConv2d(InputArray _input, OutputArray _output,


you have 2 functions with very similar names: doConvolution and runFastConv2d. What's the difference between 'do' and 'run'? Can you modify the name of one of the functions to make it more clear?

vpisarev · 2022-06-21T22:02:13Z

modules/dnn/src/layers/fast_convolution/fast_convolution.cpp

+    int Kstripes = Kg_nblocks*stripes_per_sample;
+    int nsubtasks = N*ngroups*Kstripes;
+
+    float* inpbuf_all = (float *)fastMalloc(inputbufsize * sizeof(float ));


In Ficus engine I had to use C, because this is the native output language for Ficus compiler. In C++ code, please, never use plain "malloc" or its alternatives. Use std::vector<> instead.

Thanks for code reviewing, I will fix these issues in the next update.

zihaomu · 2022-06-24T10:04:27Z

Hi @vpisarev, the Code has been updated. And some comments are left.

alalek · 2022-06-17T16:18:54Z

modules/dnn/src/layers/convolution_layer.cpp

 using namespace cv::dnn::cuda4dnn;
 #endif

+#include "./fast_convolution/fast_convolution.hpp"


./

Relative prefix should not be used.

alalek · 2022-06-24T11:01:07Z

modules/dnn/src/layers/convolution_layer.cpp

 public:
    enum { VEC_ALIGN = 8, DFT_TYPE = CV_32F };
-    Mat weightsMat;
+    Mat weightsMat, fastWeights;


It is better to put this near fastConv2dImpl

Also it makes sense to add documetation/information about its layout and the difference from weighstMat.

alalek · 2022-06-24T11:05:43Z

modules/dnn/src/layers/fast_convolution/depthwise_convolution.cpp

+/*
+    This file is a part of ficus language project.
+    See ficus/LICENSE for the licensing terms
+*/
+// This file is modified from the ficus (https://github.com/vpisarev/ficus/blob/master/lib/NN/OpConv.fx)


@vpisarev Please verify as this integration contradicts to 3rdparty original files and/or 3rdparty adopted files.

Thanks for code reviewing. Any advice on this? I don't know how to modify it.

@alalek, could you please explain your comment? What's the contradiction? I can confirm that the code has been borrowed from Ficus, licensed under Apache 2 license.

I expect header similar to modules/dnn/src/layers/fast_convolution/winograd_3x3s1_f63.cpp
where we have OpenCV header on top and then the original license header of the adapted code.

alalek · 2022-06-24T11:10:59Z

modules/dnn/src/layers/fast_convolution/fast_convolution.simd.hpp

+    CV_LOG_WARNING(NULL, "Runing at unoptimized code. The combination of FAST_CONV_MR and/or FAST_CONV_NR "
+                         "is not supported in SIMD128 branch.");


ISA-targeted/SIMD code should not emit warnings (or in general call any other non-optimized functions).

@zihaomu, it should be compile-time error. If user changes FAST_CONV_MR/FAST_CONV_NR, he/she should also modify the optimized loop or explicitly disable it and switch to C implementation.

Ok, I will update it later.

alalek · 2022-06-30T05:33:17Z

modules/dnn/src/layers/convolution_layer.cpp

 using namespace cv::dnn::cuda4dnn;
 #endif

+#include "fast_convolution/fast_convolution.hpp"


Need to take a look on:

test failures in Linux Debug configuration

test failures in Linux AVX2 configuration (-DCPU_BASELINE=AVX2)

looks like unconditional doubling of weights storage requires more memory and several Win32 tests started to fail with OOM message. @vpisarev

Thanks for code reviewing. The failure in Linux Debug and Linux AVX2 only occurs in the quantized model. Since parameters of int8 layers rely on the output of fp32 models, we can modify the threshold to solve it in a short time.

For Win32, I'm looking for a way around it.

For Win32, I have changed the memory limitation of FasterRCNN_vgg16 from 1GB to 2GB. FasterRCNN_vgg16 has a large memory requirement(412 MB needed on one FC layer). The new patch needs to pack the weightMats in advance in the initialization phase. So for Conv, we need twice as much memory to store the weights.

alalek · 2022-06-30T05:34:32Z

modules/dnn/test/test_backends.cpp

    Mat inp = blobFromImage(img, 1.0, Size(320, 240), Scalar(103.939, 116.779, 123.68), false, false);
    // Output image has values in range [-143.526, 148.539].
-    float l1 = 4e-5, lInf = 2e-3;
+    float l1 = 5e-5, lInf = 2e-3;


Please use x2-x5 values for test tolerance checks.

4e-5 => 1e-4 instead of 5e-5

Below:

1e-5 => 1e-4 instead of 1.01e-5

asenyaev · 2022-07-01T23:01:00Z

After merging this PR, Android build fails: https://github.com/opencv/ci-gha-workflow/runs/7157789387?check_suite_focus=true#step:8:1639

zihaomu · 2022-07-01T23:17:49Z

After merging this PR, Android build fails: https://github.com/opencv/ci-gha-workflow/runs/7157789387?check_suite_focus=true#step:8:1639

Thanks for that, i'm looking for a solution, it's estimated to take a day or two to resolve.

DNN: Accelerating convolution * Fast Conv of ARM, X86 and universal intrinsics. * improve code style. * error fixed. * improve the License * optimize memory allocated and Adjust the threshold. * change FasterRCNN_vgg16 to 2GB memory.

zihaomu requested a review from vpisarev April 25, 2022 14:14

vpisarev reviewed Apr 28, 2022

View reviewed changes

cmake/OpenCVFindFrameworks.cmake Outdated Show resolved Hide resolved

vpisarev reviewed Apr 28, 2022

View reviewed changes

modules/dnn/include/opencv2/dnn/dnn.hpp Outdated Show resolved Hide resolved