Skip to content

DNN: Accelerating convolution#21910

Merged
alalek merged 6 commits intoopencv:4.xfrom
zihaomu:fast_conv_ARM
Jul 1, 2022
Merged

DNN: Accelerating convolution#21910
alalek merged 6 commits intoopencv:4.xfrom
zihaomu:fast_conv_ARM

Conversation

@zihaomu
Copy link
Copy Markdown
Member

@zihaomu zihaomu commented Apr 25, 2022

The goal of this proposal is to speed up the convolution layer, thereby speeding up the running speed of the dnn module.

Speed Up Branch Status Remarks
Convolution 2D ✔️ Done
DepthWise 2D ✔️ AVX & universal intrinsics
Winograd_Conv2D with 3x3 stride 1 ✔️ NEON supported only
Cross Platform support ✔️ universal intrinsics, NEON, AVX2

Performance Test on ARM (Appel M1 Chip, 8 threads)

On ARM platform, it can achieve 2.5X speed up on ResNet50, 1.7X speed up on MobileNetV2.

Model Name Oringial With Fast Conv NCNN FP32 NCNN FP16
ReseNet 50 65 ms 26.8 ms 21.51 ms 14.29 ms
MobileNetv2 9.2 ms 5.43 ms 3.01 ms 1.75 ms

Performance Test on X86 (AMD 5600X, 12 threads)

It can achieve 15% faster on X86 platform.

Model Name Oringial With Fast Conv NCNN's benchmark (FP32)
ReseNet 50 22.33 ms 18.5 ms 22 ms
MobileNetv2 3.9532 ms 3.12 ms 3.0 ms

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
  • The PR is proposed to the proper branch
  • There is a reference to the original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake

WIP

force_builders=linux,docs,Win32,Linux Debug,Linux AVX2
test_modules=dnn

@zihaomu zihaomu requested a review from vpisarev April 25, 2022 14:14
@zihaomu zihaomu changed the title DNN: Accelerating convolution on ARM platforms -WIP DNN: Accelerating convolution -WIP May 4, 2022
@zihaomu zihaomu force-pushed the fast_conv_ARM branch 2 times, most recently from 4204f89 to e85f3d5 Compare June 9, 2022 02:41
@zihaomu zihaomu force-pushed the fast_conv_ARM branch 2 times, most recently from 3281b3c to 67d4dc2 Compare June 13, 2022 05:14
@zihaomu zihaomu marked this pull request as ready for review June 14, 2022 05:56
@zihaomu zihaomu closed this Jun 14, 2022
@zihaomu zihaomu reopened this Jun 14, 2022
@zihaomu

This comment was marked as resolved.

@zihaomu zihaomu changed the title DNN: Accelerating convolution -WIP DNN: Accelerating convolution Jun 15, 2022

#include "opencv2/core/hal/intrin.hpp"

#ifndef FAST_CONV_PRAM
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since you've added separate fast_convolution.simd.hpp, why not move all the new convolution kernels there? Please, move conv_block and depthWiseBlock there. At once, please, normalize the naming. If you use lowercase names with underscores for conv_block, use the same style for depthwise_block. Or make everything mixed case: convBlock, depthWiseBlock

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

}
}

void runFastConv2d(InputArray _input, OutputArray _output,
Copy link
Copy Markdown
Contributor

@vpisarev vpisarev Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have 2 functions with very similar names: doConvolution and runFastConv2d. What's the difference between 'do' and 'run'? Can you modify the name of one of the functions to make it more clear?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

int Kstripes = Kg_nblocks*stripes_per_sample;
int nsubtasks = N*ngroups*Kstripes;

float* inpbuf_all = (float *)fastMalloc(inputbufsize * sizeof(float ));
Copy link
Copy Markdown
Contributor

@vpisarev vpisarev Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Ficus engine I had to use C, because this is the native output language for Ficus compiler. In C++ code, please, never use plain "malloc" or its alternatives. Use std::vector<> instead.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for code reviewing, I will fix these issues in the next update.

@zihaomu
Copy link
Copy Markdown
Member Author

zihaomu commented Jun 24, 2022

Hi @vpisarev, the Code has been updated. And some comments are left.

using namespace cv::dnn::cuda4dnn;
#endif

#include "./fast_convolution/fast_convolution.hpp"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

./

Relative prefix should not be used.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

public:
enum { VEC_ALIGN = 8, DFT_TYPE = CV_32F };
Mat weightsMat;
Mat weightsMat, fastWeights;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to put this near fastConv2dImpl

Also it makes sense to add documetation/information about its layout and the difference from weighstMat.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +1 to +5
/*
This file is a part of ficus language project.
See ficus/LICENSE for the licensing terms
*/
// This file is modified from the ficus (https://github.com/vpisarev/ficus/blob/master/lib/NN/OpConv.fx)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vpisarev Please verify as this integration contradicts to 3rdparty original files and/or 3rdparty adopted files.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for code reviewing. Any advice on this? I don't know how to modify it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alalek, could you please explain your comment? What's the contradiction? I can confirm that the code has been borrowed from Ficus, licensed under Apache 2 license.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect header similar to modules/dnn/src/layers/fast_convolution/winograd_3x3s1_f63.cpp
where we have OpenCV header on top and then the original license header of the adapted code.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +123 to +124
CV_LOG_WARNING(NULL, "Runing at unoptimized code. The combination of FAST_CONV_MR and/or FAST_CONV_NR "
"is not supported in SIMD128 branch.");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ISA-targeted/SIMD code should not emit warnings (or in general call any other non-optimized functions).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zihaomu, it should be compile-time error. If user changes FAST_CONV_MR/FAST_CONV_NR, he/she should also modify the optimized loop or explicitly disable it and switch to C implementation.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will update it later.

@zihaomu zihaomu requested a review from vpisarev June 27, 2022 11:09
@zihaomu zihaomu requested a review from alalek June 29, 2022 09:06
using namespace cv::dnn::cuda4dnn;
#endif

#include "fast_convolution/fast_convolution.hpp"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to take a look on:

  • test failures in Linux Debug configuration
  • test failures in Linux AVX2 configuration (-DCPU_BASELINE=AVX2)
  • looks like unconditional doubling of weights storage requires more memory and several Win32 tests started to fail with OOM message. @vpisarev

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for code reviewing. The failure in Linux Debug and Linux AVX2 only occurs in the quantized model. Since parameters of int8 layers rely on the output of fp32 models, we can modify the threshold to solve it in a short time.

For Win32, I'm looking for a way around it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Win32, I have changed the memory limitation of FasterRCNN_vgg16 from 1GB to 2GB. FasterRCNN_vgg16 has a large memory requirement(412 MB needed on one FC layer). The new patch needs to pack the weightMats in advance in the initialization phase. So for Conv, we need twice as much memory to store the weights.

Mat inp = blobFromImage(img, 1.0, Size(320, 240), Scalar(103.939, 116.779, 123.68), false, false);
// Output image has values in range [-143.526, 148.539].
float l1 = 4e-5, lInf = 2e-3;
float l1 = 5e-5, lInf = 2e-3;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use x2-x5 values for test tolerance checks.

4e-5 => 1e-4 instead of 5e-5


Below:

1e-5 => 1e-4 instead of 1.01e-5

@alalek alalek merged commit 59b870a into opencv:4.x Jul 1, 2022
@asenyaev
Copy link
Copy Markdown
Contributor

asenyaev commented Jul 1, 2022

@zihaomu
Copy link
Copy Markdown
Member Author

zihaomu commented Jul 1, 2022

After merging this PR, Android build fails: https://github.com/opencv/ci-gha-workflow/runs/7157789387?check_suite_focus=true#step:8:1639

Thanks for that, i'm looking for a solution, it's estimated to take a day or two to resolve.

@zihaomu zihaomu mentioned this pull request Jul 3, 2022
6 tasks
@hanliutong hanliutong mentioned this pull request Jul 21, 2022
6 tasks
@alalek alalek mentioned this pull request Aug 21, 2022
a-sajjad72 pushed a commit to a-sajjad72/opencv that referenced this pull request Mar 30, 2023
DNN: Accelerating convolution

* Fast Conv of ARM, X86 and universal intrinsics.

* improve code style.

* error fixed.

* improve the License

* optimize memory allocated and Adjust the threshold.

* change FasterRCNN_vgg16 to 2GB memory.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants