DNN: clean old convolution and optimize depth-wise Conv, Conv1D and Conv3D#22905
DNN: clean old convolution and optimize depth-wise Conv, Conv1D and Conv3D#22905opencv-pushbot merged 1 commit intoopencv:4.xfrom
Conversation
c7a5ccd to
81448a1
Compare
b2cf1d7 to
7e9d28c
Compare
54cf528 to
82efdb5
Compare
|
Hi @alalek, it seems that default CI's error is not related to this PR. |
| #if CV_TRY_AVX2 | ||
| if(useAVX2) | ||
| opt_AVX2::fastDepthwiseConv(wptr, kernel_h, kernel_w, | ||
| stride_h, stride_w, dilation_h, dilation_w, pad_t, pad_l, | ||
| biasptr, relu, inptr_, height, width, outptr_, out_d, outH, outW); | ||
| else | ||
| #endif | ||
| #if CV_TRY_AVX | ||
| if(useAVX) | ||
| opt_AVX::fastDepthwiseConv(wptr, kernel_h, kernel_w, | ||
| stride_h, stride_w, dilation_h, dilation_w, pad_t, pad_l, | ||
| biasptr, relu, inptr_, height, width, outptr_, out_d, outH, outW); | ||
| else | ||
| #endif | ||
| #if CV_TRY_RVV | ||
| if(useRVV) | ||
| opt_RVV::fastDepthwiseConv(wptr, kernel_h, kernel_w, | ||
| stride_h, stride_w, dilation_h, dilation_w, pad_t, pad_l, | ||
| biasptr, relu, inptr_, height, width, outptr_, out_d, outH, outW); | ||
| else | ||
| #endif | ||
| #if CV_TRY_LASX | ||
| if(useLASX) | ||
| opt_LASX::fastDepthwiseConv(wptr, kernel_h, kernel_w, | ||
| stride_h, stride_w, dilation_h, dilation_w, pad_t, pad_l, | ||
| biasptr, relu, inptr_, height, width, outptr_, out_d, outH, outW); |
There was a problem hiding this comment.
We need performance comparison report in PR's description for that massive changes.
Also it makes sense to compare memory consumption too (to avoid 3x regressions like recent case).
There was a problem hiding this comment.
Hi @alalek. Thanks for your reminder, the purpose of this PR is that let Conv3d and Conv1D execute in the new branch. And it will not affect the speed and memory consumption of Conv2D. (Conv2D Related PR: #21910. )
I will update the speed performance test and the memory consumption of Conv3D and Conv1D.
In theory, Conv3D and Conv1D require twice as much memory as before, since we need repack the weight at the fast_conv initialized stage. It will not reach 7 times (because Conv3D and Conv1D do not support Winograd).
There was a problem hiding this comment.
Please use OpenCV performance testing infrastructure instead of some kind of manual testing:
- Single result is not enough. We must ensure that there are no regression in other cases.
Resnet34_kinetics- no such test in opencv_perf_dnn- CPU/SIMD code optimization validation should be done with
--perf_threads=1 - i7-12700k has P and E cores. Need to bind test process to selected CPUs.
Finally, we have regressions, in conv1d case:
| Name of Test | base | patch | x-factor |
|---|---|---|---|
| conv1d::Conv1D::(GFLOPS=0.000, K=[3], IN={1, 2, 19}, OCN=2, G=2, S=2, P=(1, 1), BIAS, OCV/CPU) | 0.001 | 0.001 | 0.74 |
| conv1d::Conv1D::(GFLOPS=0.000, K=[3], IN={1, 2, 25}, OCN=2, G=2, P=(2, 2), PM=SAME, OCV/CPU) | 0.001 | 0.001 | 0.72 |
| conv1d::Conv1D::(GFLOPS=0.000, K=[3], IN={1, 6, 10}, OCN=6, PM=VALID, BIAS, OCV/CPU) | 0.001 | 0.001 | 0.74 |
(CPU: i7-12700k, Linux)
Test launch cmd:
taskset -c 0 ./bin/opencv_perf_dnn '--gtest_filter=*' --gtest_output=xml:../perf/pr22905/0-1th.xml --perf_threads=1
Report cmd:
python3 ${OPENCV_SRC}/modules/ts/misc/summary.py -m min -f conv1d ../perf/pr22905/{0,1}-1th.xml
(use -o markdown for GitHub report)
There was a problem hiding this comment.
i7-12700k has P and E cores. Need to bind test process to selected CPUs.
Thanks for your reminder. How about I add a new performance with a single thread result? I'm not sure if this is enough.
Resnet34_kinetics - no such test in opencv_perf_dnn
Resnet34_kinetics is an accuracy test. And the backbone of Resnet34_Conv3d is a typical implementation of Conv3D.
There was a problem hiding this comment.
Performance report is not intended to show non-reproducible marketing single number.
Performance report is required to track regressions and to avoid them. We need all numbers from performance tests.
@vpisarev Used pipeline of optimization development has serious gaps. Ignoring existence of performance tests during optimization is not an acceptable flow. We need to fix that process.
There was a problem hiding this comment.
I will try to update the performance test resulting from performance testing infrastructure.
Because the calculation amount of Conv1D in the performance test is too small, I am worried that the test results cannot reflect the model in the real application scenario.
|
Hi @alalek, the default CI still fails. And it seems that these errors are not related to this PR. |
|
Hi @alalek, I just check the CI's error: And these three test regressions only contain the |
|
Ignore them. Not related to this patch (just check builder history for the same target branch). |
|
the pull request improves speed and simplifies implementation (removes old branches). I suggest to merge it |
| int conv_dim = CONV_2D; | ||
| if (inputs[0].dims == 3) | ||
| conv_dim = CONV_1D; | ||
| if (inputs[0].dims == 5) | ||
| conv_dim = CONV_3D; |
There was a problem hiding this comment.
List of regressions for non-SIMD (fully scalar) mode:
cmake ... -DCMAKE_BUILD_TYPE=Release -DCV_DISABLE_OPTIMIZATION=ON
$ python3 ../../dev/modules/ts/misc/summary.py -m mean --regressions-only=0.97 ../perf/pr22905/nosimd-{0,1}-1th.xml -o markdown
| Name of Test | nosimd-0-1th | nosimd-1-1th | nosimd-1-1th vs nosimd-0-1th (x-factor) |
|---|---|---|---|
| YOLOv4_tiny_2::Layer_Slice::OCV/CPU | 0.016 | 0.017 | 0.97 |
| conv1d::Conv1D::(GFLOPS=0.000, K=[3], IN={1, 2, 19}, OCN=2, G=2, S=2, P=(1, 1), BIAS, OCV/CPU) | 0.001 | 0.001 | 0.78 |
| conv1d::Conv1D::(GFLOPS=0.000, K=[3], IN={1, 2, 25}, OCN=2, G=2, P=(2, 2), PM=SAME, OCV/CPU) | 0.001 | 0.001 | 0.77 |
| conv1d::Conv1D::(GFLOPS=0.000, K=[3], IN={1, 6, 10}, OCN=6, PM=VALID, BIAS, OCV/CPU) | 0.001 | 0.001 | 0.91 |
| conv3d::Conv3D::(GFLOPS=0.000, K=[3 x 3 x 3], IN={1, 2, 19, 19, 19}, OCN=2, G=2, S=[2 x 2 x 2], P=(1, 1) x (1, 1) x (1, 1), BIAS, OCV/CPU) | 0.062 | 0.080 | 0.78 |
| conv3d::Conv3D::(GFLOPS=0.001, K=[3 x 3 x 3], IN={1, 2, 25, 19, 19}, OCN=2, G=2, S=[1 x 2 x 2], P=(2, 2) x (2, 2) x (2, 2), PM=SAME, OCV/CPU) | 0.169 | 0.197 | 0.86 |
(CPU: i7-12700k)
There was a problem hiding this comment.
There are regressions since 4.6.0 release:
| Name of Test | nosimd-4.6.0-1th | nosimd-1-1th | nosimd-1-1th vs nosimd-4.6.0-1th (x-factor) |
|---|---|---|---|
| conv1d::Conv1D::(GFLOPS=0.000, K=[3], IN={1, 2, 19}, OCN=2, G=2, S=2, P=(1, 1), BIAS, OCV/CPU) | 0.001 | 0.001 | 0.75 |
| conv1d::Conv1D::(GFLOPS=0.000, K=[3], IN={1, 2, 25}, OCN=2, G=2, P=(2, 2), PM=SAME, OCV/CPU) | 0.001 | 0.001 | 0.74 |
| conv1d::Conv1D::(GFLOPS=0.000, K=[3], IN={1, 6, 10}, OCN=6, PM=VALID, BIAS, OCV/CPU) | 0.001 | 0.001 | 0.88 |
| conv3d::Conv3D::(GFLOPS=0.000, K=[3 x 3 x 3], IN={1, 2, 19, 19, 19}, OCN=2, G=2, S=[2 x 2 x 2], P=(1, 1) x (1, 1) x (1, 1), BIAS, OCV/CPU) | 0.063 | 0.080 | 0.79 |
| conv3d::Conv3D::(GFLOPS=0.001, K=[3 x 3 x 3], IN={1, 2, 25, 19, 19}, OCN=2, G=2, S=[1 x 2 x 2], P=(2, 2) x (2, 2) x (2, 2), PM=SAME, OCV/CPU) | 0.161 | 0.197 | 0.82 |
| conv::Conv::(GFLOPS=1.660, K=[3 x 3], IN={1, 128, 75, 75}, OCN=128, G=128, P=[1 x 1], BIAS, OCV/CPU) | 1.066 | 2.823 | 0.38 |
| conv::Conv::(GFLOPS=1.704, K=[3 x 3], IN={1, 256, 38, 38}, OCN=256, G=256, P=[1 x 1], BIAS, OCV/CPU) | 0.585 | 1.609 | 0.36 |
| conv::Conv::(GFLOPS=1.704, K=[3 x 3], IN={1, 512, 19, 19}, OCN=512, G=512, P=[1 x 1], BIAS, OCV/CPU) | 0.281 | 0.975 | 0.29 |
There was a problem hiding this comment.
Thank for your test. I will try to optimize the performance of the new engine on x86 and will be delivered in a day or two.
b905d6d to
91fe85d
Compare
|
Performance comparison for armv7 with neon: There are several significant regressions for both conv3d and conv1d. |
|
Please use |
91fe85d to
13bede9
Compare
|
Hi @alalek, @asmorkalov and @vpisarev. I have tried my best to speed up all cases in the performance test. And there are some regressions for Conv1D. And for Conv3D and Conv2D, We get a noticeable speedup. Regarding the performance test of Conv1D:
Why Conv1D can't get the acceleration of new engine, like Conv2D or Conv3D?
|
5084ff5 to
8f44578
Compare
|
@alalek, @asmorkalov, what is the final decision? I guess, we get noticeable acceleration in most cases. Shall we merge it before the release? |
|
Updated benchmark result on armv7+neon (jetson tk1): |
f78f026 to
b8a42fa
Compare
|
@zihaomu please squash commits before merge. |
b8a42fa to
71c6339
Compare
The purpose of this PR:
Speed performance test for Conv3D and Conv1D.
Speed Test at Apple M1 (ARMv8):
Conv1D, Conv2D and Conv3D performance test of SIMD
Min (ms)
Conv3D model performance test mannually
Test at i7-12700K (X86_64):
Conv1D, Conv2D and Conv3D performance test of SIMD
Min (ms)
Conv1D, Conv2D and Conv3D performance test of no-simd
Min (ms)
Conv3D model performance test mannually
Run a thousand times, choose the shortest time.
Manually test the code
Memory usage increase
With the
fast_convimplementation, we need to repack the weight of convolution to run convolution as fast as possible. And this will double the memory requirements for Conv1D and Conv3D compared to the previous implementation.Note: And this will not affect the Conv2D memory consumption, since we have supported Conv2D with
fast_convbefore.Pull Request Readiness Checklist
See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request
Patch to opencv_extra has the same branch name.