dnn: parallelize nary elementwise forward implementation & enable related conformance tests by fengyuentau · Pull Request #25630 · opencv/opencv

fengyuentau · 2024-05-23T10:13:11Z

This PR introduces the following changes:

Parallelize binary forward impl
Parallelize ternary forward impl (Where)
Parallelize nary (Operator that can take >=1 operands)
Enable conformance tests if workable

Performance

i7-12700K, RAM 64GB, Ubuntu 22.04

Geometric mean (ms)

                Name of Test                     opencv        opencv        opencv
                                                  perf          perf          perf
                                              core.x64.0606 core.x64.0606 core.x64.0606
                                                                               vs
                                                                             opencv
                                                                              perf
                                                                          core.x64.0606
                                                                           (x-factor)
NCHW_C_sum::Layer_NaryEltwise::OCV/CPU           16.116        11.161         1.44
NCHW_NCHW_add::Layer_NaryEltwise::OCV/CPU        17.469        11.446         1.53
NCHW_NCHW_div::Layer_NaryEltwise::OCV/CPU        17.531        11.469         1.53
NCHW_NCHW_equal::Layer_NaryEltwise::OCV/CPU      28.653        13.682         2.09
NCHW_NCHW_greater::Layer_NaryEltwise::OCV/CPU    21.899        13.422         1.63
NCHW_NCHW_less::Layer_NaryEltwise::OCV/CPU       21.738        13.185         1.65
NCHW_NCHW_max::Layer_NaryEltwise::OCV/CPU        16.172        11.473         1.41
NCHW_NCHW_mean::Layer_NaryEltwise::OCV/CPU       16.309        11.565         1.41
NCHW_NCHW_min::Layer_NaryEltwise::OCV/CPU        16.166        11.454         1.41
NCHW_NCHW_mul::Layer_NaryEltwise::OCV/CPU        16.157        11.443         1.41
NCHW_NCHW_pow::Layer_NaryEltwise::OCV/CPU        163.459       15.234         10.73
NCHW_NCHW_ref_div::Layer_NaryEltwise::OCV/CPU    10.880        10.868         1.00
NCHW_NCHW_ref_max::Layer_NaryEltwise::OCV/CPU    10.947        11.058         0.99
NCHW_NCHW_ref_min::Layer_NaryEltwise::OCV/CPU    10.948        10.910         1.00
NCHW_NCHW_ref_mul::Layer_NaryEltwise::OCV/CPU    10.874        10.871         1.00
NCHW_NCHW_ref_sum::Layer_NaryEltwise::OCV/CPU    10.971        10.920         1.00
NCHW_NCHW_sub::Layer_NaryEltwise::OCV/CPU        17.546        11.462         1.53
NCHW_NCHW_sum::Layer_NaryEltwise::OCV/CPU        16.175        11.475         1.41
NHWC_C::Layer_NaryEltwise::OCV/CPU               11.339        11.333         1.00
NHWC_H::Layer_NaryEltwise::OCV/CPU               16.154        11.102         1.46

Apple M1, RAM 16GB, macOS 14.4.1

Geometric mean (ms)

                Name of Test                     opencv          opencv             opencv      
                                                  perf            perf               perf       
                                              core.m1.0606 core.m1.0606.patch core.m1.0606.patch
                                                                                      vs        
                                                                                    opencv      
                                                                                     perf       
                                                                                 core.m1.0606   
                                                                                  (x-factor)    
NCHW_C_sum::Layer_NaryEltwise::OCV/CPU           28.418          3.768               7.54       
NCHW_NCHW_add::Layer_NaryEltwise::OCV/CPU        6.942           5.679               1.22       
NCHW_NCHW_div::Layer_NaryEltwise::OCV/CPU        5.822           5.653               1.03       
NCHW_NCHW_equal::Layer_NaryEltwise::OCV/CPU      5.751           5.628               1.02       
NCHW_NCHW_greater::Layer_NaryEltwise::OCV/CPU    5.797           5.599               1.04       
NCHW_NCHW_less::Layer_NaryEltwise::OCV/CPU       7.272           5.578               1.30       
NCHW_NCHW_max::Layer_NaryEltwise::OCV/CPU        5.777           5.562               1.04       
NCHW_NCHW_mean::Layer_NaryEltwise::OCV/CPU       5.819           5.559               1.05       
NCHW_NCHW_min::Layer_NaryEltwise::OCV/CPU        5.830           5.574               1.05       
NCHW_NCHW_mul::Layer_NaryEltwise::OCV/CPU        5.759           5.567               1.03       
NCHW_NCHW_pow::Layer_NaryEltwise::OCV/CPU       342.260          74.655              4.58       
NCHW_NCHW_ref_div::Layer_NaryEltwise::OCV/CPU    8.338           8.280               1.01       
NCHW_NCHW_ref_max::Layer_NaryEltwise::OCV/CPU    8.359           8.309               1.01       
NCHW_NCHW_ref_min::Layer_NaryEltwise::OCV/CPU    8.412           8.295               1.01       
NCHW_NCHW_ref_mul::Layer_NaryEltwise::OCV/CPU    8.380           8.297               1.01       
NCHW_NCHW_ref_sum::Layer_NaryEltwise::OCV/CPU    8.356           8.323               1.00       
NCHW_NCHW_sub::Layer_NaryEltwise::OCV/CPU        6.818           5.561               1.23       
NCHW_NCHW_sum::Layer_NaryEltwise::OCV/CPU        5.805           5.570               1.04       
NHWC_C::Layer_NaryEltwise::OCV/CPU               3.834           4.817               0.80       
NHWC_H::Layer_NaryEltwise::OCV/CPU               28.402          3.771               7.53

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

I agree to contribute to the project under Apache 2 License.
To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
The PR is proposed to the proper branch
There is a reference to the original bug report and related work
There is accuracy test, performance test and test data in opencv_extra repository, if applicable
Patch to opencv_extra has the same branch name.
The feature is well documented and sample code can be built with the project CMake

opencv-alalek · 2024-05-27T11:45:34Z

modules/dnn/src/layers/nary_eltwise_layers.cpp

+            double nstripes = getNumThreads();
+            parallel_for_(Range(0, nplanes), worker, nstripes);


nstripes = getNumThreads();

This should not be used.
Already discussed several months ago - e.g. #23047

Thank you for review but take it easy, this pr is still drafting. I still remember our discussion.

Changed. Performance results are also updated.

asmorkalov · 2024-06-10T16:35:08Z

My results with Jetson tk1 (armv7+neon):

ubuntu@jetson1:~/Projects/perf-dnn$ python3 ../opencv/modules/ts/misc/summary.py ./4.x-1.xml ./patched-1.xml | grep NaryEltwise
NCHW_C_sum::Layer_NaryEltwise::OCV/CPU                                                                                                          65.891   43.371      1.52   
NCHW_NCHW_add::Layer_NaryEltwise::OCV/CPU                                                                                                       79.287   81.868      0.97   
NCHW_NCHW_div::Layer_NaryEltwise::OCV/CPU                                                                                                      187.457   187.657     1.00   
NCHW_NCHW_equal::Layer_NaryEltwise::OCV/CPU                                                                                                     88.643   96.376      0.92   
NCHW_NCHW_greater::Layer_NaryEltwise::OCV/CPU                                                                                                   88.694   96.035      0.92   
NCHW_NCHW_less::Layer_NaryEltwise::OCV/CPU                                                                                                      88.716   90.298      0.98   
NCHW_NCHW_max::Layer_NaryEltwise::OCV/CPU                                                                                                       84.722   83.976      1.01   
NCHW_NCHW_mean::Layer_NaryEltwise::OCV/CPU                                                                                                      92.757   81.105      1.14   
NCHW_NCHW_min::Layer_NaryEltwise::OCV/CPU                                                                                                       84.285   84.010      1.00   
NCHW_NCHW_mul::Layer_NaryEltwise::OCV/CPU                                                                                                       78.594   78.574      1.00   
NCHW_NCHW_pow::Layer_NaryEltwise::OCV/CPU                                                                                                      3407.037 3475.724     0.98   
NCHW_NCHW_ref_div::Layer_NaryEltwise::OCV/CPU                                                                                                  189.651   189.454     1.00   
NCHW_NCHW_ref_max::Layer_NaryEltwise::OCV/CPU                                                                                                   87.859   87.771      1.00   
NCHW_NCHW_ref_min::Layer_NaryEltwise::OCV/CPU                                                                                                   87.915   88.053      1.00   
NCHW_NCHW_ref_mul::Layer_NaryEltwise::OCV/CPU                                                                                                   84.077   84.063      1.00   
NCHW_NCHW_ref_sum::Layer_NaryEltwise::OCV/CPU                                                                                                   85.160   84.625      1.01   
NCHW_NCHW_sub::Layer_NaryEltwise::OCV/CPU                                                                                                       86.368   79.089      1.09   
NCHW_NCHW_sum::Layer_NaryEltwise::OCV/CPU                                                                                                       89.897   78.993      1.14   
NHWC_C::Layer_NaryEltwise::OCV/CPU                                                                                                              77.220   71.425      1.08   
NHWC_H::Layer_NaryEltwise::OCV/CPU                                                                                                              67.494   42.832      1.58

asmorkalov · 2024-06-11T12:39:39Z

My results for Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz (no AVX2):

NCHW_C_sum::Layer_NaryEltwise::OCV/CPU                                                                                                          24.193   17.846      1.36   
NCHW_NCHW_add::Layer_NaryEltwise::OCV/CPU                                                                                                       24.026   23.313      1.03   
NCHW_NCHW_div::Layer_NaryEltwise::OCV/CPU                                                                                                       27.370   23.279      1.18   
NCHW_NCHW_equal::Layer_NaryEltwise::OCV/CPU                                                                                                     35.025   23.254      1.51   
NCHW_NCHW_greater::Layer_NaryEltwise::OCV/CPU                                                                                                   32.455   23.260      1.40   
NCHW_NCHW_less::Layer_NaryEltwise::OCV/CPU                                                                                                      32.509   23.321      1.39   
NCHW_NCHW_max::Layer_NaryEltwise::OCV/CPU                                                                                                       23.997   23.262      1.03   
NCHW_NCHW_mean::Layer_NaryEltwise::OCV/CPU                                                                                                      24.038   23.270      1.03   
NCHW_NCHW_min::Layer_NaryEltwise::OCV/CPU                                                                                                       23.977   23.269      1.03   
NCHW_NCHW_mul::Layer_NaryEltwise::OCV/CPU                                                                                                       23.927   23.279      1.03   
NCHW_NCHW_pow::Layer_NaryEltwise::OCV/CPU                                                                                                      320.598   98.029      3.27   
NCHW_NCHW_ref_div::Layer_NaryEltwise::OCV/CPU                                                                                                   24.507   24.488      1.00   
NCHW_NCHW_ref_max::Layer_NaryEltwise::OCV/CPU                                                                                                   24.484   24.477      1.00   
NCHW_NCHW_ref_min::Layer_NaryEltwise::OCV/CPU                                                                                                   24.500   24.471      1.00   
NCHW_NCHW_ref_mul::Layer_NaryEltwise::OCV/CPU                                                                                                   24.486   24.482      1.00   
NCHW_NCHW_ref_sum::Layer_NaryEltwise::OCV/CPU                                                                                                   24.472   24.476      1.00   
NCHW_NCHW_sub::Layer_NaryEltwise::OCV/CPU                                                                                                       23.953   23.281      1.03   
NCHW_NCHW_sum::Layer_NaryEltwise::OCV/CPU                                                                                                       23.992   23.274      1.03   
NHWC_C::Layer_NaryEltwise::OCV/CPU                                                                                                              18.260   18.489      0.99   
NHWC_H::Layer_NaryEltwise::OCV/CPU                                                                                                              24.182   17.829      1.36

fengyuentau · 2024-06-12T04:20:13Z

Thank you @asmorkalov for adding more performance results :)

fengyuentau · 2024-06-14T09:24:47Z

Any review comments?

asmorkalov · 2024-06-19T07:44:20Z

The patch leads to significant OpenCL pipelines degradation, e.g.:

VIT_B_32::DNNTestNetwork::OCV/CPU 	149.576 	191.409 	0.78
VIT_B_32::DNNTestNetwork::OCV/OCL 	104.428 	445.013 	0.23
VIT_B_32::DNNTestNetwork::OCV/OCL_FP16 	102.505 	442.994 	0.23

I use NVIDIA GF 1080 for benchmark. Looks like the patch prevents some graph fusing or some inference optimization.
Looking into details, if it really caused by the PR.

fengyuentau · 2024-06-19T08:30:42Z

The patch leads to significant OpenCL pipelines degradation, e.g.:
VIT_B_32::DNNTestNetwork::OCV/CPU 	149.576 	191.409 	0.78
VIT_B_32::DNNTestNetwork::OCV/OCL 	104.428 	445.013 	0.23
VIT_B_32::DNNTestNetwork::OCV/OCL_FP16 	102.505 	442.994 	0.23 
I use NVIDIA GF 1080 for benchmark. Looks like the patch prevents some graph fusing or some inference optimization. Looking into details, if it really caused by the PR.

Ok, I will take a look at the problem.

fengyuentau · 2024-06-24T07:55:29Z

@asmorkalov The performance "degradation" is due to very out-of-date code base (>450 commits behind 4.x). I have updated the code base. Performance testings (on Intel UHD 770) seem to be okay on my side. Feel free to retest on your side.

Thinking positively, we have achieved a lot performance boosting from those commits (OCL is ~4x faster and CPU is ~1.3x faster). Maybe I can add the OCL backend for this layer later :)

asmorkalov · 2024-06-28T18:35:01Z

perf-dnn.zip
OpenCL related degradation disappeared. Perf numbers for updated PR for core i5-2500:

NCHW_C_sum::Layer_NaryEltwise::OCV/CPU 	24.142 	17.999 	1.34
NCHW_NCHW_add::Layer_NaryEltwise::OCV/CPU 	23.860 	23.265 	1.03
NCHW_NCHW_div::Layer_NaryEltwise::OCV/CPU 	27.383 	23.282 	1.18
NCHW_NCHW_equal::Layer_NaryEltwise::OCV/CPU 	39.056 	23.292 	1.68
NCHW_NCHW_greater::Layer_NaryEltwise::OCV/CPU 	32.489 	23.290 	1.39
NCHW_NCHW_less::Layer_NaryEltwise::OCV/CPU 	32.435 	23.257 	1.39
NCHW_NCHW_max::Layer_NaryEltwise::OCV/CPU 	23.966 	23.269 	1.03
NCHW_NCHW_mean::Layer_NaryEltwise::OCV/CPU 	23.992 	23.276 	1.03
NCHW_NCHW_min::Layer_NaryEltwise::OCV/CPU 	23.951 	23.273 	1.03
NCHW_NCHW_mul::Layer_NaryEltwise::OCV/CPU 	23.862 	23.272 	1.03
NCHW_NCHW_pow::Layer_NaryEltwise::OCV/CPU 	320.265 	97.879 	3.27
NCHW_NCHW_ref_div::Layer_NaryEltwise::OCV/CPU 	24.491 	24.487 	1.00
NCHW_NCHW_ref_max::Layer_NaryEltwise::OCV/CPU 	24.463 	24.464 	1.00
NCHW_NCHW_ref_min::Layer_NaryEltwise::OCV/CPU 	24.472 	24.465 	1.00
NCHW_NCHW_ref_mul::Layer_NaryEltwise::OCV/CPU 	24.460 	24.453 	1.00
NCHW_NCHW_ref_sum::Layer_NaryEltwise::OCV/CPU 	24.463 	24.530 	1.00
NCHW_NCHW_sub::Layer_NaryEltwise::OCV/CPU 	23.870 	23.271 	1.03
NCHW_NCHW_sum::Layer_NaryEltwise::OCV/CPU 	23.964 	23.764 	1.01
NHWC_C::Layer_NaryEltwise::OCV/CPU 	18.083 	18.458 	0.98
NHWC_H::Layer_NaryEltwise::OCV/CPU 	24.140 	17.857 	1.35

asmorkalov · 2024-07-01T09:37:04Z

I also tried Xiaomi Mi 10 phone. The result is volatile (m.b. power management), but I do not see significant performance gain, besides NCHW_C_sum and NCHW_NCHW_pow.
perf-dnn-xiaomi-mi10.zip

fengyuentau · 2024-07-02T08:13:34Z

The result is volatile (m.b. power management), but I do not see significant performance gain

It is tuned to have multi-theading if input scale is large enough. Traditional convolutional nets do not have such a large input scale for elementwise layers.

…_thread dnn: merge #25630 to 5.x #25900 Sync changes from #25630 to 5.x. ### Pull Request Readiness Checklist See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request - [x] I agree to contribute to the project under Apache 2 License. - [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV - [x] The PR is proposed to the proper branch - [x] There is a reference to the original bug report and related work - [x] There is accuracy test, performance test and test data in opencv_extra repository, if applicable Patch to opencv_extra has the same branch name. - [x] The feature is well documented and sample code can be built with the project CMake

opencv-alalek reviewed May 27, 2024

View reviewed changes

fengyuentau added optimization category: dnn labels May 31, 2024

fengyuentau added this to the 4.11.0 milestone Jun 3, 2024

fengyuentau marked this pull request as ready for review June 6, 2024 10:07

fengyuentau requested a review from dkurt June 7, 2024 04:34

fengyuentau mentioned this pull request Jun 14, 2024

Fix parser for supporting mean operation from conformance tests #25761

Closed

6 tasks

fengyuentau changed the title ~~dnn: parallelize nary elementwise forward implementation~~ dnn: parallelize nary elementwise forward implementation & enable related conformance tests Jun 14, 2024

fengyuentau and others added 16 commits June 24, 2024 15:51

parallelize binary forward impl

d8d0498

fix bug and format

0742117

refactor dispatch logic; add doc

7fd7017

enable some conformance tests

700716c

use NaryEltwiseLayer for num_inputs=1

9c7f617

parallelize ternary forward impl

afa4ccc

filter some conformance tests for vulkan backend

5f37e82

suppport one input forward in cuda backend

7b1acfa

remove check of number of inputs

4205124

cuda: add pow

e307ab5

separate cuda fp16 filter list to make ci happy

bb522f7

ocl: fix when having only one input

7143030

ov: apply filters to some tests

60e54d6

make default ci happy

d98c148

quickfix for ov backend

09eb31a

parallelize nary forward impl

dba76e2

fengyuentau and others added 5 commits June 24, 2024 15:51

ov: quickfix for namespace

a107489

fix a bug where different threads can read and write ptrs

4649c50

fix ci

20d0d7e

tune threads

ad5be07

fix nary_forward_impl with mean operation; enable test_mean_example

f3adabe

fengyuentau force-pushed the nary-multi-thread branch from 4be1a1f to f3adabe Compare June 24, 2024 07:53

vpisarev self-requested a review June 27, 2024 21:29

vpisarev approved these changes Jun 27, 2024

View reviewed changes

asmorkalov approved these changes Jul 3, 2024

View reviewed changes

asmorkalov merged commit a7fd944 into opencv:4.x Jul 3, 2024

fengyuentau mentioned this pull request Jul 12, 2024

dnn: merge #25630 to 5.x #25900

Merged

6 tasks

asmorkalov added the port/backport done Label for maintainers. Authors of PR can ignore this label Jul 12, 2024

asmorkalov mentioned this pull request Jul 16, 2024

(5.x) Merge 4.x #25915

Merged

fengyuentau deleted the nary-multi-thread branch July 30, 2024 15:06

asmorkalov mentioned this pull request Aug 5, 2024

fix compilation errors caused by namespace #25987

Merged

6 tasks

		double nstripes = getNumThreads();
		parallel_for_(Range(0, nplanes), worker, nstripes);

Uh oh!

Conversation

fengyuentau commented May 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance

i7-12700K, RAM 64GB, Ubuntu 22.04

Apple M1, RAM 16GB, macOS 14.4.1

Pull Request Readiness Checklist

Uh oh!

opencv-alalek May 27, 2024

Choose a reason for hiding this comment

Uh oh!

fengyuentau May 27, 2024

Choose a reason for hiding this comment

Uh oh!

fengyuentau Jun 6, 2024

Choose a reason for hiding this comment

Uh oh!

asmorkalov commented Jun 10, 2024

Uh oh!

asmorkalov commented Jun 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fengyuentau commented Jun 12, 2024

Uh oh!

fengyuentau commented Jun 14, 2024

Uh oh!

asmorkalov commented Jun 19, 2024

Uh oh!

fengyuentau commented Jun 19, 2024

Uh oh!

fengyuentau commented Jun 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asmorkalov commented Jun 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asmorkalov commented Jul 1, 2024

Uh oh!

fengyuentau commented Jul 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fengyuentau commented May 23, 2024 •

edited

Loading

asmorkalov commented Jun 11, 2024 •

edited

Loading

fengyuentau commented Jun 24, 2024 •

edited

Loading

asmorkalov commented Jun 28, 2024 •

edited

Loading