dnn: parallelize nary elementwise forward implementation & enable related conformance tests#25630
Conversation
| double nstripes = getNumThreads(); | ||
| parallel_for_(Range(0, nplanes), worker, nstripes); |
There was a problem hiding this comment.
nstripes = getNumThreads();
This should not be used.
Already discussed several months ago - e.g. #23047
There was a problem hiding this comment.
Thank you for review but take it easy, this pr is still drafting. I still remember our discussion.
There was a problem hiding this comment.
Changed. Performance results are also updated.
|
My results with Jetson tk1 (armv7+neon): |
|
My results for Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz (no AVX2): |
|
Thank you @asmorkalov for adding more performance results :) |
|
Any review comments? |
|
The patch leads to significant OpenCL pipelines degradation, e.g.: I use NVIDIA GF 1080 for benchmark. Looks like the patch prevents some graph fusing or some inference optimization. |
Ok, I will take a look at the problem. |
4be1a1f to
f3adabe
Compare
|
@asmorkalov The performance "degradation" is due to very out-of-date code base (>450 commits behind 4.x). I have updated the code base. Performance testings (on Intel UHD 770) seem to be okay on my side. Feel free to retest on your side. Thinking positively, we have achieved a lot performance boosting from those commits (OCL is ~4x faster and CPU is ~1.3x faster). Maybe I can add the OCL backend for this layer later :) |
|
perf-dnn.zip |
|
I also tried Xiaomi Mi 10 phone. The result is volatile (m.b. power management), but I do not see significant performance gain, besides NCHW_C_sum and NCHW_NCHW_pow. |
It is tuned to have multi-theading if input scale is large enough. Traditional convolutional nets do not have such a large input scale for elementwise layers. |
…_thread dnn: merge #25630 to 5.x #25900 Sync changes from #25630 to 5.x. ### Pull Request Readiness Checklist See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request - [x] I agree to contribute to the project under Apache 2 License. - [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV - [x] The PR is proposed to the proper branch - [x] There is a reference to the original bug report and related work - [x] There is accuracy test, performance test and test data in opencv_extra repository, if applicable Patch to opencv_extra has the same branch name. - [x] The feature is well documented and sample code can be built with the project CMake
This PR introduces the following changes:
Performance
i7-12700K, RAM 64GB, Ubuntu 22.04
Apple M1, RAM 16GB, macOS 14.4.1
Pull Request Readiness Checklist
See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request
Patch to opencv_extra has the same branch name.