optimization for parallelization when large core number by casualwind · Pull Request #24280 · opencv/opencv

casualwind · 2023-09-17T13:35:23Z

Problem description：
When the number of cores is large, OpenCV’s thread library may reduce performance when processing parallel jobs.

The reason for this problem:
When the number of cores (the thread pool initialized the threads, whose number is as same as the number of cores) is large, the main thread will spend too much time on waking up unnecessary threads.
When a parallel job needs to be executed, the main thread will wake up all threads in sequence, and then wait for the signal for the job completion after waking up all threads. When the number of threads is larger than the parallel number of a job slices, there will be a situation where the main thread wakes up the threads in sequence and the awakened threads have completed the job, but the main thread is still waking up the other threads. The threads woken up by the main thread after this have nothing
to do, and the broadcasts made by the waking threads take a lot of time, which reduce the performance.

Solution：
Reduce the time for the process of main thread waking up the worker threads through the following two methods:

• The number of threads awakened by the main thread should be adjusted according to the parallel number of a job slices. If the number of threads is greater than the number of the parallel number of job slices, the total number of threads awakened should be reduced.
• In the process of waking up threads in sequence, if the main thread finds that all parallel job slices have been allocated, it will jump out of the loop in time and wait for the signal for the job completion.

Performance Test:
The tests were run in the manner described by https://github.com/opencv/opencv/wiki/HowToUsePerfTests.
At core number = 160, There are big performance gain in some cases.

Take the following cases in the video module as examples:

OpticalFlowPyrLK_self::Path_Idx_Cn_NPoints_WSize_Deriv::("cv/optflow/frames/VGA_%02d.png", 2, 1, (9, 9), 11, true)
Performance improves 191%:0.185405ms ->0.0636496ms
perf::DenseOpticalFlow_VariationalRefinement::(320x240, 10, 10)
Performance improves 112%:23.88938ms -> 11.2562ms
Among all the modules, the performance improvement is greatest on module video, and there are also certain improvements on other modules.

At core number = 160, the times labeled below are the geometric mean of the average time of all cases for one module. The optimization is available on each module.

overall	time(ms)
module name	gapi	dnn	features2d	objdetect	core	imgproc	stitching	video
original	0.185	1.586	9.998	11.846	0.205	0.215	164.409	0.803
optimized	0.174	1.353	9.535	11.105	0.199	0.185	153.972	0.489
Performance improves	6%	17%	5%	7%	3%	16%	7%	64%

Meanwhile, It is found that adjusting the order of test cases will have an impact on some test cases. For example, we used option --gtest-shuffle to run opencv_perf_gapi, the performance of TestPerformance::CmpWithScalarPerfTestFluid/CmpWithScalarPerfTest::(compare_f, CMP_GE, 1920x1080, 32FC1, { gapi.kernel_package }) case had 30% changes compared to the case without shuffle. I would like to ask if you have also encountered such a situation and could you share your experience?

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

I agree to contribute to the project under Apache 2 License.
To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
The PR is proposed to the proper branch
There is a reference to the original bug report and related work
There is accuracy test, performance test and test data in opencv_extra repository, if applicable
Patch to opencv_extra has the same branch name.
The feature is well documented and sample code can be built with the project CMake

asmorkalov

👍

modules/core/src/parallel_impl.cpp

opencv-alalek · 2023-09-21T22:06:23Z

modules/core/src/parallel_impl.cpp

+            size_t max_loop = std::min(static_cast<size_t>(range.size()), threads.size());
+            for (size_t i = 0; i < max_loop; ++i)
            {
+                if (job->current_task >= job->range.size())


/cc @vrabaud Could you check if your version of TSAN is happy with that change? (no data-race with current_task)

vrabaud · 2023-09-22T14:53:21Z

I confirm this is fine on our end.

asmorkalov · 2023-09-27T09:25:48Z

@opencv-alalek Could the patch be merged?

Optimization for parallelization when large core number opencv#24280 **Problem description：** When the number of cores is large, OpenCV’s thread library may reduce performance when processing parallel jobs. **The reason for this problem:** When the number of cores (the thread pool initialized the threads, whose number is as same as the number of cores) is large, the main thread will spend too much time on waking up unnecessary threads. When a parallel job needs to be executed, the main thread will wake up all threads in sequence, and then wait for the signal for the job completion after waking up all threads. When the number of threads is larger than the parallel number of a job slices, there will be a situation where the main thread wakes up the threads in sequence and the awakened threads have completed the job, but the main thread is still waking up the other threads. The threads woken up by the main thread after this have nothing to do, and the broadcasts made by the waking threads take a lot of time, which reduce the performance. **Solution：** Reduce the time for the process of main thread waking up the worker threads through the following two methods: • The number of threads awakened by the main thread should be adjusted according to the parallel number of a job slices. If the number of threads is greater than the number of the parallel number of job slices, the total number of threads awakened should be reduced. • In the process of waking up threads in sequence, if the main thread finds that all parallel job slices have been allocated, it will jump out of the loop in time and wait for the signal for the job completion. **Performance Test:** The tests were run in the manner described by https://github.com/opencv/opencv/wiki/HowToUsePerfTests. At core number = 160, There are big performance gain in some cases. Take the following cases in the video module as examples: OpticalFlowPyrLK_self::Path_Idx_Cn_NPoints_WSize_Deriv::("cv/optflow/frames/VGA_%02d.png", 2, 1, (9, 9), 11, true) Performance improves 191%:0.185405ms ->0.0636496ms perf::DenseOpticalFlow_VariationalRefinement::(320x240, 10, 10) Performance improves 112%:23.88938ms -> 11.2562ms Among all the modules, the performance improvement is greatest on module video, and there are also certain improvements on other modules. At core number = 160, the times labeled below are the geometric mean of the average time of all cases for one module. The optimization is available on each module. overall | time(ms) | | | | | | | -- | -- | -- | -- | -- | -- | -- | -- | -- module name | gapi | dnn | features2d | objdetect | core | imgproc | stitching | video original | 0.185 | 1.586 | 9.998 | 11.846 | 0.205 | 0.215 | 164.409 | 0.803 optimized | 0.174 | 1.353 | 9.535 | 11.105 | 0.199 | 0.185 | 153.972 | 0.489 Performance improves | 6% | 17% | 5% | 7% | 3% | 16% | 7% | 64% Meanwhile, It is found that adjusting the order of test cases will have an impact on some test cases. For example, we used option --gtest-shuffle to run opencv_perf_gapi, the performance of TestPerformance::CmpWithScalarPerfTestFluid/CmpWithScalarPerfTest::(compare_f, CMP_GE, 1920x1080, 32FC1, { gapi.kernel_package }) case had 30% changes compared to the case without shuffle. I would like to ask if you have also encountered such a situation and could you share your experience? ### Pull Request Readiness Checklist See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request - [x] I agree to contribute to the project under Apache 2 License. - [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV - [x] The PR is proposed to the proper branch - [ ] There is a reference to the original bug report and related work - [ ] There is accuracy test, performance test and test data in opencv_extra repository, if applicable Patch to opencv_extra has the same branch name. - [ ] The feature is well documented and sample code can be built with the project CMake

casualwind added 2 commits September 17, 2023 09:21

opt1

9b39994

revise brackets

631f29c

asmorkalov requested a review from opencv-alalek September 18, 2023 06:20

asmorkalov added optimization category: core labels Sep 18, 2023

asmorkalov added this to the 4.9.0 milestone Sep 18, 2023

asmorkalov approved these changes Sep 18, 2023

View reviewed changes

asmorkalov assigned opencv-alalek Sep 18, 2023

opencv-alalek reviewed Sep 21, 2023

View reviewed changes

change para name to num_threads_to_wake

3c7c5bf

opencv-alalek approved these changes Sep 27, 2023

View reviewed changes

asmorkalov merged commit 7b399c4 into opencv:4.x Sep 27, 2023

asmorkalov mentioned this pull request Sep 28, 2023

(5.x) Merge 4.x #24338

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

optimization for parallelization when large core number#24280

optimization for parallelization when large core number#24280
asmorkalov merged 3 commits intoopencv:4.xfrom
casualwind:parallel_opt

casualwind commented Sep 17, 2023 •

edited

Loading

Uh oh!

asmorkalov left a comment

Uh oh!

Uh oh!

opencv-alalek Sep 21, 2023

Uh oh!

vrabaud commented Sep 22, 2023

Uh oh!

asmorkalov commented Sep 27, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

casualwind commented Sep 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Readiness Checklist

Uh oh!

asmorkalov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

opencv-alalek Sep 21, 2023

Choose a reason for hiding this comment

Uh oh!

vrabaud commented Sep 22, 2023

Uh oh!

asmorkalov commented Sep 27, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

casualwind commented Sep 17, 2023 •

edited

Loading