optimization for parallelization when large core number#24280
Merged
asmorkalov merged 3 commits intoopencv:4.xfrom Sep 27, 2023
Merged
optimization for parallelization when large core number#24280asmorkalov merged 3 commits intoopencv:4.xfrom
asmorkalov merged 3 commits intoopencv:4.xfrom
Conversation
| size_t max_loop = std::min(static_cast<size_t>(range.size()), threads.size()); | ||
| for (size_t i = 0; i < max_loop; ++i) | ||
| { | ||
| if (job->current_task >= job->range.size()) |
Contributor
There was a problem hiding this comment.
/cc @vrabaud Could you check if your version of TSAN is happy with that change? (no data-race with current_task)
Contributor
|
I confirm this is fine on our end. |
Contributor
|
@opencv-alalek Could the patch be merged? |
opencv-alalek
approved these changes
Sep 27, 2023
Merged
hanliutong
pushed a commit
to hanliutong/opencv
that referenced
this pull request
Oct 7, 2023
Optimization for parallelization when large core number opencv#24280 **Problem description:** When the number of cores is large, OpenCV’s thread library may reduce performance when processing parallel jobs. **The reason for this problem:** When the number of cores (the thread pool initialized the threads, whose number is as same as the number of cores) is large, the main thread will spend too much time on waking up unnecessary threads. When a parallel job needs to be executed, the main thread will wake up all threads in sequence, and then wait for the signal for the job completion after waking up all threads. When the number of threads is larger than the parallel number of a job slices, there will be a situation where the main thread wakes up the threads in sequence and the awakened threads have completed the job, but the main thread is still waking up the other threads. The threads woken up by the main thread after this have nothing to do, and the broadcasts made by the waking threads take a lot of time, which reduce the performance. **Solution:** Reduce the time for the process of main thread waking up the worker threads through the following two methods: • The number of threads awakened by the main thread should be adjusted according to the parallel number of a job slices. If the number of threads is greater than the number of the parallel number of job slices, the total number of threads awakened should be reduced. • In the process of waking up threads in sequence, if the main thread finds that all parallel job slices have been allocated, it will jump out of the loop in time and wait for the signal for the job completion. **Performance Test:** The tests were run in the manner described by https://github.com/opencv/opencv/wiki/HowToUsePerfTests. At core number = 160, There are big performance gain in some cases. Take the following cases in the video module as examples: OpticalFlowPyrLK_self::Path_Idx_Cn_NPoints_WSize_Deriv::("cv/optflow/frames/VGA_%02d.png", 2, 1, (9, 9), 11, true) Performance improves 191%:0.185405ms ->0.0636496ms perf::DenseOpticalFlow_VariationalRefinement::(320x240, 10, 10) Performance improves 112%:23.88938ms -> 11.2562ms Among all the modules, the performance improvement is greatest on module video, and there are also certain improvements on other modules. At core number = 160, the times labeled below are the geometric mean of the average time of all cases for one module. The optimization is available on each module. overall | time(ms) | | | | | | | -- | -- | -- | -- | -- | -- | -- | -- | -- module name | gapi | dnn | features2d | objdetect | core | imgproc | stitching | video original | 0.185 | 1.586 | 9.998 | 11.846 | 0.205 | 0.215 | 164.409 | 0.803 optimized | 0.174 | 1.353 | 9.535 | 11.105 | 0.199 | 0.185 | 153.972 | 0.489 Performance improves | 6% | 17% | 5% | 7% | 3% | 16% | 7% | 64% Meanwhile, It is found that adjusting the order of test cases will have an impact on some test cases. For example, we used option --gtest-shuffle to run opencv_perf_gapi, the performance of TestPerformance::CmpWithScalarPerfTestFluid/CmpWithScalarPerfTest::(compare_f, CMP_GE, 1920x1080, 32FC1, { gapi.kernel_package }) case had 30% changes compared to the case without shuffle. I would like to ask if you have also encountered such a situation and could you share your experience? ### Pull Request Readiness Checklist See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request - [x] I agree to contribute to the project under Apache 2 License. - [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV - [x] The PR is proposed to the proper branch - [ ] There is a reference to the original bug report and related work - [ ] There is accuracy test, performance test and test data in opencv_extra repository, if applicable Patch to opencv_extra has the same branch name. - [ ] The feature is well documented and sample code can be built with the project CMake
thewoz
pushed a commit
to thewoz/opencv
that referenced
this pull request
Jan 4, 2024
Optimization for parallelization when large core number opencv#24280 **Problem description:** When the number of cores is large, OpenCV’s thread library may reduce performance when processing parallel jobs. **The reason for this problem:** When the number of cores (the thread pool initialized the threads, whose number is as same as the number of cores) is large, the main thread will spend too much time on waking up unnecessary threads. When a parallel job needs to be executed, the main thread will wake up all threads in sequence, and then wait for the signal for the job completion after waking up all threads. When the number of threads is larger than the parallel number of a job slices, there will be a situation where the main thread wakes up the threads in sequence and the awakened threads have completed the job, but the main thread is still waking up the other threads. The threads woken up by the main thread after this have nothing to do, and the broadcasts made by the waking threads take a lot of time, which reduce the performance. **Solution:** Reduce the time for the process of main thread waking up the worker threads through the following two methods: • The number of threads awakened by the main thread should be adjusted according to the parallel number of a job slices. If the number of threads is greater than the number of the parallel number of job slices, the total number of threads awakened should be reduced. • In the process of waking up threads in sequence, if the main thread finds that all parallel job slices have been allocated, it will jump out of the loop in time and wait for the signal for the job completion. **Performance Test:** The tests were run in the manner described by https://github.com/opencv/opencv/wiki/HowToUsePerfTests. At core number = 160, There are big performance gain in some cases. Take the following cases in the video module as examples: OpticalFlowPyrLK_self::Path_Idx_Cn_NPoints_WSize_Deriv::("cv/optflow/frames/VGA_%02d.png", 2, 1, (9, 9), 11, true) Performance improves 191%:0.185405ms ->0.0636496ms perf::DenseOpticalFlow_VariationalRefinement::(320x240, 10, 10) Performance improves 112%:23.88938ms -> 11.2562ms Among all the modules, the performance improvement is greatest on module video, and there are also certain improvements on other modules. At core number = 160, the times labeled below are the geometric mean of the average time of all cases for one module. The optimization is available on each module. overall | time(ms) | | | | | | | -- | -- | -- | -- | -- | -- | -- | -- | -- module name | gapi | dnn | features2d | objdetect | core | imgproc | stitching | video original | 0.185 | 1.586 | 9.998 | 11.846 | 0.205 | 0.215 | 164.409 | 0.803 optimized | 0.174 | 1.353 | 9.535 | 11.105 | 0.199 | 0.185 | 153.972 | 0.489 Performance improves | 6% | 17% | 5% | 7% | 3% | 16% | 7% | 64% Meanwhile, It is found that adjusting the order of test cases will have an impact on some test cases. For example, we used option --gtest-shuffle to run opencv_perf_gapi, the performance of TestPerformance::CmpWithScalarPerfTestFluid/CmpWithScalarPerfTest::(compare_f, CMP_GE, 1920x1080, 32FC1, { gapi.kernel_package }) case had 30% changes compared to the case without shuffle. I would like to ask if you have also encountered such a situation and could you share your experience? ### Pull Request Readiness Checklist See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request - [x] I agree to contribute to the project under Apache 2 License. - [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV - [x] The PR is proposed to the proper branch - [ ] There is a reference to the original bug report and related work - [ ] There is accuracy test, performance test and test data in opencv_extra repository, if applicable Patch to opencv_extra has the same branch name. - [ ] The feature is well documented and sample code can be built with the project CMake
thewoz
pushed a commit
to thewoz/opencv
that referenced
this pull request
May 29, 2024
Optimization for parallelization when large core number opencv#24280 **Problem description:** When the number of cores is large, OpenCV’s thread library may reduce performance when processing parallel jobs. **The reason for this problem:** When the number of cores (the thread pool initialized the threads, whose number is as same as the number of cores) is large, the main thread will spend too much time on waking up unnecessary threads. When a parallel job needs to be executed, the main thread will wake up all threads in sequence, and then wait for the signal for the job completion after waking up all threads. When the number of threads is larger than the parallel number of a job slices, there will be a situation where the main thread wakes up the threads in sequence and the awakened threads have completed the job, but the main thread is still waking up the other threads. The threads woken up by the main thread after this have nothing to do, and the broadcasts made by the waking threads take a lot of time, which reduce the performance. **Solution:** Reduce the time for the process of main thread waking up the worker threads through the following two methods: • The number of threads awakened by the main thread should be adjusted according to the parallel number of a job slices. If the number of threads is greater than the number of the parallel number of job slices, the total number of threads awakened should be reduced. • In the process of waking up threads in sequence, if the main thread finds that all parallel job slices have been allocated, it will jump out of the loop in time and wait for the signal for the job completion. **Performance Test:** The tests were run in the manner described by https://github.com/opencv/opencv/wiki/HowToUsePerfTests. At core number = 160, There are big performance gain in some cases. Take the following cases in the video module as examples: OpticalFlowPyrLK_self::Path_Idx_Cn_NPoints_WSize_Deriv::("cv/optflow/frames/VGA_%02d.png", 2, 1, (9, 9), 11, true) Performance improves 191%:0.185405ms ->0.0636496ms perf::DenseOpticalFlow_VariationalRefinement::(320x240, 10, 10) Performance improves 112%:23.88938ms -> 11.2562ms Among all the modules, the performance improvement is greatest on module video, and there are also certain improvements on other modules. At core number = 160, the times labeled below are the geometric mean of the average time of all cases for one module. The optimization is available on each module. overall | time(ms) | | | | | | | -- | -- | -- | -- | -- | -- | -- | -- | -- module name | gapi | dnn | features2d | objdetect | core | imgproc | stitching | video original | 0.185 | 1.586 | 9.998 | 11.846 | 0.205 | 0.215 | 164.409 | 0.803 optimized | 0.174 | 1.353 | 9.535 | 11.105 | 0.199 | 0.185 | 153.972 | 0.489 Performance improves | 6% | 17% | 5% | 7% | 3% | 16% | 7% | 64% Meanwhile, It is found that adjusting the order of test cases will have an impact on some test cases. For example, we used option --gtest-shuffle to run opencv_perf_gapi, the performance of TestPerformance::CmpWithScalarPerfTestFluid/CmpWithScalarPerfTest::(compare_f, CMP_GE, 1920x1080, 32FC1, { gapi.kernel_package }) case had 30% changes compared to the case without shuffle. I would like to ask if you have also encountered such a situation and could you share your experience? ### Pull Request Readiness Checklist See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request - [x] I agree to contribute to the project under Apache 2 License. - [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV - [x] The PR is proposed to the proper branch - [ ] There is a reference to the original bug report and related work - [ ] There is accuracy test, performance test and test data in opencv_extra repository, if applicable Patch to opencv_extra has the same branch name. - [ ] The feature is well documented and sample code can be built with the project CMake
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem description:
When the number of cores is large, OpenCV’s thread library may reduce performance when processing parallel jobs.
The reason for this problem:
When the number of cores (the thread pool initialized the threads, whose number is as same as the number of cores) is large, the main thread will spend too much time on waking up unnecessary threads.
When a parallel job needs to be executed, the main thread will wake up all threads in sequence, and then wait for the signal for the job completion after waking up all threads. When the number of threads is larger than the parallel number of a job slices, there will be a situation where the main thread wakes up the threads in sequence and the awakened threads have completed the job, but the main thread is still waking up the other threads. The threads woken up by the main thread after this have nothing
to do, and the broadcasts made by the waking threads take a lot of time, which reduce the performance.
Solution:
Reduce the time for the process of main thread waking up the worker threads through the following two methods:
• The number of threads awakened by the main thread should be adjusted according to the parallel number of a job slices. If the number of threads is greater than the number of the parallel number of job slices, the total number of threads awakened should be reduced.
• In the process of waking up threads in sequence, if the main thread finds that all parallel job slices have been allocated, it will jump out of the loop in time and wait for the signal for the job completion.
Performance Test:
The tests were run in the manner described by https://github.com/opencv/opencv/wiki/HowToUsePerfTests.
At core number = 160, There are big performance gain in some cases.
Take the following cases in the video module as examples:
OpticalFlowPyrLK_self::Path_Idx_Cn_NPoints_WSize_Deriv::("cv/optflow/frames/VGA_%02d.png", 2, 1, (9, 9), 11, true)
Performance improves 191%:0.185405ms ->0.0636496ms
perf::DenseOpticalFlow_VariationalRefinement::(320x240, 10, 10)
Performance improves 112%:23.88938ms -> 11.2562ms
Among all the modules, the performance improvement is greatest on module video, and there are also certain improvements on other modules.
At core number = 160, the times labeled below are the geometric mean of the average time of all cases for one module. The optimization is available on each module.
Meanwhile, It is found that adjusting the order of test cases will have an impact on some test cases. For example, we used option --gtest-shuffle to run opencv_perf_gapi, the performance of TestPerformance::CmpWithScalarPerfTestFluid/CmpWithScalarPerfTest::(compare_f, CMP_GE, 1920x1080, 32FC1, { gapi.kernel_package }) case had 30% changes compared to the case without shuffle. I would like to ask if you have also encountered such a situation and could you share your experience?
Pull Request Readiness Checklist
See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request
Patch to opencv_extra has the same branch name.