fix race condition in oclCvtColorBGR2HSV (color_hsv.dispatch.cpp) by using a mutex#24444
fix race condition in oclCvtColorBGR2HSV (color_hsv.dispatch.cpp) by using a mutex#24444kallaballa wants to merge 9 commits intoopencv:4.xfrom
Conversation
| static thread_local UMat sdiv_data; | ||
| static thread_local UMat hdiv_data180; | ||
| static thread_local UMat hdiv_data256; | ||
| static thread_local int sdiv_table[256]; |
There was a problem hiding this comment.
Lets try to apply thread_local for UMat variables only. No need to touch other tables (there is no problem with them).
Use dedicated initializedUMat for that.
There was a problem hiding this comment.
initializedUMat
I am not sure what you mean by the last part
There was a problem hiding this comment.
Split initialization on 2 parts:
- regular CPU buffers (like
int sdiv_table[256];) - they don't needthread-localat all. - OpenCL related stuff which causes problems and needs thread-local variables (also put it under condition that this data is really needed - CPU-only processing doesn't need that at all).
There was a problem hiding this comment.
probably, even better solution would be to merge all 3 tables together (of course, 3 pointers can be initialized to point to different parts of the joint table) and then we can convert it to UMat on-fly:
int rgb2hsv_tab[256*3];
volatile bool rgb2hsv_initialized = false;
if (!rgb2hsv_initialized) { ... }
if (use_opencl) {
UMat rgb2hsv_utab = Mat(1, 256*3, CV_32S, rgb2hsv_tab).getUMat();
... // run opencl kernel with rgb2hsv_utab as one of the parameters
}
yes, it means extra copy, but compared to the processed data it should be a tiny overhead, and much easier to handle than static thread-local UMat's.
There was a problem hiding this comment.
I'm not experienced enough to say the roundtrip-lateny will be more than we'd like, but it think so. But I like the safety and simplicity of your approach, so I guess I'll profile it.
There was a problem hiding this comment.
I neither like unecessary round-trips nor messing with the memory model, so i came up with this:
if(_src.depth() == CV_8U)
{
static std::mutex mtx;
{
std::unique_lock<std::mutex> lock(mtx);
static UMat sdiv_data;
static UMat hdiv_data180;
static UMat hdiv_data256;
static int combined_table[256];
static volatile bool initialized180 = false, initialized256 = false;
static volatile bool initialized = hrange == 180 ? initialized180 : initialized256;
if (!initialized)
{
int * const hdiv_table = hrange == 180 ? combined_table : &combined_table[128], hsv_shift = 12;
UMat & hdiv_data = hrange == 180 ? hdiv_data180 : hdiv_data256;
combined_table[0] = 0;
int v = 255 << hsv_shift;
if (!initialized180 && !initialized256)
{
for(int i = 1; i < 256; i++ )
combined_table[i] = saturate_cast<int>(v/(1.*i));
Mat(1, 256, CV_32SC1, combined_table).copyTo(sdiv_data);
}
v = hrange << hsv_shift;
for (int i = 1; i < 256; i++ )
hdiv_table[i] = saturate_cast<int>(v/(6.*i));
Mat(1, 256, CV_32SC1, hdiv_table).copyTo(hdiv_data);
initialized = true;
}
h.setArg(ocl::KernelArg::PtrReadOnly(sdiv_data));
h.setArg(hrange == 256 ? ocl::KernelArg::PtrReadOnly(hdiv_data256) :
ocl::KernelArg::PtrReadOnly(hdiv_data180));
}
}There was a problem hiding this comment.
Proposed code locks mutex on each call. That should be properly avoided.
There was a problem hiding this comment.
@kallaballa, let's try the proposed by me approach without use of static UMat's at all.
There was a problem hiding this comment.
@opencv-alalek double checked pattern and lock_guard 0f290bc
@vpisarev I'll give it a try and provide numbers
There was a problem hiding this comment.
Test failed e.g. on MacOS: OCL_TEST_P(CvtColor8u32f, RGB2HSV_FULL) { performTest(3, 3, CVTCODE(RGB2HSV_FULL), IPP_EPS); }
I don't have a MacOS setup, but it stands to reason that it is a concurrency issue. So i checked on the memory model guarantees of volatile and there are none. So I guessed that is the problem. Let's see the test.
@vpisarev implementation without static still pending. want both version working.
|
@kallaballa Friendly reminder. |
|
Was locked out of my account. I am back on it :) |
…ovide guarantess about atomicity or ordering of operations.
|
this PR depends on #24326 to work correctly in mutli-threaded environments. I implemented a compromise in matters of static. |
|
Waiting for the tests. Apart from that i think the PR is ready. |
| static int hdiv_table256[256]; | ||
| static volatile bool initialized180 = false, initialized256 = false; | ||
| volatile bool & initialized = hrange == 180 ? initialized180 : initialized256; | ||
| static std::mutex mtx; |
There was a problem hiding this comment.
This breaks bare metal build configurations without any threading support (see OPENCV_DISABLE_THREAD_SUPPORT)
Use OpenCV wrappers instead.
Reuse cv::getInitializationMutex() instead of creation of new one.
There was a problem hiding this comment.
Oh i understand. thx for the info. is there a list of features i should usually test for? i started to test for WITH_OPENCL=OFF and i will keep in mind that there are targets without thread support... are there more configurations i should keep in mind?
There was a problem hiding this comment.
Done 031abea. Tested locally by running opencv_test_improc with following build configuration. Btw. i found cv::utils::lock_guard but it doesn't seem necessary since there is C++11 support.
cmake -DINSTALL_BIN_EXAMPLES=OFF -DBUILD_SHARED_LIBS=ON -DWITH_OPENCL=ON -DBUILD_opencv_java=OFF -DBUILD_opencv_js=OFF -DBUILD_opencv_python2=OFF -DBUILD_opencv_python3=OFF -DBUILD_EXAMPLES=OFF -DBUILD_PACKAGE=OFF -DBUILD_TESTS=ON -DBUILD_PERF_TESTS=ON -DBUILD_DOCS=OFF -DBUILD_opencv_videoio=ON -DBUILD_opencv_highgui=ON -DBUILD_opencv_ts=ON -DBUILD_opencv_imgcodecs=ON -DBUILD_opencv_plot=OFF -DBUILD_opencv_tracking=OFF -DBUILD_opencv_video=OFF -DBUILD_opencv_world=ON -DCMAKE_C_FLAGS=-DOPENCV_DISABLE_THREAD_SUPPORT -DCMAKE_CXX_FLAGS=-DOPENCV_DISABLE_THREAD_SUPPORT ../opencvThere was a problem hiding this comment.
std::lock_guard => cv::AutoLock
Did you measure performance changes due to added of .copyTo() calls?
There was a problem hiding this comment.
std::lock_guard=>cv::AutoLockDid you measure performance changes due to added of
.copyTo()calls?
I compared the current HEAD of 4.x with the branch of the PR (which i updated to the least 4.x HEAD) .
I used follwing build configuration for both branches:
cmake -DINSTALL_BIN_EXAMPLES=OFF -DBUILD_SHARED_LIBS=ON -DWITH_OPENCL=ON -DBUILD_opencv_java=OFF -DBUILD_opencv_js=OFF -DBUILD_opencv_python2=OFF -DBUILD_opencv_python3=OFF -DBUILD_EXAMPLES=OFF -DBUILD_PACKAGE=OFF -DBUILD_TESTS=ON -DBUILD_PERF_TESTS=ON -DBUILD_DOCS=OFF -DBUILD_opencv_videoio=ON -DBUILD_opencv_highgui=ON -DBUILD_opencv_ts=ON -DBUILD_opencv_imgcodecs=ON -DBUILD_opencv_plot=OFF -DBUILD_opencv_tracking=OFF -DBUILD_opencv_video=OFF -DBUILD_opencv_world=ON -DENABLE_PROFILING=ON ../opencvAnd the following test program: https://github.com/kallaballa/OpenCV-Issues/blob/main/src/oclCvtColorBGR2HSV-race/perf-test.cpp
Without patch it takes: ~14.8s
With patch it takes: ~15.2s
So there is a slow-down. I'll optimize a bit and measure again.
There was a problem hiding this comment.
When profiling i had suspicious results and so i increased the iterations by 10x.
This gave me
- 154.483s for patched without copy.
- 156.108s for patch with copy.
- 153.803s without patch.
I think the differences are still quite influenced by variance. anyway it is not much. should i continue?
There was a problem hiding this comment.
I meassured also the variant no-static/no-thread_local/no-locking: 155.512s
There was a problem hiding this comment.
Please always reuse OpenCV performance tests or add them if something is missing.
Avoid using of hand-written tests cases
suspicious results
Provided measurement code has bug due to async execution of cvtColor(UMat) (code should schedule OpenCL kernels, but do not wait the final result).
P.S. Mat to UMat coping is synchronous call in current API - forces wait of OpenCL execution queue.
…e \-DOPENCV_DISABLE_THREAD_SUPPORT builds fail.
|
@opencv-alalek @vpisarev please review. |
I would prefer the always-copy variant till diverging cl_context issues have been resolved. because the other variants need special handling in multi-context/multi-thread situations. and the performance difference is neglect-able. Which one should i commit before you review? |
Also i have changes to the current state of the PR. should i add them? |
|
@opencv-alalek @kallaballa What is the PR status? Is it ready for merge? |
|
I just checked. It isn't ready. Picking it back up. |
|
@kallaballa What is the PR status? Is it ready for review? |
Pertaining issue: #24443
Pull Request Readiness Checklist
See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request
Patch to opencv_extra has the same branch name.