Skip to content

fix race condition in oclCvtColorBGR2HSV (color_hsv.dispatch.cpp) by using a mutex#24444

Closed
kallaballa wants to merge 9 commits intoopencv:4.xfrom
kallaballa:color_hsv_dispatch_thread_local
Closed

fix race condition in oclCvtColorBGR2HSV (color_hsv.dispatch.cpp) by using a mutex#24444
kallaballa wants to merge 9 commits intoopencv:4.xfrom
kallaballa:color_hsv_dispatch_thread_local

Conversation

@kallaballa
Copy link
Copy Markdown
Contributor

Pertaining issue: #24443

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
  • The PR is proposed to the proper branch
  • There is a reference to the original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake

static thread_local UMat sdiv_data;
static thread_local UMat hdiv_data180;
static thread_local UMat hdiv_data256;
static thread_local int sdiv_table[256];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets try to apply thread_local for UMat variables only. No need to touch other tables (there is no problem with them).

Use dedicated initializedUMat for that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initializedUMat

I am not sure what you mean by the last part

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split initialization on 2 parts:

  • regular CPU buffers (like int sdiv_table[256];) - they don't need thread-local at all.
  • OpenCL related stuff which causes problems and needs thread-local variables (also put it under condition that this data is really needed - CPU-only processing doesn't need that at all).

Copy link
Copy Markdown
Contributor

@vpisarev vpisarev Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably, even better solution would be to merge all 3 tables together (of course, 3 pointers can be initialized to point to different parts of the joint table) and then we can convert it to UMat on-fly:

int rgb2hsv_tab[256*3];
volatile bool rgb2hsv_initialized = false;
if (!rgb2hsv_initialized) { ... }
if (use_opencl) {
UMat rgb2hsv_utab = Mat(1, 256*3, CV_32S, rgb2hsv_tab).getUMat();
... // run opencl kernel with rgb2hsv_utab as one of the parameters
}

yes, it means extra copy, but compared to the processed data it should be a tiny overhead, and much easier to handle than static thread-local UMat's.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not experienced enough to say the roundtrip-lateny will be more than we'd like, but it think so. But I like the safety and simplicity of your approach, so I guess I'll profile it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I neither like unecessary round-trips nor messing with the memory model, so i came up with this:

    if(_src.depth() == CV_8U)
    {
		static std::mutex mtx;
		{
			std::unique_lock<std::mutex> lock(mtx);
			static UMat sdiv_data;
			static UMat hdiv_data180;
			static UMat hdiv_data256;
			static int combined_table[256];
			static volatile bool initialized180 = false, initialized256 = false;
			static volatile bool initialized = hrange == 180 ? initialized180 : initialized256;

			if (!initialized)
			{
				int * const hdiv_table = hrange == 180 ? combined_table : &combined_table[128], hsv_shift = 12;
				UMat & hdiv_data = hrange == 180 ? hdiv_data180 : hdiv_data256;

				combined_table[0] = 0;

				int v = 255 << hsv_shift;
				if (!initialized180 && !initialized256)
				{
					for(int i = 1; i < 256; i++ )
						combined_table[i] = saturate_cast<int>(v/(1.*i));
					Mat(1, 256, CV_32SC1, combined_table).copyTo(sdiv_data);
				}

				v = hrange << hsv_shift;
				for (int i = 1; i < 256; i++ )
					hdiv_table[i] = saturate_cast<int>(v/(6.*i));

				Mat(1, 256, CV_32SC1, hdiv_table).copyTo(hdiv_data);
				initialized = true;
			}

			h.setArg(ocl::KernelArg::PtrReadOnly(sdiv_data));
			h.setArg(hrange == 256 ? ocl::KernelArg::PtrReadOnly(hdiv_data256) :
							 ocl::KernelArg::PtrReadOnly(hdiv_data180));
		}
	}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proposed code locks mutex on each call. That should be properly avoided.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kallaballa, let's try the proposed by me approach without use of static UMat's at all.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@opencv-alalek double checked pattern and lock_guard 0f290bc
@vpisarev I'll give it a try and provide numbers

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test failed e.g. on MacOS: OCL_TEST_P(CvtColor8u32f, RGB2HSV_FULL) { performTest(3, 3, CVTCODE(RGB2HSV_FULL), IPP_EPS); }

I don't have a MacOS setup, but it stands to reason that it is a concurrency issue. So i checked on the memory model guarantees of volatile and there are none. So I guessed that is the problem. Let's see the test.

@vpisarev implementation without static still pending. want both version working.

@asmorkalov
Copy link
Copy Markdown
Contributor

@kallaballa Friendly reminder.

@kallaballa
Copy link
Copy Markdown
Contributor Author

Was locked out of my account. I am back on it :)

@kallaballa kallaballa changed the title fix race condition in oclCvtColorBGR2HSV (color_hsv.dispatch.cpp) by using thread_local fix race condition in oclCvtColorBGR2HSV (color_hsv.dispatch.cpp) by using <stike>thread_local</strike> a mutex Nov 10, 2023
@kallaballa kallaballa changed the title fix race condition in oclCvtColorBGR2HSV (color_hsv.dispatch.cpp) by using <stike>thread_local</strike> a mutex fix race condition in oclCvtColorBGR2HSV (color_hsv.dispatch.cpp) by using a mutex Nov 10, 2023
@kallaballa
Copy link
Copy Markdown
Contributor Author

this PR depends on #24326 to work correctly in mutli-threaded environments.

I implemented a compromise in matters of static.

@kallaballa
Copy link
Copy Markdown
Contributor Author

Waiting for the tests. Apart from that i think the PR is ready.

static int hdiv_table256[256];
static volatile bool initialized180 = false, initialized256 = false;
volatile bool & initialized = hrange == 180 ? initialized180 : initialized256;
static std::mutex mtx;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks bare metal build configurations without any threading support (see OPENCV_DISABLE_THREAD_SUPPORT)

Use OpenCV wrappers instead.

Reuse cv::getInitializationMutex() instead of creation of new one.

Copy link
Copy Markdown
Contributor Author

@kallaballa kallaballa Nov 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh i understand. thx for the info. is there a list of features i should usually test for? i started to test for WITH_OPENCL=OFF and i will keep in mind that there are targets without thread support... are there more configurations i should keep in mind?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done 031abea. Tested locally by running opencv_test_improc with following build configuration. Btw. i found cv::utils::lock_guard but it doesn't seem necessary since there is C++11 support.

cmake -DINSTALL_BIN_EXAMPLES=OFF -DBUILD_SHARED_LIBS=ON -DWITH_OPENCL=ON -DBUILD_opencv_java=OFF -DBUILD_opencv_js=OFF -DBUILD_opencv_python2=OFF -DBUILD_opencv_python3=OFF -DBUILD_EXAMPLES=OFF -DBUILD_PACKAGE=OFF -DBUILD_TESTS=ON -DBUILD_PERF_TESTS=ON -DBUILD_DOCS=OFF -DBUILD_opencv_videoio=ON -DBUILD_opencv_highgui=ON -DBUILD_opencv_ts=ON -DBUILD_opencv_imgcodecs=ON -DBUILD_opencv_plot=OFF -DBUILD_opencv_tracking=OFF -DBUILD_opencv_video=OFF -DBUILD_opencv_world=ON -DCMAKE_C_FLAGS=-DOPENCV_DISABLE_THREAD_SUPPORT -DCMAKE_CXX_FLAGS=-DOPENCV_DISABLE_THREAD_SUPPORT ../opencv

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::lock_guard => cv::AutoLock

Did you measure performance changes due to added of .copyTo() calls?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done: 8cbf9c7

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::lock_guard => cv::AutoLock

Did you measure performance changes due to added of .copyTo() calls?

I compared the current HEAD of 4.x with the branch of the PR (which i updated to the least 4.x HEAD) .
I used follwing build configuration for both branches:

cmake -DINSTALL_BIN_EXAMPLES=OFF -DBUILD_SHARED_LIBS=ON -DWITH_OPENCL=ON -DBUILD_opencv_java=OFF -DBUILD_opencv_js=OFF -DBUILD_opencv_python2=OFF -DBUILD_opencv_python3=OFF -DBUILD_EXAMPLES=OFF -DBUILD_PACKAGE=OFF -DBUILD_TESTS=ON -DBUILD_PERF_TESTS=ON -DBUILD_DOCS=OFF -DBUILD_opencv_videoio=ON -DBUILD_opencv_highgui=ON -DBUILD_opencv_ts=ON -DBUILD_opencv_imgcodecs=ON -DBUILD_opencv_plot=OFF -DBUILD_opencv_tracking=OFF -DBUILD_opencv_video=OFF -DBUILD_opencv_world=ON -DENABLE_PROFILING=ON ../opencv

And the following test program: https://github.com/kallaballa/OpenCV-Issues/blob/main/src/oclCvtColorBGR2HSV-race/perf-test.cpp

Without patch it takes: ~14.8s
With patch it takes: ~15.2s

So there is a slow-down. I'll optimize a bit and measure again.

Copy link
Copy Markdown
Contributor Author

@kallaballa kallaballa Dec 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When profiling i had suspicious results and so i increased the iterations by 10x.
This gave me

  • 154.483s for patched without copy.
  • 156.108s for patch with copy.
  • 153.803s without patch.

I think the differences are still quite influenced by variance. anyway it is not much. should i continue?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meassured also the variant no-static/no-thread_local/no-locking: 155.512s

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please always reuse OpenCV performance tests or add them if something is missing.

Avoid using of hand-written tests cases

suspicious results

Provided measurement code has bug due to async execution of cvtColor(UMat) (code should schedule OpenCL kernels, but do not wait the final result).

P.S. Mat to UMat coping is synchronous call in current API - forces wait of OpenCL execution queue.

@asmorkalov
Copy link
Copy Markdown
Contributor

@opencv-alalek @vpisarev please review.

@kallaballa
Copy link
Copy Markdown
Contributor Author

kallaballa commented Dec 14, 2023

@opencv-alalek @vpisarev please review.

I would prefer the always-copy variant till diverging cl_context issues have been resolved. because the other variants need special handling in multi-context/multi-thread situations. and the performance difference is neglect-able.

Which one should i commit before you review?

@kallaballa
Copy link
Copy Markdown
Contributor Author

@opencv-alalek @vpisarev please review.

I would prefer the always-copy variant till diverging cl_context issues have been resolved. because the other variants need special handling in multi-context/multi-thread situations. and the performance difference is neglect-able.

Which one should i commit before you review?

Also i have changes to the current state of the PR. should i add them?

@asmorkalov
Copy link
Copy Markdown
Contributor

@opencv-alalek @kallaballa What is the PR status? Is it ready for merge?

@opencv-alalek opencv-alalek modified the milestones: 4.9.0, 4.10.0 Dec 21, 2023
@asmorkalov asmorkalov modified the milestones: 4.10.0, 4.11.0 May 16, 2024
@asmorkalov asmorkalov modified the milestones: 4.11.0, 4.12.0 Dec 20, 2024
@kallaballa
Copy link
Copy Markdown
Contributor Author

I just checked. It isn't ready. Picking it back up.

@asmorkalov
Copy link
Copy Markdown
Contributor

@kallaballa What is the PR status? Is it ready for review?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants