UMat usageFlags fixes opencv/opencv#19807#20027
UMat usageFlags fixes opencv/opencv#19807#20027opencv-pushbot merged 1 commit intoopencv:masterfrom diablodale:fix19807-UMat-usageFlags
Conversation
|
I don't see any failures in the build related to my code changes. |
|
@diablodale , only default check is required for now, so other issues can be ignored. Could you please also add regression test(s) for this issue? Did you try to compare perf results before and after the patch? |
New test casesYes, I will add regression tests for this issue. It will be in my next forcepush (which might include any needed changes for 3.x). PerformanceThis is a concern of mine that I discuss at the top of the original issue #19807
The specific use case of non-DEFAULT very much depends on the cpu, gpu, gpu's core and opencl driver, and the app code itself. This combination has interactions of PCIe buses, caches, memory timing, driver optimizations, opencl implementation, etc. The performance on a computer with a discrete NVidia 1080 is different than an integrated Intel HD. The NVidia almost always has to move memory across the PCIe bus. The Intel might be able to use the same RAM memory as the CPU. On both those cases, the three different USAGE settings will have different interactions and likely some measurable difference in performance if run millions of times. https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/ is such a topic where NVidia writes
I will write some test cases that iterate millions of times and include a report. It might tell us a relative difference, on my specific computer, specific GPUs, specific drivers, for a specific codepath in an app. In my app that processes 5 streams of visual data at 30fps, I saw no consistent universal measurable perf changes. I thought I saw one case of measurable improvement, but it wasn't consistent. What data do we need from a performance test case?What data do you think we need to have? Specifically. What exact test case(s) do I need to write? I can imagine 3 groups; one group for each usage type: Then within a group, to do something that is tests the allocation and transfer of memory...and doesn't have gpu computation work that would be magnitudes more costly -- that cost would overwhelm the transfer metrics. For example, it would probably be bad to run a DNN model. Perhaps Do I need to manipulate the OpenCV cache pool/allocator? Settings like |
|
No, it is not necessary to write new performance tests, I meant running existing test suite (at least for core, imgproc modules): https://github.com/opencv/opencv/wiki/HowToUsePerfTests export OPENCV_TEST_DATA_PATH=...
# without patch
mkdir -p before
python3 opencv/modules/ts/misc/run.py -t core,imgproc -w before build
# with patch
mkdir -p after
python3 opencv/modules/ts/misc/run.py -t core,imgproc -w after build...and comparing results with summary.py script for m in core imgproc ; do
python3 opencv/modules/ts/misc/summary.py -o html before/*${m}*.xml after/*${m}*.xml > report_${m}.html
doneThe idea is to check whether this patch makes significant performance difference (faster or slower by >30%) on iGPU and dGPU (if possible). I don't think advanced settings should be changed. We even have perf test for usage flags: https://github.com/opencv/opencv/blob/master/modules/core/perf/opencl/perf_usage_flags.cpp NoteThere can be cases when performance results are not stable, you will see lines like this in the test log (stdev is >3-5%): If you experience these issues, it is recommended to prepare machine prior to testing: close all other applications, fixate CPU/GPU frequiencies, disable turbo boost, etc.. |
|
Same chart and pivot table for the If I filter for only the "OCL" test cases...
|
|
I don't have confidence in using the bulk test cases for this. Perhaps I am cautious. I generated a report to compare the before code (official 4.5.2) to itself. I compiled and ran yesterday, compiled and ran today. Generated the report with those two datasets. I got similar results...some better...some worse. And the variation was large. Here is the report_core.zip and two lines from it I think the only thing I can learn from these bulk reports, is if all OCL tests were slower. If I saw a clear consistent pattern like that, then I would say...we have a code problem. But since the sample data varies by 50% with the same code run on two different days...I don't think much more can be learned. In contrast, a focused test that runs much more than 13 (a very common # of samples the test harness runs) samples and over a longer period of time might give us more meaningful information. Yes, the caches in the hardware (cpu,gpu) will be engaged. But the traffic across the PCIe (or not for integrated) remains as well as cpu and gpu memory contention. I'm going to write an additional perf test case that runs for a long time to see if I can discern anything. The existing UsageFlags perf test case does show performance differences when the different combinations of memory types are used. I want to make those clearer. |
|
@diablodale , perf tests have parameters |
|
I can not discern from the performance reports any significant and consistent improvements or slowdowns associated with the code fix. Same code run on two different days often has variations of 45% or more (xfactor 0.39, 0.52, etc.). However, it is a very small number of test cases that have these large variations. I believe they are outliers and test noise. All charts include data that are...
Core
The super majority of test cases is located between 0.7 - 1.3 xfactor All test cases sorted by xfactor from most improvement to least improvement. Y axis is the xfactor for the infinitely thin "bar" in the chart. It shows a "long tail" distribution on each side. Box and whisker chart sorted by xfactor of individual test cases from most to least while also grouped by test group. This chart suggests that these extreme improvements are outliers and likely test noise. The more consistent test results are the shaded bars that are usually around 1.0. Box and whisker chart reverse sorted. This suggests that these extreme declines are also test noise. Imgproc
The super majority of test cases is located between 0.7 - 1.3 xfactor All test cases sorted by xfactor from most improvement to least improvement. Y axis is the xfactor for the infinitely thin "bar" in the chart. It shows a "long tail" distribution on each side. Box and whisker sorted by xfactor of individual test cases from most to least while also grouped by test group. The imgproc module has more outliers, though this still does not concern me given the other research. |
|
@mshabunin do you have any comments regarding the above perf test results? |
alalek
left a comment
There was a problem hiding this comment.
Thank you for contribution!
Please take a look on comments below.
| OCL_TEST_SIZES, | ||
| testing::Values(USAGE_DEFAULT, USAGE_ALLOCATE_HOST_MEMORY, USAGE_ALLOCATE_DEVICE_MEMORY), // USAGE_ALLOCATE_SHARED_MEMORY | ||
| testing::Values(USAGE_DEFAULT, USAGE_ALLOCATE_HOST_MEMORY, USAGE_ALLOCATE_DEVICE_MEMORY), // USAGE_ALLOCATE_SHARED_MEMORY | ||
| testing::Values(USAGE_DEFAULT, USAGE_ALLOCATE_HOST_MEMORY, USAGE_ALLOCATE_DEVICE_MEMORY) // USAGE_ALLOCATE_SHARED_MEMORY |
There was a problem hiding this comment.
These tests are very long (>50% of whole opencv_perf_core):
108 tests from OCL_SizeUsageFlagsFixture_UsageFlags_AllocMem (104914 ms total)
Perhaps we should replace OCL_TEST_SIZES with ::testing::Values(sz1080p)
and revert this change:
OCL_TEST_CYCLE_MULTIRUN(500)
There was a problem hiding this comment.
As I wrote in the OP...this PR is NOT NOT NOT ready to be merged. I have been seeking code and performance review. I have those two now...so I will start looking at 3.x and if I can readily backport this.
These long perf tests will not be part of the final PR. They are only in the current changeset so that I can show you the tests that I run to generate the above performance reports, and to have broader coverage on more platforms than I have locally. They will be used in my effort to backport to 3.x (if that is readily possible).
I will definitely remove the OCL_TEST_CYCLE_MULTIRUN() in final code. AT the same time, I will evaluate if I can continue to use OCL_TEST_SIZES. We might even consider making this perf test disabled.
There was a problem hiding this comment.
updated PR no longer loops 500. On my test machine, the test you write above is smaller than other perf tests, e.g. the Dft test is 3 times longer.
144 tests from OCL_DftFixture_Dft (39435 ms total)
...
108 tests from OCL_SizeUsageFlagsFixture_UsageFlags_AllocMem (13932 ms total)
Is this length ok?
If you still want to remove sizes, then I recommend we keep only OCL_SIZE_1 (which is szVGA) and OCL_SIZE_4 (which is sz2160p) so that we have the smallest, and largest. That will probably half again the run time.
There was a problem hiding this comment.
the default CI on linux x64 opencl and windows opencl shows the perf test runs in single-digit seconds, or in less than one second. Dft is still larger on both.
I think we can keep the perf test as it is now. Agree?
If yes, then I have no further work to do and, from my eval, this PR is ready to merge into master.
|
@alalek, this PR can not be applied to the 3.x branch. That branch has no support for move-semantics. Therefore, there are no methods and operator= that use move-semantics. This PR has changes which fix issues in those move methods and if you try to apply, it will fail because those methods don't exist. A subset of this PR could be applied to 3.x. So I recommend one of the following:
What approach do you recommend? |
|
Please continue working on the "master" branch. |
|
Ok. I will now...
If you potentially cherry-pick to 3.4 branch... I want to caution that...
|
- corrects code to support non- USAGE_DEFAULT settings - accuracy, regression, perf test cases - not tested on the 3.x branch












fixes #19807
cv::UMatI have been using this in my own code for one month with opencl on/off and with two GPUs (intel and nvidia).
Passes all core unit tests when applied to the 4.5.2 branch.
No known issues.
Pull Request Readiness Checklist