UMat usageFlags fixes opencv/opencv#19807 by diablodale · Pull Request #20027 · opencv/opencv

diablodale · 2021-05-03T14:40:06Z

fixes #19807

Request code review of these changes before merge.
I do not recommend merging this until after code review, then I try this on the 3.x branch, and then I make any needed changes. I have not done any coding or testing of this on the 3.x legacy branch. I request code review of this approach before I do anything in 3.x
Clarified doxygen comments on some fields of cv::UMat
Please consider my notes and caveats at UMat constructors and UMat::create() discard UMatUsageFlags; resets to USAGE_DEFAULT #19807 (comment)

I have been using this in my own code for one month with opencl on/off and with two GPUs (intel and nvidia).
Passes all core unit tests when applied to the 4.5.2 branch.
No known issues.

[----------] Global test environment tear-down
[ SKIPSTAT ] 36 tests skipped
[ SKIPSTAT ] TAG='mem_6gb' skip 1 tests
[ SKIPSTAT ] TAG='skip_bigdata' skip 1 tests
[ SKIPSTAT ] TAG='skip_other' skip 34 tests
[==========] 11621 tests from 250 test cases ran. (764388 ms total)
[  PASSED  ] 11621 tests.
  YOU HAVE 19 DISABLED TESTS

Pull Request Readiness Checklist

I agree to contribute to the project under Apache 2 License.
To the best of my knowledge, the proposed patch is not based on a code under GPL or other license that is incompatible with OpenCV
The PR is proposed to proper branch
There is reference to original bug report and related work
The feature is well documented and sample code can be built with the project CMake

diablodale · 2021-05-03T22:41:24Z

I don't see any failures in the build related to my code changes.
Instead, the fails appear related to Flann code and to the test infrastructure itself.
If you think otherwise, please point me to the specific test case failure that is related to my code.

mshabunin · 2021-05-04T14:36:33Z

@diablodale , only default check is required for now, so other issues can be ignored.

Could you please also add regression test(s) for this issue? Did you try to compare perf results before and after the patch?

diablodale · 2021-05-04T15:44:44Z

New test cases

Yes, I will add regression tests for this issue. It will be in my next forcepush (which might include any needed changes for 3.x).

Performance

This is a concern of mine that I discuss at the top of the original issue #19807
I believe we need to announce the fix in the next release. Something like

Fixed a bug in our OpenCL memory management. Will not affect most users since USAGE_DEFAULT has not changed. If you use OpenCL in your OpenCV app and a usage type other than USAGE_DEFAULT, we recommend you test your app for any performance changes. If the changes are negative, then we recommend changing your code to USAGE_DEFAULT since that was the actual usage before this fix.

The specific use case of non-DEFAULT very much depends on the cpu, gpu, gpu's core and opencl driver, and the app code itself. This combination has interactions of PCIe buses, caches, memory timing, driver optimizations, opencl implementation, etc. The performance on a computer with a discrete NVidia 1080 is different than an integrated Intel HD. The NVidia almost always has to move memory across the PCIe bus. The Intel might be able to use the same RAM memory as the CPU. On both those cases, the three different USAGE settings will have different interactions and likely some measurable difference in performance if run millions of times. https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/ is such a topic where NVidia writes

You should not over-allocate [host] pinned memory....How much is too much is difficult to tell in advance, so as with all optimizations, test your applications and the systems they run on for optimal performance parameters.

I will write some test cases that iterate millions of times and include a report. It might tell us a relative difference, on my specific computer, specific GPUs, specific drivers, for a specific codepath in an app.
I have no problem doing this. 👍🙂
And...I don't think we will have any meaningful action with any result I get.

In my app that processes 5 streams of visual data at 30fps, I saw no consistent universal measurable perf changes. I thought I saw one case of measurable improvement, but it wasn't consistent.

What data do we need from a performance test case?

What data do you think we need to have? Specifically.

What exact test case(s) do I need to write?

I can imagine 3 groups; one group for each usage type: USAGE_DEFAULT, USAGE_ALLOCATE_HOST_MEMORY, and USAGE_ALLOCATE_DEVICE_MEMORY

Then within a group, to do something that is tests the allocation and transfer of memory...and doesn't have gpu computation work that would be magnitudes more costly -- that cost would overwhelm the transfer metrics. For example, it would probably be bad to run a DNN model. Perhaps UMat::ones()*UMat::ones() operation? 🤷‍♀️

Do I need to manipulate the OpenCV cache pool/allocator? Settings like OPENCV_OPENCL_HOST_PTR_BUFFERPOOL_LIMIT or OPENCV_OPENCL_BUFFER_FORCE_MAPPING, etc. ? 🤷‍♂️

mshabunin · 2021-05-04T17:54:01Z

No, it is not necessary to write new performance tests, I meant running existing test suite (at least for core, imgproc modules): https://github.com/opencv/opencv/wiki/HowToUsePerfTests
Running...

export OPENCV_TEST_DATA_PATH=...
# without patch
mkdir -p before
python3 opencv/modules/ts/misc/run.py -t core,imgproc -w before build
# with patch
mkdir -p after
python3 opencv/modules/ts/misc/run.py -t core,imgproc -w after build

...and comparing results with summary.py script

for m in core imgproc ; do
  python3 opencv/modules/ts/misc/summary.py -o html before/*${m}*.xml after/*${m}*.xml > report_${m}.html
done

The idea is to check whether this patch makes significant performance difference (faster or slower by >30%) on iGPU and dGPU (if possible). I don't think advanced settings should be changed.

We even have perf test for usage flags: https://github.com/opencv/opencv/blob/master/modules/core/perf/opencl/perf_usage_flags.cpp

Note

There can be cases when performance results are not stable, you will see lines like this in the test log (stdev is >3-5%):

[ PERFSTAT ]    (samples=100   mean=2.02   median=1.44   min=0.34   stddev=1.68 (83.4%))

If you experience these issues, it is recommended to prepare machine prior to testing: close all other applications, fixate CPU/GPU frequiencies, disable turbo boost, etc..

diablodale · 2021-05-05T00:06:17Z

Not seeing actionable data in the reports. Suggestions?
Before running run.py the env var was set "OPENCV_OPENCL_DEVICE": "NVIDIA:GPU:"

report1.zip

Processing all the data and all test cases yet more in excel, charts and pivottables...

Overall, if % changes of all test cases are averaged, the fix is 1.2% faster.

If I filter and only include test cases that start with "OCL", then it is overall an average of 2.3% faster.

Row Labels	Average of %change, negative is slower
OCL_AbsDiffFixture	3.8%
OCL_AddFixture	1.9%
OCL_AddWeightedFixture	-1.8%
OCL_BitwiseAndFixture	-0.6%
OCL_BitwiseNotFixture	-5.5%
OCL_BitwiseOrFixture	-6.9%
OCL_BitwiseXorFixture	-0.8%
OCL_BufferPoolFixture	1.9%
OCL_CartToPolarFixture	2.5%
OCL_CompareFixture	-0.7%
OCL_ConvertScaleAbsFixture	-1.2%
OCL_ConvertToFixture	2.0%
OCL_CopyToFixture	-0.6%
OCL_CountNonZeroFixture	3.2%
OCL_DftFixture	14.3%
OCL_DivFixture	3.8%
OCL_ExpFixture	2.5%
OCL_ExtractChannelFixture	5.1%
OCL_FlipFixture	-3.1%
OCL_GemmFixture	-0.3%
OCL_InRangeFixture	2.5%
OCL_InsertChannelFixture	6.7%
OCL_LogFixture	5.0%
OCL_LUTFixture	5.2%
OCL_MagnitudeFixture	2.8%
OCL_MaxFixture	3.3%
OCL_MeanStdDevFixture	0.6%
OCL_MergeFixture	1.4%
OCL_MinFixture	9.1%
OCL_MinMaxLocFixture	-1.6%
OCL_MixChannelsFixture	-0.8%
OCL_MulFixture	4.6%
OCL_MulSpectrumsFixture	-1.7%
OCL_NormalizeFixture	-0.7%
OCL_NormFixture	-1.1%
OCL_PatchNaNsFixture	1.9%
OCL_PhaseFixture	-0.7%
OCL_PolarToCartFixture	9.4%
OCL_PowFixture	2.2%
OCL_PSNRFixture	-2.3%
OCL_ReduceAccFixture	9.0%
OCL_ReduceMinMaxFixture	5.0%
OCL_RepeatFixture	3.7%
OCL_ScaleAddFixture	2.3%
OCL_SetIdentityFixture	-2.3%
OCL_SetToFixture	5.3%
OCL_SizeUsageFlagsFixture	10.6%
OCL_SplitFixture	-3.2%
OCL_SqrtFixture	4.9%
OCL_SubtractFixture	-1.2%
OCL_SumFixture	0.0%
OCL_TransformFixture	0.1%
OCL_TransposeFixture	-1.0%
OCL_UMatDotFixture	-1.4%
OCL_UMatTest	1.2%
Grand Total	2.3%

diablodale · 2021-05-05T01:34:00Z

Same chart and pivot table for the imgproc module...

Overall 2.1% faster

If I filter for only the "OCL" test cases...
overall 3.6% faster

Row Labels	Average of %change, negative is slower
OCL	25.3%
OCL__3vs4_Fixture	-0.5%
OCL_AccumulateFixture	-3.6%
OCL_AccumulateProductFixture	1.1%
OCL_AccumulateSquareFixture	-5.8%
OCL_AccumulateWeightedFixture	-0.6%
OCL_BilateralFixture	1.2%
OCL_BlendLinearFixture	-2.5%
OCL_BlurFixture	11.9%
OCL_BuildPyramidFixture	0.9%
OCL_CalcBackProjFixture	34.6%
OCL_CalcHistFixture	29.4%
OCL_CannyFixture	0.8%
OCL_CLAHEFixture	-4.1%
OCL_CopyMakeBorderFixture	9.2%
OCL_CornerHarrisFixture	-0.6%
OCL_CornerMinEigenValFixture	1.1%
OCL_CV_TM_CCORR_NORMEDFixture	5.6%
OCL_CV_TM_CCORRFixture	3.9%
OCL_CvtColorFixture	-0.3%
OCL_DilateFixture	2.4%
OCL_EqualizeHistFixture	27.0%
OCL_ErodeFixture	2.6%
OCL_Filter2DFixture	24.4%
OCL_GaussianBlurFixture	-16.7%
OCL_GoodFeaturesToTrackFixture	1.7%
OCL_HoughLinesFixture	-1.6%
OCL_HoughLinesPFixture	27.5%
OCL_ImgSize_TmplSize_Method_MatType	2.5%
OCL_IntegralFixture	-0.7%
OCL_LaplacianFixture	13.7%
OCL_MedianBlurFixture	0.8%
OCL_MomentsFixture	-0.9%
OCL_MorphologyExFixture	3.2%
OCL_PreCornerDetectFixture	0.6%
OCL_PyrDownFixture	11.8%
OCL_PyrUpFixture	2.0%
OCL_RemapFixture	1.9%
OCL_ResizeAreaFixture	2.1%
OCL_ResizeFixture	0.5%
OCL_ResizeLinearExactFixture	0.0%
OCL_ScharrFixture	-11.1%
OCL_SepFilter2D	20.5%
OCL_SobelFixture	-9.0%
OCL_SqrBoxFilterFixture	9.7%
OCL_ThreshFixture	3.3%
OCL_WarpAffineFixture	6.4%
OCL_WarpPerspectiveFixture	3.2%
Grand Total	3.6%

diablodale · 2021-05-05T17:26:55Z

I don't have confidence in using the bulk test cases for this. Perhaps I am cautious.
The variability is too high. Today I ran the same tests again. Combined before and after the fix.
I got yet again some better some worse.

I generated a report to compare the before code (official 4.5.2) to itself. I compiled and ran yesterday, compiled and ran today. Generated the report with those two datasets. I got similar results...some better...some worse. And the variation was large. Here is the report_core.zip and two lines from it

norm_mask::Size_MatType_NormType::(1920x1080, 32FC1, NORM_L2)     64% faster
CUDA_GpuMat_CopyToMasked::Sz_Depth_Cn::(1920x1080, CV_64F, Gray) 49% slower

I think the only thing I can learn from these bulk reports, is if all OCL tests were slower. If I saw a clear consistent pattern like that, then I would say...we have a code problem. But since the sample data varies by 50% with the same code run on two different days...I don't think much more can be learned.

In contrast, a focused test that runs much more than 13 (a very common # of samples the test harness runs) samples and over a longer period of time might give us more meaningful information. Yes, the caches in the hardware (cpu,gpu) will be engaged. But the traffic across the PCIe (or not for integrated) remains as well as cpu and gpu memory contention.

I'm going to write an additional perf test case that runs for a long time to see if I can discern anything. The existing UsageFlags perf test case does show performance differences when the different combinations of memory types are used. I want to make those clearer.

mshabunin · 2021-05-06T09:07:43Z

@diablodale , perf tests have parameters --perf_min_samples=10 and --perf_force_samples=100 to set boundaries for number of samples, test usually stops either when the timeout (--perf_time_limit=3.0) is reached either when result deviation is less than defined (--perf_max_deviation=1.0). Also option --gtest_filter=*OCL* should enable only OpenCL tests. You can pass these parameters to run.py wrapper script or directly to test executables.

diablodale · 2021-05-10T02:48:54Z

I can not discern from the performance reports any significant and consistent improvements or slowdowns associated with the code fix. Same code run on two different days often has variations of 45% or more (xfactor 0.39, 0.52, etc.). However, it is a very small number of test cases that have these large variations. I believe they are outliers and test noise.

All charts include data that are...

opencl enabled and my discrete nvidia rtx2070super selected using the envvar
release x64 windows build
windows 10
thousands of test cases, therefore labeling test cases on an axis is usually not possible
all data has equal weighting; therefore when parameterized test cases run many combinations of parameters (e.g .matrix sizes) that causes the test case to have a larger contribution to everything -- likely distorting some analysis -- please remember this
Use microsecond resolution to get non-zero data due to roundoff. Yet...changes in the microsecond range are difficult to distinguish from test noise and generates xfactors that are likely invalid. For example
```
// I doubt this is actually 50% faster and is instead test noise
bitwise_xor::Size_MatType::(127x61, 8UC1)   before=0.3µs   after=0.2µs   xfactor=1.5
```
report was created with python3 ../modules/ts/misc/summary.py --units us -o html module_before.xml module_after.xml > fullus.html.
this fullus.html report imported into Excel and charts generated

Core

4287 data points (due to parameters) from 153 test cases within 121 test groups
data point average is xfactor 1.015379053 (tiny improvement -- likely noise)
test case average is xfactor 1.002387016 (tiny improvement -- likely noise)
test group average is xfactor 1.002512992 (tiny improvement -- likely noise)

The super majority of test cases is located between 0.7 - 1.3 xfactor

All test cases sorted by xfactor from most improvement to least improvement. Y axis is the xfactor for the infinitely thin "bar" in the chart. It shows a "long tail" distribution on each side.

Box and whisker chart sorted by xfactor of individual test cases from most to least while also grouped by test group. This chart suggests that these extreme improvements are outliers and likely test noise. The more consistent test results are the shaded bars that are usually around 1.0.

Box and whisker chart reverse sorted. This suggests that these extreme declines are also test noise.

Imgproc

5364 data points (due to parameters) from 147 test cases within 113 test groups
data point average is xfactor 1.014306488 (tiny improvement -- likely noise)
test case average is xfactor 1.011526064 (tiny improvement -- likely noise)
test group average is xfactor 1.010814858 (tiny improvement -- likely noise)

The super majority of test cases is located between 0.7 - 1.3 xfactor

All test cases sorted by xfactor from most improvement to least improvement. Y axis is the xfactor for the infinitely thin "bar" in the chart. It shows a "long tail" distribution on each side.

Box and whisker sorted by xfactor of individual test cases from most to least while also grouped by test group. The imgproc module has more outliers, though this still does not concern me given the other research.

Box and whisker chart reverse sorted.

diablodale · 2021-05-22T22:29:10Z

@mshabunin do you have any comments regarding the above perf test results?
If no comments, then i will shift my focus to examine if something quick can be done to the 3.x branch.

alalek

Thank you for contribution!

Please take a look on comments below.

modules/core/include/opencv2/core/mat.hpp

modules/core/src/umatrix.cpp

modules/ts/misc/table_formatter.py

alalek · 2021-06-01T14:51:33Z

modules/core/perf/opencl/perf_usage_flags.cpp

+        OCL_TEST_SIZES,
+        testing::Values(USAGE_DEFAULT, USAGE_ALLOCATE_HOST_MEMORY, USAGE_ALLOCATE_DEVICE_MEMORY), // USAGE_ALLOCATE_SHARED_MEMORY
+        testing::Values(USAGE_DEFAULT, USAGE_ALLOCATE_HOST_MEMORY, USAGE_ALLOCATE_DEVICE_MEMORY), // USAGE_ALLOCATE_SHARED_MEMORY
+        testing::Values(USAGE_DEFAULT, USAGE_ALLOCATE_HOST_MEMORY, USAGE_ALLOCATE_DEVICE_MEMORY) // USAGE_ALLOCATE_SHARED_MEMORY


These tests are very long (>50% of whole opencv_perf_core):

108 tests from OCL_SizeUsageFlagsFixture_UsageFlags_AllocMem (104914 ms total)

Perhaps we should replace OCL_TEST_SIZES with ::testing::Values(sz1080p)

and revert this change:

OCL_TEST_CYCLE_MULTIRUN(500)

As I wrote in the OP...this PR is NOT NOT NOT ready to be merged. I have been seeking code and performance review. I have those two now...so I will start looking at 3.x and if I can readily backport this.

These long perf tests will not be part of the final PR. They are only in the current changeset so that I can show you the tests that I run to generate the above performance reports, and to have broader coverage on more platforms than I have locally. They will be used in my effort to backport to 3.x (if that is readily possible).

I will definitely remove the OCL_TEST_CYCLE_MULTIRUN() in final code. AT the same time, I will evaluate if I can continue to use OCL_TEST_SIZES. We might even consider making this perf test disabled.

updated PR no longer loops 500. On my test machine, the test you write above is smaller than other perf tests, e.g. the Dft test is 3 times longer.

144 tests from OCL_DftFixture_Dft (39435 ms total) ... 108 tests from OCL_SizeUsageFlagsFixture_UsageFlags_AllocMem (13932 ms total)

Is this length ok?
If you still want to remove sizes, then I recommend we keep only OCL_SIZE_1 (which is szVGA) and OCL_SIZE_4 (which is sz2160p) so that we have the smallest, and largest. That will probably half again the run time.

the default CI on linux x64 opencl and windows opencl shows the perf test runs in single-digit seconds, or in less than one second. Dft is still larger on both.

I think we can keep the perf test as it is now. Agree?
If yes, then I have no further work to do and, from my eval, this PR is ready to merge into master.

diablodale · 2021-06-02T14:34:50Z

@alalek, this PR can not be applied to the 3.x branch. That branch has no support for move-semantics. Therefore, there are no methods and operator= that use move-semantics. This PR has changes which fix issues in those move methods and if you try to apply, it will fail because those methods don't exist.

A subset of this PR could be applied to 3.x. So I recommend one of the following:

Don't apply fix to 3.x. That old branch works as is....it just can't support non-default opencl storage
Create two PRs, each one for each branch
You do cherry picking and sort it out.

What approach do you recommend?

alalek · 2021-06-02T17:41:33Z

Please continue working on the "master" branch.
I will cherry-pick changes after merge of this PR if needed (3.4 branch will get low profile support only soon).

diablodale · 2021-06-02T20:54:03Z

Ok. I will now...

remove the MULTIRUN(500) and casually experiment with its run time so it is not excessive while also considering if that specific perf test should be disabled
remove the python compatibility workaround
flatten, rebase, and force push again to master

If you potentially cherry-pick to 3.4 branch... I want to caution that...

I have only done a visual code review of this PR's changes on 3.4.
I think a subset of this PR will work. The subset would be all the changes except for the move-semantics methods/operators
I have not compiled these changes on 3.4 branch. They might need very minor adjustments when cherry-picked.
I have not done unit, functional, or performance testing on 3.4 branch. I do not recommend applying this PR to 3.4 without all this testing

- corrects code to support non- USAGE_DEFAULT settings - accuracy, regression, perf test cases - not tested on the 3.x branch

alalek

Looks good! Thank you 👍

alalek reviewed Jun 1, 2021

View reviewed changes

UMat usageFlags fixes #19807

c2ce3d9

- corrects code to support non- USAGE_DEFAULT settings - accuracy, regression, perf test cases - not tested on the 3.x branch

alalek approved these changes Jun 6, 2021

View reviewed changes

opencv-pushbot merged commit 286ec92 into opencv:master Jun 7, 2021

alalek mentioned this pull request Jun 13, 2021

(5.x) Merge 4.x #20266

Merged

diablodale mentioned this pull request Jun 22, 2021

add usageFlags to UMat static factories #20295

Merged

5 tasks

alalek mentioned this pull request Oct 15, 2021

(5.x) Merge 4.x #20886

Merged

Uh oh!

Conversation

diablodale commented May 3, 2021

Pull Request Readiness Checklist

Uh oh!

diablodale commented May 3, 2021

Uh oh!

mshabunin commented May 4, 2021

Uh oh!

diablodale commented May 4, 2021

New test cases

Performance

What data do we need from a performance test case?

Uh oh!

mshabunin commented May 4, 2021

Note

Uh oh!

diablodale commented May 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

diablodale commented May 5, 2021

Uh oh!

diablodale commented May 5, 2021

Uh oh!

mshabunin commented May 6, 2021

Uh oh!

diablodale commented May 10, 2021

Core

Imgproc

Uh oh!

diablodale commented May 22, 2021

Uh oh!

alalek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alalek Jun 1, 2021

Choose a reason for hiding this comment

Uh oh!

diablodale Jun 1, 2021

Choose a reason for hiding this comment

Uh oh!

diablodale Jun 3, 2021

Choose a reason for hiding this comment

Uh oh!

diablodale Jun 3, 2021

Choose a reason for hiding this comment

Uh oh!

diablodale commented Jun 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alalek commented Jun 2, 2021

Uh oh!

diablodale commented Jun 2, 2021

Uh oh!

alalek left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

diablodale commented May 5, 2021 •

edited

Loading

diablodale commented Jun 2, 2021 •

edited

Loading