Skip to content

UMat usageFlags fixes opencv/opencv#19807#20027

Merged
opencv-pushbot merged 1 commit intoopencv:masterfrom
diablodale:fix19807-UMat-usageFlags
Jun 7, 2021
Merged

UMat usageFlags fixes opencv/opencv#19807#20027
opencv-pushbot merged 1 commit intoopencv:masterfrom
diablodale:fix19807-UMat-usageFlags

Conversation

@diablodale
Copy link
Copy Markdown
Contributor

fixes #19807

  • Request code review of these changes before merge.
  • I do not recommend merging this until after code review, then I try this on the 3.x branch, and then I make any needed changes. I have not done any coding or testing of this on the 3.x legacy branch. I request code review of this approach before I do anything in 3.x
  • Clarified doxygen comments on some fields of cv::UMat
  • Please consider my notes and caveats at UMat constructors and UMat::create() discard UMatUsageFlags; resets to USAGE_DEFAULT #19807 (comment)

I have been using this in my own code for one month with opencl on/off and with two GPUs (intel and nvidia).
Passes all core unit tests when applied to the 4.5.2 branch.
No known issues.

[----------] Global test environment tear-down
[ SKIPSTAT ] 36 tests skipped
[ SKIPSTAT ] TAG='mem_6gb' skip 1 tests
[ SKIPSTAT ] TAG='skip_bigdata' skip 1 tests
[ SKIPSTAT ] TAG='skip_other' skip 34 tests
[==========] 11621 tests from 250 test cases ran. (764388 ms total)
[  PASSED  ] 11621 tests.
  YOU HAVE 19 DISABLED TESTS

Pull Request Readiness Checklist

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or other license that is incompatible with OpenCV
  • The PR is proposed to proper branch
  • There is reference to original bug report and related work
  • The feature is well documented and sample code can be built with the project CMake

@diablodale
Copy link
Copy Markdown
Contributor Author

I don't see any failures in the build related to my code changes.
Instead, the fails appear related to Flann code and to the test infrastructure itself.
If you think otherwise, please point me to the specific test case failure that is related to my code.

@mshabunin
Copy link
Copy Markdown
Contributor

@diablodale , only default check is required for now, so other issues can be ignored.

Could you please also add regression test(s) for this issue? Did you try to compare perf results before and after the patch?

@diablodale
Copy link
Copy Markdown
Contributor Author

New test cases

Yes, I will add regression tests for this issue. It will be in my next forcepush (which might include any needed changes for 3.x).

Performance

This is a concern of mine that I discuss at the top of the original issue #19807
I believe we need to announce the fix in the next release. Something like

Fixed a bug in our OpenCL memory management. Will not affect most users since USAGE_DEFAULT has not changed. If you use OpenCL in your OpenCV app and a usage type other than USAGE_DEFAULT, we recommend you test your app for any performance changes. If the changes are negative, then we recommend changing your code to USAGE_DEFAULT since that was the actual usage before this fix.

The specific use case of non-DEFAULT very much depends on the cpu, gpu, gpu's core and opencl driver, and the app code itself. This combination has interactions of PCIe buses, caches, memory timing, driver optimizations, opencl implementation, etc. The performance on a computer with a discrete NVidia 1080 is different than an integrated Intel HD. The NVidia almost always has to move memory across the PCIe bus. The Intel might be able to use the same RAM memory as the CPU. On both those cases, the three different USAGE settings will have different interactions and likely some measurable difference in performance if run millions of times. https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/ is such a topic where NVidia writes

You should not over-allocate [host] pinned memory....How much is too much is difficult to tell in advance, so as with all optimizations, test your applications and the systems they run on for optimal performance parameters.

I will write some test cases that iterate millions of times and include a report. It might tell us a relative difference, on my specific computer, specific GPUs, specific drivers, for a specific codepath in an app.
I have no problem doing this. 👍🙂
And...I don't think we will have any meaningful action with any result I get.

In my app that processes 5 streams of visual data at 30fps, I saw no consistent universal measurable perf changes. I thought I saw one case of measurable improvement, but it wasn't consistent.

What data do we need from a performance test case?

What data do you think we need to have? Specifically.

What exact test case(s) do I need to write?

I can imagine 3 groups; one group for each usage type: USAGE_DEFAULT, USAGE_ALLOCATE_HOST_MEMORY, and USAGE_ALLOCATE_DEVICE_MEMORY

Then within a group, to do something that is tests the allocation and transfer of memory...and doesn't have gpu computation work that would be magnitudes more costly -- that cost would overwhelm the transfer metrics. For example, it would probably be bad to run a DNN model. Perhaps UMat::ones()*UMat::ones() operation? 🤷‍♀️

Do I need to manipulate the OpenCV cache pool/allocator? Settings like OPENCV_OPENCL_HOST_PTR_BUFFERPOOL_LIMIT or OPENCV_OPENCL_BUFFER_FORCE_MAPPING, etc. ? 🤷‍♂️

@mshabunin
Copy link
Copy Markdown
Contributor

No, it is not necessary to write new performance tests, I meant running existing test suite (at least for core, imgproc modules): https://github.com/opencv/opencv/wiki/HowToUsePerfTests
Running...

export OPENCV_TEST_DATA_PATH=...
# without patch
mkdir -p before
python3 opencv/modules/ts/misc/run.py -t core,imgproc -w before build
# with patch
mkdir -p after
python3 opencv/modules/ts/misc/run.py -t core,imgproc -w after build

...and comparing results with summary.py script

for m in core imgproc ; do
  python3 opencv/modules/ts/misc/summary.py -o html before/*${m}*.xml after/*${m}*.xml > report_${m}.html
done

The idea is to check whether this patch makes significant performance difference (faster or slower by >30%) on iGPU and dGPU (if possible). I don't think advanced settings should be changed.

We even have perf test for usage flags: https://github.com/opencv/opencv/blob/master/modules/core/perf/opencl/perf_usage_flags.cpp

Note

There can be cases when performance results are not stable, you will see lines like this in the test log (stdev is >3-5%):

[ PERFSTAT ]    (samples=100   mean=2.02   median=1.44   min=0.34   stddev=1.68 (83.4%))

If you experience these issues, it is recommended to prepare machine prior to testing: close all other applications, fixate CPU/GPU frequiencies, disable turbo boost, etc..

@diablodale
Copy link
Copy Markdown
Contributor Author

diablodale commented May 5, 2021

Not seeing actionable data in the reports. Suggestions?
Before running run.py the env var was set "OPENCV_OPENCL_DEVICE": "NVIDIA:GPU:"

report1.zip

Processing all the data and all test cases yet more in excel, charts and pivottables...

Overall, if % changes of all test cases are averaged, the fix is 1.2% faster.
image

If I filter and only include test cases that start with "OCL", then it is overall an average of 2.3% faster.

image

Row Labels Average of %change, negative is slower
OCL_AbsDiffFixture 3.8%
OCL_AddFixture 1.9%
OCL_AddWeightedFixture -1.8%
OCL_BitwiseAndFixture -0.6%
OCL_BitwiseNotFixture -5.5%
OCL_BitwiseOrFixture -6.9%
OCL_BitwiseXorFixture -0.8%
OCL_BufferPoolFixture 1.9%
OCL_CartToPolarFixture 2.5%
OCL_CompareFixture -0.7%
OCL_ConvertScaleAbsFixture -1.2%
OCL_ConvertToFixture 2.0%
OCL_CopyToFixture -0.6%
OCL_CountNonZeroFixture 3.2%
OCL_DftFixture 14.3%
OCL_DivFixture 3.8%
OCL_ExpFixture 2.5%
OCL_ExtractChannelFixture 5.1%
OCL_FlipFixture -3.1%
OCL_GemmFixture -0.3%
OCL_InRangeFixture 2.5%
OCL_InsertChannelFixture 6.7%
OCL_LogFixture 5.0%
OCL_LUTFixture 5.2%
OCL_MagnitudeFixture 2.8%
OCL_MaxFixture 3.3%
OCL_MeanStdDevFixture 0.6%
OCL_MergeFixture 1.4%
OCL_MinFixture 9.1%
OCL_MinMaxLocFixture -1.6%
OCL_MixChannelsFixture -0.8%
OCL_MulFixture 4.6%
OCL_MulSpectrumsFixture -1.7%
OCL_NormalizeFixture -0.7%
OCL_NormFixture -1.1%
OCL_PatchNaNsFixture 1.9%
OCL_PhaseFixture -0.7%
OCL_PolarToCartFixture 9.4%
OCL_PowFixture 2.2%
OCL_PSNRFixture -2.3%
OCL_ReduceAccFixture 9.0%
OCL_ReduceMinMaxFixture 5.0%
OCL_RepeatFixture 3.7%
OCL_ScaleAddFixture 2.3%
OCL_SetIdentityFixture -2.3%
OCL_SetToFixture 5.3%
OCL_SizeUsageFlagsFixture 10.6%
OCL_SplitFixture -3.2%
OCL_SqrtFixture 4.9%
OCL_SubtractFixture -1.2%
OCL_SumFixture 0.0%
OCL_TransformFixture 0.1%
OCL_TransposeFixture -1.0%
OCL_UMatDotFixture -1.4%
OCL_UMatTest 1.2%
Grand Total 2.3%

@diablodale
Copy link
Copy Markdown
Contributor Author

Same chart and pivot table for the imgproc module...

Overall 2.1% faster
image

If I filter for only the "OCL" test cases...
overall 3.6% faster
image

Row Labels Average of %change, negative is slower
OCL 25.3%
OCL__3vs4_Fixture -0.5%
OCL_AccumulateFixture -3.6%
OCL_AccumulateProductFixture 1.1%
OCL_AccumulateSquareFixture -5.8%
OCL_AccumulateWeightedFixture -0.6%
OCL_BilateralFixture 1.2%
OCL_BlendLinearFixture -2.5%
OCL_BlurFixture 11.9%
OCL_BuildPyramidFixture 0.9%
OCL_CalcBackProjFixture 34.6%
OCL_CalcHistFixture 29.4%
OCL_CannyFixture 0.8%
OCL_CLAHEFixture -4.1%
OCL_CopyMakeBorderFixture 9.2%
OCL_CornerHarrisFixture -0.6%
OCL_CornerMinEigenValFixture 1.1%
OCL_CV_TM_CCORR_NORMEDFixture 5.6%
OCL_CV_TM_CCORRFixture 3.9%
OCL_CvtColorFixture -0.3%
OCL_DilateFixture 2.4%
OCL_EqualizeHistFixture 27.0%
OCL_ErodeFixture 2.6%
OCL_Filter2DFixture 24.4%
OCL_GaussianBlurFixture -16.7%
OCL_GoodFeaturesToTrackFixture 1.7%
OCL_HoughLinesFixture -1.6%
OCL_HoughLinesPFixture 27.5%
OCL_ImgSize_TmplSize_Method_MatType 2.5%
OCL_IntegralFixture -0.7%
OCL_LaplacianFixture 13.7%
OCL_MedianBlurFixture 0.8%
OCL_MomentsFixture -0.9%
OCL_MorphologyExFixture 3.2%
OCL_PreCornerDetectFixture 0.6%
OCL_PyrDownFixture 11.8%
OCL_PyrUpFixture 2.0%
OCL_RemapFixture 1.9%
OCL_ResizeAreaFixture 2.1%
OCL_ResizeFixture 0.5%
OCL_ResizeLinearExactFixture 0.0%
OCL_ScharrFixture -11.1%
OCL_SepFilter2D 20.5%
OCL_SobelFixture -9.0%
OCL_SqrBoxFilterFixture 9.7%
OCL_ThreshFixture 3.3%
OCL_WarpAffineFixture 6.4%
OCL_WarpPerspectiveFixture 3.2%
Grand Total 3.6%

@diablodale
Copy link
Copy Markdown
Contributor Author

I don't have confidence in using the bulk test cases for this. Perhaps I am cautious.
The variability is too high. Today I ran the same tests again. Combined before and after the fix.
I got yet again some better some worse.

I generated a report to compare the before code (official 4.5.2) to itself. I compiled and ran yesterday, compiled and ran today. Generated the report with those two datasets. I got similar results...some better...some worse. And the variation was large. Here is the report_core.zip and two lines from it

norm_mask::Size_MatType_NormType::(1920x1080, 32FC1, NORM_L2)     64% faster
CUDA_GpuMat_CopyToMasked::Sz_Depth_Cn::(1920x1080, CV_64F, Gray) 49% slower

I think the only thing I can learn from these bulk reports, is if all OCL tests were slower. If I saw a clear consistent pattern like that, then I would say...we have a code problem. But since the sample data varies by 50% with the same code run on two different days...I don't think much more can be learned.

In contrast, a focused test that runs much more than 13 (a very common # of samples the test harness runs) samples and over a longer period of time might give us more meaningful information. Yes, the caches in the hardware (cpu,gpu) will be engaged. But the traffic across the PCIe (or not for integrated) remains as well as cpu and gpu memory contention.

I'm going to write an additional perf test case that runs for a long time to see if I can discern anything. The existing UsageFlags perf test case does show performance differences when the different combinations of memory types are used. I want to make those clearer.

@mshabunin
Copy link
Copy Markdown
Contributor

@diablodale , perf tests have parameters --perf_min_samples=10 and --perf_force_samples=100 to set boundaries for number of samples, test usually stops either when the timeout (--perf_time_limit=3.0) is reached either when result deviation is less than defined (--perf_max_deviation=1.0). Also option --gtest_filter=*OCL* should enable only OpenCL tests. You can pass these parameters to run.py wrapper script or directly to test executables.

@diablodale
Copy link
Copy Markdown
Contributor Author

I can not discern from the performance reports any significant and consistent improvements or slowdowns associated with the code fix. Same code run on two different days often has variations of 45% or more (xfactor 0.39, 0.52, etc.). However, it is a very small number of test cases that have these large variations. I believe they are outliers and test noise.

All charts include data that are...

  • opencl enabled and my discrete nvidia rtx2070super selected using the envvar
  • release x64 windows build
  • windows 10
  • thousands of test cases, therefore labeling test cases on an axis is usually not possible
  • all data has equal weighting; therefore when parameterized test cases run many combinations of parameters (e.g .matrix sizes) that causes the test case to have a larger contribution to everything -- likely distorting some analysis -- please remember this
  • Use microsecond resolution to get non-zero data due to roundoff. Yet...changes in the microsecond range are difficult to distinguish from test noise and generates xfactors that are likely invalid. For example
    // I doubt this is actually 50% faster and is instead test noise
    bitwise_xor::Size_MatType::(127x61, 8UC1)   before=0.3µs   after=0.2µs   xfactor=1.5
    
  • report was created with python3 ../modules/ts/misc/summary.py --units us -o html module_before.xml module_after.xml > fullus.html.
  • this fullus.html report imported into Excel and charts generated

Core

  • 4287 data points (due to parameters) from 153 test cases within 121 test groups
  • data point average is xfactor 1.015379053 (tiny improvement -- likely noise)
  • test case average is xfactor 1.002387016 (tiny improvement -- likely noise)
  • test group average is xfactor 1.002512992 (tiny improvement -- likely noise)

The super majority of test cases is located between 0.7 - 1.3 xfactor
image

All test cases sorted by xfactor from most improvement to least improvement. Y axis is the xfactor for the infinitely thin "bar" in the chart. It shows a "long tail" distribution on each side.
image

Box and whisker chart sorted by xfactor of individual test cases from most to least while also grouped by test group. This chart suggests that these extreme improvements are outliers and likely test noise. The more consistent test results are the shaded bars that are usually around 1.0.
image

Box and whisker chart reverse sorted. This suggests that these extreme declines are also test noise.
image

Imgproc

  • 5364 data points (due to parameters) from 147 test cases within 113 test groups
  • data point average is xfactor 1.014306488 (tiny improvement -- likely noise)
  • test case average is xfactor 1.011526064 (tiny improvement -- likely noise)
  • test group average is xfactor 1.010814858 (tiny improvement -- likely noise)

The super majority of test cases is located between 0.7 - 1.3 xfactor
image

All test cases sorted by xfactor from most improvement to least improvement. Y axis is the xfactor for the infinitely thin "bar" in the chart. It shows a "long tail" distribution on each side.
image

Box and whisker sorted by xfactor of individual test cases from most to least while also grouped by test group. The imgproc module has more outliers, though this still does not concern me given the other research.
image

Box and whisker chart reverse sorted.
image

@diablodale
Copy link
Copy Markdown
Contributor Author

@mshabunin do you have any comments regarding the above perf test results?
If no comments, then i will shift my focus to examine if something quick can be done to the 3.x branch.

Copy link
Copy Markdown
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for contribution!

Please take a look on comments below.

OCL_TEST_SIZES,
testing::Values(USAGE_DEFAULT, USAGE_ALLOCATE_HOST_MEMORY, USAGE_ALLOCATE_DEVICE_MEMORY), // USAGE_ALLOCATE_SHARED_MEMORY
testing::Values(USAGE_DEFAULT, USAGE_ALLOCATE_HOST_MEMORY, USAGE_ALLOCATE_DEVICE_MEMORY), // USAGE_ALLOCATE_SHARED_MEMORY
testing::Values(USAGE_DEFAULT, USAGE_ALLOCATE_HOST_MEMORY, USAGE_ALLOCATE_DEVICE_MEMORY) // USAGE_ALLOCATE_SHARED_MEMORY
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are very long (>50% of whole opencv_perf_core):

108 tests from OCL_SizeUsageFlagsFixture_UsageFlags_AllocMem (104914 ms total)

Perhaps we should replace OCL_TEST_SIZES with ::testing::Values(sz1080p)

and revert this change:

OCL_TEST_CYCLE_MULTIRUN(500)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I wrote in the OP...this PR is NOT NOT NOT ready to be merged. I have been seeking code and performance review. I have those two now...so I will start looking at 3.x and if I can readily backport this.

These long perf tests will not be part of the final PR. They are only in the current changeset so that I can show you the tests that I run to generate the above performance reports, and to have broader coverage on more platforms than I have locally. They will be used in my effort to backport to 3.x (if that is readily possible).

I will definitely remove the OCL_TEST_CYCLE_MULTIRUN() in final code. AT the same time, I will evaluate if I can continue to use OCL_TEST_SIZES. We might even consider making this perf test disabled.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated PR no longer loops 500. On my test machine, the test you write above is smaller than other perf tests, e.g. the Dft test is 3 times longer.

144 tests from OCL_DftFixture_Dft (39435 ms total)
...
108 tests from OCL_SizeUsageFlagsFixture_UsageFlags_AllocMem (13932 ms total)

Is this length ok?
If you still want to remove sizes, then I recommend we keep only OCL_SIZE_1 (which is szVGA) and OCL_SIZE_4 (which is sz2160p) so that we have the smallest, and largest. That will probably half again the run time.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the default CI on linux x64 opencl and windows opencl shows the perf test runs in single-digit seconds, or in less than one second. Dft is still larger on both.

I think we can keep the perf test as it is now. Agree?
If yes, then I have no further work to do and, from my eval, this PR is ready to merge into master.

@diablodale
Copy link
Copy Markdown
Contributor Author

diablodale commented Jun 2, 2021

@alalek, this PR can not be applied to the 3.x branch. That branch has no support for move-semantics. Therefore, there are no methods and operator= that use move-semantics. This PR has changes which fix issues in those move methods and if you try to apply, it will fail because those methods don't exist.

A subset of this PR could be applied to 3.x. So I recommend one of the following:

  1. Don't apply fix to 3.x. That old branch works as is....it just can't support non-default opencl storage
  2. Create two PRs, each one for each branch
  3. You do cherry picking and sort it out.

What approach do you recommend?

@alalek
Copy link
Copy Markdown
Member

alalek commented Jun 2, 2021

Please continue working on the "master" branch.
I will cherry-pick changes after merge of this PR if needed (3.4 branch will get low profile support only soon).

@diablodale
Copy link
Copy Markdown
Contributor Author

Ok. I will now...

  • remove the MULTIRUN(500) and casually experiment with its run time so it is not excessive while also considering if that specific perf test should be disabled
  • remove the python compatibility workaround
  • flatten, rebase, and force push again to master

If you potentially cherry-pick to 3.4 branch... I want to caution that...

  • I have only done a visual code review of this PR's changes on 3.4.
  • I think a subset of this PR will work. The subset would be all the changes except for the move-semantics methods/operators
  • I have not compiled these changes on 3.4 branch. They might need very minor adjustments when cherry-picked.
  • I have not done unit, functional, or performance testing on 3.4 branch. I do not recommend applying this PR to 3.4 without all this testing

- corrects code to support non- USAGE_DEFAULT settings
- accuracy, regression, perf test cases
- not tested on the 3.x branch
Copy link
Copy Markdown
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thank you 👍

@opencv-pushbot opencv-pushbot merged commit 286ec92 into opencv:master Jun 7, 2021
@alalek alalek mentioned this pull request Jun 13, 2021
@alalek alalek mentioned this pull request Oct 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

UMat constructors and UMat::create() discard UMatUsageFlags; resets to USAGE_DEFAULT

4 participants