Skip to content

add thread-safe startup of fastMalloc and fastFree#19029

Merged
alalek merged 3 commits intoopencv:masterfrom
diablodale:fix19004-memthreadstart
Dec 8, 2020
Merged

add thread-safe startup of fastMalloc and fastFree#19029
alalek merged 3 commits intoopencv:masterfrom
diablodale:fix19004-memthreadstart

Conversation

@diablodale
Copy link
Copy Markdown
Contributor

Fixes program startup unsafe thread behavior of the core memory allocator fastMalloc() and
deallocator fastFree() from issue #19004 and adds performance test case.

Summary: The code works as expected on below tested platforms. I do not see any
consistent performance gains or losses.

🚨Please note: this PR is only for OpenCV 4 and newer. This fix requires
a C++11 compliant compiler. Do not apply this exact code to the 3.x codebase since
that code might be compiled on an older C++98 compiler. Only C++11 or newer
promises thread-safe initialization of static local variables
https://en.cppreference.com/w/cpp/language/storage_duration#Static_local_variables.
Such a variable is low cost, will only initialize once, handles exceptions, and
threads block if they try to pass through the same declaration if another thread
is initializing.

Unit functionality testing

It is difficult with the current test harness to unit test this fix in isolation
since the easiest failure would only appear at startup and other failures would
occur randomly seconds, minutes, or hours later. Instead, most OpenCV test cases
will indirectly test this since most allocate memory. If random or widespread
memory violations appear after this PR, it suggests there is an issue that needs
investigation. If anyone has an idea how to reliably unit test this, please contact
me and I'll write the test.

Meanwhile, I have run the core unit tests opencv_test_core. Test was compiled
with code that includes this fix and the performance test case. All unit tests
passed.

[----------] Global test environment tear-down
[ SKIPSTAT ] 40 tests skipped
[ SKIPSTAT ] TAG='mem_6gb' skip 1 tests
[ SKIPSTAT ] TAG='skip_bigdata' skip 1 tests
[ SKIPSTAT ] TAG='skip_other' skip 38 tests
[==========] 11608 tests from 245 test cases ran. (58634 ms total)
[  PASSED  ] 11608 tests.

Performance testing

Code changes of this type have very small changes; often in the low micro or innanoseconds.
In an effort to see any patterns of poor performance, 100000 iterations are run and multiple
samples collected. Code was compiled on two separate computers and the test run afterwards with
run.py . -t core --configuration Release --gtest_filter=*Allocation* --perf_min_samples=50

Absolute results are not useful as they will different on all machines/vm. Relative results will
slightly vary due to variations of resource availability on the test machine/vm at the time of
the test.

  1. i7-3720QM 4C/8T w/ 16GB ram, Win 10, Windows x64 Release build via VS2019 v16.8.2
  2. i7-10875H 8C/16T w/ 32GB ram, Ubuntu 18.04, ELF64 Release build via g++ 7.5.0
    This Ubuntu is running in a Docker within WSL2 running on host Win10.

Code master is commit e726ff3 + the performance test case.
Code fix is above code + this fix.
Summary is generated with summary.py *.xml

Windows build

4 tests with the fix, 4 tests with master

MatType 1-fix 1-master 2-fix 2-master 3-fix 3-master 4-fix 4-master
8UC1 803.41 1108.649 1096.648 1069.492 1030.315 962.789 1071.768 1016.982
16SC1 1049.798 1451.937 1383.221 1377.024 1308.098 1293.132 1413.328 1155.163
8UC3 1231.156 1446.758 1433.7 1202.575 1361.586 1291.419 1217.747 1359.528
8UC4 1308.182 1528.312 1449.753 1276.084 1482.031 1405.05 1292.891 1458.997

image

Ubuntu build

3 tests with fix, 3 tests with master

MatType 1-fix 1-master 2-fix 2-master 3-fix 3-master
8UC1 3.719 3.772 3.715 3.681 3.714 3.643
16SC1 17.87 18.15 17.112 17.043 17.309 17.26
8UC3 17.93 17.894 17.282 17.481 17.357 17.615
8UC4 18.211 18.105 17.709 17.722 17.46 17.806

image

Pull Request Readiness Checklist

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or other license that is incompatible with OpenCV
  • The PR is proposed to proper branch
  • There is reference to original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake

@diablodale
Copy link
Copy Markdown
Contributor Author

I ran 100,000 times the test app in the related issue #15691
There was no crash with the code of this PR.

Copy link
Copy Markdown
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was no crash with the code of this PR.

BTW, without this PR mentioned sample doesn't crash too.
(there is one thread until main() entry)

You can explore generated code for both cases here: https://godbolt.org/z/n1nhe6

@@ -100,25 +100,24 @@ static bool readMemoryAlignmentParameter()
// TODO add checks for valgrind, ASAN if value == false
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// should not call fastMalloc() internally

Main assumption is here to avoid recursive deadlock.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also holds for this PRs code. C++11 standard declares that reentrancy on the same thread that is in a static local initialization results in undefined behavior.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I didn't use proposed static bool useMemalign = readMemoryAlignmentParameter(); form initially even under #if CV_CXX11 on 3.4 branch.
OpenCV 3.4 supports C++98 but it allows to be compiled in C++11 mode (GCC 7.x+ uses C++14 mode by default).

In OpenCV 3.x there is own cv::String which uses fastMalloc() internally (OpenCV 4.x uses typedef on std::string).
getConfigurationParameterBool() uses cv::String which uses fastAlloc().

Adding scope-based fastMalloc()/fastFree() here causes hang with this PR (on master):

    void* p = fastMalloc(1024);
    fastFree(p);

(triggers undefined behavior of dynamic initialization)


So this comment about assumption was not correct initially...

  • for OpenCV 3.4 now
  • there are plans to implement configuration manager (but probably such manager is not usable in this place if it will use fastMalloc()).

Comment on lines +115 to +117
#if defined __GNUC__
__attribute__((unused))
#endif
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'll make the change and spin up tests Monday.

@diablodale
Copy link
Copy Markdown
Contributor Author

There was no crash with the code of this PR.

BTW, without this PR mentioned sample doesn't crash too.
(there is one thread until main() entry)

You can explore generated code for both cases here: https://godbolt.org/z/n1nhe6

Sure. 👍 I'm highlighting there is no regression. That this PR's code also prevents the crashes that appeared in the sample code from that dev.

I would like to have that code as a test case itself and run on every build. Is there some way for the overall test harness to run an arbitrary program 10000 times? I need it to be a separate program, not something merged together like gtest currently does with the test cases in opencv, or how it puts the iterations inside the program instead of outside...I need repeating program startup 🤔

@alalek
Copy link
Copy Markdown
Member

alalek commented Dec 7, 2020

There is no support for such test cases (also there is no goal to add this feature).

@diablodale
Copy link
Copy Markdown
Contributor Author

Sidebar joke 😆 the windows builds are all the same. This PR has no effect on Windows. All the variations we see on Windows are unrelated fluctuations... the master and fix code is all the same.

Why? Because the current OpenCV codebase...on Windows...uses #ifdef to avoid all this code. I included support for Windows but it is not yet enabled due to the #ifdef. A later separate PR which implements native aligned memory on Windows will enable these codepaths and in that PR we can evaluate both functionality/performance.

It is the non-Windows test scenarios that are currently interesting. The reason I needed to setup my Ubuntu 18.04 test harness this weekend.

Copy link
Copy Markdown
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there easy (and cross-platform) way to detect multiple threads during initialization?

typedef perf::TestBaseWithParam<MatType> MatDepth_tb;

PERF_TEST_P(MatDepth_tb, Allocation_Aligned,
testing::Values(CV_8UC1, CV_16SC1, CV_8UC3, CV_8UC4))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change is harmless from performance perspective: https://godbolt.org/z/n1nhe6

"Fast code path" is the same (with/without patch) so there is no any measurable performance changes expected.
(some difference may appear only on the measurement of the first sample of the first iteration)

Please remove this perf test from test suite (you can put its code into PR's description under <details> ... </details> block).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not aware of a way to "detect multiple threads during initialization". Because what does "during initialization" mean? Initialization is done by the C++ runtime, not user code. So the user code is not aware of where it is in the overall progress of initialization.

I'm uncomfortable in removing the perf test. Here's my thinking:

  1. A custom memory allocator like fastMalloc() is a bedrock critical function...a core feature on top of which other code components are used. I see it used 19 times in the OpenCV 4.5 codebase. And it it used in themost critical places like class Allocator, class Mat, and class UMat. Therefore, I believe it important to always run performance tests on core components and do it isolated.
  2. This specific performance test helps evaluate forthcoming updates. I have pending the native Windows aligned memory PR. I will need this exact same test in the code. And I need to run the same test cases yet this time to alter the envvar OPENCV_ENABLE_MEMALIGN.
  3. Performance testing is not a one-time thing. Performance testing is an ongoing process https://www.google.com/search?q=performance+testing+is+an+ongoing+process

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps leave the perf test cpp, and disable it. Then the dev/testers for the Windows PR, or a fix for the GNULIBC issue is made, they can re-enable it for their testing.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and disable it

Agreed.

Comment on lines -115 to -121
// do not use variable directly, details: https://github.com/opencv/opencv/issues/15691
static const bool g_force_initialization_memalign_flag
#if defined __GNUC__
__attribute__((unused))
#endif
= isAlignedAllocationEnabled();

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please restore this (to avoid issues like #17617 - it is unlikely, but I prefer to keep it)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will revert and force push the branch again.
I when changes are fixed, I'll also squash so it is all in one commit. It is my understand that is the preferred process.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like wrong commit is reverted (about __attribute__((unused))).

This request is to restore global static initializer g_force_initialization_memalign_flag.

@@ -100,25 +100,24 @@ static bool readMemoryAlignmentParameter()
// TODO add checks for valgrind, ASAN if value == false
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I didn't use proposed static bool useMemalign = readMemoryAlignmentParameter(); form initially even under #if CV_CXX11 on 3.4 branch.
OpenCV 3.4 supports C++98 but it allows to be compiled in C++11 mode (GCC 7.x+ uses C++14 mode by default).

In OpenCV 3.x there is own cv::String which uses fastMalloc() internally (OpenCV 4.x uses typedef on std::string).
getConfigurationParameterBool() uses cv::String which uses fastAlloc().

Adding scope-based fastMalloc()/fastFree() here causes hang with this PR (on master):

    void* p = fastMalloc(1024);
    fastFree(p);

(triggers undefined behavior of dynamic initialization)


So this comment about assumption was not correct initially...

  • for OpenCV 3.4 now
  • there are plans to implement configuration manager (but probably such manager is not usable in this place if it will use fastMalloc()).

@diablodale
Copy link
Copy Markdown
Contributor Author

I'm unclear something I can read above. I see a section where @alalek writes "Actually I didn't use proposed".
I don't understand any part of that post. And (strangely) I can not reply to it. It seems to be part of something else.
???😵

@alalek
Copy link
Copy Markdown
Member

alalek commented Dec 7, 2020

This was just a FYI, why this is not added initially (as part of #15544 to 3.4 branch, under "#if CV_CXX11" macro).
And why this approach may hang (hopefully it hangs "always" without sporadic effects).

Copy link
Copy Markdown
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we can merge that approach.

Please take a look on comments below.

@@ -0,0 +1,44 @@
#include <array>
#include "perf_precomp.hpp"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perf_precomp.hpp

Please put perf_precomp.hpp first.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll reverse and test.

@@ -0,0 +1,44 @@
#include <array>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add short license header: https://github.com/opencv/opencv/wiki/Coding_Style_Guide

// This file is part of OpenCV project.
// It is subject to the license terms in the LICENSE file found in the top-level directory
// of this distribution and at http://opencv.org/license.html.	

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do.

typedef perf::TestBaseWithParam<MatType> MatDepth_tb;

PERF_TEST_P(MatDepth_tb, Allocation_Aligned,
testing::Values(CV_8UC1, CV_16SC1, CV_8UC3, CV_8UC4))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and disable it

Agreed.

::perf::sz5MP, ::perf::sz2K, ::perf::szSmall128, ::perf::szODD, ::perf::szQVGA, \
::perf::szVGA, ::perf::szSVGA, ::perf::sz720p, ::perf::sz1080p, ::perf::sz2160p, \
::perf::sz4320p, ::perf::sz3MP, ::perf::szXGA, ::perf::szSXGA, ::perf::szWQHD, \
::perf::sznHD, ::perf::szqHD
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, It makes sense to make size as a test parameter and remove "types" (CV_8UC1, CV_16SC1, CV_8UC3, CV_8UC4).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I can make this change.
I do want to test all sizes and all types. Unfortunately, the testing harness requires I pass one or more test parameters. And that parameter has a restricted list of possible c++ types. So I chose the CV_Types.

Copy link
Copy Markdown
Contributor Author

@diablodale diablodale Dec 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooh. You want that global static variable g_force_initialization_memalign_flag? That doesn't make sense. There is no logic to have it. Its confusing to have it when it does nothing and provides nothing.
If you force me to add it, then I request that I put a comment there that I disagree with it but that you specifically asked for it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And there is no "force" or negative from you, that's my poor choice of words. Better is, "if you require this as part of the PR, then I request...".

Here's my suggestion of the comment // need for this static const is disputed; retaining as it doesn't cause any harm

Comment on lines -115 to -121
// do not use variable directly, details: https://github.com/opencv/opencv/issues/15691
static const bool g_force_initialization_memalign_flag
#if defined __GNUC__
__attribute__((unused))
#endif
= isAlignedAllocationEnabled();

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like wrong commit is reverted (about __attribute__((unused))).

This request is to restore global static initializer g_force_initialization_memalign_flag.

Copy link
Copy Markdown
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@alalek alalek merged commit ad94d8c into opencv:master Dec 8, 2020
@alalek alalek mentioned this pull request Apr 9, 2021
a-sajjad72 pushed a commit to a-sajjad72/opencv that referenced this pull request Mar 30, 2023
add thread-safe startup of fastMalloc and fastFree

* add perf test core memory allocation

* fix threading in isAlignedAllocationEnabled()

* tweaks requested by maintainer
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants