Skip to content

core(tls): avoid destruction of TlsAbstraction singleton#20841

Merged
opencv-pushbot merged 1 commit intoopencv:3.4from
alalek:core_keep_TlsAbstraction_singleton_3.4
Oct 8, 2021
Merged

core(tls): avoid destruction of TlsAbstraction singleton#20841
opencv-pushbot merged 1 commit intoopencv:3.4from
alalek:core_keep_TlsAbstraction_singleton_3.4

Conversation

@alalek
Copy link
Copy Markdown
Member

@alalek alalek commented Oct 8, 2021

resolves #20606
replaces #20802

Debug code:

Problem is reproduced through detached but not joined thread: https://github.com/alalek/opencv/blob/debug_20606/apps/version/opencv_version.cpp#L134-L152

NOTE: TSAN and GLIBC provides different behavior of threads termination process.

@alalek alalek mentioned this pull request Oct 8, 2021
3 tasks
@alalek
Copy link
Copy Markdown
Member Author

alalek commented Oct 8, 2021

👍

@opencv-pushbot opencv-pushbot merged commit 17bd9a1 into opencv:3.4 Oct 8, 2021
@alalek alalek mentioned this pull request Oct 8, 2021
@seanm
Copy link
Copy Markdown
Contributor

seanm commented Oct 8, 2021

@alalek I should have time to test this today.

Looking over the code change, one question: in the master branch (where you can use C++11), why limit the use of std::atomic<bool> to inside #if __has_feature(thread_sanitizer)?

@alalek
Copy link
Copy Markdown
Member Author

alalek commented Oct 8, 2021

@seanm Great! Feel free to report related issues to #20606 if any.

std::atomic<bool> is not used for synchronization here, it just to suppress TSAN warning.

Copy link
Copy Markdown
Contributor

@diablodale diablodale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments in some trouble areas. Most are thread safety bugs. There are more threading bugs in the TLS code outside the changes in this PR. So it is important to review all the TLS code to get all the bugs.

#if defined(__clang__) && defined(__has_feature)
#if __has_feature(thread_sanitizer)
#define OPENCV_WITH_THREAD_SANITIZER 1
#include <atomic> // assume C++11
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to test for C++11. Use CV defines like CV_CXX11 or known C++11 compiler flags.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TSAN without C++11 is out of scope.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clang is not the only compiler to support TSan, so does gcc.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with (tsan && !c++11) being out of scope. But I'm not ok with assumptions.
Use CV_CXX11 like is already done several lines below to test for C++11. Don't assume.
Using the define is zero cost and definitively good.

{
#ifdef CV_CXX11
static TlsAbstraction* instance = &getTlsAbstraction_();
static TlsAbstraction *g_tls = new TlsAbstraction(); // memory leak is intended here to avoid disposing of TLS container
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

always use const when possible, e.g. static TlsAbstraction *const g_tls =

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when possible

fixed: if needed

Compilers are smart enough to handle that.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its not about compilers. its about CODERS! Using const protects coders from making mistakes. Easy mistakes.

static TlsAbstraction* volatile instance = NULL;
if (instance == NULL)
static TlsAbstraction* volatile g_tls = NULL;
if (g_tls == NULL)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use double-checked locking. it is an anti-pattern and outdated https://en.wikipedia.org/wiki/Double-checked_locking

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wont fix until reproducer (e.g, on godbolt)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With respect, why would you need a reproducer to correct a well-known anti-pattern? Couldn't std::once_flag be used here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not going to create a fictious repo case for a technique WIDELY documented and declared as BAD.
And...creating repo cases for race conditions is EXTREMELY difficult and a time waste when the bad code is well known.

void* TlsAbstraction::getData_() const
void* TlsAbstraction::getData() const
{
if (disposed)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not thread safe. threading calling this function can be paused after the if test. This another thread leads to dispose=true. When former thread resumes it will fail when making the tls/fls api calls. I discussed this in depth in issuing including threading example.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considered, but performance has more importance here.

This object is cleaned after the main() exit.
It is a bad programming practice to left working threads on this point (but we need to fight with that here)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope! Wrong. If "performance" and unreliable is what you want...then remove all atomics, and mutex.
Performance that crashes is useless.
And you have no evidence or proof that performance will suffer.
This is bad coding.

void TlsAbstraction::setData_(void *pData)
void TlsAbstraction::setData(void *pData)
{
if (disposed)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not thread safe. threading calling this function can be paused after the if test. This another thread leads to dispose=true. When former thread resumes it will fail when making the tls/fls api calls. I discussed this in depth in issuing including threading example.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

void TlsAbstraction::setData_(void *pData)
void TlsAbstraction::setData(void *pData)
{
if (disposed)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not thread safe. threading calling this function can be paused after the if test. This another thread leads to dispose=true. When former thread resumes it will fail when making the tls/fls api calls. I discussed this in depth in issuing including threading example.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

#else // _WIN32
pthread_key_t tlsKey;
#if OPENCV_WITH_THREAD_SANITIZER
std::atomic<bool> disposed;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no benefit. adds useless memory and cpu overhead to this PRs proposed codebase. I already discussed this in issue.
To quiet thread-sanitizer either: 1) fix the race/threading bugs, 2) use the zero-cost thread-sanitizer ignore methods like https://clang.llvm.org/docs/ThreadSanitizer.html#ignorelist

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is for TSAN suppression only. #if says about that.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no benefit. adds useless memory and cpu overhead to this PRs proposed codebase.

Correctness is preferable to performance. Is this code even performance sensitive?

To quiet thread-sanitizer either: 1) fix the race/threading bugs, 2)...

(1) is the only correct option. If not with std::atomic then some other way.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A TSAN warning is just that. A warning. Some condition setup BEFORE the declaration of disposed then was catch doing potentially unsafe things. The key word is BEFORE. When this PR changes all the code in unsafe and new ways, the conditions are changed. I have well documented over and over that disposed being atomic is false hope. It fixes nothing.
Use TSAN supression pragmas as well documented. Zero code.
Making this atomic is false hope, extra memory, extra CPU, extra thread handing.

Copy link
Copy Markdown
Member Author

@alalek alalek Oct 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data race is still here (in cleanup "no-op" code).
Crash possibility is reduced by merged changes.
This patch do not make existed code worse than it was before.


Making this atomic is false hope, extra memory, extra CPU, extra thread handing.

What platform behaves like that? (hint: dump https://en.cppreference.com/w/cpp/atomic/atomic/is_lock_free value)
We really should not care about "extra" resources in TSAN builds.


I see data race in TSAN builds only. I still don't have reproducer against normal GLIB run (as requested in issue multiple times).

If you can do better then do that (hopefully proposed patch didn't break performance of really important configurations).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A TSAN warning is just that. A warning.

@diablodale No, that's not the case. As I've tried to convey a few times now, 99% of TSan messages are errors, not warnings. They indicate a bug.

TlsStorage() :
tlsSlotsSize(0)
{
(void)getTlsAbstraction(); // ensure singeton initialization (for correct order of atexit calls)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see no call to atexit() in this cpp file. What are you trying to achieve or expecting to occur?
The atexit infrastructure requires explicit registration of functions https://en.cppreference.com/w/cpp/utility/program/atexit

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

atexit queue, not atexit call (queue handling function is called something like __cxa_atexit)

we ensure here that validity scope of TlsAbstraction is larger than TlsStorage.

@@ -1834,11 +1865,11 @@ static void WINAPI opencv_fls_destructor(void* pData)
#endif // CV_USE_FLS
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also problems in... releaseThread() above near line 16744 where getting abstraction and check for NULL is not thread-safe. You must always load/check in an atomic operation to be thread-safe. Later calls to getData() can also fail by the abstraction being freed after the if test but before getting data. The getData() functions attempt to catch/handle these failures within their own function, but they have bugs within them. And the guard(mtxGlobalAccess) may protect some thread array, but the tls abstraction can again be freed during this for() loop aking the tls->setData(0) within it fail/crash.

Likely more threading issues but this is enough to help rethink TLS

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check for NULL

There is TODO comment left to revert references back for getTlsAbstraction().

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs work. This PR does not make the OpenCV project better. It instead introduces new threading bugs and changing existing behavior. All programs that use OpenCV might now crash because the bugs are not fixed and the previous behavior has changed.
The PR should be retracted and not applied until the race conditions are fixed.

#endif // _WIN32

static TlsAbstraction* const g_force_initialization_of_TlsAbstraction
static TlsStorage* const g_force_initialization_of_TlsStorage
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not thread safe, and no order guaranted for static initialization in the dynamic stage. The call to getTlsStorage() makes this static a dynamic initialization. There is no guarantee of order for dynamic init of statics across translation units. For example, some other code can call getTlsStorage() before this failed "force" is evaluated.

Therefore: a) this static is useless if getTlsStorage() has no threading bugs, or b) this static approach is itself unreliable due to no-guarantee on static init order. The latter can often be fixed in C++11 by using a static within the function.

g_isTlsStorageInitialized is also involved in these thread-safety issues. I suspect when the threading issues are resolved, this var will not be needed in its current form.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's value is not used, so it is not supposed to be thread-safe (in C++11 it is thread-safe anyway).

We just ensure that it is initialized before main() startup.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. Wrong thinking about main(). And your concern about main is only for the most basic, simple, single-threaded program. That is not thinking broadly and is one of the sources of your coding errors in this PR.
You have to think dynamic libraries, you have to think multiple threads.

I already describe the bug in my text.

not thread safe, and no order guaranted for static initialization in the dynamic stage. The call to getTlsStorage() makes this static a dynamic initialization. There is no guarantee of order for dynamic init of statics across translation units. For example, some other code can call getTlsStorage() before this failed "force" is evaluated.

Therefore: a) this static is useless if getTlsStorage() has no threading bugs, or b) this static approach is itself unreliable due to no-guarantee on static init order. The latter can often be fixed in C++11 by using a static within the function.

@seanm
Copy link
Copy Markdown
Contributor

seanm commented Oct 10, 2021

std::atomic<bool> is not used for synchronization here, it just to suppress TSAN warning.

Do you believe the TSan warning to be a false positive? Like all software, TSan can have bugs of course, but in general TSan does not have false positives. It's not a "warning" like when a compiler tells you int foo = 0;; has an extra semicolon, it's an error message. It's pointing out that the code is incorrect. Fixing the error would be better than just suppressing it.

@diablodale
Copy link
Copy Markdown
Contributor

@seanm I discussed this ad nauseam in the issue with proof. In addition, it may not be possible to reliably shutdown when dealing with ancient outdated OpenCV 3.x before C++11.

And even with C++11 the shutdown scenario probably needs a shutdown manager in which code registered itself...which none of the existing code using OpenCV has and few new programs written would do. Maybe...you, the single squeezy wheel, would call it but you're just not enough people to write all the code needed to manage that.

This whole PR effort is a gigantic waste of time that benefits, so far, only two people. 🤦‍♀️

#ifdef CV_CXX11
static TlsAbstraction* instance = &getTlsAbstraction_();
static TlsAbstraction *g_tls = new TlsAbstraction(); // memory leak is intended here to avoid disposing of TLS container
static TlsAbstractionReleaseGuard g_tlsReleaseGuard(*g_tls);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also seems useless to me. There is no "guard" or "guarding".
Passing *g_tls does not create ownership or increment any refcount. g_tls can die at any time and then the reference stored within TlsAbstractionReleaseGuard is invalid.

Perhaps you are attempting to get the release of "system resources" to happen before the destruction of TlsAbstraction. Well, that is done by calling tls_.releaseSystemResources(); within the destructor of ~TlsAbstraction. I see no benefit of splitting it out into a separate class and variable. You can do it...sure...but I do not see any change in code behavior. Instead a waste of disk space, memory space, and CPU time.

@seanm
Copy link
Copy Markdown
Contributor

seanm commented Oct 11, 2021

@seanm I discussed this ad nauseam in the issue with proof.

Proof of what? As I said already I agree with you that there are multiple threading bugs here. I agree with (at least most of) your review of this patch too. I agree that merely making the bool atomic is insufficient.

What I don't understand is your view of TSan. Perhaps you could answer directly:

  1. Do you believe the TSan warning to be a false positive? (My impression is that your answer is "yes".)
  2. Do you agree that if TSan is giving an error, that there is a bug? (My impression is that your answer is "no".)

In addition, it may not be possible to reliably shutdown when dealing with ancient outdated OpenCV 3.x before C++11.

Yes, this was discussed already also. Surely not everything needs to be done on both branches, otherwise why have branches? Better to fix only master than neither branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Race condition in TlsAbstraction

4 participants