core(tls): avoid destruction of TlsAbstraction singleton#20841
core(tls): avoid destruction of TlsAbstraction singleton#20841opencv-pushbot merged 1 commit intoopencv:3.4from
Conversation
|
👍 |
|
@alalek I should have time to test this today. Looking over the code change, one question: in the master branch (where you can use C++11), why limit the use of |
diablodale
left a comment
There was a problem hiding this comment.
Added comments in some trouble areas. Most are thread safety bugs. There are more threading bugs in the TLS code outside the changes in this PR. So it is important to review all the TLS code to get all the bugs.
| #if defined(__clang__) && defined(__has_feature) | ||
| #if __has_feature(thread_sanitizer) | ||
| #define OPENCV_WITH_THREAD_SANITIZER 1 | ||
| #include <atomic> // assume C++11 |
There was a problem hiding this comment.
Need to test for C++11. Use CV defines like CV_CXX11 or known C++11 compiler flags.
There was a problem hiding this comment.
TSAN without C++11 is out of scope.
There was a problem hiding this comment.
Clang is not the only compiler to support TSan, so does gcc.
There was a problem hiding this comment.
I am ok with (tsan && !c++11) being out of scope. But I'm not ok with assumptions.
Use CV_CXX11 like is already done several lines below to test for C++11. Don't assume.
Using the define is zero cost and definitively good.
| { | ||
| #ifdef CV_CXX11 | ||
| static TlsAbstraction* instance = &getTlsAbstraction_(); | ||
| static TlsAbstraction *g_tls = new TlsAbstraction(); // memory leak is intended here to avoid disposing of TLS container |
There was a problem hiding this comment.
always use const when possible, e.g. static TlsAbstraction *const g_tls =
There was a problem hiding this comment.
when possible
fixed: if needed
Compilers are smart enough to handle that.
There was a problem hiding this comment.
Its not about compilers. its about CODERS! Using const protects coders from making mistakes. Easy mistakes.
| static TlsAbstraction* volatile instance = NULL; | ||
| if (instance == NULL) | ||
| static TlsAbstraction* volatile g_tls = NULL; | ||
| if (g_tls == NULL) |
There was a problem hiding this comment.
don't use double-checked locking. it is an anti-pattern and outdated https://en.wikipedia.org/wiki/Double-checked_locking
There was a problem hiding this comment.
There was a problem hiding this comment.
There was a problem hiding this comment.
wont fix until reproducer (e.g, on godbolt)
There was a problem hiding this comment.
With respect, why would you need a reproducer to correct a well-known anti-pattern? Couldn't std::once_flag be used here?
There was a problem hiding this comment.
I'm not going to create a fictious repo case for a technique WIDELY documented and declared as BAD.
And...creating repo cases for race conditions is EXTREMELY difficult and a time waste when the bad code is well known.
| void* TlsAbstraction::getData_() const | ||
| void* TlsAbstraction::getData() const | ||
| { | ||
| if (disposed) |
There was a problem hiding this comment.
Not thread safe. threading calling this function can be paused after the if test. This another thread leads to dispose=true. When former thread resumes it will fail when making the tls/fls api calls. I discussed this in depth in issuing including threading example.
There was a problem hiding this comment.
Considered, but performance has more importance here.
This object is cleaned after the main() exit.
It is a bad programming practice to left working threads on this point (but we need to fight with that here)
There was a problem hiding this comment.
Nope! Wrong. If "performance" and unreliable is what you want...then remove all atomics, and mutex.
Performance that crashes is useless.
And you have no evidence or proof that performance will suffer.
This is bad coding.
| void TlsAbstraction::setData_(void *pData) | ||
| void TlsAbstraction::setData(void *pData) | ||
| { | ||
| if (disposed) |
There was a problem hiding this comment.
Not thread safe. threading calling this function can be paused after the if test. This another thread leads to dispose=true. When former thread resumes it will fail when making the tls/fls api calls. I discussed this in depth in issuing including threading example.
| void TlsAbstraction::setData_(void *pData) | ||
| void TlsAbstraction::setData(void *pData) | ||
| { | ||
| if (disposed) |
There was a problem hiding this comment.
Not thread safe. threading calling this function can be paused after the if test. This another thread leads to dispose=true. When former thread resumes it will fail when making the tls/fls api calls. I discussed this in depth in issuing including threading example.
| #else // _WIN32 | ||
| pthread_key_t tlsKey; | ||
| #if OPENCV_WITH_THREAD_SANITIZER | ||
| std::atomic<bool> disposed; |
There was a problem hiding this comment.
no benefit. adds useless memory and cpu overhead to this PRs proposed codebase. I already discussed this in issue.
To quiet thread-sanitizer either: 1) fix the race/threading bugs, 2) use the zero-cost thread-sanitizer ignore methods like https://clang.llvm.org/docs/ThreadSanitizer.html#ignorelist
There was a problem hiding this comment.
It is for TSAN suppression only. #if says about that.
There was a problem hiding this comment.
no benefit. adds useless memory and cpu overhead to this PRs proposed codebase.
Correctness is preferable to performance. Is this code even performance sensitive?
To quiet thread-sanitizer either: 1) fix the race/threading bugs, 2)...
(1) is the only correct option. If not with std::atomic then some other way.
There was a problem hiding this comment.
A TSAN warning is just that. A warning. Some condition setup BEFORE the declaration of disposed then was catch doing potentially unsafe things. The key word is BEFORE. When this PR changes all the code in unsafe and new ways, the conditions are changed. I have well documented over and over that disposed being atomic is false hope. It fixes nothing.
Use TSAN supression pragmas as well documented. Zero code.
Making this atomic is false hope, extra memory, extra CPU, extra thread handing.
There was a problem hiding this comment.
Data race is still here (in cleanup "no-op" code).
Crash possibility is reduced by merged changes.
This patch do not make existed code worse than it was before.
Making this atomic is false hope, extra memory, extra CPU, extra thread handing.
What platform behaves like that? (hint: dump https://en.cppreference.com/w/cpp/atomic/atomic/is_lock_free value)
We really should not care about "extra" resources in TSAN builds.
I see data race in TSAN builds only. I still don't have reproducer against normal GLIB run (as requested in issue multiple times).
If you can do better then do that (hopefully proposed patch didn't break performance of really important configurations).
There was a problem hiding this comment.
A TSAN warning is just that. A warning.
@diablodale No, that's not the case. As I've tried to convey a few times now, 99% of TSan messages are errors, not warnings. They indicate a bug.
| TlsStorage() : | ||
| tlsSlotsSize(0) | ||
| { | ||
| (void)getTlsAbstraction(); // ensure singeton initialization (for correct order of atexit calls) |
There was a problem hiding this comment.
I see no call to atexit() in this cpp file. What are you trying to achieve or expecting to occur?
The atexit infrastructure requires explicit registration of functions https://en.cppreference.com/w/cpp/utility/program/atexit
There was a problem hiding this comment.
atexit queue, not atexit call (queue handling function is called something like __cxa_atexit)
we ensure here that validity scope of TlsAbstraction is larger than TlsStorage.
| @@ -1834,11 +1865,11 @@ static void WINAPI opencv_fls_destructor(void* pData) | |||
| #endif // CV_USE_FLS | |||
There was a problem hiding this comment.
also problems in... releaseThread() above near line 16744 where getting abstraction and check for NULL is not thread-safe. You must always load/check in an atomic operation to be thread-safe. Later calls to getData() can also fail by the abstraction being freed after the if test but before getting data. The getData() functions attempt to catch/handle these failures within their own function, but they have bugs within them. And the guard(mtxGlobalAccess) may protect some thread array, but the tls abstraction can again be freed during this for() loop aking the tls->setData(0) within it fail/crash.
Likely more threading issues but this is enough to help rethink TLS
There was a problem hiding this comment.
check for NULL
There is TODO comment left to revert references back for getTlsAbstraction().
There was a problem hiding this comment.
needs work. This PR does not make the OpenCV project better. It instead introduces new threading bugs and changing existing behavior. All programs that use OpenCV might now crash because the bugs are not fixed and the previous behavior has changed.
The PR should be retracted and not applied until the race conditions are fixed.
| #endif // _WIN32 | ||
|
|
||
| static TlsAbstraction* const g_force_initialization_of_TlsAbstraction | ||
| static TlsStorage* const g_force_initialization_of_TlsStorage |
There was a problem hiding this comment.
not thread safe, and no order guaranted for static initialization in the dynamic stage. The call to getTlsStorage() makes this static a dynamic initialization. There is no guarantee of order for dynamic init of statics across translation units. For example, some other code can call getTlsStorage() before this failed "force" is evaluated.
Therefore: a) this static is useless if getTlsStorage() has no threading bugs, or b) this static approach is itself unreliable due to no-guarantee on static init order. The latter can often be fixed in C++11 by using a static within the function.
g_isTlsStorageInitialized is also involved in these thread-safety issues. I suspect when the threading issues are resolved, this var will not be needed in its current form.
There was a problem hiding this comment.
It's value is not used, so it is not supposed to be thread-safe (in C++11 it is thread-safe anyway).
We just ensure that it is initialized before main() startup.
There was a problem hiding this comment.
Nope. Wrong thinking about main(). And your concern about main is only for the most basic, simple, single-threaded program. That is not thinking broadly and is one of the sources of your coding errors in this PR.
You have to think dynamic libraries, you have to think multiple threads.
I already describe the bug in my text.
not thread safe, and no order guaranted for static initialization in the dynamic stage. The call to getTlsStorage() makes this static a dynamic initialization. There is no guarantee of order for dynamic init of statics across translation units. For example, some other code can call getTlsStorage() before this failed "force" is evaluated.
Therefore: a) this static is useless if getTlsStorage() has no threading bugs, or b) this static approach is itself unreliable due to no-guarantee on static init order. The latter can often be fixed in C++11 by using a static within the function.
Do you believe the TSan warning to be a false positive? Like all software, TSan can have bugs of course, but in general TSan does not have false positives. It's not a "warning" like when a compiler tells you |
|
@seanm I discussed this ad nauseam in the issue with proof. In addition, it may not be possible to reliably shutdown when dealing with ancient outdated OpenCV 3.x before C++11. And even with C++11 the shutdown scenario probably needs a shutdown manager in which code registered itself...which none of the existing code using OpenCV has and few new programs written would do. Maybe...you, the single squeezy wheel, would call it but you're just not enough people to write all the code needed to manage that. This whole PR effort is a gigantic waste of time that benefits, so far, only two people. 🤦♀️ |
| #ifdef CV_CXX11 | ||
| static TlsAbstraction* instance = &getTlsAbstraction_(); | ||
| static TlsAbstraction *g_tls = new TlsAbstraction(); // memory leak is intended here to avoid disposing of TLS container | ||
| static TlsAbstractionReleaseGuard g_tlsReleaseGuard(*g_tls); |
There was a problem hiding this comment.
This also seems useless to me. There is no "guard" or "guarding".
Passing *g_tls does not create ownership or increment any refcount. g_tls can die at any time and then the reference stored within TlsAbstractionReleaseGuard is invalid.
Perhaps you are attempting to get the release of "system resources" to happen before the destruction of TlsAbstraction. Well, that is done by calling tls_.releaseSystemResources(); within the destructor of ~TlsAbstraction. I see no benefit of splitting it out into a separate class and variable. You can do it...sure...but I do not see any change in code behavior. Instead a waste of disk space, memory space, and CPU time.
Proof of what? As I said already I agree with you that there are multiple threading bugs here. I agree with (at least most of) your review of this patch too. I agree that merely making the bool atomic is insufficient. What I don't understand is your view of TSan. Perhaps you could answer directly:
Yes, this was discussed already also. Surely not everything needs to be done on both branches, otherwise why have branches? Better to fix only master than neither branch. |
resolves #20606
replaces #20802
Debug code:
(put
#if 0beforestd::atomic<bool> disposedto trigger TSAN message)Problem is reproduced through detached but not joined thread: https://github.com/alalek/opencv/blob/debug_20606/apps/version/opencv_version.cpp#L134-L152
NOTE: TSAN and GLIBC provides different behavior of threads termination process.