Add recursion / thread interaction guard in LogInfoForFatalError#46561
Add recursion / thread interaction guard in LogInfoForFatalError#46561trylek merged 5 commits intodotnet:masterfrom
Conversation
src/coreclr/vm/eepolicy.cpp
Outdated
There was a problem hiding this comment.
This should be user-centric error message, without references to internal implementation details like LogInfoForFatalError
There was a problem hiding this comment.
Thanks Jan for the suggestion. I have reformulated it to Repeated fatal crash, please let me know if you have additional suggestions on improving the message.
There was a problem hiding this comment.
I would slightly prefer "Fatal error while logging another fatal error" or "Fatal error while logging fatal error". @jkotas what do you prefer?
There was a problem hiding this comment.
I agree with @janvorli suggestion if we still need error message after addressing the feedback,
src/coreclr/vm/eepolicy.cpp
Outdated
There was a problem hiding this comment.
Can we end up with Repeated fatal crash skipped to avoid infinite recursion and interaction between crashing threads. printed by this thread, and nothing printed by the primary crashing thread here?
Do we need to wait for the first thread to finish logging the error here, similar to what the stack overflow logging does?
There was a problem hiding this comment.
I have further refined the logging and locking and I've created a simple unit test for the parallel crash case. I'm not sure whether there's any easy way to unit test the recursive case, I guess that would require somehow artificially crash the stack unwinder or something. Can you please take another look when you have a chance?
|
It may be useful to test this with artificially induced crash to verify that it really works as intended. |
There was a problem hiding this comment.
Interlocked.Increment should be used to avoid race condition.
There was a problem hiding this comment.
Also I would move the increment below the string.Format call.
There was a problem hiding this comment.
I have changed the increment operator to Interlocked.Increment in 4th commit but I left it above the string.Format call because I need the value returned from the method as the thread ID to print.
There was a problem hiding this comment.
Copy & paste - removed in 4th commit, thanks for spotting.
There was a problem hiding this comment.
Since the threadMap array is not used anywhere, you could simplify to:
for (int threadIndex = ThreadCount; --threadIndex >= 0; )
{
new Thread(CrashInParallel).Start();
}There was a problem hiding this comment.
Good point, thanks, removed in 4th commit.
src/coreclr/vm/eepolicy.cpp
Outdated
There was a problem hiding this comment.
I would remove this message, it would just mess up the logged error details below.
src/coreclr/vm/eepolicy.cpp
Outdated
There was a problem hiding this comment.
We don't want to sleep forever. If this was the main thread, the application would hang. So we'll need some special value in the s_crashingThread or an extra static local to indicate that the logging has completed. If you prefer using the special value in s_crashingThread, then since the thread id can be anything, I would switch to using the Thread* instead (the result from GetCurrentThread) and use InterlockedCompareExchangeT. The special value can then be either some value that cannot be a pointer like "1", the original value with the lowest bit set to 1 or something along these lines.
I would also use much smaller sleep value so that the process exit is not visibly delayed once the logging is done. The stack overflow uses Sleep(50), so it would be nice to use the same wait here.
There was a problem hiding this comment.
Fixed in 4th commit, thanks for the detailed explanation.
src/coreclr/vm/eepolicy.cpp
Outdated
There was a problem hiding this comment.
Can you please change the message based on one of my suggestion?
There was a problem hiding this comment.
It would be good to also have another test to exercise the case when the failfast comes from several secondary threads.
There was a problem hiding this comment.
Well, as of the 3rd commit that's exactly what the test does - it hangs on the main thread and crashes all the worker threads in parallel. I have added two more variations - one to just crash the main thread and another to crash both the main thread and the worker threads. Please let me know if you believe we need to further improve the test coverage.
Based on local instrumentation I introduced while investigating issues related to the switch-over to compile System.Private.CoreLib using Crossgen2 I'm proposing to harden fatal exception handling against nested and concurrent fatal error occurrences. Thanks Tomas
(*) Replaced infinite wait in eepolicy with just a wait until the fatal error logging is done and reduced the wait quantum to 50 msecs; (*) Removed error message for the secondary threads entering the fatal error logging routine and reformulated the primary error per JanV's feedback; (*) Added the use of Interlocked.Increment and cleaned up the unit test per Anton's PR feedback; (*) Created three flavors of the test to (a) crash on the main thread, (b) crash on the worker threads, (c) both per JanV's PR feedback. Thanks Tomas
bf438a2 to
122452e
Compare
Based on local instrumentation I introduced while investigating
issues related to the switch-over to compile System.Private.CoreLib
using Crossgen2 I'm proposing to harden fatal exception handling
against nested and concurrent fatal error occurrences.
Thanks
Tomas