Skip to content

Add recursion / thread interaction guard in LogInfoForFatalError#46561

Merged
trylek merged 5 commits intodotnet:masterfrom
trylek:EHRecursionGuard
Jan 10, 2021
Merged

Add recursion / thread interaction guard in LogInfoForFatalError#46561
trylek merged 5 commits intodotnet:masterfrom
trylek:EHRecursionGuard

Conversation

@trylek
Copy link
Member

@trylek trylek commented Jan 5, 2021

Based on local instrumentation I introduced while investigating
issues related to the switch-over to compile System.Private.CoreLib
using Crossgen2 I'm proposing to harden fatal exception handling
against nested and concurrent fatal error occurrences.

Thanks

Tomas

@trylek trylek added this to the 6.0.0 milestone Jan 5, 2021
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be user-centric error message, without references to internal implementation details like LogInfoForFatalError

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Jan for the suggestion. I have reformulated it to Repeated fatal crash, please let me know if you have additional suggestions on improving the message.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would slightly prefer "Fatal error while logging another fatal error" or "Fatal error while logging fatal error". @jkotas what do you prefer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @janvorli suggestion if we still need error message after addressing the feedback,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we end up with Repeated fatal crash skipped to avoid infinite recursion and interaction between crashing threads. printed by this thread, and nothing printed by the primary crashing thread here?

Do we need to wait for the first thread to finish logging the error here, similar to what the stack overflow logging does?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have further refined the logging and locking and I've created a simple unit test for the parallel crash case. I'm not sure whether there's any easy way to unit test the recursive case, I guess that would require somehow artificially crash the stack unwinder or something. Can you please take another look when you have a chance?

@jkotas
Copy link
Member

jkotas commented Jan 5, 2021

It may be useful to test this with artificially induced crash to verify that it really works as intended.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interlocked.Increment should be used to avoid race condition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I would move the increment below the string.Format call.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed the increment operator to Interlocked.Increment in 4th commit but I left it above the string.Format call because I need the value returned from the method as the thread ID to print.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: not necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy & paste - removed in 4th commit, thanks for spotting.

Comment on lines 21 to 38
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the threadMap array is not used anywhere, you could simplify to:

        for (int threadIndex = ThreadCount; --threadIndex >= 0; )
        {
            new Thread(CrashInParallel).Start();
        }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, thanks, removed in 4th commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove this message, it would just mess up the logged error details below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in 4th commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to sleep forever. If this was the main thread, the application would hang. So we'll need some special value in the s_crashingThread or an extra static local to indicate that the logging has completed. If you prefer using the special value in s_crashingThread, then since the thread id can be anything, I would switch to using the Thread* instead (the result from GetCurrentThread) and use InterlockedCompareExchangeT. The special value can then be either some value that cannot be a pointer like "1", the original value with the lowest bit set to 1 or something along these lines.

I would also use much smaller sleep value so that the process exit is not visibly delayed once the logging is done. The stack overflow uses Sleep(50), so it would be nice to use the same wait here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4th commit, thanks for the detailed explanation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please change the message based on one of my suggestion?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4th commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to also have another test to exercise the case when the failfast comes from several secondary threads.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, as of the 3rd commit that's exactly what the test does - it hangs on the main thread and crashes all the worker threads in parallel. I have added two more variations - one to just crash the main thread and another to crash both the main thread and the worker threads. Please let me know if you believe we need to further improve the test coverage.

trylek added 4 commits January 8, 2021 20:06
Based on local instrumentation I introduced while investigating
issues related to the switch-over to compile System.Private.CoreLib
using Crossgen2 I'm proposing to harden fatal exception handling
against nested and concurrent fatal error occurrences.

Thanks

Tomas
(*) Replaced infinite wait in eepolicy with just a wait until the
fatal error logging is done and reduced the wait quantum to 50 msecs;

(*) Removed error message for the secondary threads entering the
fatal error logging routine and reformulated the primary error
per JanV's feedback;

(*) Added the use of Interlocked.Increment and cleaned up the
unit test per Anton's PR feedback;

(*) Created three flavors of the test to (a) crash on the main thread,
(b) crash on the worker threads, (c) both per JanV's PR feedback.

Thanks

Tomas
@trylek trylek merged commit af1e7ee into dotnet:master Jan 10, 2021
@trylek trylek deleted the EHRecursionGuard branch January 10, 2021 21:06
@ghost ghost locked as resolved and limited conversation to collaborators Feb 9, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants