Fix occasional crash during shutdown when explicitly calling ros::start but not ros::shutdown#2355
Conversation
|
@mjcarroll could you PTAL? |
rhaschke
left a comment
There was a problem hiding this comment.
Generally, this change looks reasonable. But I don't understand why you need to set the flag. The destruction order ensured by the compiler should already guarantee that the thread finishes before singleton destruction.
14abf18 to
b547358
Compare
b547358 to
334e290
Compare
|
@rhaschke could you take another look? |
rhaschke
left a comment
There was a problem hiding this comment.
Switching from boolean variables to static objects is probably the only way to correctly fix that issue. But I have one open question (see below).
If you take the effort to write a test, why don't you make a true unit test from it, i.e. adding some assertions?
I guess because of this?
Order of de-initialization used to be:
Now it is
|
|
I'm afraid we're talking past each other. My point is that the shutdown handler isn't registered in The ros_comm/clients/roscpp/src/libros/init.cpp Line 322 in 842f0f0 Actually, I would expect the call to |
The GitHub diff view is a little misleading here.
Sorry, I didn't respond to that comment yet. Yes, that is true. With this change,
My understanding was that Looking at the shutdown code in void shutdown()
{
boost::recursive_mutex::scoped_lock lock(g_shutting_down_mutex);
if (g_shutting_down)
return;
else
g_shutting_down = true;
ros::console::shutdown(); // ROSCONSOLE_AUTOINIT is called in ros::init
g_global_queue->disable(); // created in init()
g_global_queue->clear();
if (g_internal_queue_thread.get_id() != boost::this_thread::get_id())
{
g_internal_queue_thread.join();
} // created in start()
//ros::console::deregister_appender(g_rosout_appender);
delete g_rosout_appender; // created in start()
g_rosout_appender = 0;
if (g_started)
{
TopicManager::instance()->shutdown();
ServiceManager::instance()->shutdown();
PollManager::instance()->shutdown();
ConnectionManager::instance()->shutdown();
XMLRPCManager::instance()->shutdown();
}
g_started = false;
g_ok = false;
Time::shutdown(); // ros::Time::init(); is called in start()
} |
the assertions are made in the the missing assignment of |
|
I'll see if I can fix the mismatch between |
Yes, it's a mess and much more of a rabbit hole than I signed up for. I pushed another fix. |
|
oh what have I gotten myself into. This code is such a f-ing mess |
|
Alright. It seems like the current implementation preserves the previous behavior, except for joining the thread before singletons are destructed. I also tested that introducing various bugs does indeed fail the tests. |
|
@rhaschke Thanks a lot for the quick review cycle. would you mind taking another look? |
400900e to
61d3f42
Compare
I'm in a meeting now and then driving home. I will have a look into that later in the evening again. |
|
@rhaschke happy to also jump on a call if you want to talk through the changes here. I opened this PR on a whim, but now that I've spent a full day on it, I'm kind of invested in it. |
|
@rhaschke I also added a more readable explanation to the PR description. |
rhaschke
left a comment
There was a problem hiding this comment.
Looks good. Great to have unit tests for this now!
|
@mjcarroll could you merge this? |
|
@ros/ros_comm-maintainers could someone give this a final review and merge? |
|
This looks good to me, @sloretz do you want to take a look before we merge? |
|
Thank you for the PR and clear description! |
There was a problem hiding this comment.
This test case introduced a dependency on rosbash which provides the rosrun executable. So add
<test_depend>rosbash</test_depend> to package.xml? Or would it be possible to just use a relative path here (./) and make sure that the working directory is set accordingly in test/test_roscpp/test/launch/missing_call_to_shutdown.xml?
There was a problem hiding this comment.
Adding a <test_depend> could work as long as nothing in https://github.com/ros/ros depends on ros/ros_comm, which I think is the case. I'm cautious about hard coding a relative path and setting the test working directory, but that could work too.
Another option to remove the rosbash dependency:
- Remove
${PROJECT_NAME}-missing_call_to_shutdown - Rename
${PROJECT_NAME}-missing_call_to_shutdown_implto${PROJECT_NAME}-missing_call_to_shutdown - Make
test/test_roscpp/test/launch/missing_call_to_shutdown.xmlcalltest_roscpp-missing_call_to_shutdowntwice: once withargs="0"and once withargs="1".
…rt but not ros::shutdown (ros#2355) * Fix occasional crash during shutdown * add link to PR * comment * fix implementation * add missing hasError = true; * also call deInit * only deInit once * only deinit once * yet more fixes * add another test for init only * revert * preserve legacy behavior * add gtest wrapper * minimize code changes * add test * reduce changes even more * add comment * comment (cherry picked from commit 845f746)
…rt but not ros::shutdown (ros#2355) * Fix occasional crash during shutdown * add link to PR * comment * fix implementation * add missing hasError = true; * also call deInit * only deInit once * only deinit once * yet more fixes * add another test for init only * revert * preserve legacy behavior * add gtest wrapper * minimize code changes * add test * reduce changes even more * add comment * comment (cherry picked from commit 845f746) (cherry picked from commit 7d66154)
Summary
In the case that user code calls
ros::startexplicitly, theros::NodeHandleclass will not callros::shutdownwhen the last instance is destroyed.In order to make sure that the program can exit error-free nonetheless, currently
ros::initregisters theros::atexitCallbackcallback to be executed after the user'smainhas finished, which performs clean-up of resources that were set up ininit.However,
atexitCallbackis registered before the various ROS singletons are instantiated, which means that it is called after the singletons have been destroyed.This PR addressed most of the problem, however with one remaining issue:
ros::startcreates a thread, which will continue to execute after singletons have been destroyed and untilros::atexitCallbackis called. So, in rare circumstances, this thread can access singleton objects after they have been destroyed.This PR fixes the issue by ensuring that
ros::shutdownis called aftermainhas returned, but before singletons are destroyed.Detailed Description
We came across a quite rare crash of ROS nodes during shutdown at my company with the following backtrace:
main thread:
second thread:
After digging through the source code, I found that this is indeed a bug in ROS itself, which happens in the following circumstances:
ros::start, but notros::shutdown.main()has endedIn this case, the sequence that leads to the crash is the following:
ros::init(), theros::atexitCallbackis registered to be called aftermain()has ended.ros::start(), Singletons are created and a thread which callsinternalCallbackQueueThreadFuncis started, which calls callbacks fromros::g_internal_callback_queue.TransportPublisherLink::onRetryTimerand pushes this ontoros::g_internal_callback_queue, accessed via ros::getInternalCallbackQueue().Once the user's
main()function exits:internalCallbackQueueThreadFunccontinues to runatexitCallbackis called and callsros::shutdown, which setsg_shutting_downto true, signaling theinternalCallbackQueueThreadFuncto exit, and waits for the thread to join.internalCallbackQueueThreadFuncthread however has already entered the call toqueue->callAvailable, which in turn callsTransportPublisherLink::onRetryTimer.TransportPublisherLink::onRetryTimercallsPollManager::instance(), which returns the address of the destructed singleton objectTransportTCP::connectresults in an exception, since it tries to lock an already-destructed mutex.The fix in this PR works around this problem by making sure that the
internalCallbackQueueThreadFuncthread is ended before singletons are destroyed.It does this by stopping the thread in the destructor of a
InternalQueueJoiningThreadinstance which is declared as a static variable inside of theinitInternalQueueJoiningThreadfunction.This function is called after the singleton instances (which are also static function-scope variables) are created, which means that the compiler must generate code that destroys the
InternalQueueJoiningThreadobject, and thus joins the thread, before the other Singletons are destroyed.Note that, as mentioned above, init.cpp registers
atexitCallbackspecifically for the case thatros::shutdownwas not called by the user code. However, it fails to address the issue described above since it is called after singletons have been destroyed but before theinternalCallbackQueueThreadFuncthread has ended.This was meant to be addressed by this PR,, but the fix was incomplete.