Skip to content
This repository was archived by the owner on May 31, 2025. It is now read-only.

Fix occasional crash during shutdown when explicitly calling ros::start but not ros::shutdown#2355

Merged
sloretz merged 18 commits intoros:noetic-develfrom
dgossow:noetic-devel
Nov 21, 2023
Merged

Fix occasional crash during shutdown when explicitly calling ros::start but not ros::shutdown#2355
sloretz merged 18 commits intoros:noetic-develfrom
dgossow:noetic-devel

Conversation

@dgossow
Copy link
Copy Markdown
Contributor

@dgossow dgossow commented Oct 6, 2023

Summary

In the case that user code calls ros::start explicitly, the ros::NodeHandle class will not call ros::shutdown when the last instance is destroyed.

In order to make sure that the program can exit error-free nonetheless, currently ros::init registers the ros::atexitCallback callback to be executed after the user's main has finished, which performs clean-up of resources that were set up in init.

However, atexitCallback is registered before the various ROS singletons are instantiated, which means that it is called after the singletons have been destroyed.

This PR addressed most of the problem, however with one remaining issue: ros::start creates a thread, which will continue to execute after singletons have been destroyed and until ros::atexitCallback is called. So, in rare circumstances, this thread can access singleton objects after they have been destroyed.

This PR fixes the issue by ensuring that ros::shutdown is called after main has returned, but before singletons are destroyed.

Detailed Description

We came across a quite rare crash of ROS nodes during shutdown at my company with the following backtrace:

main thread:

#0  0x00007f0076baead3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x29708d8) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  0x00007f0076baead3 in __pthread_cond_wait_common (abstime=0x0, mutex=0x2970888, cond=0x29708b0) at pthread_cond_wait.c:502
#2  0x00007f0076baead3 in __pthread_cond_wait (cond=0x29708b0, mutex=0x2970888) at pthread_cond_wait.c:655
#3  0x00007f00773f1b5d in boost::condition_variable::wait(boost::unique_lock<boost::mutex>&) () at /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.65.1
#4  0x00007f00773e9d14 in boost::thread::join_noexcept() () at /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.65.1
#5  0x00007f00ae60784d in ros::shutdown() () at /opt/ros/melodic/lib/libroscpp.so
#6  0x00007f00ae607a49 in ros::atexitCallback() () at /opt/ros/melodic/lib/libroscpp.so
#7  0x00007f007362f031 in __run_exit_handlers (status=0, listp=0x7f00739d7718 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108
#8  0x00007f007362f12a in __GI_exit (status=<optimized out>) at exit.c:139
#9  0x00007f007360dc8e in __libc_start_main (main=0x419a90 <main(int, char**)>, argc=27, argv=0x7ffdf94ccdc8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffdf94ccdb8) at ../csu/libc-start.c:344
#10 0x00000000004176ca in _start () at /usr/include/boost/asio/ssl/impl/error.ipp:89

second thread:

terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::lock_error> >'
  what():  boost: mutex lock failed in pthread_mutex_lock: Invalid argument
Stack trace (most recent call last) in thread 245:
#18   Object "", at 0xffffffffffffffff, in 
#17   Object "/lib/x86_64-linux-gnu/libc.so.6", at 0x7fae78dc561e, in clone
#16   Object "/lib/x86_64-linux-gnu/libpthread.so.0", at 0x7fae7c25f6da, in start_thread
#15   Object "/usr/lib/x86_64-linux-gnu/libboost_thread.so.1.65.1", at 0x7fae7caa0bcc, in boost::this_thread::interruption_requested()
#14   Object "/opt/ros/melodic/lib/libroscpp.so", at 0x7faeb3cbdef0, in ros::internalCallbackQueueThreadFunc()
#13   Object "/opt/ros/melodic/lib/libroscpp.so", at 0x7faeb3c7e2fa, in ros::CallbackQueue::callAvailable(ros::WallDuration)
#12   Object "/opt/ros/melodic/lib/libroscpp.so", at 0x7faeb3c7c558, in ros::CallbackQueue::callOneCB(ros::CallbackQueue::TLS*)
#11   Object "/opt/ros/melodic/lib/libroscpp.so", at 0x7faeb3cb267a, in ros::TimerManager<ros::SteadyTime, ros::WallDuration, ros::SteadyTimerEvent>::TimerQueueCallback::call()
#10   Object "/opt/ros/melodic/lib/libroscpp.so", at 0x7faeb3cad644, in ros::TransportPublisherLink::onRetryTimer(ros::SteadyTimerEvent const&)
#9    Object "/opt/ros/melodic/lib/libroscpp.so", at 0x7faeb3ca32c2, in ros::TransportTCP::connect(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int)
#8    Object "/opt/ros/melodic/lib/libroscpp.so", at 0x7faeb3ca1e34, in ros::TransportTCP::initializeSocket()
#7    Object "/opt/ros/melodic/lib/libroscpp.so", at 0x7faeb3cdbb31, in ros::PollSet::addSocket(int, boost::function<void (int)> const&, boost::shared_ptr<ros::Transport> const&)
#6    Object "/opt/ros/melodic/lib/libroscpp.so", at 0x7faeb3c2a368, in void boost::throw_exception<boost::lock_error>(boost::lock_error const&)
#5    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6", at 0x7fae796ddd53, in __cxa_throw
#4    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6", at 0x7fae796ddb20, in std::terminate()
#3    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6", at 0x7fae796ddae5, in std::rethrow_exception(std::__exception_ptr::exception_ptr)
#2    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6", at 0x7fae796d7956, in 
#1    Object "/lib/x86_64-linux-gnu/libc.so.6", at 0x7fae78ce47f0, in abort
#0    Object "/lib/x86_64-linux-gnu/libc.so.6", at 0x7fae78ce2e87, in gsignal
Aborted (Signal sent by tkill() 231 0)

After digging through the source code, I found that this is indeed a bug in ROS itself, which happens in the following circumstances:

  • The user code calls ros::start, but not ros::shutdown.
  • There is some TCP transport publisher link which is dropped and happens to try to reconnect after the program's main() has ended

In this case, the sequence that leads to the crash is the following:

Once the user's main() function exits:

The fix in this PR works around this problem by making sure that the internalCallbackQueueThreadFunc thread is ended before singletons are destroyed.

It does this by stopping the thread in the destructor of a InternalQueueJoiningThread instance which is declared as a static variable inside of the initInternalQueueJoiningThread function.

This function is called after the singleton instances (which are also static function-scope variables) are created, which means that the compiler must generate code that destroys the InternalQueueJoiningThread object, and thus joins the thread, before the other Singletons are destroyed.

Note that, as mentioned above, init.cpp registers atexitCallback specifically for the case that ros::shutdown was not called by the user code. However, it fails to address the issue described above since it is called after singletons have been destroyed but before the internalCallbackQueueThreadFunc thread has ended.

This was meant to be addressed by this PR,, but the fix was incomplete.

@dgossow
Copy link
Copy Markdown
Contributor Author

dgossow commented Oct 6, 2023

@mjcarroll could you PTAL?

Copy link
Copy Markdown
Contributor

@rhaschke rhaschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, this change looks reasonable. But I don't understand why you need to set the flag. The destruction order ensured by the compiler should already guarantee that the thread finishes before singleton destruction.

@david-vay david-vay force-pushed the noetic-devel branch 4 times, most recently from 14abf18 to b547358 Compare October 23, 2023 10:20
@dgossow
Copy link
Copy Markdown
Contributor Author

dgossow commented Oct 23, 2023

@rhaschke could you take another look?

Copy link
Copy Markdown
Contributor

@rhaschke rhaschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching from boolean variables to static objects is probably the only way to correctly fix that issue. But I have one open question (see below).
If you take the effort to write a test, why don't you make a true unit test from it, i.e. adding some assertions?

@dgossow
Copy link
Copy Markdown
Contributor Author

dgossow commented Oct 23, 2023

The destruction order ensured by the compiler should already guarantee that the thread finishes before singleton destruction.

I guess because of this?

d) For each local object obj with static storage duration, obj is destroyed as if a function calling the destructor of obj were registered with std::atexit at the completion of the constructor of obj.

c) If the completion of the initialization of a static object A was sequenced-before the call to std::atexit for some function F, the call to F during termination is sequenced-before the start of the destruction of A

Order of de-initialization used to be:

  • singleton destructors
  • atexitCallback
  • g_internal_queue_thread destructor

Now it is

  • ShutdownCaller destructor
  • singleton destructors
  • g_internal_queue_thread destructor

@rhaschke
Copy link
Copy Markdown
Contributor

I'm afraid we're talking past each other. My point is that the shutdown handler isn't registered in ros::init() at all anymore.
I would expect that you replace the atexit registration with ShutdownCaller instantiation, i.e. calling initShutdownCaller(), but you just removed the atexit registration w/o replacing it with something else.

The ShutdownCaller is only initialized in shutdownCallback(), which is only called on an external shutdown request:

XMLRPCManager::instance()->bind("shutdown", shutdownCallback);

Actually, I would expect the call to initShutdownCaller() only in ros::init() and not in the latter callback.

@dgossow
Copy link
Copy Markdown
Contributor Author

dgossow commented Oct 23, 2023

The ShutdownCaller is only initialized in shutdownCallback(), which is only called on an external shutdown request:

The GitHub diff view is a little misleading here. initShutdownCaller is now called in void start() and not shutdownCallback`

I'm afraid we're talking past each other. My point is that the shutdown handler isn't registered in ros::init() at all anymore.

Sorry, I didn't respond to that comment yet.

Yes, that is true. With this change, ros::init does not register any type of exit handler anymore. Only ros::start does.

atexitHandler used to call ros::shutdown (with g_started set to false).

My understanding was that shutdown only has to be called if start has been called. But you are right, some of the resources it frees are allocated in init().

Looking at the shutdown code in ros::shutdown:

void shutdown()
{
  boost::recursive_mutex::scoped_lock lock(g_shutting_down_mutex);
  if (g_shutting_down)
    return;
  else
    g_shutting_down = true;

  ros::console::shutdown(); // ROSCONSOLE_AUTOINIT is called in ros::init

  g_global_queue->disable(); // created in init()
  g_global_queue->clear();

  if (g_internal_queue_thread.get_id() != boost::this_thread::get_id())
  {
    g_internal_queue_thread.join();
  } // created in start()

  //ros::console::deregister_appender(g_rosout_appender);
  delete g_rosout_appender; // created in start()
  g_rosout_appender = 0;

  if (g_started)
  {
    TopicManager::instance()->shutdown();
    ServiceManager::instance()->shutdown();
    PollManager::instance()->shutdown();
    ConnectionManager::instance()->shutdown();
    XMLRPCManager::instance()->shutdown();
  }

  g_started = false;
  g_ok = false;
  Time::shutdown(); // ros::Time::init(); is called in start()
}

@dgossow
Copy link
Copy Markdown
Contributor Author

dgossow commented Oct 23, 2023

If you take the effort to write a test, why don't you make a true unit test from it, i.e. adding some assertions?

the assertions are made in the atexit callback, this is why gtest is not used, if this was your question.

the missing assignment of hasError was a bug, it's fixed now

@dgossow
Copy link
Copy Markdown
Contributor Author

dgossow commented Oct 23, 2023

I'll see if I can fix the mismatch between init, start and shutdown

@dgossow
Copy link
Copy Markdown
Contributor Author

dgossow commented Oct 23, 2023

Switching from boolean variables to static objects is probably the only way to correctly fix that issue.

Yes, it's a mess and much more of a rabbit hole than I signed up for.

I pushed another fix.

@dgossow
Copy link
Copy Markdown
Contributor Author

dgossow commented Oct 23, 2023

oh what have I gotten myself into. This code is such a f-ing mess

@dgossow
Copy link
Copy Markdown
Contributor Author

dgossow commented Oct 23, 2023

Alright.

It seems like the current implementation preserves the previous behavior, except for joining the thread before singletons are destructed.

I also tested that introducing various bugs does indeed fail the tests.

@dgossow dgossow requested a review from rhaschke October 23, 2023 14:29
@dgossow
Copy link
Copy Markdown
Contributor Author

dgossow commented Oct 23, 2023

@rhaschke Thanks a lot for the quick review cycle. would you mind taking another look?

@rhaschke
Copy link
Copy Markdown
Contributor

would you mind taking another look?

I'm in a meeting now and then driving home. I will have a look into that later in the evening again.

@dgossow
Copy link
Copy Markdown
Contributor Author

dgossow commented Oct 23, 2023

@rhaschke happy to also jump on a call if you want to talk through the changes here.

I opened this PR on a whim, but now that I've spent a full day on it, I'm kind of invested in it.

@dgossow dgossow changed the title Fix occasional crash during shutdown Fix occasional crash during shutdown when using ros::start Oct 24, 2023
@dgossow dgossow changed the title Fix occasional crash during shutdown when using ros::start Fix occasional crash during shutdown when explicitly calling ros::start but not ros::shutdown Oct 24, 2023
@dgossow
Copy link
Copy Markdown
Contributor Author

dgossow commented Oct 24, 2023

@rhaschke I also added a more readable explanation to the PR description.

Copy link
Copy Markdown
Contributor

@rhaschke rhaschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Great to have unit tests for this now!

@dgossow
Copy link
Copy Markdown
Contributor Author

dgossow commented Oct 24, 2023

@mjcarroll could you merge this?

@dgossow
Copy link
Copy Markdown
Contributor Author

dgossow commented Nov 20, 2023

@ros/ros_comm-maintainers could someone give this a final review and merge?

@mjcarroll
Copy link
Copy Markdown
Member

This looks good to me, @sloretz do you want to take a look before we merge?

@sloretz
Copy link
Copy Markdown
Contributor

sloretz commented Nov 21, 2023

Thank you for the PR and clear description!

@sloretz sloretz merged commit 845f746 into ros:noetic-devel Nov 21, 2023
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case introduced a dependency on rosbash which provides the rosrun executable. So add
<test_depend>rosbash</test_depend> to package.xml? Or would it be possible to just use a relative path here (./) and make sure that the working directory is set accordingly in test/test_roscpp/test/launch/missing_call_to_shutdown.xml?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a <test_depend> could work as long as nothing in https://github.com/ros/ros depends on ros/ros_comm, which I think is the case. I'm cautious about hard coding a relative path and setting the test working directory, but that could work too.

Another option to remove the rosbash dependency:

  1. Remove ${PROJECT_NAME}-missing_call_to_shutdown
  2. Rename ${PROJECT_NAME}-missing_call_to_shutdown_impl to ${PROJECT_NAME}-missing_call_to_shutdown
  3. Make test/test_roscpp/test/launch/missing_call_to_shutdown.xml call test_roscpp-missing_call_to_shutdown twice: once with args="0" and once with args="1".

AadityaRavindran pushed a commit to locusrobotics/ros_comm that referenced this pull request Apr 4, 2025
…rt but not ros::shutdown (ros#2355)

* Fix occasional crash during shutdown

* add link to PR

* comment

* fix implementation

* add missing hasError = true;

* also call deInit

* only deInit once

* only deinit once

* yet more fixes

* add another test for init only

* revert

* preserve legacy behavior

* add gtest wrapper

* minimize code changes

* add test

* reduce changes even more

* add comment

* comment

(cherry picked from commit 845f746)
AadityaRavindran pushed a commit to locusrobotics/ros_comm that referenced this pull request Apr 4, 2025
…rt but not ros::shutdown (ros#2355)

* Fix occasional crash during shutdown

* add link to PR

* comment

* fix implementation

* add missing hasError = true;

* also call deInit

* only deInit once

* only deinit once

* yet more fixes

* add another test for init only

* revert

* preserve legacy behavior

* add gtest wrapper

* minimize code changes

* add test

* reduce changes even more

* add comment

* comment

(cherry picked from commit 845f746)
(cherry picked from commit 7d66154)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants