Skip to content

unsafe code in signal handlers #604

@wjwwood

Description

@wjwwood

We discovered this due to a new flakey test: ros2/build_farmer#152

Basically we're locking mutex's in the signal handler in many ways (one of which is by triggering guard conditions in rmw_fastrtps_cpp's code), which can lead to deadlocks if those functions are reentered during signal handling.

After a lot of tracing and digging, @hidmic and I are pretty certain this is the core issue and that it could be an issue in a lot of other places, but the above flakey tests were just unlucky enough to setup an environment in which it happens fairly regularly.

We need to fix this for Crystal, so it's a high priority, but we have a plan to fix it.

Basically we're going to take all of the code which may lock a mutex (and therefore may deadlock) in the signal handler and move it into a global class which will call that code from a dedicated thread. The thread will be started when the signal handler is installed and joined when it is uninstalled or when the global class is destructed. In the thread it will simply wait to be safely notified from the signal handler and when notified it will do the necessary work like triggering guard conditions and running shutdown() (and therefore on_shutdown() callbacks) for each context.

@hidmic is going to get started on that, and I'm going to start working on refactoring of the notification system for rclcpp::sleep_for to make it context specific, removing yet more global state of which the signal handler class no longer needs to keep track.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions