Skip to content

Failures in lifecycle transitions are not correctly handled #1880

@alsora

Description

@alsora

I recently realized that there is a bug in the current implementation of lifecycle transitions.
However, it's unclear to me what should be the expected behavior, so opening a ticket to discuss.

The problem is the following.
Let's assume that the node is currently in the active state and there is a "deactivate" transition request.

A lifecycle transition is requested, thus calling change_state function https://github.com/ros2/rclcpp/blob/master/rclcpp_lifecycle/src/lifecycle_node_interface_impl.hpp#L379-L448

The code will then execute rcl_lifecycle_trigger_transition_by_id, trying to transition to the deactivating state.

In the RCL function _trigger_transition https://github.com/ros2/rcl/blob/master/rcl_lifecycle/src/rcl_lifecycle.c#L349-L371 we update the current state state_machine->current_state = transition->goal; before publishing notifications.
If the call to rcl_publish fails (and it can fail) the function will immediately return with an error code and this will result in immediately aborting the change_state function call https://github.com/ros2/rclcpp/blob/master/rclcpp_lifecycle/src/lifecycle_node_interface_impl.hpp#L396-L401 with an error along the lines of "Unable to start transition %u from current state %s: Failed to publish.

The problem is that, although the transition failed to start, we already updated the state machine so the node is now in the deactivatingstate.
The on_deactivate user callback has not been invoked.
Future requests either to activate or to deactivate will be rejected because they are not valid transitions from this state.

The first easy solution I thought of was to just move the state_machine->current_state = transition->goal; line after publishing the notification.
In this way, if the transition really failed to start, the node would remain in the original state.

The same situation can also happen while finalizing the transition.
However, here we can't stay in the current state (which would be "deactivating"), we can't even go back to the initial one, as we already invoked the user callback.

What's your thought on this problem?
Given the fact that transitions are made of multiple stages and that they all can fail, how should a user deal with this?
Should we always bring the node to the error state?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions