Failures in lifecycle transitions are not correctly handled

I recently realized that there is a bug in the current implementation of lifecycle transitions.
However, it's unclear to me what should be the expected behavior, so opening a ticket to discuss.

The problem is the following.
Let's assume that the node is currently in the active state and there is a "deactivate" transition request.

A lifecycle transition is requested, thus calling `change_state` function https://github.com/ros2/rclcpp/blob/master/rclcpp_lifecycle/src/lifecycle_node_interface_impl.hpp#L379-L448

The code will then execute `rcl_lifecycle_trigger_transition_by_id`, trying to transition to the `deactivating` state.

In the RCL function `_trigger_transition` https://github.com/ros2/rcl/blob/master/rcl_lifecycle/src/rcl_lifecycle.c#L349-L371 we update the current state `state_machine->current_state = transition->goal;` before publishing notifications.
If the call to `rcl_publish` fails (and it can fail) the function will immediately return with an error code and this will result in immediately aborting the `change_state` function call https://github.com/ros2/rclcpp/blob/master/rclcpp_lifecycle/src/lifecycle_node_interface_impl.hpp#L396-L401 with an error along the lines of `"Unable to start transition %u from current state %s: Failed to publish`.

The problem is that, although the transition failed to start, we already updated the state machine so the node is now in the `deactivating`state.
The `on_deactivate` user callback has not been invoked.
Future requests either to activate or to deactivate will be rejected because they are not valid transitions from this state.

The first easy solution I thought of was to just move the `state_machine->current_state = transition->goal;` line after publishing the notification.
In this way, if the transition really failed to start, the node would remain in the original state.

The same situation can also happen while finalizing the transition.
However, here we can't stay in the current state (which would be "deactivating"), we can't even go back to the initial one, as we already invoked the user callback.

What's your thought on this problem?
Given the fact that transitions are made of multiple stages and that they all can fail, how should a user deal with this?
Should we always bring the node to the error state?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failures in lifecycle transitions are not correctly handled #1880

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Failures in lifecycle transitions are not correctly handled #1880

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions