Skip to content

[Humble] Revert "Add diagnostics (backport #820) (#922)"#1015

Merged
bmagyar merged 1 commit intoros-controls:humblefrom
schornakj:pr-revert-922
May 19, 2023
Merged

[Humble] Revert "Add diagnostics (backport #820) (#922)"#1015
bmagyar merged 1 commit intoros-controls:humblefrom
schornakj:pr-revert-922

Conversation

@schornakj
Copy link
Copy Markdown
Contributor

@schornakj schornakj commented May 9, 2023

This fixes the most frequent cause of the segfault described in #979.

In Humble accessing the state of a lifecycle node is not thread-safe, and because the diagnostic updater that was added in #922 gets the node state within a timer callback separately from the main update() cycle, it eventually causes the controller manager to crash.

Removing the diagnostic updater prevents this situation from happening. We can add it back in once this PR to resolve the underlying thread safety issue in Humble rclcpp has been merged.

This reverts commit dca10f4.

This reverts commit dca10f4.

Signed-off-by: Joe Schornak <joe.schornak@gmail.com>
@destogl
Copy link
Copy Markdown
Member

destogl commented May 9, 2023

If this is just temporary, then we should just add a protector for this. This changes the user's interface and we should do this twice. Is there another solution?

@bmagyar
Copy link
Copy Markdown
Member

bmagyar commented May 14, 2023

Given that things seem to be moving fairly fast on the referenced rclcpp PR, I think the simplest for us is to sit this one out. Syncs on the buildfarm don't happen overnight unless it's literally the day before the sync ;)
Anything we remove and release now will likely land the same time as the rclcpp fix.

@schornakj
Copy link
Copy Markdown
Contributor Author

schornakj commented May 18, 2023

Given that things seem to be moving fairly fast on the referenced rclcpp PR, I think the simplest for us is to sit this one out. Syncs on the buildfarm don't happen overnight unless it's literally the day before the sync ;)
Anything we remove and release now will likely land the same time as the rclcpp fix.

While the rclcpp folks have started reviewing ros2/rclcpp#2183, they've also said that there isn't a firm timeline on when they'll be able to get it in because they're focusing on the Iron release for the next few weeks.

The issue this PR resolves is a critical failure and I think it's important to get some kind of fix out even if it means users have to build the package from source. Without this fix there is a random chance of the controller manager segfaulting while the robot is running, and as the system runs longer and longer the chance that the segfault will happen eventually gets higher and higher. For the robots in our lab we see segfaults about once per day per robot without this fix.

If this is just temporary, then we should just add a protector for this. This changes the user's interface and we should do this twice. Is there another solution?

Tracking down every place where the controller state is accessed and wrapping it in a mutex lock would be much more time-intensive and invasive than reverting the problematic commit.

More broadly, I think the diagnostic publisher feature should not have been backported to Humble, since it both breaks ABI and exposes a thread race condition that causes a segfault. I don't think it's worth keeping the diagnostic publisher if it causes my robots to crash every day.

@bmagyar
Copy link
Copy Markdown
Member

bmagyar commented May 19, 2023

Points taken. Thanks for the through feedback!

We do intend to have some breaking changes in humble from time to time since we'd really like it to be a feature-solid LTS so let's not rule diagnostics out completely :)

@bmagyar bmagyar merged commit 26a027a into ros-controls:humble May 19, 2023
@schornakj
Copy link
Copy Markdown
Contributor Author

Points taken. Thanks for the through feedback!

Thank you!

We do intend to have some breaking changes in humble from time to time since we'd really like it to be a feature-solid LTS so let's not rule diagnostics out completely :)

Understood and definitely reasonable -- I was talking to some of my colleagues today about how to achieve this type of balance.

@tylerjw
Copy link
Copy Markdown
Contributor

tylerjw commented May 20, 2023

@bmagyar have you seen my post on how MoveIt approaches the LTS Humble? https://github.com/orgs/ros-planning/discussions/2190

I know in the past this project has been much more willing to break API on humble than MoveIt has. I wonder if we included ros2_control in our apt repo if it would make sense for this project to be more stable on humble and encourage users who want the latest features to use that instead? Then you wouldn't be releasing breaking changes into the official buildfarm and going against the guidelines of ROS for stability.

Also, @schornakj, amazing work here getting this issue sorted from all the different angles.

@bmagyar
Copy link
Copy Markdown
Member

bmagyar commented May 20, 2023

FYI @tonynajjar

@bmagyar
Copy link
Copy Markdown
Member

bmagyar commented May 20, 2023

Humble release without diagnostics: ros/rosdistro#37294

@tonynajjar
Copy link
Copy Markdown
Contributor

Ugh, apologies for introducing this bug with the diagnostics PR, it felt to me like a harmless small improvement back then. I see you guys have this already figured out, if there is some way I can still help, please let me know.
For the record, I've been using ros2_control on Humble since that PR with 3 controllers and have not encountered that crash 🤔

@bmagyar
Copy link
Copy Markdown
Member

bmagyar commented May 23, 2023

Forgot to mention that I put the release out straight away that day: ros/rosdistro#37294

flochre pushed a commit to flochre/ros2_control that referenced this pull request Jul 5, 2023
This reverts commit dca10f4.

Signed-off-by: Joe Schornak <joe.schornak@gmail.com>
bmagyar added a commit that referenced this pull request Mar 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants