Skip to content

prototype of lifecycle bond system#1869

Closed
SteveMacenski wants to merge 15 commits intomasterfrom
bond_lifecycle
Closed

prototype of lifecycle bond system#1869
SteveMacenski wants to merge 15 commits intomasterfrom
bond_lifecycle

Conversation

@SteveMacenski
Copy link
Copy Markdown
Member

@SteveMacenski SteveMacenski commented Jul 14, 2020


Basic Info

Info Please fill out this column
Ticket(s) this addresses #1754
Primary OS tested on Ubuntu,
Robotic platform tested on gazebo simulation

Description of contribution in a few bullet points

  • Added bond support to the major lifecycle servers
  • Kills all the nodes if one of the critical relevant nodes fails, then will wait for stimulus to restart up
  • Parameterized to be able to disable
  • If misconfigured (e.x. can't communicate to a non-existent server) will bring down everything for safety

Description of documentation updates required from your changes

  • Add to migration guide
  • Add capability docs to let people know this is happening and why

Future work that may be required in bullet points

Before end of WIP:

  • test working basic (timeouts work and settable/disableable, will actually bring system down, can be brought back up, clean shutdown and startup, no false alarms, delayed/no connection on startup clean warn)
  • should keep the bond option or always?
  • If always (and if misconfigured) how does it handle never connecting on startup (or delayed connection)?
  • Make sure all forms of shutdown have this triggered to break the bond
  • option for duration before killing (which could be set inf to disable)

@SteveMacenski SteveMacenski requested a review from naiveHobo July 18, 2020 00:05
@SteveMacenski
Copy link
Copy Markdown
Member Author

@Marwan99 please review. Its not quite working yet (one last quirk) & need to add tests, but look over the basics of what's going on

Copy link
Copy Markdown
Contributor

@Marwan99 Marwan99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in principle.

@SteveMacenski
Copy link
Copy Markdown
Member Author

SteveMacenski commented Jul 20, 2020

tested:

  • Mess with timeout values parameterization working
  • Clean exit
  • 0 valued timeout disables
  • is properly brought down if a server crashes while operating or breaks bone
  • All up, then transition down and back up again 2 times without issues
  • All up and stable for 1 hour
  • comes up 100% of the time without crashing or timing out

Needs testing / working

  • test coverage (manager: fake node work up and down; param set to 0, lifecycle wrapper: covered by integration tests)

@codecov
Copy link
Copy Markdown

codecov bot commented Jul 21, 2020

Codecov Report

Merging #1869 into master will decrease coverage by 0.02%.
The diff coverage is 97.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1869      +/-   ##
==========================================
- Coverage   70.14%   70.11%   -0.03%     
==========================================
  Files         218      218              
  Lines       10586    10649      +63     
==========================================
+ Hits         7426     7467      +41     
- Misses       3160     3182      +22     
Impacted Files Coverage Δ
nav2_util/include/nav2_util/lifecycle_node.hpp 100.00% <ø> (ø)
nav2_lifecycle_manager/src/lifecycle_manager.cpp 93.64% <95.83%> (+0.84%) ⬆️
nav2_amcl/src/amcl_node.cpp 84.01% <100.00%> (-0.29%) ⬇️
nav2_bt_navigator/src/bt_navigator.cpp 80.25% <100.00%> (+0.12%) ⬆️
nav2_controller/src/nav2_controller.cpp 82.05% <100.00%> (-0.43%) ⬇️
nav2_map_server/src/map_server/map_server.cpp 90.36% <100.00%> (+0.11%) ⬆️
nav2_planner/src/planner_server.cpp 72.26% <100.00%> (+0.20%) ⬆️
nav2_recoveries/src/recovery_server.cpp 92.40% <100.00%> (+0.09%) ⬆️
nav2_util/src/lifecycle_node.cpp 100.00% <100.00%> (ø)
nav2_waypoint_follower/src/waypoint_follower.cpp 73.50% <100.00%> (+0.22%) ⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 69977cd...19e1dcc. Read the comment docs.

@naiveHobo
Copy link
Copy Markdown
Contributor

Looks good to me!

@SteveMacenski
Copy link
Copy Markdown
Member Author

I'm super irritated - so I have things generally working (if a server dies or breaks the bond, the lifecycle manager hears about it and triggers things down).

Now the issue is that I can't keep things up stably. I have printouts saying that both directions are sending regular heartbeats but then after the timeout period, every single time one of the servers triggers a failed heartbeat connection to the lifecycle manager checking if they're broken and the lifecycle manager brings down all the nodes.

I think bondcpp is broken or something really weird happening. I see the active regular heartbeats coming in at 4x the rate of the heartbeat timeout so everything should be working fine. I think it has to do with the ROS1 to ROS2 port trying to be fancy and ended up messing alot of things up.

@naiveHobo mind taking a look? See the repos file for my fork of bondcpp to enable lifecycle. Take particular attention at the heartbeat things & the ROS1 versions. I see the Heartbeat() method being triggered by the state machine every time we have a new message from the bond status callback (it calls SisterAlive() or something which in turns calls Heartbeat()). In Heartbeat() for ROS1 it resets the timer without triggering the heartbeat timeout callback but in this one for ROS2, it pulls a flag and resets the timer which calls it and triggers the timeout procedure to bring things down I think.

I think the timeout.cpp from ROS1 is not equivalent to the weird timer reset / cancel functions they added.

@naiveHobo
Copy link
Copy Markdown
Contributor

Yup sure, on it

@SteveMacenski
Copy link
Copy Markdown
Member Author

Awesome! I also filed a ticket in Bond itself since I got deep enough into it to believe that bond itself is broken. Today I'm going to scour the interwebs for another ROS2 application using it, but I'm not hopeful, to see if I'm just messing something up and its working for anyone else.

@SteveMacenski
Copy link
Copy Markdown
Member Author

new PR to replace later today

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants