Fix the flaky dynamic params test thats been causing CI to fail#568
Fix the flaky dynamic params test thats been causing CI to fail#56813 commits merged intomasterfrom unknown repository
Conversation
The ServiceClient class will be used Lifecycle helper functions that will be put in nav2_util, so we need it there. It seemed more of utility class than a Task anyhow.
Add some helper functions for generating node names. These will be used to fix some potential problems with the ServiceClient and the upcoming LifecycleServiceClient.
ServiceClient reuses the service name as the temporary node name. However, service names frequently have / characters in them and that is invalid for a node name. Also, it seems likely we'll have multiple clients using the same service. This would result in a name collision. This now adds a random suffix to the internal node name.
ServiceClient can now use a supplied node instead of always creating it's own. This allows us to create a higher level service client that interacts with multiple services using just one generated node.
The unit tests will be handled at a higher level by a subsequent checkin
I need to add another function to it, and it could stand some restructuring before hand.
Adding a function to split a string at delimiters. The intent is to use this later so tests can get ':' seperated string of lifecycle nodes from and environment variable, and spin them all up.
|
Sorry for all the noise on this PR. It seems Github doesn't present commits in the order they are in the branch. Instead it shows them according to "author date". |
Codecov Report
@@ Coverage Diff @@
## master #568 +/- ##
==========================================
+ Coverage 31.9% 32.04% +0.13%
==========================================
Files 220 230 +10
Lines 9070 9206 +136
Branches 3044 3112 +68
==========================================
+ Hits 2894 2950 +56
- Misses 4288 4310 +22
- Partials 1888 1946 +58
Continue to review full report at Codecov.
|
|
I'd also like feedback on whether breaking the large PR into individually reviewable commits like this worked to make the review more manageable. |
| #define NAV2_TASKS__COSTMAP_SERVICE_CLIENT_HPP_ | ||
|
|
||
| #include "nav2_tasks/service_client.hpp" | ||
| #include "nav2_util/service_client.hpp" |
There was a problem hiding this comment.
I agree that we should move this to util. Good idea.
| { | ||
|
|
||
| /// Transition the given lifecycle nodes to the ACTIVATED state in order | ||
| void bringupLifecycleNodes(const std::vector<std::string> & node_names); |
There was a problem hiding this comment.
I suppose this is here as a convenience to quickly get the lifecycle nodes to the ACTIVE state. However, it doesn't have the corresponding shutdownLifecycleNodes. Also, it seems a bit like test code rather than general code to be used with lifecycle nodes.
There was a problem hiding this comment.
I agree. I wanted to make it super easy to write a test that uses lifecycle nodes, because I expect we will want to do that a lot to improve test stability.
I also didn't want to invest too much time into managing lifecycle states until you'd had a chance to look at this. I figured you probably had a better way to do this in the work you'd been doing.
There was a problem hiding this comment.
As I was went through this to separate test code from production code, I found I didn't have any test code to put in the library, so I'm going to defer creating that library.
This bringupLifecycleNodes function ended up being used by the command line utility I created, which I consider production code as opposed to just test code.
| explicit ServiceClient(const std::string & name) | ||
| explicit ServiceClient( | ||
| const std::string & name, | ||
| const rclcpp::Node::SharedPtr & provided_node = rclcpp::Node::SharedPtr()) |
There was a problem hiding this comment.
I've noticed that the ROS2 code uses SharedPtr and not SharedPtr&. Seems to be that if one is going to hold on to this value using a local node_ member, it should be reference counted.
There was a problem hiding this comment.
I believe that when we do the copy assignment of the reference to the member variable at node_ = provided_node;, that is when the reference will get incremented.
By passing in a reference, we only avoid adding a temporary reference in the function parameter itself.
In other words, by passing the shared ptr by value instead of by ref, we create the parameter provided_node and increment the reference count. Then assign it to the member variable and increment the reference count again. When the function returns the parameter is destroyed and the reference counted is decremented again.
We'd end up taking the shared ptr lock an extra two times for the sole purpose of the temporary parameter object.
nav2_util/src/lifecycle_utils.cpp
Outdated
| LifecycleServiceClient sc(node_name); | ||
| sc.ChangeState(lifecycle_msgs::msg::Transition::TRANSITION_CONFIGURE); | ||
| sc.ChangeState(lifecycle_msgs::msg::Transition::TRANSITION_ACTIVATE); | ||
| while (sc.GetState() != lifecycle_msgs::msg::State::PRIMARY_STATE_ACTIVE) { |
There was a problem hiding this comment.
If the Configure or Activate failed, this code will be stuck in the loop. Perhaps instead of looping until ACTIVE the code could simply ensure that each state transition was successful (and therefore wouldn't require the GetState at the end).
There was a problem hiding this comment.
Yes. You're right. I was approaching this from the test mindset, where CMake or the user will Ctrl-C this if it takes too long.
I had to add the check for the current state being ACTIVE instead of just counting on the successful call to activate, but I don't remember why exactly. I'll retest to figure out why that was necessary.
There was a problem hiding this comment.
Adding a timeout could help.
nav2_util/src/node_utils.cpp
Outdated
| return sanitizeNodeName(prefix) + "_" + timeToString(8); | ||
| } | ||
|
|
||
| rclcpp::Node::SharedPtr generateInternalNode(const std::string & prefix) |
There was a problem hiding this comment.
Is this intended to be code generally available for the implementation to use or is it support code for testing (or both perhaps)? If it's testing-specific code, should we have a place for such code or is util the right place? I'm wondering if it's worth separating.
There was a problem hiding this comment.
I was thinking it could be used anywhere we need to create a node internal to a class without bothering the client about that detail. So it would get used by ServiceClient, maybe DynamicParamsClient, and a few other places. It wasn't meant to be just for test code.
With some of the talk on the logger exception thread about nodes have unique names, I was afraid that these classes could end up creating nodes with identical names. I figured the solution to that was to add a randomized string to the end of the name, and I was trying to encapsulate the algorithm for that in this function.
| { | ||
|
|
||
| class globalLocalizationServiceClient : public ServiceClient<std_srvs::srv::Empty> | ||
| class globalLocalizationServiceClient : public nav2_util::ServiceClient<std_srvs::srv::Empty> |
There was a problem hiding this comment.
Should global be uppercase?
There was a problem hiding this comment.
Yes. Fixed. As well as a bunch of other name violations in the nav2_utils package. I addressed all the functions in the main nav2_utils library, but left the libraries that came from AMCL and map server
orduno
left a comment
There was a problem hiding this comment.
Looks very good. Splitting into multiple commits did help with the review.
Service calls still hang for some unknown reason. The only known fix is to time it out and retry.
|
This change is necessary to make the dynamic params test stable, but is not enough in itself. The dynamic params test still fails sometimes when run concurrently with other tests. This will be addressed in another PR |
| string node_name(potential_node_name); | ||
| // read this as `replace` characters in `node_name` `if` not alphanumeric. | ||
| // replace with '_' | ||
| replace_if(begin(node_name), end(node_name), |
| { | ||
|
|
||
| void bringupLifecycleNode(const std::string & node_name) | ||
| #define RETRY(fn, retries) \ |
There was a problem hiding this comment.
There are other places in the code besides life-cycle that could benefit from this retry method. For example for Auto Localization currently I'm using the BT retry node for this purpose but we might need a simple retry method for other tasks.
There was a problem hiding this comment.
I agree. I'll do that as a separate PR. It seems like a big enough change that it should have a focused review and not just tacked on as an addendum to this one.
mhpanah
left a comment
There was a problem hiding this comment.
We can make the retry method in the utils to be available to not just the life-cycle node.
NOTE: I recommend reviewing this commit by commit rather than as one big PR
This is quite a big PR, but I cleaned up the commit history, so you can review it commit by commit. Each commit is an independent change if you follow it from oldest to newest. The newer ones build on the older ones.
Basic Info
Description of contribution in a few bullet points
Fixes dynamic params test that has been failing in CI. The test was failing because the "helper" nodes were not ready yet at the time the test started. I fixed this by converting the "helper" nodes to lifecycle nodes and waiting in the test till they were activated.
This refactors and adds a bunch of support code, so this mechanism can be easily used on other tests that need it