Skip to content

Fix flaky lifecycle node tests#1606

Merged
jacobperron merged 5 commits intomasterfrom
jacob/fix_1605
Mar 30, 2021
Merged

Fix flaky lifecycle node tests#1606
jacobperron merged 5 commits intomasterfrom
jacob/fix_1605

Conversation

@jacobperron
Copy link
Copy Markdown
Member

Apparently, the topics and services that LifecycleNode provides are not
available immediately after creating a node.

Fix flaky tests by accounting for some delay between the LifecycleNode
constructor and queries about its topics and services.

Fixes #1605

Apparently, the topics and services that LifecycleNode provides are not
available immediately after creating a node.

Fix flaky tests by accounting for some delay between the LifecycleNode
constructor and queries about its topics and services.

Signed-off-by: Jacob Perron <jacob@openrobotics.org>
@jacobperron jacobperron self-assigned this Mar 30, 2021
Copy link
Copy Markdown
Collaborator

@fujitatomoya fujitatomoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Copy Markdown
Contributor

@clalancette clalancette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit inline, but I don't feel strongly about it. Seems good to me with green CI.

Signed-off-by: Jacob Perron <jacob@openrobotics.org>
Signed-off-by: Jacob Perron <jacob@openrobotics.org>
Signed-off-by: Jacob Perron <jacob@openrobotics.org>
Signed-off-by: Jacob Perron <jacob@openrobotics.org>
@jacobperron
Copy link
Copy Markdown
Member Author

jacobperron commented Mar 30, 2021

PR job failure looks unrelated (a test failure in rclcpp). Here's CI for rclcpp_lifecycle:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status (unrelated CMake warning)

@jacobperron jacobperron merged commit f986ca3 into master Mar 30, 2021
@delete-merged-branch delete-merged-branch bot deleted the jacob/fix_1605 branch March 30, 2021 21:42
@jacobperron
Copy link
Copy Markdown
Member Author

jacobperron commented Mar 30, 2021

I just noticed that this rclcpp test is also flaky due to a similar pattern of creating a ROS entity and then immediately checking that it exists:

auto publisher = node->create_publisher<test_msgs::msg::BasicTypes>(topic_name, qos);
// List should have one item
auto publisher_list = node->get_publishers_info_by_topic(fq_topic_name);
ASSERT_EQ(publisher_list.size(), (size_t)1);

Seems like this pattern is not reliable since the switch to CycloneDDS. I'm not sure we guarantee anywhere that the graph query API will contain entities that were created right before the query, even if they were created by the same node making the query.

@clalancette
Copy link
Copy Markdown
Contributor

Seems like this pattern is not reliable since the switch to CycloneDDS. I'm not sure we guarantee anywhere that the graph query API will contain entities that were created right before the query, even if they were created by the same node making the query.

Hm, good point. @eboasson if you get a chance, could you share some thoughts here? We don't have any guarantees around this, but the behavior is different than we used to see. So we're wondering if this is expected, and we should update the tests, or if it indicates an issue elsewhere.

@eboasson
Copy link
Copy Markdown
Contributor

eboasson commented Apr 1, 2021

@clalancette I checked, the reason is that the Cyclone RMW layer updates the graph layer asynchronously from the discovery thread. There's no particular reason the create/destroy subscription and publisher operations can't update the graph synchronously, one would just have to make sure to filter out any local entities in the discovery thread.

@clalancette
Copy link
Copy Markdown
Contributor

@clalancette I checked, the reason is that the Cyclone RMW layer updates the graph layer asynchronously from the discovery thread. There's no particular reason the create/destroy subscription and publisher operations can't update the graph synchronously, one would just have to make sure to filter out any local entities in the discovery thread.

Thanks for looking into it, it is appreciated. So it sounds this isn't unexpected, it is just different from what we had before. I guess in order to be robust to different implementations, the tests really shouldn't expect that the topics and services are immediately available. @jacobperron does that make sense to you?

@jacobperron
Copy link
Copy Markdown
Member Author

Makes sense to me. I'm working on adding API to rcl to make this kind of testing pattern easier, e.g. rcl_wait_for_publishers(). Stay tuned...

@jacobperron
Copy link
Copy Markdown
Member Author

Stay tuned...

ros2/rcl#907

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test_lifecycle_node tests

5 participants