Fix race in ros2action test_cli by pbaughman · Pull Request #454 · ros2/ros2cli

pbaughman · 2020-02-20T16:55:50Z

This test was trying to talk to the fibonacci action server before it
was ready to be talked to.

Fix the test by having the fibonacci action server write to stdout when
it's ready and having the test wait for the stdout to start

I believe this was the cause of the test failures in #376 but I've found the ros2cli tests are very fragile, so I'd like to try to merge this separately so there aren't confounding factors if the tests continue to fail.

How to reproduce the original race/failure:

If you delay the Fibonacci action server program's startup by adding

import time
time.sleep(5)

right before main is called here

. . . and also remove the retry_on_failure here you could make test_fibonacci_info fail every time.

Signed-off-by: Pete Baughman peter.c.baughman@gmail.com

This test was trying to talk to the fibonacci action server before it was ready to be talked to. Fix the test by having hte fibonacci action server write to stdout when it's ready and having the test wait for the stdout to start Signed-off-by: Pete Baughman <peter.c.baughman@gmail.com>

pbaughman · 2020-02-20T16:56:18Z

ros2action/test/test_cli.py

            strict=False
        )

-    @launch_testing.markers.retry_on_failure(times=5)


I believe this was a workaround for the race condition that I've fixed above and is no longer necessary

pbaughman · 2020-02-20T16:57:03Z

ros2action/test/fixtures/fibonacci_action_server.py

+        # Spend up to 2 seconds doing all the work we need to have the fibonacci
+        # action server actually talking over DDS before signaling to the test it's
+        # OK to start
+        for _ in range(20):


There may be a way to spin_until_future_complete here, but I just don't know. Is there a future I can wait on that indicates the action server is up, ready, and can be communicated with by other processes before I print out 'fibonacci_action_server has started' ??

pbaughman · 2020-02-20T17:08:05Z

. . . Now i'm seeing failures locally with rmw_cyclonedds_cpp on my branch and also when I build/test using 'master'. . . It feels totally impossible to make any progress 😭

Test failures on my machine on 'master':

[213.948s] test_fibonacci_info[rmw_cyclonedds_cpp] (test_cli.TestROS2ActionCLI) ... ok                     
[213.948s] test_fibonacci_info_count[rmw_cyclonedds_cpp] (test_cli.TestROS2ActionCLI) ... ok               
[213.949s] test_fibonacci_info_with_types[rmw_cyclonedds_cpp] (test_cli.TestROS2ActionCLI) ... ok          
[213.949s] test_info_on_nonexistent_action[rmw_cyclonedds_cpp] (test_cli.TestROS2ActionCLI) ... ok         
[213.949s] test_list[rmw_cyclonedds_cpp] (test_cli.TestROS2ActionCLI) ... FAIL                             
[213.949s] test_list_count[rmw_cyclonedds_cpp] (test_cli.TestROS2ActionCLI) ... FAIL                       
[213.949s] test_list_with_types[rmw_cyclonedds_cpp] (test_cli.TestROS2ActionCLI) ... FAIL                  
[213.949s] test_send_fibonacci_goal[rmw_cyclonedds_cpp] (test_cli.TestROS2ActionCLI) ... FAIL              
[213.950s] test_send_fibonacci_goal_with_feedback[rmw_cyclonedds_cpp] (test_cli.TestROS2ActionCLI) ... FAIL

pbaughman · 2020-02-20T17:18:34Z

@hidmic Side-note: I'm not sure if this is being tracked anywhere, but the pytest plugin for launch tests makes it really hard to see what has failed here. Instead of the results showing 9 test cases x however many DDS implementations there are, we just get one failing test called test_cli.py and you have to cat the logs to see what actually happened. You can't figure it out programmatically.

pbaughman · 2020-02-20T17:25:15Z

Looks like rmw_connext_cpp is also failing on my machine when I try to build/test the master branch:

[216.498s] test_fibonacci_info[rmw_connext_cpp] (test_cli.TestROS2ActionCLI) ... ok                     
[216.499s] test_fibonacci_info_count[rmw_connext_cpp] (test_cli.TestROS2ActionCLI) ... ok               
[216.499s] test_fibonacci_info_with_types[rmw_connext_cpp] (test_cli.TestROS2ActionCLI) ... ok          
[216.499s] test_info_on_nonexistent_action[rmw_connext_cpp] (test_cli.TestROS2ActionCLI) ... ok         
[216.499s] test_list[rmw_connext_cpp] (test_cli.TestROS2ActionCLI) ... FAIL                             
[216.499s] test_list_count[rmw_connext_cpp] (test_cli.TestROS2ActionCLI) ... FAIL                       
[216.500s] test_list_with_types[rmw_connext_cpp] (test_cli.TestROS2ActionCLI) ... FAIL                  
[216.500s] test_send_fibonacci_goal[rmw_connext_cpp] (test_cli.TestROS2ActionCLI) ... FAIL              
[216.500s] test_send_fibonacci_goal_with_feedback[rmw_connext_cpp] (test_cli.TestROS2ActionCLI) ... FAIL

Does

PRESCstReaderCollator_storeSampleData:deserialize sample error in topic 'DISCParticipant' with type 'DISCParticipantP...storeSampleData:deserialize sample error in topic 'DISCParticipant' with type 'DISCParticipantParameter'\n/fibonacci\n",

mean anything to anybody?

pbaughman · 2020-02-20T17:26:40Z

@sloretz Maybe we can run CI on this and see if we get lucky? I'm so sorry about all of the bad CI jobs we're spinning up for this issue. This package is extremely difficult for me to wrangle.

The changes here depend on ros2/launch#386 which was just merged, so we might need to wait for those changes to propegate into CI

dirk-thomas · 2020-02-20T18:47:10Z

Fix the test by having the fibonacci action server write to stdout when
it's ready and having the test wait for the stdout to start

I don't think this is the right way to fix this test. We couldn't recommend the same approach to a user trying to use actions in their own code.

pbaughman · 2020-02-20T18:51:11Z

@dirk-thomas Can you suggest something different? The Fibonacci action server program was created specifically as a test fixture for this test, and must provide some mechanism to indicate that it has finished starting up or else the test will race and fail.

We couldn't recommend the same approach to a user trying to use actions in their own code.

. . . but the current test starts the action server in a process and attempts to communicate with it before it's finished starting up with no feedback whatsoever and just fails with a cryptic error sometimes. Would you recommend that approach to a user?

pbaughman · 2020-02-20T18:53:45Z

@dirk-thomas Would you accept a PR where we disable the test because it's broken? That would satisfy my need to get this PR unstuck

pbaughman · 2020-02-21T23:36:43Z

@dirk-thomas Can you please further clarify your objection to this fix? The code in question is a test fixture meant to facilitate the testing of the ros2 actions cli tool. In order for the tests to start, the test fixture must be up and running. The mechanism I've used here is a message printed over stdout, which I don't think is very unreasonable.

The test fixture in question is not meant as a demo or a suggestion to any user about how to create a ROS2 action server - it's a dummy process for the test to run against.

I would be happy to use some other synchronization method suggested, but thus far no suggestion has been offered.

I see that this issue is also causing CI failures in this PR as well as my other PR. If we can't agree on a way to improve the test, I suggest disabling it so we can make progress elsewhere.

pbaughman · 2020-02-25T16:45:34Z

@sloretz Do you have any opinions about this? It should fix the test that's blocking #376 from being merged

jacobperron · 2020-02-25T23:00:14Z

@pbaughman I spoke with @wjwwood offline about this issue and we agree that the correct thing to do in this situation (with the CLI daemon) is to retry the affected tests until discovery happens. I believe the bug in this case is that the retry_on_failure decorator is not waiting between retries and so not enough time is left for discovery to happen. I've proposed a delay parameter in ros2/launch#390 that we should be able to use here to help with the flaky tests.

jacobperron · 2020-02-25T23:22:55Z

As an alternative to this PR, I've updated all the tests involving the CLI daemon in #459.

pbaughman · 2020-02-25T23:31:22Z

@jacobperron Sure. My only dissapointment is that the whole point of having a "Ready To Test" action as part of the launch description was that the test author would be forced to think about things like this and wouldn't have tests that try to run before all of their preconditions are met. This was a big problem with test actions in launch files using rostest in ROS1. If we're going to put "ReadyToTest" in the wrong spot and then fill the tests with retry logic, it doesn't feel like we've made much progress.

Having said that - let's do whatever we need to do to unstick this.

jacobperron · 2020-02-25T23:36:59Z

@pbaughman I agree with the sentiment, but I think these CLI tests are an edge case because of the underlying daemon that they use. For example, in a different scenario where we have access to the node objects we are testing, it's more feasible to condition a launch event on the nodes discovering each other.

wjwwood · 2020-02-25T23:37:03Z

If we're going to put "ReadyToTest" in the wrong spot and then fill the tests with retry logic, it doesn't feel like we've made much progress.

Well, this is still an improvement. It just so happens that the thing we're testing for is the precondition. We could write it like this:

run fixture
wait for ros2 action info to show right info or fail
ready_fn
assert true

But instead we're writing it like this:

run fixture
ready_fn
retry until ros2 action info shows the the right info or fail

Other tests will be able to do:

run fixture
wait for something like ros2 topic list/ros2 service list/ros2 action info .../etc... to show expected info, or fail
ready_fn
check something more complicated, like actually call the service/action

wjwwood · 2020-02-25T23:40:35Z

Also, we could additionally (to what @jacobperron proposes in #459) wait for the server to be available (using an action client), but that would influence the output of ros2 action info so we'd have to retry until the client disappears (which is asynchronous as well).

AFAICT, the only reason this pull request helps is that it takes some time to complete your new code in the action server and that allows discovery to happen. If you artificially added a lag to the daemon so it did not report new discovery information until sometime in the future you'd still see this problem with this pull request (imo).

pbaughman · 2020-02-25T23:42:05Z

@wjwwood just an aside:

wait for something like ros2 topic list/ros2 service list/ros2 action info .../etc... to show expected info, or fail

I had hoped to write some launch_testing event handlers to make it easy to wait for a certain topic to be available or for data to appear on a certain topic - as these are common things you want to wait for before starting a test, but I never got the chance when I was contracting at Apex. I'm hopeful that they or someone else will want to make that improvement at some point.

pbaughman · 2020-02-25T23:43:38Z

Closed in favor of #459

pbaughman commented Feb 20, 2020

View reviewed changes

pbaughman requested a review from hidmic February 20, 2020 17:08

pbaughman mentioned this pull request Feb 20, 2020

Warn that old-style ready_fn and test attributes will be deprecated ros2/launch#346

Merged

jacobperron self-requested a review February 25, 2020 21:51

jacobperron mentioned this pull request Feb 25, 2020

Add delay parameter to retry_on_failure decorator ros2/launch#390

Merged

jacobperron mentioned this pull request Feb 25, 2020

Add delay when retrying tests involving the CLI daemon #459

Merged

pbaughman closed this Feb 25, 2020

Conversation

pbaughman commented Feb 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to reproduce the original race/failure:

Uh oh!

pbaughman Feb 20, 2020

Choose a reason for hiding this comment

Uh oh!

pbaughman Feb 20, 2020

Choose a reason for hiding this comment

Uh oh!

pbaughman commented Feb 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pbaughman commented Feb 20, 2020

Uh oh!

pbaughman commented Feb 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pbaughman commented Feb 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dirk-thomas commented Feb 20, 2020

Uh oh!

pbaughman commented Feb 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pbaughman commented Feb 20, 2020

Uh oh!

pbaughman commented Feb 21, 2020

Uh oh!

pbaughman commented Feb 25, 2020

Uh oh!

jacobperron commented Feb 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacobperron commented Feb 25, 2020

Uh oh!

pbaughman commented Feb 25, 2020

Uh oh!

jacobperron commented Feb 25, 2020

Uh oh!

wjwwood commented Feb 25, 2020

Uh oh!

wjwwood commented Feb 25, 2020

Uh oh!

pbaughman commented Feb 25, 2020

Uh oh!

pbaughman commented Feb 25, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pbaughman commented Feb 20, 2020 •

edited

Loading

pbaughman commented Feb 20, 2020 •

edited

Loading

pbaughman commented Feb 20, 2020 •

edited

Loading

pbaughman commented Feb 20, 2020 •

edited

Loading

pbaughman commented Feb 20, 2020 •

edited

Loading

jacobperron commented Feb 25, 2020 •

edited

Loading