Skip to content

Fix race in ros2action test_cli#454

Closed
pbaughman wants to merge 1 commit intoros2:masterfrom
pbaughman:fix_ros2action_test_race
Closed

Fix race in ros2action test_cli#454
pbaughman wants to merge 1 commit intoros2:masterfrom
pbaughman:fix_ros2action_test_race

Conversation

@pbaughman
Copy link
Copy Markdown
Contributor

@pbaughman pbaughman commented Feb 20, 2020

This test was trying to talk to the fibonacci action server before it
was ready to be talked to.

Fix the test by having the fibonacci action server write to stdout when
it's ready and having the test wait for the stdout to start

I believe this was the cause of the test failures in #376 but I've found the ros2cli tests are very fragile, so I'd like to try to merge this separately so there aren't confounding factors if the tests continue to fail.

How to reproduce the original race/failure:

If you delay the Fibonacci action server program's startup by adding

import time
time.sleep(5)

right before main is called here

. . . and also remove the retry_on_failure here you could make test_fibonacci_info fail every time.

Signed-off-by: Pete Baughman peter.c.baughman@gmail.com

This test was trying to talk to the fibonacci action server before it
was ready to be talked to.

Fix the test by having hte fibonacci action server write to stdout when
it's ready and having the test wait for the stdout to start

Signed-off-by: Pete Baughman <peter.c.baughman@gmail.com>
strict=False
)

@launch_testing.markers.retry_on_failure(times=5)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this was a workaround for the race condition that I've fixed above and is no longer necessary

# Spend up to 2 seconds doing all the work we need to have the fibonacci
# action server actually talking over DDS before signaling to the test it's
# OK to start
for _ in range(20):
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be a way to spin_until_future_complete here, but I just don't know. Is there a future I can wait on that indicates the action server is up, ready, and can be communicated with by other processes before I print out 'fibonacci_action_server has started' ??

@pbaughman
Copy link
Copy Markdown
Contributor Author

pbaughman commented Feb 20, 2020

. . . Now i'm seeing failures locally with rmw_cyclonedds_cpp on my branch and also when I build/test using 'master'. . . It feels totally impossible to make any progress 😭

Test failures on my machine on 'master':

[213.948s] test_fibonacci_info[rmw_cyclonedds_cpp] (test_cli.TestROS2ActionCLI) ... ok                     
[213.948s] test_fibonacci_info_count[rmw_cyclonedds_cpp] (test_cli.TestROS2ActionCLI) ... ok               
[213.949s] test_fibonacci_info_with_types[rmw_cyclonedds_cpp] (test_cli.TestROS2ActionCLI) ... ok          
[213.949s] test_info_on_nonexistent_action[rmw_cyclonedds_cpp] (test_cli.TestROS2ActionCLI) ... ok         
[213.949s] test_list[rmw_cyclonedds_cpp] (test_cli.TestROS2ActionCLI) ... FAIL                             
[213.949s] test_list_count[rmw_cyclonedds_cpp] (test_cli.TestROS2ActionCLI) ... FAIL                       
[213.949s] test_list_with_types[rmw_cyclonedds_cpp] (test_cli.TestROS2ActionCLI) ... FAIL                  
[213.949s] test_send_fibonacci_goal[rmw_cyclonedds_cpp] (test_cli.TestROS2ActionCLI) ... FAIL              
[213.950s] test_send_fibonacci_goal_with_feedback[rmw_cyclonedds_cpp] (test_cli.TestROS2ActionCLI) ... FAIL

@pbaughman pbaughman requested a review from hidmic February 20, 2020 17:08
@pbaughman
Copy link
Copy Markdown
Contributor Author

@hidmic Side-note: I'm not sure if this is being tracked anywhere, but the pytest plugin for launch tests makes it really hard to see what has failed here. Instead of the results showing 9 test cases x however many DDS implementations there are, we just get one failing test called test_cli.py and you have to cat the logs to see what actually happened. You can't figure it out programmatically.

@pbaughman
Copy link
Copy Markdown
Contributor Author

pbaughman commented Feb 20, 2020

Looks like rmw_connext_cpp is also failing on my machine when I try to build/test the master branch:

[216.498s] test_fibonacci_info[rmw_connext_cpp] (test_cli.TestROS2ActionCLI) ... ok                     
[216.499s] test_fibonacci_info_count[rmw_connext_cpp] (test_cli.TestROS2ActionCLI) ... ok               
[216.499s] test_fibonacci_info_with_types[rmw_connext_cpp] (test_cli.TestROS2ActionCLI) ... ok          
[216.499s] test_info_on_nonexistent_action[rmw_connext_cpp] (test_cli.TestROS2ActionCLI) ... ok         
[216.499s] test_list[rmw_connext_cpp] (test_cli.TestROS2ActionCLI) ... FAIL                             
[216.499s] test_list_count[rmw_connext_cpp] (test_cli.TestROS2ActionCLI) ... FAIL                       
[216.500s] test_list_with_types[rmw_connext_cpp] (test_cli.TestROS2ActionCLI) ... FAIL                  
[216.500s] test_send_fibonacci_goal[rmw_connext_cpp] (test_cli.TestROS2ActionCLI) ... FAIL              
[216.500s] test_send_fibonacci_goal_with_feedback[rmw_connext_cpp] (test_cli.TestROS2ActionCLI) ... FAIL

Does

PRESCstReaderCollator_storeSampleData:deserialize sample error in topic 'DISCParticipant' with type 'DISCParticipantP...storeSampleData:deserialize sample error in topic 'DISCParticipant' with type 'DISCParticipantParameter'\n/fibonacci\n",

mean anything to anybody?

@pbaughman
Copy link
Copy Markdown
Contributor Author

pbaughman commented Feb 20, 2020

@sloretz Maybe we can run CI on this and see if we get lucky? I'm so sorry about all of the bad CI jobs we're spinning up for this issue. This package is extremely difficult for me to wrangle.

The changes here depend on ros2/launch#386 which was just merged, so we might need to wait for those changes to propegate into CI

@dirk-thomas
Copy link
Copy Markdown
Member

Fix the test by having the fibonacci action server write to stdout when
it's ready and having the test wait for the stdout to start

I don't think this is the right way to fix this test. We couldn't recommend the same approach to a user trying to use actions in their own code.

@pbaughman
Copy link
Copy Markdown
Contributor Author

pbaughman commented Feb 20, 2020

@dirk-thomas Can you suggest something different? The Fibonacci action server program was created specifically as a test fixture for this test, and must provide some mechanism to indicate that it has finished starting up or else the test will race and fail.

We couldn't recommend the same approach to a user trying to use actions in their own code.

. . . but the current test starts the action server in a process and attempts to communicate with it before it's finished starting up with no feedback whatsoever and just fails with a cryptic error sometimes. Would you recommend that approach to a user?

@pbaughman
Copy link
Copy Markdown
Contributor Author

@dirk-thomas Would you accept a PR where we disable the test because it's broken? That would satisfy my need to get this PR unstuck

@pbaughman
Copy link
Copy Markdown
Contributor Author

@dirk-thomas Can you please further clarify your objection to this fix? The code in question is a test fixture meant to facilitate the testing of the ros2 actions cli tool. In order for the tests to start, the test fixture must be up and running. The mechanism I've used here is a message printed over stdout, which I don't think is very unreasonable.

The test fixture in question is not meant as a demo or a suggestion to any user about how to create a ROS2 action server - it's a dummy process for the test to run against.

I would be happy to use some other synchronization method suggested, but thus far no suggestion has been offered.

I see that this issue is also causing CI failures in this PR as well as my other PR. If we can't agree on a way to improve the test, I suggest disabling it so we can make progress elsewhere.

@pbaughman
Copy link
Copy Markdown
Contributor Author

@sloretz Do you have any opinions about this? It should fix the test that's blocking #376 from being merged

@jacobperron
Copy link
Copy Markdown
Member

jacobperron commented Feb 25, 2020

@pbaughman I spoke with @wjwwood offline about this issue and we agree that the correct thing to do in this situation (with the CLI daemon) is to retry the affected tests until discovery happens. I believe the bug in this case is that the retry_on_failure decorator is not waiting between retries and so not enough time is left for discovery to happen. I've proposed a delay parameter in ros2/launch#390 that we should be able to use here to help with the flaky tests.

@jacobperron
Copy link
Copy Markdown
Member

As an alternative to this PR, I've updated all the tests involving the CLI daemon in #459.

@pbaughman
Copy link
Copy Markdown
Contributor Author

@jacobperron Sure. My only dissapointment is that the whole point of having a "Ready To Test" action as part of the launch description was that the test author would be forced to think about things like this and wouldn't have tests that try to run before all of their preconditions are met. This was a big problem with test actions in launch files using rostest in ROS1. If we're going to put "ReadyToTest" in the wrong spot and then fill the tests with retry logic, it doesn't feel like we've made much progress.

Having said that - let's do whatever we need to do to unstick this.

@jacobperron
Copy link
Copy Markdown
Member

@pbaughman I agree with the sentiment, but I think these CLI tests are an edge case because of the underlying daemon that they use. For example, in a different scenario where we have access to the node objects we are testing, it's more feasible to condition a launch event on the nodes discovering each other.

@wjwwood
Copy link
Copy Markdown
Member

wjwwood commented Feb 25, 2020

If we're going to put "ReadyToTest" in the wrong spot and then fill the tests with retry logic, it doesn't feel like we've made much progress.

Well, this is still an improvement. It just so happens that the thing we're testing for is the precondition. We could write it like this:

  • run fixture
  • wait for ros2 action info to show right info or fail
  • ready_fn
  • assert true

But instead we're writing it like this:

  • run fixture
  • ready_fn
  • retry until ros2 action info shows the the right info or fail

Other tests will be able to do:

  • run fixture
  • wait for something like ros2 topic list/ros2 service list/ros2 action info .../etc... to show expected info, or fail
  • ready_fn
  • check something more complicated, like actually call the service/action

@wjwwood
Copy link
Copy Markdown
Member

wjwwood commented Feb 25, 2020

Also, we could additionally (to what @jacobperron proposes in #459) wait for the server to be available (using an action client), but that would influence the output of ros2 action info so we'd have to retry until the client disappears (which is asynchronous as well).

AFAICT, the only reason this pull request helps is that it takes some time to complete your new code in the action server and that allows discovery to happen. If you artificially added a lag to the daemon so it did not report new discovery information until sometime in the future you'd still see this problem with this pull request (imo).

@pbaughman
Copy link
Copy Markdown
Contributor Author

@wjwwood just an aside:

  • wait for something like ros2 topic list/ros2 service list/ros2 action info .../etc... to show expected info, or fail

I had hoped to write some launch_testing event handlers to make it easy to wait for a certain topic to be available or for data to appear on a certain topic - as these are common things you want to wait for before starting a test, but I never got the chance when I was contracting at Apex. I'm hopeful that they or someone else will want to make that improvement at some point.

@pbaughman
Copy link
Copy Markdown
Contributor Author

Closed in favor of #459

@pbaughman pbaughman closed this Feb 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants