Skip to content

Stop client component from hogging executor#187

Merged
dhood merged 8 commits intomasterfrom
client_dont_retry
Nov 8, 2017
Merged

Stop client component from hogging executor#187
dhood merged 8 commits intomasterfrom
client_dont_retry

Conversation

@dhood
Copy link
Copy Markdown
Member

@dhood dhood commented Nov 3, 2017

Using the API composition demo, if the client component is started before the server, then a request to load the server component cannot be processed because the client is hogging the executor waiting for the service.

To reproduce:

ros2 run composition api_composition

In another terminal:

ros2 run composition api_composition_cli -- composition composition::Client
Sending request...
Waiting for response...
Result of load_node: success = true
ros2 run composition api_composition_cli -- composition composition::Server
Sending request...
Waiting for response...
<never loads>
  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

An alternative fix is to make the API composition demo use a multithreaded executor. That change may independently make sense, but I think the client component should be more co-operative regardless.

Rather than change the existing test I added a new regression test with the client being started first. However, this PR should also address flakiness of the existing test if the client waiitng was the cause of the server not being loaded sometimes e.g. http://ci.ros2.org/view/nightly/job/nightly_osx_repeated/883/testReport/junit/(root)/projectroot/test_composition__rmw_connext_cpp/

@dhood dhood added the in review Waiting for review (Kanban column) label Nov 3, 2017
@dhood dhood self-assigned this Nov 3, 2017
Copy link
Copy Markdown
Member

@wjwwood wjwwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the best solution.

}
fprintf(stderr, "service not available, waiting again...\n");
fprintf(stderr, "service not available after waiting.\n");
return false;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you didn't write this function, but why does it return a bool? It looks like the timer callback is expected to return void: https://github.com/ros2/rclcpp/blob/989084b3de84000a15fb4bd5a7d63929001cd769/rclcpp/include/rclcpp/timer.hpp#L145-L168

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure but I'll update it to be void.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, for my understanding, is it an issue if the function returns bool? Even if the return code will just be ignored if called from the timer, it might provide useful info in other contexts. What if instead we rename the function to e.g. call_service, which in this demo would just be called by the timer, but it could be called manually if appropriate.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a problem, but it might be an issue if this code thinks that the return type will be acted on, e.g. if it is written such that returning false should stop subsequent timer callbacks, then it would be misleading to read, since the return value is being ignored.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I've made it return void to remove any chance of readers misinterpreting that the return type is considered by the timer: d8c8b40

@dirk-thomas
Copy link
Copy Markdown
Member

From reading the patch removing the loop within the timer callback makes absolute sense. The client should check if the service is there and if not return from the callback in order to not block one worker thread.

I don't see why timing of the timer should be changed? Shouldn't it work just the same even if the time stays at 1s (or even if it would be 0.5s)?

(Anyway it would be good to move the request further down so that it doesn't get created before the callback aborts in case the service isn't ready yet.)

@dhood
Copy link
Copy Markdown
Member Author

dhood commented Nov 7, 2017

I don't see why timing of the timer should be changed? Shouldn't it work just the same even if the time stays at 1s (or even if it would be 0.5s)?

Not if the API composition demo runs with a single threaded executor; by the time wait_for_service of 1s has finished another timer callback is queued, and this process goes on forever, so it won't respond to requests to load components. This makes intuitive sense to me (with a single threaded executor), what is the behaviour that you were expecting?

@dhood dhood added in progress Actively being worked on (Kanban column) and removed in review Waiting for review (Kanban column) labels Nov 7, 2017
@dirk-thomas
Copy link
Copy Markdown
Member

Not if the API composition demo runs with a single threaded executor; by the time wait_for_service of 1s has finished another timer callback is queued, and this process goes on forever, so it won't respond to requests to load components. This makes intuitive sense to me (with a single threaded executor), what is the behaviour that you were expecting?

In an "ideal" world if the request to load a component is arriving while one component is handling its timer event I would expect that after the timer event handler finished that the request is handled before the component can handle the next time event (since the timer event "arrived" after the load request).

If the behavior is like you are describing it that sounds like our implementation favors a specific kind of event over other events. In the extreme case that means that we can't guarantee that a specific event is ever handled since there could always be "something else" which is being handled before. This is a classic starvation problem when synchronizing access to resources.

I would argue that should not be worked around in the use land code. A single threaded executor is logically perfectly able to handle this case. Instead we might want to investigate the implementation details and how we can avoid this starvation in the first place.

{
client_ = create_client<example_interfaces::srv::AddTwoInts>("add_two_ints");
timer_ = create_wall_timer(1s, std::bind(&Client::on_timer, this));
timer_ = create_wall_timer(2s, std::bind(&Client::on_timer, this));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment here describing why the timer must have a larger timeout than the wait for service call below.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep: 894d918

@dhood
Copy link
Copy Markdown
Member Author

dhood commented Nov 8, 2017

Style-wise we agree that the client should not repeatedly call to wait_for_service.

@dirk-thomas' question about the need for the client's timer duration to be higher than the wait_for_service duration has revealed a limitation of the single-threaded executor: ros2/rclcpp#392

While investigating that I realised that the regression test only consistently reproduces the issue if the server component is started after the timer starts, so I added an option to api_composition_cli to start delayed.

Standard CI (only composition package):

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

CI highlighting that the regression test fails if the timer period is less than the wait_for_service duration.
Regression test added, callback returning if wait_for_service call fails, but timer set at 1s instead of 2s:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status (failed 4 times)
  • Windows Build Status

@dhood dhood added in review Waiting for review (Kanban column) and removed in progress Actively being worked on (Kanban column) labels Nov 8, 2017
Copy link
Copy Markdown
Member

@dirk-thomas dirk-thomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dhood dhood merged commit 190d401 into master Nov 8, 2017
@dhood dhood deleted the client_dont_retry branch November 8, 2017 20:38
@dhood dhood removed the in review Waiting for review (Kanban column) label Nov 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants