Stop client component from hogging executor by dhood · Pull Request #187 · ros2/demos

dhood · 2017-11-03T22:11:23Z

Using the API composition demo, if the client component is started before the server, then a request to load the server component cannot be processed because the client is hogging the executor waiting for the service.

To reproduce:

ros2 run composition api_composition

In another terminal:

ros2 run composition api_composition_cli -- composition composition::Client
Sending request...
Waiting for response...
Result of load_node: success = true

ros2 run composition api_composition_cli -- composition composition::Server
Sending request...
Waiting for response...
<never loads>

Linux
Linux-aarch64
macOS
Windows

An alternative fix is to make the API composition demo use a multithreaded executor. That change may independently make sense, but I think the client component should be more co-operative regardless.

Rather than change the existing test I added a new regression test with the client being started first. However, this PR should also address flakiness of the existing test if the client waiitng was the cause of the server not being loaded sometimes e.g. http://ci.ros2.org/view/nightly/job/nightly_osx_repeated/883/testReport/junit/(root)/projectroot/test_composition__rmw_connext_cpp/

…ecutor

wjwwood

Seems like the best solution.

wjwwood · 2017-11-04T00:59:33Z

composition/src/client_component.cpp

    }
-    fprintf(stderr, "service not available, waiting again...\n");
+    fprintf(stderr, "service not available after waiting.\n");
+    return false;


I know you didn't write this function, but why does it return a bool? It looks like the timer callback is expected to return void: https://github.com/ros2/rclcpp/blob/989084b3de84000a15fb4bd5a7d63929001cd769/rclcpp/include/rclcpp/timer.hpp#L145-L168

Not sure but I'll update it to be void.

actually, for my understanding, is it an issue if the function returns bool? Even if the return code will just be ignored if called from the timer, it might provide useful info in other contexts. What if instead we rename the function to e.g. call_service, which in this demo would just be called by the timer, but it could be called manually if appropriate.

It's not a problem, but it might be an issue if this code thinks that the return type will be acted on, e.g. if it is written such that returning false should stop subsequent timer callbacks, then it would be misleading to read, since the return value is being ignored.

ok, I've made it return void to remove any chance of readers misinterpreting that the return type is considered by the timer: d8c8b40

dirk-thomas · 2017-11-07T00:09:39Z

From reading the patch removing the loop within the timer callback makes absolute sense. The client should check if the service is there and if not return from the callback in order to not block one worker thread.

I don't see why timing of the timer should be changed? Shouldn't it work just the same even if the time stays at 1s (or even if it would be 0.5s)?

(Anyway it would be good to move the request further down so that it doesn't get created before the callback aborts in case the service isn't ready yet.)

dhood · 2017-11-07T22:31:51Z

I don't see why timing of the timer should be changed? Shouldn't it work just the same even if the time stays at 1s (or even if it would be 0.5s)?

Not if the API composition demo runs with a single threaded executor; by the time wait_for_service of 1s has finished another timer callback is queued, and this process goes on forever, so it won't respond to requests to load components. This makes intuitive sense to me (with a single threaded executor), what is the behaviour that you were expecting?

dirk-thomas · 2017-11-07T22:41:06Z

Not if the API composition demo runs with a single threaded executor; by the time wait_for_service of 1s has finished another timer callback is queued, and this process goes on forever, so it won't respond to requests to load components. This makes intuitive sense to me (with a single threaded executor), what is the behaviour that you were expecting?

In an "ideal" world if the request to load a component is arriving while one component is handling its timer event I would expect that after the timer event handler finished that the request is handled before the component can handle the next time event (since the timer event "arrived" after the load request).

If the behavior is like you are describing it that sounds like our implementation favors a specific kind of event over other events. In the extreme case that means that we can't guarantee that a specific event is ever handled since there could always be "something else" which is being handled before. This is a classic starvation problem when synchronizing access to resources.

I would argue that should not be worked around in the use land code. A single threaded executor is logically perfectly able to handle this case. Instead we might want to investigate the implementation details and how we can avoid this starvation in the first place.

dirk-thomas · 2017-11-08T18:54:25Z

composition/src/client_component.cpp

 {
  client_ = create_client<example_interfaces::srv::AddTwoInts>("add_two_ints");
-  timer_ = create_wall_timer(1s, std::bind(&Client::on_timer, this));
+  timer_ = create_wall_timer(2s, std::bind(&Client::on_timer, this));


Can you add a comment here describing why the timer must have a larger timeout than the wait for service call below.

Yep: 894d918

dhood · 2017-11-08T19:05:57Z

Style-wise we agree that the client should not repeatedly call to wait_for_service.

@dirk-thomas' question about the need for the client's timer duration to be higher than the wait_for_service duration has revealed a limitation of the single-threaded executor: ros2/rclcpp#392

While investigating that I realised that the regression test only consistently reproduces the issue if the server component is started after the timer starts, so I added an option to api_composition_cli to start delayed.

Standard CI (only composition package):

Linux
Linux-aarch64
macOS
Windows

CI highlighting that the regression test fails if the timer period is less than the wait_for_service duration.
Regression test added, callback returning if wait_for_service call fails, but timer set at 1s instead of 2s:

Linux
Linux-aarch64
macOS (failed 4 times)
Windows

dirk-thomas

LGTM

dhood added 3 commits November 3, 2017 14:27

Don't let the client component retry wait_for_service, it hogs the ex…

9654da0

…ecutor

Increase the period of the timer to more than wait time

5a5e4c6

Add regression test for client starting first in api_composition demo

cd9a338

dhood added the in review Waiting for review (Kanban column) label Nov 3, 2017

dhood self-assigned this Nov 3, 2017

wjwwood approved these changes Nov 4, 2017

View reviewed changes

dhood added in progress Actively being worked on (Kanban column) and removed in review Waiting for review (Kanban column) labels Nov 7, 2017

dhood mentioned this pull request Nov 8, 2017

Single-threaded executor can be starved by timers ros2/rclcpp#392

Open

dhood added 3 commits November 7, 2017 18:15

void return value for timer callback

d8c8b40

Add delay option to api composition cli

ba94f11

Add delay to regression test so starts after timer running

4ef3526

dirk-thomas reviewed Nov 8, 2017

View reviewed changes

dhood added 2 commits November 8, 2017 11:00

Add reference to new ticket

894d918

More request creation after wait_for_service

fe38a10

dhood added in review Waiting for review (Kanban column) and removed in progress Actively being worked on (Kanban column) labels Nov 8, 2017

dirk-thomas approved these changes Nov 8, 2017

View reviewed changes

dhood merged commit 190d401 into master Nov 8, 2017

dhood deleted the client_dont_retry branch November 8, 2017 20:38

dhood removed the in review Waiting for review (Kanban column) label Nov 8, 2017

This was referenced Dec 11, 2017

[composition] new test not invoked in main function #217

Merged

[composition] new test needs higher timeout #218

Merged

Conversation

dhood commented Nov 3, 2017

Uh oh!

wjwwood left a comment

Choose a reason for hiding this comment

Uh oh!

wjwwood Nov 4, 2017

Choose a reason for hiding this comment

Uh oh!

dhood Nov 7, 2017

Choose a reason for hiding this comment

Uh oh!

dhood Nov 7, 2017

Choose a reason for hiding this comment

Uh oh!

wjwwood Nov 7, 2017

Choose a reason for hiding this comment

Uh oh!

dhood Nov 8, 2017

Choose a reason for hiding this comment

Uh oh!

dirk-thomas commented Nov 7, 2017

Uh oh!

dhood commented Nov 7, 2017

Uh oh!

dirk-thomas commented Nov 7, 2017

Uh oh!

dirk-thomas Nov 8, 2017

Choose a reason for hiding this comment

Uh oh!

dhood Nov 8, 2017

Choose a reason for hiding this comment

Uh oh!

dhood commented Nov 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dirk-thomas left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dhood commented Nov 8, 2017 •

edited

Loading