Stop client component from hogging executor#187
Conversation
wjwwood
left a comment
There was a problem hiding this comment.
Seems like the best solution.
composition/src/client_component.cpp
Outdated
| } | ||
| fprintf(stderr, "service not available, waiting again...\n"); | ||
| fprintf(stderr, "service not available after waiting.\n"); | ||
| return false; |
There was a problem hiding this comment.
I know you didn't write this function, but why does it return a bool? It looks like the timer callback is expected to return void: https://github.com/ros2/rclcpp/blob/989084b3de84000a15fb4bd5a7d63929001cd769/rclcpp/include/rclcpp/timer.hpp#L145-L168
There was a problem hiding this comment.
Not sure but I'll update it to be void.
There was a problem hiding this comment.
actually, for my understanding, is it an issue if the function returns bool? Even if the return code will just be ignored if called from the timer, it might provide useful info in other contexts. What if instead we rename the function to e.g. call_service, which in this demo would just be called by the timer, but it could be called manually if appropriate.
There was a problem hiding this comment.
It's not a problem, but it might be an issue if this code thinks that the return type will be acted on, e.g. if it is written such that returning false should stop subsequent timer callbacks, then it would be misleading to read, since the return value is being ignored.
There was a problem hiding this comment.
ok, I've made it return void to remove any chance of readers misinterpreting that the return type is considered by the timer: d8c8b40
|
From reading the patch removing the loop within the timer callback makes absolute sense. The client should check if the service is there and if not return from the callback in order to not block one worker thread. I don't see why timing of the timer should be changed? Shouldn't it work just the same even if the time stays at 1s (or even if it would be 0.5s)? (Anyway it would be good to move the |
Not if the API composition demo runs with a single threaded executor; by the time wait_for_service of 1s has finished another timer callback is queued, and this process goes on forever, so it won't respond to requests to load components. This makes intuitive sense to me (with a single threaded executor), what is the behaviour that you were expecting? |
In an "ideal" world if the request to load a component is arriving while one component is handling its timer event I would expect that after the timer event handler finished that the request is handled before the component can handle the next time event (since the timer event "arrived" after the load request). If the behavior is like you are describing it that sounds like our implementation favors a specific kind of event over other events. In the extreme case that means that we can't guarantee that a specific event is ever handled since there could always be "something else" which is being handled before. This is a classic starvation problem when synchronizing access to resources. I would argue that should not be worked around in the use land code. A single threaded executor is logically perfectly able to handle this case. Instead we might want to investigate the implementation details and how we can avoid this starvation in the first place. |
| { | ||
| client_ = create_client<example_interfaces::srv::AddTwoInts>("add_two_ints"); | ||
| timer_ = create_wall_timer(1s, std::bind(&Client::on_timer, this)); | ||
| timer_ = create_wall_timer(2s, std::bind(&Client::on_timer, this)); |
There was a problem hiding this comment.
Can you add a comment here describing why the timer must have a larger timeout than the wait for service call below.
|
Style-wise we agree that the client should not repeatedly call to wait_for_service. @dirk-thomas' question about the need for the client's timer duration to be higher than the wait_for_service duration has revealed a limitation of the single-threaded executor: ros2/rclcpp#392 While investigating that I realised that the regression test only consistently reproduces the issue if the server component is started after the timer starts, so I added an option to Standard CI (only CI highlighting that the regression test fails if the timer period is less than the wait_for_service duration. |
Using the API composition demo, if the client component is started before the server, then a request to load the server component cannot be processed because the client is hogging the executor waiting for the service.
To reproduce:
In another terminal:
An alternative fix is to make the API composition demo use a multithreaded executor. That change may independently make sense, but I think the client component should be more co-operative regardless.
Rather than change the existing test I added a new regression test with the client being started first. However, this PR should also address flakiness of the existing test if the client waiitng was the cause of the server not being loaded sometimes e.g. http://ci.ros2.org/view/nightly/job/nightly_osx_repeated/883/testReport/junit/(root)/projectroot/test_composition__rmw_connext_cpp/