Refactor launch service run_async loop to wait on futures and queued events#449
Refactor launch service run_async loop to wait on futures and queued events#449jacobperron merged 11 commits intomasterfrom
Conversation
Fixes ros2/launch_ros#169 Otherwise, it's possible to run into a race condition between checking for 'idle' and processing events from the queue. I.e. it's possible that an event is completed between executing the following two lines: https://github.com/ros2/launch/blob/657745c315ba6a61984b66f168f9c34b3a7e2108/launch/launch/launch_service.py#L158 https://github.com/ros2/launch/blob/657745c315ba6a61984b66f168f9c34b3a7e2108/launch/launch/launch_service.py#L272 To fix the race, this changes makes all accesses to the event queue thread safe and introduces a helper function get_next_event() that first checks if the queue is empty before potentially blocking forever with asynchio.Queue.get(). Signed-off-by: Jacob Perron <jacob@openrobotics.org>
|
See ros2/launch_ros#169 for steps to reproduce the issue. |
ivanpauno
left a comment
There was a problem hiding this comment.
I guess this is fine ...
I'm not convinced if launch code should be thread safe or not (it currently isn't).
Maybe it should not and we should only rely on running things asynchronously in an event loop, IDK.
|
Actually, I'm not sure if this is the right way to handle the problem. You can currently emit an event thread safely if you want, e.g.: launch/launch/launch/launch_service.py Lines 115 to 118 in 657745c So, I'm not sure if the event Queue should be thread safe. |
|
Reading the code I'm not sure how the two lines you pointed out @jacobperron could ever be run concurrently, but maybe I'm not using my imagination properly :) |
Looking again, I might be wrong about the reason for the hang. It looks like launch/launch/launch/launch_service.py Line 332 in 40f3be3 Which means either the queue is not empty or there are still futures to wait on. But when we get to processing events (from the queue) the queue is empty and we block forever on this line: launch/launch/launch/launch_service.py Line 357 in 40f3be3 |
Signed-off-by: Jacob Perron <jacob@openrobotics.org>
|
It looks like the future created here is never marked as done, even though logging suggests the function is run to completion: |
mmmm, so run_in_executor is using the default executor, which seems to actually be multi-threaded (reference needed here). |
Reference found: https://github.com/python/cpython/blob/374d998b507d34a6c0a3816a163926a8ba0c483f/Lib/asyncio/base_events.py#L808-L812. |
|
Yeah, the docs say the default executor must be a ThreadPoolExecutor. |
|
What I think it's happening is the following:
Easy hack: Using get with a timeout. |
I like looking at the code 😂 |
Unfortunately, |
|
That was my initial theory too but the shutdown event didn’t clear the hang. A better solution would be to have a noop event like “check for idle” or “future completed” that would cause get to return and would be posted after the future is completed. |
|
I think I have figured out a fix (it works well in my mind at least): The issue: _is_iddle is returning true if there is either a future available or an event has not be processed. But we are then blocking on getting a new event (see 1 and 2). We could rather do the following: If we always do that, not only after shutdown has happened, we should never run into the race (as we would check again |
|
@ivanpauno how do you wait on events in the queue and the incomplete futures at the same time? |
Using launch/launch/launch/launch_service.py Lines 352 to 356 in 657745c IIUC, that allows you to wait on a collection of awaitables at the same time, and it will return when one of them is completed. We should take care of not processing another event if |
|
Ok, so the set of things being waited on in each loop would be |
|
Sounds good to me. I'll give it a try. |
|
Almost seems to work; now I get stuck in the inner shutdown loop. Maybe I've hit a different bug related to shutdown, but I don't understand why there is a launch/launch/launch/launch_service.py Line 340 in 657745c What I see is this condition evaluates to launch/launch/launch/launch_service.py Line 337 in 657745c Then we call launch/launch/launch/launch_service.py Line 336 in 657745c Then we get stuck waiting to get something from the queue and repeatedly see this debug: launch/launch/launch/launch_service.py Line 359 in 657745c I guess this logic is broken (I don't know under what circumstance this is True): launch/launch/launch/launch_service.py Lines 344 to 346 in 657745c If I remove the |
That continue ensure that launch/launch/launch/launch_service.py Line 342 in 657745c At least I think that's the case. Maybe that's no longer the case. |
|
@jacobperron can you push that code somewhere? |
Signed-off-by: Jacob Perron <jacob@openrobotics.org>
|
This solves my original issue, but now the |
|
The test is emitting an event from a thread and it gets get stuck here: Launch service call stack ends up here: launch/launch/launch/launch_service.py Lines 115 to 117 in 19e0d68 and waiting forever for the result launch/launch/launch/launch_service.py Lines 124 to 125 in 19e0d68 |
|
I guess the issue is that we need to |
|
@ivanpauno @wjwwood Okay, I think I've got it now (1118883). PTAL at the latest changes. |
Signed-off-by: Jacob Perron <jacob@openrobotics.org>
Signed-off-by: Jacob Perron <jacob@openrobotics.org>
Signed-off-by: Jacob Perron <jacob@openrobotics.org>
We don't need to have an inner loop or timeout when waiting on futures. Signed-off-by: Jacob Perron <jacob@openrobotics.org>
ivanpauno
left a comment
There was a problem hiding this comment.
LGTM! (with full CI testing packages above launch)
I would like somebody else reviewing this before merging
|
Great finding! LGTM too with green CI. |
This prevents spurious wake-ups when we're waiting for an event to finish processing. Signed-off-by: Jacob Perron <jacob@openrobotics.org>
wjwwood
left a comment
There was a problem hiding this comment.
The new logic makes sense (once I read it without the diff), lgtm
|
The Rpr job is failing due to an infrastructure issue. Since the failure is unrelated to this change and CI is green, I'll go ahead and merge. Thanks for the reviews everyone! |
…events (#449) Fixes ros2/launch_ros#169 Otherwise, it's possible to get into a hung state where we wait for an event, even though there are no more events. This is because the check for an "idle" state evaluates to "True" as we wait for some futures to complete. By waiting for futures and events concurrently, we can avoid this problem. Further, we don't have to wait for an event if there's nothing in the queue. Signed-off-by: Jacob Perron <jacob@openrobotics.org> * Wait until one event is processed or nothing done (timeout) Signed-off-by: Jacob Perron <jacob@openrobotics.org> * Fix bugs Always create a task to wait on and cancel it if there is a timeout. Signed-off-by: Jacob Perron <jacob@openrobotics.org> * Avoid canceling an event mid-processing Signed-off-by: Jacob Perron <jacob@openrobotics.org> * minor refactor Signed-off-by: Jacob Perron <jacob@openrobotics.org> * Guard against leaving a task pending Signed-off-by: Jacob Perron <jacob@openrobotics.org> * Further refactoring We don't need to have an inner loop or timeout when waiting on futures. Signed-off-by: Jacob Perron <jacob@openrobotics.org> * Only wait on futures if there are no events in the queue This prevents spurious wake-ups when we're waiting for an event to finish processing. Signed-off-by: Jacob Perron <jacob@openrobotics.org>
|
I've decided not to backport this to Eloquent or Dashing since there's been a lot of changes since they were last touched. |
…events (#449) (#455) Fixes ros2/launch_ros#169 Otherwise, it's possible to get into a hung state where we wait for an event, even though there are no more events. This is because the check for an "idle" state evaluates to "True" as we wait for some futures to complete. By waiting for futures and events concurrently, we can avoid this problem. Further, we don't have to wait for an event if there's nothing in the queue. Signed-off-by: Jacob Perron <jacob@openrobotics.org> * Wait until one event is processed or nothing done (timeout) Signed-off-by: Jacob Perron <jacob@openrobotics.org> * Fix bugs Always create a task to wait on and cancel it if there is a timeout. Signed-off-by: Jacob Perron <jacob@openrobotics.org> * Avoid canceling an event mid-processing Signed-off-by: Jacob Perron <jacob@openrobotics.org> * minor refactor Signed-off-by: Jacob Perron <jacob@openrobotics.org> * Guard against leaving a task pending Signed-off-by: Jacob Perron <jacob@openrobotics.org> * Further refactoring We don't need to have an inner loop or timeout when waiting on futures. Signed-off-by: Jacob Perron <jacob@openrobotics.org> * Only wait on futures if there are no events in the queue This prevents spurious wake-ups when we're waiting for an event to finish processing. Signed-off-by: Jacob Perron <jacob@openrobotics.org>
Fixes ros2/launch_ros#169
Otherwise, it's possible to run into a race condition between checking for 'idle' and processing events from the queue.
Edit:
As described in #449 (comment), we can not be idle because we're waiting for futures to complete, but if there are no more events in the queue then we end up blocking forever.
To resolve this corner-case that results in deadlock, this PR refactors the run_async loop such that we concurrently wait on futures and events in the queue.
cc/ @wjwwood