Skip to content
This repository was archived by the owner on May 31, 2025. It is now read-only.

Fix for deadlock issue 1980#2121

Merged
jacobperron merged 2 commits intoros:noetic-develfrom
iwanders:CORE-18232-deadlock-fix
Mar 11, 2021
Merged

Fix for deadlock issue 1980#2121
jacobperron merged 2 commits intoros:noetic-develfrom
iwanders:CORE-18232-deadlock-fix

Conversation

@iwanders
Copy link
Copy Markdown
Contributor

We encountered the issue sketched in #1980 in a production system last week.

To understand the issue at hand I've made a minimal-non-working example here. Looking at the source code it seems that the callback queue is actually at fault, not the timer code. The removeByID deadlocks when a callback by that id is currently being executed.

The first commit introduces a unit test that demonstrates this deadlock in a unit test.

The second commit is the proposed fix for this problem. We try to obtain the lock, if we fail to obtain the lock we defer removal until the next callback queue cycle using the already existing marked_for_removal boolean.

FYI @guillaumeautran , @mikepurvis , @jasonimercer

This unit test fails until we can remove callbacks that are being executed.
This fixes a potential deadlock when a timer is being removed that's also being executed.
Copy link
Copy Markdown
Contributor

@fujitatomoya fujitatomoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense to me. I also confirmed that this PR fixes the deadlock problem introduced here.

@iwanders
Copy link
Copy Markdown
Contributor Author

iwanders commented Mar 8, 2021

@fujitatomoya , is there anything else I can do to help get this issue merged and resolved?

@fujitatomoya
Copy link
Copy Markdown
Contributor

@jacobperron @mjcarroll @sloretz friendly ping.

Copy link
Copy Markdown
Contributor

@jacobperron jacobperron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix looks good to me, thanks!

Also, thank you for the reproduction and unit test 🙇

@jacobperron jacobperron merged commit 70d4b5b into ros:noetic-devel Mar 11, 2021
jacobperron pushed a commit that referenced this pull request Apr 6, 2021
* Add unit test for removing a callback that's being executed.

This unit test fails until we can remove callbacks that are being executed.

* Use the marked_for_removal flag if the callback is being executed.

This fixes a potential deadlock when a timer is being removed that's also being executed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants