Fix deadlock between the flowcontroller thread and the matched_reader…#3468
Conversation
039ad57 to
f4d5887
Compare
a3b3e63 to
9013ce4
Compare
…_remove() function called from another thread Signed-off-by: Artem Shumov <agshumov@sberautotech.ru>
9013ce4 to
3073cc2
Compare
|
@richiprosima Please test this |
jsan-rt
left a comment
There was a problem hiding this comment.
Thanks for the PR @shumov-ag . Other than the comment, do you think you could add a test for this specific deadlock?
| endpoint_ = nullptr; | ||
| flush_and_reset(); |
There was a problem hiding this comment.
I think setting endpoint_ to nullptr before flush_and_reset makes flush behave as reset_to_header. Is this intentional?
There was a problem hiding this comment.
Hi. I used flush_and_reset, because in general, the behavior is different, at least due to this string.
There was a problem hiding this comment.
do you think you could add a test for this specific deadlock?
I think I can try to do that
There was a problem hiding this comment.
Is this intentional?
Yes. Since endpoint_ is nullptr, we only need to do complete reset.
|
We should make decision here on whether to go ahead without the specific test or not |
|
Hey @Mario-DL . Currently I have no idea how to make this integration test. It seems quite hard to do |
|
I've postponed the merging of this pull request until the next milestone. I don't feel confident merging it as-is without having a regression test. We'll see if we can come up with one in the coming weeks. Thanks again @shumov-ag for this PR. |
|
@shumov-ag Our thread sanitizer job is not reporting the issue you are trying to fix here. Could you check whether it is still an issue with release v2.11.2? @richiware This involves changes in the Flow Controllers implementation. Could you take a look? |
|
@MiguelCompany Hi. Unfortunately, now I don't have the opportunity to reproduce this problem in the environment in which I caught it. This problem occurred several times on an high loaded system after many hours of work, and as a result, I had a similar picture in core dumps. In my opinion the analysis of the core dumps clearly shows which mutexes are blocked in which threads. But I haven't been yet able to write a synthetic test that would reproduce it. |
|
I'm closing this for now |
…_remove() function called from another thread
This PR fixes a deadlock between the function FlowControllerImpl::run() working in one thread and the function StatefulWriter::matched_reader_remove() called from another thread.
1st thread locks:
Fast-DDS/src/cpp/rtps/flowcontrol/FlowControllerImpl.hpp
Line 1268 in 14de049
Fast-DDS/src/cpp/rtps/messages/RTPSMessageGroup.cpp
Line 277 in 14de049
2nd thread locks:
Fast-DDS/src/cpp/rtps/writer/StatefulWriter.cpp
Line 1226 in 14de049
Fast-DDS/src/cpp/rtps/flowcontrol/FlowControllerImpl.hpp
Line 1222 in 14de049
Here is more info from gdb:
gdb: thread 9 (LWP 322960) try to lock a mutex owned by 322965
gdb: thread 14 (LWP 322965) try to lock a mutex owned by 322960
gdb: info threads
Description
Contributor Checklist
versions.mdfile (if applicable).Reviewer Checklist