Skip to content

Fix deadlock between the flowcontroller thread and the matched_reader…#3468

Closed
shumov-ag wants to merge 1 commit intoeProsima:masterfrom
shumov-ag:fix-deadlock-between-flowcontroller-and-matched_reader_remove
Closed

Fix deadlock between the flowcontroller thread and the matched_reader…#3468
shumov-ag wants to merge 1 commit intoeProsima:masterfrom
shumov-ag:fix-deadlock-between-flowcontroller-and-matched_reader_remove

Conversation

@shumov-ag
Copy link
Copy Markdown
Contributor

…_remove() function called from another thread

This PR fixes a deadlock between the function FlowControllerImpl::run() working in one thread and the function StatefulWriter::matched_reader_remove() called from another thread.

1st thread locks:

std::unique_lock<std::mutex> lock(mutex_);

std::lock_guard<RTPSMessageSenderInterface> lock(*sender_);

2nd thread locks:

std::unique_lock<LocatorSelectorSender> guard_locator_selector_async(locator_selector_async_);

std::unique_lock<std::mutex> lock(mutex_);

Here is more info from gdb:

gdb: thread 9 (LWP 322960) try to lock a mutex owned by 322965
(gdb) thread 9
[Switching to thread 9 (Thread 0x7fa6249d3700 (LWP 322960))]
#0  0x00007fa6291cc170 in __lll_lock_wait () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0  0x00007fa6291cc170 in __lll_lock_wait () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007fa6291c40a3 in pthread_mutex_lock () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007fa62861809c in __gthread_mutex_lock (__mutex=0x562012448608) at /usr/include/x86_64-linux-gnu/c++/9/bits/gthr-default.h:749
#3  std::mutex::lock (this=0x562012448608) at /usr/include/c++/9/bits/std_mutex.h:100
#4  std::unique_lock<std::mutex>::lock (this=0x7fa6249d1200, this=0x7fa6249d1200) at /usr/include/c++/9/bits/unique_lock.h:141
#5  std::unique_lock<std::mutex>::unique_lock (__m=..., this=0x7fa6249d1200) at /usr/include/c++/9/bits/unique_lock.h:71
#6  eprosima::fastdds::rtps::FlowControllerImpl<eprosima::fastdds::rtps::FlowControllerSyncPublishMode, eprosima::fastdds::rtps::FlowControllerFifoSchedule>::remove_change_impl<eprosima::fastdds::rtps::FlowControllerSyncPublishMode> (change=0x5620127f9430, this=0x562012448600) at ./src/cpp/rtps/flowcontrol/FlowControllerImpl.hpp:1231
#7  eprosima::fastdds::rtps::FlowControllerImpl<eprosima::fastdds::rtps::FlowControllerSyncPublishMode, eprosima::fastdds::rtps::FlowControllerFifoSchedule>::remove_change (this=0x562012448600, 
    change=0x5620127f9430) at ./src/cpp/rtps/flowcontrol/FlowControllerImpl.hpp:1024
#8  0x00007fa6283a4363 in eprosima::fastrtps::rtps::StatefulWriter::change_removed_by_history (this=0x5620124a04d0, a_change=0x5620127f9430) at ./src/cpp/rtps/writer/StatefulWriter.cpp:534
#9  0x00007fa6283bf649 in eprosima::fastrtps::rtps::WriterHistory::remove_change_nts (this=0x5620124a0330, removal=0x5620127f9430, release=<optimized out>) at ./src/cpp/rtps/history/WriterHistory.cpp:185
#10 0x00007fa6283bdec3 in eprosima::fastrtps::rtps::History::remove_change (this=0x5620124a0330, ch=0x5620127f9430) at ./src/cpp/rtps/history/History.cpp:111
#11 0x00007fa6283a4cc8 in eprosima::fastrtps::rtps::StatefulWriter::check_acked_status (this=0x5620124a04d0) at ./include/fastdds/rtps/common/SequenceNumber.h:172
#12 0x00007fa6283a712e in eprosima::fastrtps::rtps::StatefulWriter::matched_reader_remove (this=0x5620124a04d0, reader_guid=...) at ./src/cpp/rtps/writer/StatefulWriter.cpp:1266
#13 0x00007fa6285de436 in eprosima::fastrtps::rtps::EDPSimple::removeRemoteEndpoints (this=0x562012495c20, pdata=<optimized out>) at ./src/cpp/rtps/builtin/discovery/endpoint/EDPSimple.cpp:943
#14 0x00007fa6285c2af2 in eprosima::fastrtps::rtps::PDP::remove_remote_participant (this=0x562012489560, partGUID=..., reason=eprosima::fastrtps::rtps::ParticipantDiscoveryInfo::REMOVED_PARTICIPANT)
    at ./include/fastdds/rtps/reader/ReaderDiscoveryInfo.h:56
#15 0x00007fa6285cf232 in eprosima::fastrtps::rtps::PDPListener::onNewCacheChangeAdded (this=0x5620123ffd50, reader=0x5620124925d0, change_in=0x562012493f50)
    at ./src/cpp/rtps/builtin/discovery/participant/PDPListener.cpp:202
#16 0x00007fa6283e1928 in eprosima::fastrtps::rtps::StatelessReader::change_received (this=this@entry=0x5620124925d0, change=0x562012493f50) at /usr/include/x86_64-linux-gnu/c++/9/bits/gthr-default.h:779
#17 0x00007fa6283e688f in eprosima::fastrtps::rtps::StatelessReader::processDataMsg (this=0x5620124925d0, change=0x7fa6249d1e20) at ./src/cpp/rtps/reader/StatelessReader.cpp:516
#18 0x00007fa6283f8108 in eprosima::fastrtps::rtps::MessageReceiver::<lambda(eprosima::fastrtps::rtps::RTPSReader*)>::operator() (__closure=<synthetic pointer>, reader=<optimized out>)
    at ./src/cpp/rtps/messages/MessageReceiver.cpp:200
#19 eprosima::fastrtps::rtps::MessageReceiver::findAllReaders<eprosima::fastrtps::rtps::MessageReceiver::process_data_message_without_security(const eprosima::fastrtps::rtps::EntityId_t&, eprosima::fastrtps::rtps::CacheChange_t&)::<lambda(eprosima::fastrtps::rtps::RTPSReader*)> > (callback=<synthetic pointer>..., readerID=..., this=<optimized out>) at ./src/cpp/rtps/messages/MessageReceiver.cpp:664
#20 eprosima::fastrtps::rtps::MessageReceiver::process_data_message_without_security (this=<optimized out>, reader_id=..., change=...) at ./src/cpp/rtps/messages/MessageReceiver.cpp:203
#21 0x00007fa6283fde38 in std::function<void (eprosima::fastrtps::rtps::EntityId_t const&, eprosima::fastrtps::rtps::CacheChange_t&)>::operator()(eprosima::fastrtps::rtps::EntityId_t const&, eprosima::fastrtps::rtps::CacheChange_t&) const (__args#1=..., __args#0=..., this=0x562012407ab0) at /usr/include/c++/9/bits/std_function.h:683
#22 eprosima::fastrtps::rtps::MessageReceiver::proc_Submsg_Data (this=0x5620124079c0, msg=0x7fa6249d2370, smh=<optimized out>) at ./src/cpp/rtps/messages/MessageReceiver.cpp:837
#23 0x00007fa628400986 in eprosima::fastrtps::rtps::MessageReceiver::processCDRMsg (this=0x5620124079c0, source_locator=..., reception_locator=..., msg=msg@entry=0x7fa6249d2370)
    at ./src/cpp/rtps/messages/MessageReceiver.cpp:416
#24 0x00007fa62840408a in eprosima::fastrtps::rtps::ReceiverResource::OnDataReceived (this=0x5620123f9230, data=0x562012417f30 "RTPS\002\002\001\017\001\017W\202ڣ*\022\001", size=116, localLocator=..., 
    remoteLocator=...) at ./src/cpp/rtps/network/ReceiverResource.cpp:121
#25 0x00007fa6284c4b38 in eprosima::fastdds::rtps::UDPChannelResource::perform_listen_operation (this=0x562012407780, input_locator=...) at ./src/cpp/rtps/transport/UDPChannelResource.h:168
#26 0x00007fa6284c508a in std::__invoke_impl<void, void (eprosima::fastdds::rtps::UDPChannelResource::*)(eprosima::fastrtps::rtps::Locator_t), eprosima::fastdds::rtps::UDPChannelResource*, eprosima::fastrtps::rtps::Locator_t> (__t=<optimized out>, __f=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/string_fortified.h:34
#27 std::__invoke<void (eprosima::fastdds::rtps::UDPChannelResource::*)(eprosima::fastrtps::rtps::Locator_t), eprosima::fastdds::rtps::UDPChannelResource*, eprosima::fastrtps::rtps::Locator_t> (
    __fn=<optimized out>) at /usr/include/c++/9/bits/invoke.h:95
#28 std::thread::_Invoker<std::tuple<void (eprosima::fastdds::rtps::UDPChannelResource::*)(eprosima::fastrtps::rtps::Locator_t), eprosima::fastdds::rtps::UDPChannelResource*, eprosima::fastrtps::rtps::Locator_t> >::_M_invoke<0ul, 1ul, 2ul> (this=<optimized out>) at /usr/include/c++/9/thread:244
#29 std::thread::_Invoker<std::tuple<void (eprosima::fastdds::rtps::UDPChannelResource::*)(eprosima::fastrtps::rtps::Locator_t), eprosima::fastdds::rtps::UDPChannelResource*, eprosima::fastrtps::rtps::Locator_t> >::operator() (this=<optimized out>) at /usr/include/c++/9/thread:251
#30 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (eprosima::fastdds::rtps::UDPChannelResource::*)(eprosima::fastrtps::rtps::Locator_t), eprosima::fastdds::rtps::UDPChannelResource*, eprosima::fastrtps::rtps::Locator_t> > >::_M_run (this=<optimized out>) at /usr/include/c++/9/thread:195
#31 0x00007fa629094de4 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#32 0x00007fa6291c1609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
#33 0x00007fa628ece133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6
(gdb) up 2
#2  0x00007fa62861809c in __gthread_mutex_lock (__mutex=0x562012448608) at /usr/include/x86_64-linux-gnu/c++/9/bits/gthr-default.h:749
749	    return __gthrw_(pthread_mutex_lock) (__mutex);
(gdb) print __mutex.__data
$2 = {__lock = 2, __count = 0, __owner = 322965, __nusers = 1, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}
gdb: thread 14 (LWP 322965) try to lock a mutex owned by 322960
(gdb) thread 14
[Switching to thread 14 (Thread 0x7fa61dffb700 (LWP 322960))]
#0  0x00007fa6291cc170 in __lll_lock_wait () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
(gdb) bt
#0  0x00007fa6291cc170 in __lll_lock_wait () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007fa6291c4131 in pthread_mutex_lock () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007fa6283f1aa5 in __gthread_mutex_lock (__mutex=0x5620124a0b20) at /usr/include/x86_64-linux-gnu/c++/9/bits/gthr-default.h:749
#3  __gthread_recursive_mutex_lock (__mutex=0x5620124a0b20) at /usr/include/x86_64-linux-gnu/c++/9/bits/gthr-default.h:811
#4  std::recursive_mutex::lock (this=0x5620124a0b20) at /usr/include/c++/9/mutex:106
#5  eprosima::fastrtps::rtps::LocatorSelectorSender::lock (this=0x5620124a09f8) at ./include/fastdds/rtps/writer/LocatorSelectorSender.hpp:87
#6  std::lock_guard<eprosima::fastrtps::rtps::RTPSMessageSenderInterface>::lock_guard (__m=..., this=<synthetic pointer>) at /usr/include/c++/9/bits/std_mutex.h:159
#7  eprosima::fastrtps::rtps::RTPSMessageGroup::send (this=0x562012448d30) at ./src/cpp/rtps/messages/RTPSMessageGroup.cpp:277
#8  0x00007fa6283f1b7d in eprosima::fastrtps::rtps::RTPSMessageGroup::flush (this=this@entry=0x562012448d30) at ./src/cpp/rtps/messages/RTPSMessageGroup.cpp:263
#9  0x00007fa6283f1c4d in eprosima::fastrtps::rtps::RTPSMessageGroup::flush_and_reset (this=this@entry=0x562012448d30) at ./src/cpp/rtps/messages/RTPSMessageGroup.cpp:316
#10 0x00007fa62861fbb7 in eprosima::fastrtps::rtps::RTPSMessageGroup::sender (msg_sender=0x0, endpoint=0x0, this=0x562012448d30) at ./include/fastdds/rtps/messages/RTPSMessageGroup.h:225
#11 eprosima::fastdds::rtps::FlowControllerImpl<eprosima::fastdds::rtps::FlowControllerSyncPublishMode, eprosima::fastdds::rtps::FlowControllerFifoSchedule>::run (this=0x562012448600)
    at ./src/cpp/rtps/flowcontrol/FlowControllerImpl.hpp:1381
#12 0x00007fa629094de4 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#13 0x00007fa6291c1609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
#14 0x00007fa628ece133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6
(gdb) up 2
#2  0x00007fa6283f1aa5 in __gthread_mutex_lock (__mutex=0x5620124a0b20) at /usr/include/x86_64-linux-gnu/c++/9/bits/gthr-default.h:749
749	    return __gthrw_(pthread_mutex_lock) (__mutex);
(gdb) print __mutex.__data
$3 = {__lock = 2, __count = 1, __owner = 322960, __nusers = 1, __kind = 1, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}
gdb: info threads
(gdb) info threads
  Id   Target Id                          Frame 
* 1    Thread 0x7fa611894700 (LWP 322985) 0x00007fa6291c8376 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
  2    Thread 0x7fa6289f9f80 (LWP 322890) 0x00007fa6291c87d1 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
  3    Thread 0x7fa6279d9700 (LWP 322940) 0x00007fa6291c87d1 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
  4    Thread 0x7fa6271d8700 (LWP 322941) 0x00007fa6291cb454 in do_futex_wait.constprop () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
  5    Thread 0x7fa6269d7700 (LWP 322943) 0x00007fa628ece46e in epoll_wait () from /usr/lib/x86_64-linux-gnu/libc.so.6
  6    Thread 0x7fa6261d6700 (LWP 322944) 0x00007fa6291c87d1 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
  7    Thread 0x7fa6259d5700 (LWP 322957) 0x00007fa6291cc170 in __lll_lock_wait () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
  8    Thread 0x7fa6251d4700 (LWP 322959) 0x00007fa6291cd0ed in recvmsg () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
  9    Thread 0x7fa6249d3700 (LWP 322960) 0x00007fa6291cc170 in __lll_lock_wait () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
  10   Thread 0x7fa61ffff700 (LWP 322961) 0x00007fa6291cc170 in __lll_lock_wait () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
  11   Thread 0x7fa61f7fe700 (LWP 322962) 0x00007fa6291cc170 in __lll_lock_wait () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
  12   Thread 0x7fa61effd700 (LWP 322963) 0x00007fa6291cd0ed in recvmsg () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
  13   Thread 0x7fa61e7fc700 (LWP 322964) 0x00007fa6291cb678 in do_futex_wait.constprop () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
  14   Thread 0x7fa61dffb700 (LWP 322965) 0x00007fa6291cc170 in __lll_lock_wait () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
  15   Thread 0x7fa61d7fa700 (LWP 322966) 0x00007fa6291c8376 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
  16   Thread 0x7fa61cff9700 (LWP 322983) 0x00007fa6291cc170 in __lll_lock_wait () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
  17   Thread 0x7fa612095700 (LWP 322984) 0x00007fa6291cc170 in __lll_lock_wait () from /usr/lib/x86_64-linux-gnu/libpthread.so.0

Description

Contributor Checklist

  • Commit messages follow the project guidelines.
  • The code follows the style guidelines of this project.
  • Tests that thoroughly check the new feature have been added/Regression tests checking the bug and its fix have been added; the added tests pass locally
  • Any new/modified methods have been properly documented using Doxygen.
  • Changes are ABI compatible.
  • Changes are API compatible.
  • New feature has been added to the versions.md file (if applicable).
  • New feature has been documented/Current behavior is correctly described in the documentation.
  • Applicable backports have been included in the description.

Reviewer Checklist

  • The PR has a milestone assigned.
  • Check contributor checklist is correct.
  • Check CI results: changes do not issue any warning.
  • Check CI results: failing tests are unrelated with the changes.

@shumov-ag shumov-ag force-pushed the fix-deadlock-between-flowcontroller-and-matched_reader_remove branch 2 times, most recently from 039ad57 to f4d5887 Compare April 17, 2023 13:53
@shumov-ag shumov-ag force-pushed the fix-deadlock-between-flowcontroller-and-matched_reader_remove branch 2 times, most recently from a3b3e63 to 9013ce4 Compare April 27, 2023 19:52
…_remove() function called from another thread

Signed-off-by: Artem Shumov <agshumov@sberautotech.ru>
@shumov-ag shumov-ag force-pushed the fix-deadlock-between-flowcontroller-and-matched_reader_remove branch from 9013ce4 to 3073cc2 Compare April 27, 2023 19:55
@MiguelCompany
Copy link
Copy Markdown
Member

@richiprosima Please test this

@JLBuenoLopez JLBuenoLopez added this to the v2.11.0 milestone May 8, 2023
Copy link
Copy Markdown
Contributor

@jsan-rt jsan-rt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @shumov-ag . Other than the comment, do you think you could add a test for this specific deadlock?

Comment on lines +262 to +263
endpoint_ = nullptr;
flush_and_reset();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think setting endpoint_ to nullptr before flush_and_reset makes flush behave as reset_to_header. Is this intentional?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi. I used flush_and_reset, because in general, the behavior is different, at least due to this string.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think you could add a test for this specific deadlock?

I think I can try to do that

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intentional?

Yes. Since endpoint_ is nullptr, we only need to do complete reset.

@Mario-DL
Copy link
Copy Markdown
Contributor

We should make decision here on whether to go ahead without the specific test or not

@shumov-ag
Copy link
Copy Markdown
Contributor Author

Hey @Mario-DL . Currently I have no idea how to make this integration test. It seems quite hard to do

@jsan-rt jsan-rt modified the milestones: v2.11.0, v2.11.1 Jun 21, 2023
@jsan-rt
Copy link
Copy Markdown
Contributor

jsan-rt commented Jun 21, 2023

I've postponed the merging of this pull request until the next milestone. I don't feel confident merging it as-is without having a regression test. We'll see if we can come up with one in the coming weeks.

Thanks again @shumov-ag for this PR.

@MiguelCompany MiguelCompany modified the milestones: v2.11.1, v2.11.2 Jul 6, 2023
@MiguelCompany MiguelCompany modified the milestones: v2.11.2, v2.12.0 Aug 7, 2023
@MiguelCompany MiguelCompany modified the milestones: v2.12.0, v2.12.1 Sep 19, 2023
@MiguelCompany
Copy link
Copy Markdown
Member

@shumov-ag Our thread sanitizer job is not reporting the issue you are trying to fix here. Could you check whether it is still an issue with release v2.11.2?

@richiware This involves changes in the Flow Controllers implementation. Could you take a look?

@shumov-ag
Copy link
Copy Markdown
Contributor Author

@MiguelCompany Hi. Unfortunately, now I don't have the opportunity to reproduce this problem in the environment in which I caught it. This problem occurred several times on an high loaded system after many hours of work, and as a result, I had a similar picture in core dumps. In my opinion the analysis of the core dumps clearly shows which mutexes are blocked in which threads. But I haven't been yet able to write a synthetic test that would reproduce it.

@EduPonz EduPonz modified the milestones: v2.12.1, v2.13.0 Nov 8, 2023
@EduPonz
Copy link
Copy Markdown

EduPonz commented Nov 23, 2023

I'm closing this for now

@EduPonz EduPonz closed this Nov 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants