Skip to content

Conversation

@smithdh
Copy link
Contributor

@smithdh smithdh commented Nov 4, 2025

This reverts commit a85f1fd, in order to fix regression reported in #2629.

While investigating issue #2629 (in 5.9.0) by running more tests, I saw the following deadlock, appears to be the cause of #2629. This would be a regression with respect to 5.8.4.

Thread 62 (Thread 0x7f5e2efde640 (LWP 3261741) "xrootd"):
#0  0x00007f5e468868ba in __futex_abstimed_wait_common () from target:/lib64/libc.so.6
#1  0x00007f5e46891d68 in __new_sem_wait_slow64.constprop.0 () from target:/lib64/libc.so.6
#2  0x00007f5e46f5648e in XrdSys::IOEvents::Poller::SendCmd(XrdSys::IOEvents::Poller::PipeData&) () from target:/lib64/libXrdUtils.so.3
#3  0x00007f5e46f5655e in XrdSys::IOEvents::PollE::Exclude(XrdSys::IOEvents::Channel*, bool&, bool) () from target:/lib64/libXrdUtils.so.3
#4  0x00007f5e46f55590 in XrdSys::IOEvents::Channel::Delete() () from target:/lib64/libXrdUtils.so.3
#5  0x00007f5e4410e948 in XrdCl::PollerBuiltIn::RemoveSocket(XrdCl::Socket*) () from target:/lib64/libXrdCl.so.3
#6  0x00007f5e44194694 in XrdCl::AsyncSocketHandler::Close() () from target:/lib64/libXrdCl.so.3
#7  0x00007f5e44115ced in XrdCl::Stream::ForceError(XrdCl::XRootDStatus, bool) () from target:/lib64/libXrdCl.so.3
#8  0x00007f5e441105b4 in XrdCl::Channel::ForceDisconnect(bool) () from target:/lib64/libXrdCl.so.3
#9  0x00007f5e4411206f in XrdCl::PostMaster::ForceDisconnect(XrdCl::URL const&, bool) () from target:/lib64/libXrdCl.so.3
#10 0x00007f5e44113775 in XrdCl::Stream::ForceDisconnectJob::Run(void*) () from target:/lib64/libXrdCl.so.3
#11 0x00007f5e4419dbbd in XrdCl::JobManager::RunJobs() () from target:/lib64/libXrdCl.so.3
#12 0x00007f5e4419dc0d in RunRunnerThread () from target:/lib64/libXrdCl.so.3
#13 0x00007f5e46889d22 in start_thread () from target:/lib64/libc.so.6
#14 0x00007f5e4690ed40 in clone3 () from target:/lib64/libc.so.6

and

Thread 32 (Thread 0x7f5e3b5fc640 (LWP 3261711) "xrootd"):
#0  0x00007f5e46886839 in __futex_abstimed_wait_common () from target:/lib64/libc.so.6
#1  0x00007f5e4688f84b in pthread_rwlock_rdlock@GLIBC_2.2.5 () from target:/lib64/libc.so.6
#2  0x00007f5e44111a17 in XrdCl::PostMaster::Send(XrdCl::URL const&, XrdCl::Message*, XrdCl::MsgHandler*, bool, long) () from target:/lib64/libXrdCl.so.3
#3  0x00007f5e44140a60 in XrdCl::XRootDMsgHandler::RetryAtServer(XrdCl::URL const&, XrdCl::RedirectEntry::Type) () from target:/lib64/libXrdCl.so.3
#4  0x00007f5e4413f509 in XrdCl::XRootDMsgHandler::HandleError(XrdCl::XRootDStatus) () from target:/lib64/libXrdCl.so.3
#5  0x00007f5e4413f786 in XrdCl::XRootDMsgHandler::OnStreamEvent(XrdCl::MsgHandler::StreamEvent, XrdCl::XRootDStatus) () from target:/lib64/libXrdCl.so.3
#6  0x00007f5e44125aee in XrdCl::InQueue::ReportStreamEvent(XrdCl::MsgHandler::StreamEvent, XrdCl::XRootDStatus) () from target:/lib64/libXrdCl.so.3
#7  0x00007f5e4411ad84 in XrdCl::Stream::OnError(unsigned short, XrdCl::XRootDStatus) () from target:/lib64/libXrdCl.so.3
#8  0x00007f5e4411b4c3 in XrdCl::Stream::OnReadTimeout(unsigned short) () from target:/lib64/libXrdCl.so.3
#9  0x00007f5e441a2b9b in XrdCl::AsyncSocketHandler::Event(unsigned char, XrdCl::Socket*) () from target:/lib64/libXrdCl.so.3
#10 0x00007f5e4410f3aa in (anonymous namespace)::SocketCallBack::Event(XrdSys::IOEvents::Channel*, void*, int) () from target:/lib64/libXrdCl.so.3
#11 0x00007f5e46f570dc in XrdSys::IOEvents::Poller::CbkXeq(XrdSys::IOEvents::Channel*, int, int, char const*) () from target:/lib64/libXrdUtils.so.3
#12 0x00007f5e46f576b6 in XrdSys::IOEvents::Poller::CbkTMO() () from target:/lib64/libXrdUtils.so.3
#13 0x00007f5e46f579b5 in XrdSys::IOEvents::Poller::TmoGet() () from target:/lib64/libXrdUtils.so.3
#14 0x00007f5e46f584da in XrdSys::IOEvents::PollE::Begin(XrdSysSemaphore*, int&, char const**) () from target:/lib64/libXrdUtils.so.3
#15 0x00007f5e46f543f1 in XrdSys::IOEvents::BootStrap::Start(void*) () from target:/lib64/libXrdUtils.so.3
#16 0x00007f5e46f65a8c in XrdSysThread_Xeq () from target:/lib64/libXrdUtils.so.3
#17 0x00007f5e46889d22 in start_thread () from target:/lib64/libc.so.6
#18 0x00007f5e4690ed40 in clone3 () from target:/lib64/libc.so.6

Same IOEvents::Channel involved between the two threads, but different XrdCl::Stream. (This is a regression because a85f1fd made a call of ForceDisconnect asynchronous with respect to the IOEvents::Poller callback).

I thought the best way to address the #2629 is to initially revert a85f1fd in the next patch release, altough doing so leaves the original reason for the commit, #2578 unresolved. However that latter bug was only noticed by me and is presumably not curerntly being reported in any production situation. So I propose to make a new fix for #2578 later, and allow for more testing, with that to be included in R6 or a furuter 5.9.x patch release.

#2578 is more likely to be encountered if multiple streams (substreams) are in use by a long lived client process, e.g. in xcache or in an application like EOS fuse client. This situation could arrise, e.g. if xcache was configured with the non-default XRD_TLSNODATA=1, as in many situations root:// origins are being configured to require TLS on the login stream, probably beause of ZTN (and the client would introduce a bound sub-stream to support non-tls data channel).

…#2578"

This reverts commit a85f1fd, in order
to fix regression reported in xrootd#2629.
@smithdh smithdh marked this pull request as ready for review November 4, 2025 10:22
@amadio amadio linked an issue Nov 4, 2025 that may be closed by this pull request
@amadio amadio added this to the 5.9.1 milestone Nov 4, 2025
@amadio amadio merged commit 66ba9aa into xrootd:master Nov 4, 2025
12 checks passed
@amadio
Copy link
Member

amadio commented Nov 17, 2025

Just for the record, Coverity reports the issues detected in the last release as fixed.

screenshot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

xrootd process unresponsiveness

2 participants