net: Restrict period when cs_vNodes mutex is locked #21563

hebasto · 2021-03-31T21:35:56Z

This PR restricts the period when the cs_vNodes mutex is locked, prevents the only case when cs_vNodes could be locked before the ::cs_main.

This change makes the explicit locking of recursive mutexes in the explicit order redundant.

DrahtBot · 2021-04-01T04:24:01Z

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

net_processing: lock clean up #21527 (net_processing: lock clean up by ajtowns)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

jnewbery · 2021-04-02T11:04:53Z

EDIT: This PR has substantially changed. The comment below refers to the old branch.

This does indeed seem useless. The mutex was added by @TheBlueMatt in commit d7c58ad:

Split CNode::cs_vSend: message processing and message sending

cs_vSend is used for two purposes - to lock the datastructures used
to queue messages to place on the wire and to only call
SendMessages once at a time per-node. I believe SendMessages used
to access some of the vSendMsg stuff, but it doesn't anymore, so
these locks do not need to be on the same mutex, and also make
deadlocking much more likely.

We're implicitly guaranteed to only call SendMessages() serially since it's only ever called by message handler thread. If we want to be explicit about it, it'd be better to add a global lock to PeerManagerImpl (or per-Peer object), so that net_processing is enforcing its own synchronization internally.

hebasto · 2021-04-07T12:11:02Z

Rebased 8ca2ee6 -> b58097f (pr21563.02 -> pr21563.03) due to the conflict with #21571.

jnewbery · 2021-04-09T09:04:55Z

This seems safe to me. The only thing that CNode.cs_sendProcessing enforces is preventing SendMessages() from being called for the same CNode concurrently. However, SendMessages() can only be called by threadMessageHandler, so there's no possibility of it being called concurrently.

I wonder if instead of simply removing this, we should replace it with something that more closely documents our expectations. I think that calling any of the NetEventsInterface methods (InitializeNode, FinalizeNode, SendMessages, ProcessMessages) in PeerManagerImpl concurrently could lead to problems, so perhaps a global mutex in PeerManagerImpl should be added that's taken whenever any of those functions are called? If we ever want to add concurrency to PeerManagerImpl, we could look at loosening that restriction.

hebasto · 2021-04-09T10:00:06Z

@jnewbery

I wonder if instead of simply removing this, we should replace it with something that more closely documents our expectations. I think that calling any of the NetEventsInterface methods (InitializeNode, FinalizeNode, SendMessages, ProcessMessages) in PeerManagerImpl concurrently could lead to problems, so perhaps a global mutex in PeerManagerImpl should be added that's taken whenever any of those functions are called? If we ever want to add concurrency to PeerManagerImpl, we could look at loosening that restriction.

Maybe postpone it until #19398 is fulfilled?

jnewbery · 2021-04-09T17:53:49Z

Maybe postpone it until #19398 is fulfilled?

I think they can be done independently. But I think I'd like to see the new internal PeerManagerImpl lock being introduced in the same PR that removes cs_sendProcessing.

hebasto · 2021-04-12T12:42:56Z

Updated b58097f -> 4de7605 (pr21563.03 -> pr21563.04):

rebased on top of the recent changes in CI
addressed @jnewbery's comment

maflcko · 2021-04-12T13:54:42Z

 node0 stderr libc++abi.dylib: terminating with uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument

https://cirrus-ci.com/task/5038649289998336?logs=ci#L3615

hebasto · 2021-04-12T14:11:17Z

 node0 stderr libc++abi.dylib: terminating with uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument

https://cirrus-ci.com/task/5038649289998336?logs=ci#L3615

Looks like a mutex is prematurely destroyed. Weird.

hebasto · 2021-04-12T19:54:43Z

Updated 4de7605 -> b1e5ca2 (pr21563.04 -> pr21563.05, diff):

fixed bug

src/init.cpp

src/net.cpp

jnewbery · 2021-04-13T10:11:04Z

src/net_processing.cpp

-    void FinalizeNode(const CNode& node) override;
-    bool ProcessMessages(CNode* pfrom, std::atomic<bool>& interrupt) override;
-    bool SendMessages(CNode* pto) override EXCLUSIVE_LOCKS_REQUIRED(pto->cs_sendProcessing);
+    void InitializeNode(CNode* pnode) override EXCLUSIVE_LOCKS_REQUIRED(!m_net_events_mutex);


What do you think about excluding holding cs_main when calling into a NetEventsInterface method?

It is not trivial due to the

bitcoin/src/net_processing.cpp

Line 662 in c1f480f

static std::map<NodeId, CNodeState> mapNodeState GUARDED_BY(cs_main);

Going to keep the scope of this PR tight.

src/init.cpp

hebasto · 2021-04-13T12:50:15Z

Updated b1e5ca2 -> a366332 (pr21563.05 -> pr21563.06):

rebased on top of the recent CI changes
addressed @jnewbery's comments

jnewbery · 2021-04-13T12:55:52Z

This seems good to me.

@MarcoFalke - you introduced the LOCK2(::cs_main, ::g_cs_orphans); in #18458. Do you see any problem with locking cs_vNodes, grabbing a copy of vNodes, releasing the lock and then deleting the nodes in CConnman::Stop()? This is similar to what happens in the socket handler and message handler threads?

ajtowns · 2021-04-20T07:37:32Z

Removing a mutex that guards nothing then adding a mutex that also doesn't actually guard anything seems a bit backwards...

#21527 has it guarding the extratxns data structures (as part of getting rid of g_cs_orphans). It would be good for it to be able to guard the address data structures in struct Peer, but without it being a global, that will probably be awkward at best, since Peer doesn't have a reference back to PeerManagerImpl.

The cs_vNodes changes don't seem to be mentioned in the title or PR description?

maflcko · 2021-04-20T07:49:26Z

The fist commit seems separate from the other changes?

hebasto · 2021-04-20T14:02:04Z

Updated a366332 -> 9766b7f (pr21563.06 -> pr21563.07):

dropped all commits but the first one in favor of net_processing: lock clean up #21527

The cs_vNodes changes don't seem to be mentioned in the title or PR description?

The PR description updated.

jnewbery · 2021-04-20T14:18:14Z

I think this change is fine. It steals the CNode*s from vNodes, releases the mutex and then cleans up the nodes. That's very similar to the pattern in SocketHandler():

bitcoin/src/net.cpp

Lines 1481 to 1487 in 0180453

    
           std::vector<CNode*> vNodesCopy; 
        
           { 
        
               LOCK(cs_vNodes); 
        
               vNodesCopy = vNodes; 
        
               for (CNode* pnode : vNodesCopy) 
        
                   pnode->AddRef(); 
        
           }

and in ThreadMessageHandler():

bitcoin/src/net.cpp

Lines 2185 to 2192 in 0180453

    
           std::vector<CNode*> vNodesCopy; 
        
           { 
        
               LOCK(cs_vNodes); 
        
               vNodesCopy = vNodes; 
        
               for (CNode* pnode : vNodesCopy) { 
        
                   pnode->AddRef(); 
        
               } 
        
           }

The difference being that in those cases, an extra reference is taken and nRefCount is incremented. Here, the reference count is not incremented since the pointer is stolen from vNodes before clearing that vector.

Perhaps we could document our assumptions by calling Release() on the CNode object when stealing it from vNodes or immediately before calling DeleteNode(), and then asserting that nRefCount == 0 in the CNode destructor? Maybe in a follow up.

ajtowns

Seems sensible at first glance, and I think any bugs would be caught by CI. Need to have a more thorough look though.

ajtowns · 2021-04-20T22:45:08Z

src/net.cpp

+    {
+        LOCK(cs_vNodes);
+        nodes = std::move(vNodes);
+        vNodes.clear();


Could write this as:

cs_vNodes.lock(); std::vector<CNode*> nodes(std::move(vNodes)); // move constructor clears vNodes cs_vNodes.unlock();

If I'm reading the spec right, the move constructor is guaranteed to be constant time, while operator= is linear time; and the move constructor also guarantees the original ends up cleared. Since C++17 the move constructor is also marked noexcept, so lock and unlock in place of RAII locks should be sound. Alternatively, could use WAIT_LOCK(cs_vNodes, lock); ...; REVERSE_LOCK(lock);.

Does performance matter here at all? Shutdown only happens once and flushing to disk will take magnitudes longer anyway.

No I don't think so -- at worst it's just moving pointers around, and only a hundred or so of them in normal configurations. (I wasn't sure if std::move + clear was safe / necessary, so looked into the behaviours)

the move constructor is guaranteed to be constant time, while operator= is linear time

This is the move assignment operator (number 2 in https://en.cppreference.com/w/cpp/container/vector/operator%3D). Its complexity is linear in the size of this - the vector being moved to. Presumably that's because the objects in this need to be destructed and the memory already owned by this needs to deallocated. In our case, this is empty, and so the operation is constant time - it just needs to copy start pointer/size/capacity from other to this.

Does performance matter here at all?

Absolutely not.

I wasn't sure if std::move + clear was safe / necessary

Moving from a vector leaves it in a "valid but unspecified state". The clear() is probably unnecessary since this is shutdown and we're not going to touch vNodes again, but I think it's good practice to not to leave vectors in an unspecified state if we can help it.

src/net.cpp

vasild

ACK 9766b7f

Something to consider for further improvement:

It is not even necessary to lock cs_vNodes inside StopNodes(). Why? Because by the time StopNodes() is called the other threads have been shut down.

If it was necessary to protect the code in StopNodes() with a mutex, then this PR would be broken:

lock()
tmp = std::move(shared_vector)
unlock()
destroy each element in tmp
// What prevents new entries from being added to shared_vector
// here, since we have unlock()ed? Nothing. So when this function
// completes, shared_vector is still alive and kicking (with new
// elements being added to it in the meantime).

The LOCK(cs_vNodes); is only needed to keep the thread safety analysis happy. This code actually belongs to the destructor ~CConnman() where we can access vNodes without locking and without upsetting the TSA. I think it can be moved to the destructor with some further changes (outside of this PR).

jnewbery · 2021-04-22T13:17:34Z

This code actually belongs to the destructor ~CConnman() where we can access vNodes without locking and without upsetting the TSA. I think it can be moved to the destructor with some further changes (outside of this PR).

Maybe, but that's a much more invasive change. CConnman::Stop() calls PeerManager::DeleteNode() for all the nodes. The destructor for CConnman is called after the destructor for PeerManager (in the reverse order that they're constructed). That can be changed, but we'd need be very careful.

jnewbery · 2021-04-22T13:23:22Z

utACK 9766b7f

Agree with @MarcoFalke that the move & clear is more elegantly expressed as std::vector::swap().

Diff

diff --git a/src/net.cpp b/src/net.cpp
index b7c1b8c6c4..94029491ab 100644
--- a/src/net.cpp
+++ b/src/net.cpp
@@ -2637,11 +2637,7 @@ void CConnman::StopNodes()
 
     // Delete peer connections
     std::vector<CNode*> nodes;
-    {
-        LOCK(cs_vNodes);
-        nodes = std::move(vNodes);
-        vNodes.clear();
-    }
+    WITH_LOCK(cs_vNodes, nodes.swap(vNodes));
 
     for (CNode* pnode : nodes) {
         pnode->CloseSocketDisconnect();

hebasto · 2021-04-22T14:36:46Z

Updated 9766b7f -> 8c8237a (pr21563.07 -> pr21563.08, diff).

Addressed @MarcoFalke's comments:

net: Restrict period when cs_vNodes mutex is locked #21563 (comment)

Could achieve the same in one less line of code with https://en.cppreference.com/w/cpp/container/vector/swap ?

net: Restrict period when cs_vNodes mutex is locked #21563 (comment)

minor nit: would be nice to do style-fixups in a separate commit

vasild

ACK 8c8237a

jnewbery · 2021-04-22T16:58:44Z

utACK 8c8237a

ajtowns

utACK 8c8237a - logic seems sound

Couple of comment improvements that would be good; longer term would also be better to add missing guards to things in net so that the compiler can catch more errors, and simplify the ownership/interaction between CConnman/PeerManager/NodeContext...

ajtowns · 2021-04-23T07:26:09Z

src/net.cpp

-    vNodes.clear();
    vNodesDisconnected.clear();
    vhListenSocket.clear();
    semOutbound.reset();


Might be worth adding a comment as to why vNodesDisconnected, vhListenSocket, semOutbound and semAddnode are all safe to reset here -- I think it's because:

vNodesDisconnected and vhListenSocket are only otherwise accessed by ThreadSocketHandler

semOutbound and semAddnode are only otherwise accessed via ThreadOpenConnections, Start(), Interrupt() (and Interrupt() is only safe if it's invoked in between Start() and StopNodes()), and AddConnection (which is called from rpc/net.cpp so also requires that we won't be adding connections via RPC -- might be good to add an if (semOutbound == nullptr) return false; to AddConnection())

RPC must be (and is assumed to be) shutdown before Stop is called. The other threads are also assumed to be shut down. Maybe reverting my commit (#21563 (comment)) could help to self-document that better?

ajtowns · 2021-04-23T07:41:21Z

src/test/fuzz/process_message.cpp

    }
    SyncWithValidationInterfaceQueue();
-    LOCK2(::cs_main, g_cs_orphans); // See init.cpp for rationale for implicit locking order requirement
    g_setup->m_node.connman->StopNodes();


I think there probably should be a comment as to why explicitly calling StopNodes is still necessary

(I believe it's because if you don't do that, then m_node will get destructed, deleting its peerman, then attempting to delete it's connman which will see some entries in vNodes and try to call peerman->FinalizeNode() on them, but peerman is deleted at that point. With the lock order fixed, it may be possible to have ~NodeContext call if (peerman) { peerman->Stop(); } an remove those lines from the tests entirely. Not sure if that would also let you remove it from init.cpp)

I think changing the destruction order can be done in a separate pull?

vasild · 2021-04-23T10:01:22Z

(consider out of scope of this PR)

Could we assert that all threads are stopped when StopNodes() starts executing? Something like

assert(threadMessageHandler.get_id() == std::thread::id());
assert(threadSocketHandler.get_id() == std::thread::id());
...

Or even call StopThreads() at the start of StopNodes()?

maflcko · 2021-04-25T08:01:26Z

Or even call StopThreads() at the start of StopNodes()?

This used to be the case before commit fa36965. Maybe that commit should be reverted now?

maflcko

I think the changes are nice and can be merged

review ACK 8c8237a 👢

Show signature and timestamp

Signature:

Timestamp of file with hash 2744498e52ef3957483818d251ab7a217f1deba6d21ad39bc26dd3c078c33d07 -

maflcko · 2021-04-25T08:12:26Z

There is a related pr, which is waiting for review donations: #21750

8c8237a net, refactor: Fix style in CConnman::StopNodes (Hennadii Stepanov) 229ac18 net: Combine two loops into one, and update comments (Hennadii Stepanov) a3d090d net: Restrict period when cs_vNodes mutex is locked (Hennadii Stepanov) Pull request description: This PR restricts the period when the `cs_vNodes` mutex is locked, prevents the only case when `cs_vNodes` could be locked before the `::cs_main`. This change makes the explicit locking of recursive mutexes in the explicit order redundant. ACKs for top commit: jnewbery: utACK 8c8237a vasild: ACK 8c8237a ajtowns: utACK 8c8237a - logic seems sound MarcoFalke: review ACK 8c8237a 👢 Tree-SHA512: a8277924339622b188b12d260a100adf5d82781634cf974320cf6007341f946a7ff40351137c2f5369aed0d318f38aac2d32965c9b619432440d722a4e78bb73

Summary: ``` his PR restricts the period when the cs_vNodes mutex is locked, prevents the only case when cs_vNodes could be locked before the ::cs_main. This change makes the explicit locking of recursive mutexes in the explicit order redundant. ``` Backport of [[bitcoin/bitcoin#21563 | core#21563]]. Test Plan: With clang and debug: ninja check-all ./contrib/teamcity/build-configurations.py build-tsan Reviewers: #bitcoin_abc, PiRK Reviewed By: #bitcoin_abc, PiRK Differential Revision: https://reviews.bitcoinabc.org/D11301

DrahtBot added the P2P label Mar 31, 2021

This was referenced Apr 1, 2021

[p2p] Reduce addr blackholes #21528

Merged

Erlay: bandwidth-efficient transaction relay protocol #21515

Closed

net: Address outstanding review comments from PR20721 #21198

Merged

hebasto force-pushed the 210331-send branch from 3bb4411 to 8ca2ee6 Compare April 1, 2021 06:36

DrahtBot mentioned this pull request Apr 1, 2021

net/net processing: Move tx inventory into net_processing #21160

Merged

DrahtBot mentioned this pull request Apr 2, 2021

test: make sure non-IP peers get discouraged and disconnected (vasild) #21571

Merged

DrahtBot added the Needs rebase label Apr 6, 2021

hebasto force-pushed the 210331-send branch from 8ca2ee6 to b58097f Compare April 7, 2021 12:08

DrahtBot removed the Needs rebase label Apr 7, 2021

hebasto force-pushed the 210331-send branch from b58097f to 4de7605 Compare April 12, 2021 12:40

hebasto force-pushed the 210331-send branch from 4de7605 to b1e5ca2 Compare April 12, 2021 16:48

jnewbery reviewed Apr 13, 2021

View reviewed changes

hebasto force-pushed the 210331-send branch from b1e5ca2 to a366332 Compare April 13, 2021 12:47

jnewbery mentioned this pull request Apr 20, 2021

net_processing: lock clean up #21527

Closed

hebasto force-pushed the 210331-send branch from a366332 to 9766b7f Compare April 20, 2021 13:49

hebasto changed the title ~~net: Drop cs_sendProcessing mutex that guards nothing~~ net: Restrict period when cs_vNodes mutex is locked Apr 20, 2021

ajtowns reviewed Apr 20, 2021

View reviewed changes

maflcko reviewed Apr 21, 2021

View reviewed changes

src/net.cpp Outdated Show resolved Hide resolved

maflcko reviewed Apr 21, 2021

View reviewed changes

src/net.cpp Outdated Show resolved Hide resolved

src/net.cpp Show resolved Hide resolved

jnewbery mentioned this pull request Apr 22, 2021

net: remove unnecessary check of CNode::cs_vSend #21750

Merged

vasild approved these changes Apr 22, 2021

View reviewed changes

hebasto added 3 commits April 22, 2021 17:28

net: Restrict period when cs_vNodes mutex is locked

a3d090d

net: Combine two loops into one, and update comments

229ac18

net, refactor: Fix style in CConnman::StopNodes

8c8237a

hebasto force-pushed the 210331-send branch from 9766b7f to 8c8237a Compare April 22, 2021 14:33

vasild approved these changes Apr 22, 2021

View reviewed changes

ajtowns reviewed Apr 23, 2021

View reviewed changes

maflcko approved these changes Apr 25, 2021

View reviewed changes

maflcko merged commit 8f80092 into bitcoin:master Apr 25, 2021

hebasto deleted the 210331-send branch April 25, 2021 08:44

gwillen pushed a commit to ElementsProject/elements that referenced this pull request Jun 1, 2022

Merge 8f80092 into merged_master (Bitcoin PR bitcoin/bitcoin#21563)

931655f

bitcoin locked as resolved and limited conversation to collaborators Aug 16, 2022

net: Restrict period when cs_vNodes mutex is locked #21563

net: Restrict period when cs_vNodes mutex is locked #21563

Uh oh!

Conversation

hebasto commented Mar 31, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DrahtBot commented Apr 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Conflicts

Uh oh!

jnewbery commented Apr 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hebasto commented Apr 7, 2021

Uh oh!

jnewbery commented Apr 9, 2021

Uh oh!

hebasto commented Apr 9, 2021

Uh oh!

jnewbery commented Apr 9, 2021

Uh oh!

hebasto commented Apr 12, 2021

Uh oh!

maflcko commented Apr 12, 2021

Uh oh!

hebasto commented Apr 12, 2021

Uh oh!

hebasto commented Apr 12, 2021

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hebasto Apr 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hebasto commented Apr 13, 2021

Uh oh!

jnewbery commented Apr 13, 2021

Uh oh!

ajtowns commented Apr 20, 2021

Uh oh!

maflcko commented Apr 20, 2021

Uh oh!

hebasto commented Apr 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnewbery commented Apr 20, 2021

Uh oh!

ajtowns left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ajtowns Apr 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vasild left a comment

Choose a reason for hiding this comment

Uh oh!

jnewbery commented Apr 22, 2021

Uh oh!

jnewbery commented Apr 22, 2021

Uh oh!

hebasto commented Apr 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vasild left a comment

Choose a reason for hiding this comment

hebasto commented Mar 31, 2021 •

edited

Loading

DrahtBot commented Apr 1, 2021 •

edited

Loading

jnewbery commented Apr 2, 2021 •

edited

Loading

hebasto Apr 13, 2021 •

edited

Loading

hebasto commented Apr 20, 2021 •

edited

Loading

ajtowns Apr 21, 2021 •

edited

Loading

hebasto commented Apr 22, 2021 •

edited

Loading