Skip to content

Conversation

@hebasto
Copy link
Member

@hebasto hebasto commented Mar 31, 2021

This PR restricts the period when the cs_vNodes mutex is locked, prevents the only case when cs_vNodes could be locked before the ::cs_main.

This change makes the explicit locking of recursive mutexes in the explicit order redundant.

@DrahtBot DrahtBot added the P2P label Mar 31, 2021
@DrahtBot
Copy link
Contributor

DrahtBot commented Apr 1, 2021

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@jnewbery
Copy link
Contributor

jnewbery commented Apr 2, 2021

EDIT: This PR has substantially changed. The comment below refers to the old branch.

This does indeed seem useless. The mutex was added by @TheBlueMatt in commit d7c58ad:

Split CNode::cs_vSend: message processing and message sending

cs_vSend is used for two purposes - to lock the datastructures used
to queue messages to place on the wire and to only call
SendMessages once at a time per-node. I believe SendMessages used
to access some of the vSendMsg stuff, but it doesn't anymore, so
these locks do not need to be on the same mutex, and also make
deadlocking much more likely.

We're implicitly guaranteed to only call SendMessages() serially since it's only ever called by message handler thread. If we want to be explicit about it, it'd be better to add a global lock to PeerManagerImpl (or per-Peer object), so that net_processing is enforcing its own synchronization internally.

@hebasto
Copy link
Member Author

hebasto commented Apr 7, 2021

Rebased 8ca2ee6 -> b58097f (pr21563.02 -> pr21563.03) due to the conflict with #21571.

@jnewbery
Copy link
Contributor

jnewbery commented Apr 9, 2021

This seems safe to me. The only thing that CNode.cs_sendProcessing enforces is preventing SendMessages() from being called for the same CNode concurrently. However, SendMessages() can only be called by threadMessageHandler, so there's no possibility of it being called concurrently.

I wonder if instead of simply removing this, we should replace it with something that more closely documents our expectations. I think that calling any of the NetEventsInterface methods (InitializeNode, FinalizeNode, SendMessages, ProcessMessages) in PeerManagerImpl concurrently could lead to problems, so perhaps a global mutex in PeerManagerImpl should be added that's taken whenever any of those functions are called? If we ever want to add concurrency to PeerManagerImpl, we could look at loosening that restriction.

@hebasto
Copy link
Member Author

hebasto commented Apr 9, 2021

@jnewbery

I wonder if instead of simply removing this, we should replace it with something that more closely documents our expectations. I think that calling any of the NetEventsInterface methods (InitializeNode, FinalizeNode, SendMessages, ProcessMessages) in PeerManagerImpl concurrently could lead to problems, so perhaps a global mutex in PeerManagerImpl should be added that's taken whenever any of those functions are called? If we ever want to add concurrency to PeerManagerImpl, we could look at loosening that restriction.

Maybe postpone it until #19398 is fulfilled?

@jnewbery
Copy link
Contributor

jnewbery commented Apr 9, 2021

Maybe postpone it until #19398 is fulfilled?

I think they can be done independently. But I think I'd like to see the new internal PeerManagerImpl lock being introduced in the same PR that removes cs_sendProcessing.

@hebasto
Copy link
Member Author

hebasto commented Apr 12, 2021

Updated b58097f -> 4de7605 (pr21563.03 -> pr21563.04):

@maflcko
Copy link
Member

maflcko commented Apr 12, 2021

 node0 stderr libc++abi.dylib: terminating with uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument 

https://cirrus-ci.com/task/5038649289998336?logs=ci#L3615

@hebasto
Copy link
Member Author

hebasto commented Apr 12, 2021

 node0 stderr libc++abi.dylib: terminating with uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument 

https://cirrus-ci.com/task/5038649289998336?logs=ci#L3615

Looks like a mutex is prematurely destroyed. Weird.

@hebasto
Copy link
Member Author

hebasto commented Apr 12, 2021

Updated 4de7605 -> b1e5ca2 (pr21563.04 -> pr21563.05, diff):

  • fixed bug

void FinalizeNode(const CNode& node) override;
bool ProcessMessages(CNode* pfrom, std::atomic<bool>& interrupt) override;
bool SendMessages(CNode* pto) override EXCLUSIVE_LOCKS_REQUIRED(pto->cs_sendProcessing);
void InitializeNode(CNode* pnode) override EXCLUSIVE_LOCKS_REQUIRED(!m_net_events_mutex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about excluding holding cs_main when calling into a NetEventsInterface method?

Copy link
Member Author

@hebasto hebasto Apr 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not trivial due to the

static std::map<NodeId, CNodeState> mapNodeState GUARDED_BY(cs_main);

Going to keep the scope of this PR tight.

@hebasto
Copy link
Member Author

hebasto commented Apr 13, 2021

Updated b1e5ca2 -> a366332 (pr21563.05 -> pr21563.06):

  • rebased on top of the recent CI changes
  • addressed @jnewbery's comments

@jnewbery
Copy link
Contributor

This seems good to me.

@MarcoFalke - you introduced the LOCK2(::cs_main, ::g_cs_orphans); in #18458. Do you see any problem with locking cs_vNodes, grabbing a copy of vNodes, releasing the lock and then deleting the nodes in CConnman::Stop()? This is similar to what happens in the socket handler and message handler threads?

@ajtowns
Copy link
Contributor

ajtowns commented Apr 20, 2021

Removing a mutex that guards nothing then adding a mutex that also doesn't actually guard anything seems a bit backwards...

#21527 has it guarding the extratxns data structures (as part of getting rid of g_cs_orphans). It would be good for it to be able to guard the address data structures in struct Peer, but without it being a global, that will probably be awkward at best, since Peer doesn't have a reference back to PeerManagerImpl.

The cs_vNodes changes don't seem to be mentioned in the title or PR description?

@maflcko
Copy link
Member

maflcko commented Apr 20, 2021

The fist commit seems separate from the other changes?

@hebasto hebasto changed the title net: Drop cs_sendProcessing mutex that guards nothing net: Restrict period when cs_vNodes mutex is locked Apr 20, 2021
@hebasto
Copy link
Member Author

hebasto commented Apr 20, 2021

Updated a366332 -> 9766b7f (pr21563.06 -> pr21563.07):

The cs_vNodes changes don't seem to be mentioned in the title or PR description?

The PR description updated.

@jnewbery
Copy link
Contributor

I think this change is fine. It steals the CNode*s from vNodes, releases the mutex and then cleans up the nodes. That's very similar to the pattern in SocketHandler():

bitcoin/src/net.cpp

Lines 1481 to 1487 in 0180453

std::vector<CNode*> vNodesCopy;
{
LOCK(cs_vNodes);
vNodesCopy = vNodes;
for (CNode* pnode : vNodesCopy)
pnode->AddRef();
}

and in ThreadMessageHandler():

bitcoin/src/net.cpp

Lines 2185 to 2192 in 0180453

std::vector<CNode*> vNodesCopy;
{
LOCK(cs_vNodes);
vNodesCopy = vNodes;
for (CNode* pnode : vNodesCopy) {
pnode->AddRef();
}
}

The difference being that in those cases, an extra reference is taken and nRefCount is incremented. Here, the reference count is not incremented since the pointer is stolen from vNodes before clearing that vector.

Perhaps we could document our assumptions by calling Release() on the CNode object when stealing it from vNodes or immediately before calling DeleteNode(), and then asserting that nRefCount == 0 in the CNode destructor? Maybe in a follow up.

Copy link
Contributor

@ajtowns ajtowns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems sensible at first glance, and I think any bugs would be caught by CI. Need to have a more thorough look though.

src/net.cpp Outdated
{
LOCK(cs_vNodes);
nodes = std::move(vNodes);
vNodes.clear();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could write this as:

cs_vNodes.lock();
std::vector<CNode*> nodes(std::move(vNodes)); // move constructor clears vNodes
cs_vNodes.unlock();

If I'm reading the spec right, the move constructor is guaranteed to be constant time, while operator= is linear time; and the move constructor also guarantees the original ends up cleared. Since C++17 the move constructor is also marked noexcept, so lock and unlock in place of RAII locks should be sound. Alternatively, could use WAIT_LOCK(cs_vNodes, lock); ...; REVERSE_LOCK(lock);.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does performance matter here at all? Shutdown only happens once and flushing to disk will take magnitudes longer anyway.

Copy link
Contributor

@ajtowns ajtowns Apr 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I don't think so -- at worst it's just moving pointers around, and only a hundred or so of them in normal configurations. (I wasn't sure if std::move + clear was safe / necessary, so looked into the behaviours)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the move constructor is guaranteed to be constant time, while operator= is linear time

This is the move assignment operator (number 2 in https://en.cppreference.com/w/cpp/container/vector/operator%3D). Its complexity is linear in the size of this - the vector being moved to. Presumably that's because the objects in this need to be destructed and the memory already owned by this needs to deallocated. In our case, this is empty, and so the operation is constant time - it just needs to copy start pointer/size/capacity from other to this.

Does performance matter here at all?

Absolutely not.

I wasn't sure if std::move + clear was safe / necessary

Moving from a vector leaves it in a "valid but unspecified state". The clear() is probably unnecessary since this is shutdown and we're not going to touch vNodes again, but I think it's good practice to not to leave vectors in an unspecified state if we can help it.

Copy link
Contributor

@vasild vasild left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 9766b7f

Something to consider for further improvement:

It is not even necessary to lock cs_vNodes inside StopNodes(). Why? Because by the time StopNodes() is called the other threads have been shut down.

If it was necessary to protect the code in StopNodes() with a mutex, then this PR would be broken:

lock()
tmp = std::move(shared_vector)
unlock()
destroy each element in tmp
// What prevents new entries from being added to shared_vector
// here, since we have unlock()ed? Nothing. So when this function
// completes, shared_vector is still alive and kicking (with new
// elements being added to it in the meantime).

The LOCK(cs_vNodes); is only needed to keep the thread safety analysis happy. This code actually belongs to the destructor ~CConnman() where we can access vNodes without locking and without upsetting the TSA. I think it can be moved to the destructor with some further changes (outside of this PR).

@jnewbery
Copy link
Contributor

This code actually belongs to the destructor ~CConnman() where we can access vNodes without locking and without upsetting the TSA. I think it can be moved to the destructor with some further changes (outside of this PR).

Maybe, but that's a much more invasive change. CConnman::Stop() calls PeerManager::DeleteNode() for all the nodes. The destructor for CConnman is called after the destructor for PeerManager (in the reverse order that they're constructed). That can be changed, but we'd need be very careful.

@jnewbery
Copy link
Contributor

utACK 9766b7f

Agree with @MarcoFalke that the move & clear is more elegantly expressed as std::vector::swap().

Diff
diff --git a/src/net.cpp b/src/net.cpp
index b7c1b8c6c4..94029491ab 100644
--- a/src/net.cpp
+++ b/src/net.cpp
@@ -2637,11 +2637,7 @@ void CConnman::StopNodes()
 
     // Delete peer connections
     std::vector<CNode*> nodes;
-    {
-        LOCK(cs_vNodes);
-        nodes = std::move(vNodes);
-        vNodes.clear();
-    }
+    WITH_LOCK(cs_vNodes, nodes.swap(vNodes));
 
     for (CNode* pnode : nodes) {
         pnode->CloseSocketDisconnect();

@hebasto
Copy link
Member Author

hebasto commented Apr 22, 2021

Updated 9766b7f -> 8c8237a (pr21563.07 -> pr21563.08, diff).

Addressed @MarcoFalke's comments:

Could achieve the same in one less line of code with https://en.cppreference.com/w/cpp/container/vector/swap ?

minor nit: would be nice to do style-fixups in a separate commit

Copy link
Contributor

@vasild vasild left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 8c8237a

@jnewbery
Copy link
Contributor

utACK 8c8237a

Copy link
Contributor

@ajtowns ajtowns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK 8c8237a - logic seems sound

Couple of comment improvements that would be good; longer term would also be better to add missing guards to things in net so that the compiler can catch more errors, and simplify the ownership/interaction between CConnman/PeerManager/NodeContext...

vNodes.clear();
vNodesDisconnected.clear();
vhListenSocket.clear();
semOutbound.reset();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth adding a comment as to why vNodesDisconnected, vhListenSocket, semOutbound and semAddnode are all safe to reset here -- I think it's because:

  • vNodesDisconnected and vhListenSocket are only otherwise accessed by ThreadSocketHandler
  • semOutbound and semAddnode are only otherwise accessed via ThreadOpenConnections, Start(), Interrupt() (and Interrupt() is only safe if it's invoked in between Start() and StopNodes()), and AddConnection (which is called from rpc/net.cpp so also requires that we won't be adding connections via RPC -- might be good to add an if (semOutbound == nullptr) return false; to AddConnection())

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RPC must be (and is assumed to be) shutdown before Stop is called. The other threads are also assumed to be shut down. Maybe reverting my commit (#21563 (comment)) could help to self-document that better?

}
SyncWithValidationInterfaceQueue();
LOCK2(::cs_main, g_cs_orphans); // See init.cpp for rationale for implicit locking order requirement
g_setup->m_node.connman->StopNodes();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there probably should be a comment as to why explicitly calling StopNodes is still necessary

(I believe it's because if you don't do that, then m_node will get destructed, deleting its peerman, then attempting to delete it's connman which will see some entries in vNodes and try to call peerman->FinalizeNode() on them, but peerman is deleted at that point. With the lock order fixed, it may be possible to have ~NodeContext call if (peerman) { peerman->Stop(); } an remove those lines from the tests entirely. Not sure if that would also let you remove it from init.cpp)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think changing the destruction order can be done in a separate pull?

@vasild
Copy link
Contributor

vasild commented Apr 23, 2021

(consider out of scope of this PR)

Could we assert that all threads are stopped when StopNodes() starts executing? Something like

assert(threadMessageHandler.get_id() == std::thread::id());
assert(threadSocketHandler.get_id() == std::thread::id());
...

Or even call StopThreads() at the start of StopNodes()?

@maflcko
Copy link
Member

maflcko commented Apr 25, 2021

Or even call StopThreads() at the start of StopNodes()?

This used to be the case before commit fa36965. Maybe that commit should be reverted now?

Copy link
Member

@maflcko maflcko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the changes are nice and can be merged

review ACK 8c8237a 👢

Show signature and timestamp

Signature:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

review ACK 8c8237a4a10feb2ac9ce46f67b5d14bf879b670f 👢
-----BEGIN PGP SIGNATURE-----

iQGzBAEBCgAdFiEE+rVPoUahrI9sLGYTzit1aX5ppUgFAlwqrYAACgkQzit1aX5p
pUiYLgwAy9ZbgTGcc2V3HKj7fNh6KWOQ2DgAVsBLAmRP1D+443eCC3twNxELvtOe
Mq2gmwseUZK1oNKU0RnCHvg5bZOqcId1N1Sf6BEbJOmkGFPktaVLTz2TmQpLLE42
ZTMrqWEJreQ9DXqEDDKV222Zu4ucLmUFuNqs5MYj+moUTBdedFcbeF96cJK2hBc1
OgTJXL6MLoMAFeL4PKS6j+E2yHXOFuKHWE9WHPJ0wFKujTqlBPtntZxoWb7fby2A
ZHrA8wxjdNQbgovYRjToVe/bif2SNuFbfNwE2XYalQ50U8bY5n3hC2SUB/poGTaJ
GoSOWiv1jAUimiMUVyicP+JDfprAjPkB9hXKC7k4kvbluTpqYz4UGxhnBmKnU/qe
+gU1fPldReMkVGg5YjG3pg7Kd4izdq2YzyNMXdyIRHJaWn+gvsl4vQjPLUe7+WFL
DvX4dkpJ9fHF/DoHLsYi5/g2sG3G2G1Z4jrV/TFqAInAa0tuEqdWwYBomHi5QBQX
GzrRIkFr
=lXKU
-----END PGP SIGNATURE-----

Timestamp of file with hash 2744498e52ef3957483818d251ab7a217f1deba6d21ad39bc26dd3c078c33d07 -

@maflcko maflcko merged commit 8f80092 into bitcoin:master Apr 25, 2021
@maflcko
Copy link
Member

maflcko commented Apr 25, 2021

There is a related pr, which is waiting for review donations: #21750

@hebasto hebasto deleted the 210331-send branch April 25, 2021 08:44
sidhujag pushed a commit to syscoin/syscoin that referenced this pull request Apr 25, 2021
8c8237a net, refactor: Fix style in CConnman::StopNodes (Hennadii Stepanov)
229ac18 net: Combine two loops into one, and update comments (Hennadii Stepanov)
a3d090d net: Restrict period when cs_vNodes mutex is locked (Hennadii Stepanov)

Pull request description:

  This PR restricts the period when the `cs_vNodes` mutex is locked, prevents the only case when `cs_vNodes` could be locked before the `::cs_main`.

  This change makes the explicit locking of recursive mutexes in the explicit order redundant.

ACKs for top commit:
  jnewbery:
    utACK 8c8237a
  vasild:
    ACK 8c8237a
  ajtowns:
    utACK 8c8237a - logic seems sound
  MarcoFalke:
    review ACK 8c8237a 👢

Tree-SHA512: a8277924339622b188b12d260a100adf5d82781634cf974320cf6007341f946a7ff40351137c2f5369aed0d318f38aac2d32965c9b619432440d722a4e78bb73
Fabcien added a commit to Bitcoin-ABC/bitcoin-abc that referenced this pull request Apr 5, 2022
Summary:
```
his PR restricts the period when the cs_vNodes mutex is locked, prevents the only case when cs_vNodes could be locked before the ::cs_main.

This change makes the explicit locking of recursive mutexes in the explicit order redundant.
```

Backport of [[bitcoin/bitcoin#21563 | core#21563]].

Test Plan:
With clang and debug:
  ninja check-all

  ./contrib/teamcity/build-configurations.py build-tsan

Reviewers: #bitcoin_abc, PiRK

Reviewed By: #bitcoin_abc, PiRK

Differential Revision: https://reviews.bitcoinabc.org/D11301
gwillen pushed a commit to ElementsProject/elements that referenced this pull request Jun 1, 2022
@bitcoin bitcoin locked as resolved and limited conversation to collaborators Aug 16, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants