-
Notifications
You must be signed in to change notification settings - Fork 38.7k
net: Restrict period when cs_vNodes mutex is locked #21563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
|
EDIT: This PR has substantially changed. The comment below refers to the old branch. This does indeed seem useless. The mutex was added by @TheBlueMatt in commit d7c58ad:
We're implicitly guaranteed to only call |
|
Rebased 8ca2ee6 -> b58097f (pr21563.02 -> pr21563.03) due to the conflict with #21571. |
|
This seems safe to me. The only thing that I wonder if instead of simply removing this, we should replace it with something that more closely documents our expectations. I think that calling any of the |
Maybe postpone it until #19398 is fulfilled? |
I think they can be done independently. But I think I'd like to see the new internal PeerManagerImpl lock being introduced in the same PR that removes cs_sendProcessing. |
|
Updated b58097f -> 4de7605 (pr21563.03 -> pr21563.04): |
|
Looks like a mutex is prematurely destroyed. Weird. |
|
Updated 4de7605 -> b1e5ca2 (pr21563.04 -> pr21563.05, diff):
|
src/net_processing.cpp
Outdated
| void FinalizeNode(const CNode& node) override; | ||
| bool ProcessMessages(CNode* pfrom, std::atomic<bool>& interrupt) override; | ||
| bool SendMessages(CNode* pto) override EXCLUSIVE_LOCKS_REQUIRED(pto->cs_sendProcessing); | ||
| void InitializeNode(CNode* pnode) override EXCLUSIVE_LOCKS_REQUIRED(!m_net_events_mutex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about excluding holding cs_main when calling into a NetEventsInterface method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not trivial due to the
bitcoin/src/net_processing.cpp
Line 662 in c1f480f
| static std::map<NodeId, CNodeState> mapNodeState GUARDED_BY(cs_main); |
Going to keep the scope of this PR tight.
|
Updated b1e5ca2 -> a366332 (pr21563.05 -> pr21563.06):
|
|
This seems good to me. @MarcoFalke - you introduced the |
|
Removing a mutex that guards nothing then adding a mutex that also doesn't actually guard anything seems a bit backwards... #21527 has it guarding the extratxns data structures (as part of getting rid of g_cs_orphans). It would be good for it to be able to guard the address data structures in struct Peer, but without it being a global, that will probably be awkward at best, since Peer doesn't have a reference back to PeerManagerImpl. The cs_vNodes changes don't seem to be mentioned in the title or PR description? |
|
The fist commit seems separate from the other changes? |
|
Updated a366332 -> 9766b7f (pr21563.06 -> pr21563.07):
The PR description updated. |
|
I think this change is fine. It steals the Lines 1481 to 1487 in 0180453
and in Lines 2185 to 2192 in 0180453
The difference being that in those cases, an extra reference is taken and Perhaps we could document our assumptions by calling |
ajtowns
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems sensible at first glance, and I think any bugs would be caught by CI. Need to have a more thorough look though.
src/net.cpp
Outdated
| { | ||
| LOCK(cs_vNodes); | ||
| nodes = std::move(vNodes); | ||
| vNodes.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could write this as:
cs_vNodes.lock();
std::vector<CNode*> nodes(std::move(vNodes)); // move constructor clears vNodes
cs_vNodes.unlock();If I'm reading the spec right, the move constructor is guaranteed to be constant time, while operator= is linear time; and the move constructor also guarantees the original ends up cleared. Since C++17 the move constructor is also marked noexcept, so lock and unlock in place of RAII locks should be sound. Alternatively, could use WAIT_LOCK(cs_vNodes, lock); ...; REVERSE_LOCK(lock);.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does performance matter here at all? Shutdown only happens once and flushing to disk will take magnitudes longer anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I don't think so -- at worst it's just moving pointers around, and only a hundred or so of them in normal configurations. (I wasn't sure if std::move + clear was safe / necessary, so looked into the behaviours)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the move constructor is guaranteed to be constant time, while operator= is linear time
This is the move assignment operator (number 2 in https://en.cppreference.com/w/cpp/container/vector/operator%3D). Its complexity is linear in the size of this - the vector being moved to. Presumably that's because the objects in this need to be destructed and the memory already owned by this needs to deallocated. In our case, this is empty, and so the operation is constant time - it just needs to copy start pointer/size/capacity from other to this.
Does performance matter here at all?
Absolutely not.
I wasn't sure if std::move + clear was safe / necessary
Moving from a vector leaves it in a "valid but unspecified state". The clear() is probably unnecessary since this is shutdown and we're not going to touch vNodes again, but I think it's good practice to not to leave vectors in an unspecified state if we can help it.
vasild
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 9766b7f
Something to consider for further improvement:
It is not even necessary to lock cs_vNodes inside StopNodes(). Why? Because by the time StopNodes() is called the other threads have been shut down.
If it was necessary to protect the code in StopNodes() with a mutex, then this PR would be broken:
lock()
tmp = std::move(shared_vector)
unlock()
destroy each element in tmp
// What prevents new entries from being added to shared_vector
// here, since we have unlock()ed? Nothing. So when this function
// completes, shared_vector is still alive and kicking (with new
// elements being added to it in the meantime).The LOCK(cs_vNodes); is only needed to keep the thread safety analysis happy. This code actually belongs to the destructor ~CConnman() where we can access vNodes without locking and without upsetting the TSA. I think it can be moved to the destructor with some further changes (outside of this PR).
Maybe, but that's a much more invasive change. |
|
utACK 9766b7f Agree with @MarcoFalke that the move & clear is more elegantly expressed as std::vector::swap(). Diffdiff --git a/src/net.cpp b/src/net.cpp
index b7c1b8c6c4..94029491ab 100644
--- a/src/net.cpp
+++ b/src/net.cpp
@@ -2637,11 +2637,7 @@ void CConnman::StopNodes()
// Delete peer connections
std::vector<CNode*> nodes;
- {
- LOCK(cs_vNodes);
- nodes = std::move(vNodes);
- vNodes.clear();
- }
+ WITH_LOCK(cs_vNodes, nodes.swap(vNodes));
for (CNode* pnode : nodes) {
pnode->CloseSocketDisconnect(); |
|
Updated 9766b7f -> 8c8237a (pr21563.07 -> pr21563.08, diff). Addressed @MarcoFalke's comments:
|
vasild
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 8c8237a
|
utACK 8c8237a |
ajtowns
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utACK 8c8237a - logic seems sound
Couple of comment improvements that would be good; longer term would also be better to add missing guards to things in net so that the compiler can catch more errors, and simplify the ownership/interaction between CConnman/PeerManager/NodeContext...
| vNodes.clear(); | ||
| vNodesDisconnected.clear(); | ||
| vhListenSocket.clear(); | ||
| semOutbound.reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth adding a comment as to why vNodesDisconnected, vhListenSocket, semOutbound and semAddnode are all safe to reset here -- I think it's because:
vNodesDisconnectedandvhListenSocketare only otherwise accessed byThreadSocketHandlersemOutboundandsemAddnodeare only otherwise accessed viaThreadOpenConnections,Start(),Interrupt()(andInterrupt()is only safe if it's invoked in betweenStart()andStopNodes()), andAddConnection(which is called from rpc/net.cpp so also requires that we won't be adding connections via RPC -- might be good to add anif (semOutbound == nullptr) return false;toAddConnection())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RPC must be (and is assumed to be) shutdown before Stop is called. The other threads are also assumed to be shut down. Maybe reverting my commit (#21563 (comment)) could help to self-document that better?
| } | ||
| SyncWithValidationInterfaceQueue(); | ||
| LOCK2(::cs_main, g_cs_orphans); // See init.cpp for rationale for implicit locking order requirement | ||
| g_setup->m_node.connman->StopNodes(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there probably should be a comment as to why explicitly calling StopNodes is still necessary
(I believe it's because if you don't do that, then m_node will get destructed, deleting its peerman, then attempting to delete it's connman which will see some entries in vNodes and try to call peerman->FinalizeNode() on them, but peerman is deleted at that point. With the lock order fixed, it may be possible to have ~NodeContext call if (peerman) { peerman->Stop(); } an remove those lines from the tests entirely. Not sure if that would also let you remove it from init.cpp)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think changing the destruction order can be done in a separate pull?
|
(consider out of scope of this PR) Could we assert that all threads are stopped when assert(threadMessageHandler.get_id() == std::thread::id());
assert(threadSocketHandler.get_id() == std::thread::id());
...Or even call |
This used to be the case before commit fa36965. Maybe that commit should be reverted now? |
maflcko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the changes are nice and can be merged
review ACK 8c8237a 👢
Show signature and timestamp
Signature:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512
review ACK 8c8237a4a10feb2ac9ce46f67b5d14bf879b670f 👢
-----BEGIN PGP SIGNATURE-----
iQGzBAEBCgAdFiEE+rVPoUahrI9sLGYTzit1aX5ppUgFAlwqrYAACgkQzit1aX5p
pUiYLgwAy9ZbgTGcc2V3HKj7fNh6KWOQ2DgAVsBLAmRP1D+443eCC3twNxELvtOe
Mq2gmwseUZK1oNKU0RnCHvg5bZOqcId1N1Sf6BEbJOmkGFPktaVLTz2TmQpLLE42
ZTMrqWEJreQ9DXqEDDKV222Zu4ucLmUFuNqs5MYj+moUTBdedFcbeF96cJK2hBc1
OgTJXL6MLoMAFeL4PKS6j+E2yHXOFuKHWE9WHPJ0wFKujTqlBPtntZxoWb7fby2A
ZHrA8wxjdNQbgovYRjToVe/bif2SNuFbfNwE2XYalQ50U8bY5n3hC2SUB/poGTaJ
GoSOWiv1jAUimiMUVyicP+JDfprAjPkB9hXKC7k4kvbluTpqYz4UGxhnBmKnU/qe
+gU1fPldReMkVGg5YjG3pg7Kd4izdq2YzyNMXdyIRHJaWn+gvsl4vQjPLUe7+WFL
DvX4dkpJ9fHF/DoHLsYi5/g2sG3G2G1Z4jrV/TFqAInAa0tuEqdWwYBomHi5QBQX
GzrRIkFr
=lXKU
-----END PGP SIGNATURE-----
Timestamp of file with hash 2744498e52ef3957483818d251ab7a217f1deba6d21ad39bc26dd3c078c33d07 -
|
There is a related pr, which is waiting for review donations: #21750 |
8c8237a net, refactor: Fix style in CConnman::StopNodes (Hennadii Stepanov) 229ac18 net: Combine two loops into one, and update comments (Hennadii Stepanov) a3d090d net: Restrict period when cs_vNodes mutex is locked (Hennadii Stepanov) Pull request description: This PR restricts the period when the `cs_vNodes` mutex is locked, prevents the only case when `cs_vNodes` could be locked before the `::cs_main`. This change makes the explicit locking of recursive mutexes in the explicit order redundant. ACKs for top commit: jnewbery: utACK 8c8237a vasild: ACK 8c8237a ajtowns: utACK 8c8237a - logic seems sound MarcoFalke: review ACK 8c8237a 👢 Tree-SHA512: a8277924339622b188b12d260a100adf5d82781634cf974320cf6007341f946a7ff40351137c2f5369aed0d318f38aac2d32965c9b619432440d722a4e78bb73
Summary: ``` his PR restricts the period when the cs_vNodes mutex is locked, prevents the only case when cs_vNodes could be locked before the ::cs_main. This change makes the explicit locking of recursive mutexes in the explicit order redundant. ``` Backport of [[bitcoin/bitcoin#21563 | core#21563]]. Test Plan: With clang and debug: ninja check-all ./contrib/teamcity/build-configurations.py build-tsan Reviewers: #bitcoin_abc, PiRK Reviewed By: #bitcoin_abc, PiRK Differential Revision: https://reviews.bitcoinabc.org/D11301
This PR restricts the period when the
cs_vNodesmutex is locked, prevents the only case whencs_vNodescould be locked before the::cs_main.This change makes the explicit locking of recursive mutexes in the explicit order redundant.