-
Notifications
You must be signed in to change notification settings - Fork 4.1k
rpc: nodes fail to connect to peer even after the peer is up #44101
Copy link
Copy link
Open
Labels
A-cc-enablementPertains to current CC production issues or short-term projectsPertains to current CC production issues or short-term projectsA-kv-serverRelating to the KV-level RPC serverRelating to the KV-level RPC serverA-server-networkingPertains to network addressing,routing,initializationPertains to network addressing,routing,initializationC-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.T-server-and-securityDB Server & SecurityDB Server & SecurityX-server-triaged-202105
Description
I believe I've run into the following race:
- node 1 goes down
- some node 2 code path tries to dial node 1. Let's call it dial attempt 1.
- node 1 comes back, opens its ports
- another code path on node 2 tries to dial. This is dial attempt 2. Dial attempt 1 is still in progress, so dial attempt two blocks on
conn.initOnce - dial attempt one was in the process of failing, and eventually releases that lock
- every subsequent attempt (i.e. dial attempt 2) that was convoyed behind attempt 1 fails with the same message
This is quite unfortunate, particularly in situations where dial attempt 2 is causally related to node 1 coming back. For example, by node 1 having sent a SetupFlow RPC, which asks node 2 to connect back to it. Node 2 failing to connect back is very unfortunate for the respective SQL query, which will wait in vain for a long time.
A suggested fix is by having node 2 consider network availability signals (incoming connections, or successful heartbeats) and making sure that no error from a dial attempt from before a signal is propagated to any dial attempt from after the signal.
Epic: CRDB-8500
Jira issue: CRDB-5255
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
A-cc-enablementPertains to current CC production issues or short-term projectsPertains to current CC production issues or short-term projectsA-kv-serverRelating to the KV-level RPC serverRelating to the KV-level RPC serverA-server-networkingPertains to network addressing,routing,initializationPertains to network addressing,routing,initializationC-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.T-server-and-securityDB Server & SecurityDB Server & SecurityX-server-triaged-202105