Skip to content

rpc: nodes fail to connect to peer even after the peer is up #44101

@andreimatei

Description

@andreimatei

I believe I've run into the following race:

  • node 1 goes down
  • some node 2 code path tries to dial node 1. Let's call it dial attempt 1.
  • node 1 comes back, opens its ports
  • another code path on node 2 tries to dial. This is dial attempt 2. Dial attempt 1 is still in progress, so dial attempt two blocks on conn.initOnce
  • dial attempt one was in the process of failing, and eventually releases that lock
  • every subsequent attempt (i.e. dial attempt 2) that was convoyed behind attempt 1 fails with the same message

This is quite unfortunate, particularly in situations where dial attempt 2 is causally related to node 1 coming back. For example, by node 1 having sent a SetupFlow RPC, which asks node 2 to connect back to it. Node 2 failing to connect back is very unfortunate for the respective SQL query, which will wait in vain for a long time.

A suggested fix is by having node 2 consider network availability signals (incoming connections, or successful heartbeats) and making sure that no error from a dial attempt from before a signal is propagated to any dial attempt from after the signal.

cc @ajwerner @bdarnell

Epic: CRDB-8500

Jira issue: CRDB-5255

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-cc-enablementPertains to current CC production issues or short-term projectsA-kv-serverRelating to the KV-level RPC serverA-server-networkingPertains to network addressing,routing,initializationC-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.T-server-and-securityDB Server & SecurityX-server-triaged-202105

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions