-
Notifications
You must be signed in to change notification settings - Fork 4.1k
rpc: grpc-gateway loopback conn mistakenly uses onlyOnceDialer and causes sticky permanent RPC errors #103762
Description
Describe the problem
The rpc-gateway connection (used by our HTTP interfaces) uses a loopback connector. This is mistakenly configured in v23.1 to use "onlyOnceDialer", a mechanism through which a connection is not re-attempted if it fails.
The result is that when a cluster is overloaded, the loopback connection may fail once (due to a timeout) and then it will fail forever after, causing most of the HTTP interfaces to become unusable.
xref #103692 (comment)
xref #99261 (comment)
To Reproduce
Overload a v23.1 cluster and use the HTTP connection until it fails once.
Then the failure will persist forever until the node is restarted.
Expected behavior
The loopback connection should be retried if it fails (i.e. not use onlyOnceDialer)
Jira issue: CRDB-28178
Epic: CRDB-28893