Skip to content

rpc: grpc-gateway loopback conn mistakenly uses onlyOnceDialer and causes sticky permanent RPC errors #103762

@knz

Description

@knz

Describe the problem

The rpc-gateway connection (used by our HTTP interfaces) uses a loopback connector. This is mistakenly configured in v23.1 to use "onlyOnceDialer", a mechanism through which a connection is not re-attempted if it fails.

The result is that when a cluster is overloaded, the loopback connection may fail once (due to a timeout) and then it will fail forever after, causing most of the HTTP interfaces to become unusable.

xref #103692 (comment)
xref #99261 (comment)

To Reproduce

Overload a v23.1 cluster and use the HTTP connection until it fails once.
Then the failure will persist forever until the node is restarted.

Expected behavior

The loopback connection should be retried if it fails (i.e. not use onlyOnceDialer)

Jira issue: CRDB-28178
Epic: CRDB-28893

Metadata

Metadata

Assignees

Labels

A-observability-infA-server-networkingPertains to network addressing,routing,initializationC-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.O-supportWould prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docsbackport-23.1.xPAST MAINTENANCE SUPPORT: 23.1 patch releases via ER request onlybranch-release-23.1Used to mark GA and release blockers, technical advisories, and bugs for 23.1regressionRegression from a release.release-blockerIndicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.v23.1.2

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions