distsql: make physical planning during upgrade bullet-proof

When working on #87154, we realized that the current way of how DistSQL version and draining information is propagated through the cluster (via gossip) during major upgrades of nodes can lead to unexpected errors during the query execution. 

Quoting Tobi for relevant concerns:
> I think it's okay that nodes will have outdated information about the version of peers for a short time after they hard-cycle, but is this error behavior "sticky" until the gossip update arrives? In other words, why does this error message reach the client? Shouldn't we internally re-plan the flow, but this time making sure that we don't plan on that node until we have evidence that it is ready for use? I know this is all sort of tricky and since it "only" happens around node upgrades and unclear restarts it could be considered problematic, but there might be an issue to file still.

We should improve things here. In particular, I think we should examine the errors received from the remote nodes, and if it's a "version mismatch", then we would cache the information that that particular node is DistSQL-incompatible so that it won't be considered during the physical planning the next time the query is executed.

Jira issue: CRDB-19210

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distsql: make physical planning during upgrade bullet-proof #87199

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

distsql: make physical planning during upgrade bullet-proof #87199

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions