Skip to content

distsql: make physical planning during upgrade bullet-proof #87199

@yuzefovich

Description

@yuzefovich

When working on #87154, we realized that the current way of how DistSQL version and draining information is propagated through the cluster (via gossip) during major upgrades of nodes can lead to unexpected errors during the query execution.

Quoting Tobi for relevant concerns:

I think it's okay that nodes will have outdated information about the version of peers for a short time after they hard-cycle, but is this error behavior "sticky" until the gossip update arrives? In other words, why does this error message reach the client? Shouldn't we internally re-plan the flow, but this time making sure that we don't plan on that node until we have evidence that it is ready for use? I know this is all sort of tricky and since it "only" happens around node upgrades and unclear restarts it could be considered problematic, but there might be an issue to file still.

We should improve things here. In particular, I think we should examine the errors received from the remote nodes, and if it's a "version mismatch", then we would cache the information that that particular node is DistSQL-incompatible so that it won't be considered during the physical planning the next time the query is executed.

Jira issue: CRDB-19210

Metadata

Metadata

Assignees

No one assigned

    Labels

    C-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.T-sql-queriesSQL Queries Team

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions