-
Notifications
You must be signed in to change notification settings - Fork 4.1k
distsql: consider not blocking on SetupFlow RPCs before starting the flow on the gateway #87669
Description
Currently, during the physical planning of a distributed query we will perform SetupFlow RPCs in parallel (subject to the available distsql runners) while setting up the flow on the gateway, but we won't start the execution of that gateway flow until all RPCs come back. In multi-region clusters that wait will be on the order of 100ms. If the gateway flow is such that all remote flows depend on data coming from the gateway, this blocking behavior introduces the "stall" in the execution pipeline across the whole plan which can significantly increase the latency of the query. Furthermore, this latency increase is non-deterministic (depends on the placement of ranges for the TableReaders) making it harder to investigate the performance.
This blocking is performed in order to make sure that remote nodes can execute their part of the plan. I believe originally it was envisioned that if the SetupFlow RPC fails for any reason, then the gateway would take over that part of the plan - but this was never implemented. Additionally, it seems far more likely that those RPCs are successful (we only consider nodes that are "healthy" and distsql-version-compatible for physical planning), so this blocking doesn't seem to be that useful. Also, we will soon remove the queueing of the remote flows due to max_running_flows limit which will increase the likelihood of the success of RPCs even further.
This all suggests that it would beneficial to remove this blocking and, instead, opportunistically proceed with the execution on the gateway while spinning up an async task that would make sure to cancel the execution if an RPC comes back with an error.
Jira issue: CRDB-19465
Metadata
Metadata
Assignees
Labels
Type
Projects
Status