sql: implement locality-aware DistSQL planning in multi-tenant setup#84781
sql: implement locality-aware DistSQL planning in multi-tenant setup#84781craig[bot] merged 3 commits intocockroachdb:masterfrom
Conversation
|
Here is the link to the google doc that Rachael prepared before going on a parental leave which might be helpful for reviewers to get some more background. |
5c458f7 to
681158d
Compare
msirek
left a comment
There was a problem hiding this comment.
The code looks good. I just had a couple questions/comments for possible improvement.
Reviewed 2 of 3 files at r1, 3 of 3 files at r4, 1 of 1 files at r5, 4 of 4 files at r6, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)
pkg/sql/distsql_physical_planner.go line 1324 at r6 (raw file):
// Pick a random instance in this region in order to spread the // load. return instancesInRegion[dsp.rng.Intn(len(instancesInRegion))]
If doing local execution, would it be possible to use an instance in the same availability zone as the gateway instance? I'm just wondering if that would help latency vs. using a random instance (or maybe still pick the instance randomly, but with a higher probability of using an instance in the same AZ).
pkg/sql/distsql_physical_planner.go line 1335 at r6 (raw file):
dsp.rng.Shuffle(len(instances), func(i, j int) { instances[i], instances[j] = instances[j], instances[i] })
Would it be worth checking if we have a soft limit, and favoring an order which has the local region appearing first in the list, followed by other close regions?
yuzefovich
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @cucaroach and @msirek)
pkg/sql/distsql_physical_planner.go line 1324 at r6 (raw file):
Previously, msirek (Mark Sirek) wrote…
If doing local execution, would it be possible to use an instance in the same availability zone as the gateway instance? I'm just wondering if that would help latency vs. using a random instance (or maybe still pick the instance randomly, but with a higher probability of using an instance in the same AZ).
By "doing local execution" do you mean that the plan is not distributed? If that's the case, we don't use PartitionSpans at all and assign all spans to the gateway.
Or do you mean that - when the plan is distributed - for local region we'd use a non-uniform probability distribution where the gateway would be favored? That's an interesting idea, I left a TODO about this. I think it's not as clear-cut decision as it might be on the first glance because PartitionSpans is used by other use cases than the query execution (e.g. CDC, BulkIO), and in those cases the cost of extra hop to another instance in the local region seems negligible in comparison to the work each instance will have to do, so it seems beneficial to spread the spans as evenly as possible.
pkg/sql/distsql_physical_planner.go line 1335 at r6 (raw file):
Previously, msirek (Mark Sirek) wrote…
Would it be worth checking if we have a soft limit, and favoring an order which has the local region appearing first in the list, followed by other close regions?
It looks like this comment refers to the case when len(regionToSQLInstanceIDs) == 0 (meaning that we could not determine the region information for any SQL instance), so it doesn't seem applicable since we can't really distinguish instances being in the local of a remote region.
cucaroach
left a comment
There was a problem hiding this comment.
Reviewed 2 of 3 files at r1, 4 of 4 files at r7, 1 of 1 files at r8, 4 of 4 files at r9, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @msirek and @yuzefovich)
pkg/sql/distsql_physical_planner.go line 1017 at r9 (raw file):
// CheckInstanceHealthAndVersion returns a information about a node's health and // compatibility. The info is also recorded in planCtx.Nodes.
nit: planCtx.NodeStatuses
nit: "returns information" don't need the a
yuzefovich
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @cucaroach and @msirek)
pkg/sql/distsql_physical_planner.go line 1324 at r6 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
By "doing local execution" do you mean that the plan is not distributed? If that's the case, we don't use
PartitionSpansat all and assign all spans to the gateway.Or do you mean that - when the plan is distributed - for local region we'd use a non-uniform probability distribution where the gateway would be favored? That's an interesting idea, I left a TODO about this. I think it's not as clear-cut decision as it might be on the first glance because
PartitionSpansis used by other use cases than the query execution (e.g. CDC, BulkIO), and in those cases the cost of extra hop to another instance in the local region seems negligible in comparison to the work each instance will have to do, so it seems beneficial to spread the spans as evenly as possible.
@msirek curious about your thoughts here to make sure I understood your comment.
msirek
left a comment
There was a problem hiding this comment.
Reviewed 3 of 4 files at r10.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @cucaroach, @msirek, and @yuzefovich)
pkg/sql/distsql_physical_planner.go line 1324 at r6 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
@msirek curious about your thoughts here to make sure I understood your comment.
I misunderstood the point of this code in my initial comment, thanks for the clarification. I guess we don't currently track availability zone information. But if we maintained a table of the average latencies between the different availability zones of the cluster nodes, we could try to favor selection of a node here (still randomized though in case all SQLs are executed from a single region) which has a lower latency when communicating with the gateway region (or gateway AZ, if we start tracking this). I'm not sure how much this would matter. It's maybe just something to make a note of to investigate in the future. If there are lots of messages between distributed nodes and the gateway, maybe the latencies would start to add up and this could have a measurable benefit.
pkg/sql/distsql_physical_planner.go line 1335 at r6 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
It looks like this comment refers to the case when
len(regionToSQLInstanceIDs) == 0(meaning that we could not determine the region information for any SQL instance), so it doesn't seem applicable since we can't really distinguish instances being in the local of a remote region.
Right, thanks. I'm curious under what situations we can't find the region information of instances.
This commit cleans up `PartitionSpans` implementations a bit: - it pulls out some of the code shared between `partitionSpansSystem` and `partitionSpansTenant` into `PartitionSpans` to avoid the duplication - it refactors the way we handle spans with no end keys (such spans are converted into GetRequests later on). This required a minor fix to the fake span resolver used in `fakedist` config - it converts a panic about "no spans" into an assertion error. Release note: None
This commit moves a few functions around in order to get better grouping of related functions. This is a pure mechanical change. Release note: None
yuzefovich
left a comment
There was a problem hiding this comment.
TFTRs!
bors r+
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @cucaroach, @msirek, and @yuzefovich)
pkg/sql/distsql_physical_planner.go line 1324 at r6 (raw file):
Previously, msirek (Mark Sirek) wrote…
I misunderstood the point of this code in my initial comment, thanks for the clarification. I guess we don't currently track availability zone information. But if we maintained a table of the average latencies between the different availability zones of the cluster nodes, we could try to favor selection of a node here (still randomized though in case all SQLs are executed from a single region) which has a lower latency when communicating with the gateway region (or gateway AZ, if we start tracking this). I'm not sure how much this would matter. It's maybe just something to make a note of to investigate in the future. If there are lots of messages between distributed nodes and the gateway, maybe the latencies would start to add up and this could have a measurable benefit.
Makes sense, expanded the TODO.
|
Build failed (retrying...): |
|
Seems like some of the tests can be flaky. bors r- |
|
Canceled. |
This commit implements the region-aware DistSQL physical planning in the multi-tenant setup. Previously, we would use a round-robin strategy of assigning single-range spans to all available SQL instances and now we'll make a much better decision. The algorithm is as follows: 1. find all available SQL pods and determine their locality (via the `sqlinstance.Provider`) 2. iterate over all spans and break them down into single-range spans using the range cache that the gateway SQL instance maintains 3. for each single-range span, find the KV leaseholder and determine its locality (via the `kvtenantccl.Connector`), then pick a random SQL instance in the same locality 4. if there is no SQL instance in that locality, then assign the single-range span to the gateway SQL instance. At the moment, the only locality "tier" that we look at is the "region", but in the future we can easily extend it (say, to also include the "availability zone"). The choice of randomly picking the SQL instance in the same locality in the step 3 is done so that we distribute the load evenly among multiple pods. The step 2 was already being performed for the system tenant, so this commit extracts out some helpers to reuse the code as much as possible. Additionally, it fixes an omission in the physical planning in multi-tenancy when the query has a LIMIT (`getInstanceIDForScan`). If in step 1 we couldn't determine the region of any available SQL pod, then we fallback to the old naive locality-ignorant strategy of assigning pods in round-robin fashion. Release note: None (I don't think that any of these changes are user-visible since the serverless doesn't yet support the multi-region clusters.)
|
I think I de-flaked the test but will wait for green CI this time. |
|
Looks good. bors r+ |
|
Build succeeded: |
sql: clean up PartitionSpans a bit
This commit cleans up
PartitionSpansimplementations a bit:partitionSpansSystemand
partitionSpansTenantintoPartitionSpansto avoid theduplication
converted into GetRequests later on). This required a minor fix to the
fake span resolver used in
fakedistconfigRelease note: None
sql: move some functions around
This commit moves a few functions around in order to get better grouping
of related functions. This is a pure mechanical change.
Release note: None
sql: implement locality-aware DistSQL planning in multi-tenant setup
This commit implements the region-aware DistSQL physical planning in
the multi-tenant setup. Previously, we would use a round-robin strategy
of assigning single-range spans to all available SQL instances and now
we'll make a much better decision.
The algorithm is as follows:
sqlinstance.Provider)using the range cache that the gateway SQL instance maintains
locality (via the
kvtenantccl.Connector), then pick a random SQLinstance in the same locality
single-range span to the gateway SQL instance.
At the moment, the only locality "tier" that we look at is the "region",
but in the future we can easily extend it (say, to also include the
"availability zone").
The choice of randomly picking the SQL instance in the same locality in
the step 3 is done so that we distribute the load evenly among multiple
pods.
The step 2 was already being performed for the system tenant, so this
commit extracts out some helpers to reuse the code as much as possible.
Additionally, it fixes an omission in the physical planning in
multi-tenancy when the query has a LIMIT (
getInstanceIDForScan).If in step 1 we couldn't determine the region of any available SQL pod,
then we fallback to the old naive locality-ignorant strategy of
assigning pods in round-robin fashion.
Fixes: #80678.
Release note: None (I don't think that any of these changes are
user-visible since the serverless doesn't yet support the multi-region
clusters.)