sql: implement locality-aware DistSQL planning in multi-tenant setup by yuzefovich · Pull Request #84781 · cockroachdb/cockroach

yuzefovich · 2022-07-21T02:49:49Z

sql: clean up PartitionSpans a bit

This commit cleans up PartitionSpans implementations a bit:

it pulls out some of the code shared between partitionSpansSystem
and partitionSpansTenant into PartitionSpans to avoid the
duplication
it refactors the way we handle spans with no end keys (such spans are
converted into GetRequests later on). This required a minor fix to the
fake span resolver used in fakedist config
it converts a panic about "no spans" into an assertion error.

Release note: None

sql: move some functions around

This commit moves a few functions around in order to get better grouping
of related functions. This is a pure mechanical change.

Release note: None

sql: implement locality-aware DistSQL planning in multi-tenant setup

This commit implements the region-aware DistSQL physical planning in
the multi-tenant setup. Previously, we would use a round-robin strategy
of assigning single-range spans to all available SQL instances and now
we'll make a much better decision.

The algorithm is as follows:

find all available SQL pods and determine their locality (via the
sqlinstance.Provider)
iterate over all spans and break them down into single-range spans
using the range cache that the gateway SQL instance maintains
for each single-range span, find the KV leaseholder and determine its
locality (via the kvtenantccl.Connector), then pick a random SQL
instance in the same locality
if there is no SQL instance in that locality, then assign the
single-range span to the gateway SQL instance.

At the moment, the only locality "tier" that we look at is the "region",
but in the future we can easily extend it (say, to also include the
"availability zone").

The choice of randomly picking the SQL instance in the same locality in
the step 3 is done so that we distribute the load evenly among multiple
pods.

The step 2 was already being performed for the system tenant, so this
commit extracts out some helpers to reuse the code as much as possible.
Additionally, it fixes an omission in the physical planning in
multi-tenancy when the query has a LIMIT (getInstanceIDForScan).

If in step 1 we couldn't determine the region of any available SQL pod,
then we fallback to the old naive locality-ignorant strategy of
assigning pods in round-robin fashion.

Fixes: #80678.

Release note: None (I don't think that any of these changes are
user-visible since the serverless doesn't yet support the multi-region
clusters.)

cockroach-teamcity · 2022-07-21T02:50:00Z

This change is

yuzefovich · 2022-07-21T03:10:24Z

Here is the link to the google doc that Rachael prepared before going on a parental leave which might be helpful for reviewers to get some more background.

msirek

The code looks good. I just had a couple questions/comments for possible improvement.

Reviewed 2 of 3 files at r1, 3 of 3 files at r4, 1 of 1 files at r5, 4 of 4 files at r6, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)

pkg/sql/distsql_physical_planner.go line 1324 at r6 (raw file):

			// Pick a random instance in this region in order to spread the
			// load.
			return instancesInRegion[dsp.rng.Intn(len(instancesInRegion))]

If doing local execution, would it be possible to use an instance in the same availability zone as the gateway instance? I'm just wondering if that would help latency vs. using a random instance (or maybe still pick the instance randomly, but with a higher probability of using an instance in the same AZ).

pkg/sql/distsql_physical_planner.go line 1335 at r6 (raw file):

		dsp.rng.Shuffle(len(instances), func(i, j int) {
			instances[i], instances[j] = instances[j], instances[i]
		})

Would it be worth checking if we have a soft limit, and favoring an order which has the local region appearing first in the list, followed by other close regions?

yuzefovich

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @cucaroach and @msirek)

pkg/sql/distsql_physical_planner.go line 1324 at r6 (raw file):

Previously, msirek (Mark Sirek) wrote…

If doing local execution, would it be possible to use an instance in the same availability zone as the gateway instance? I'm just wondering if that would help latency vs. using a random instance (or maybe still pick the instance randomly, but with a higher probability of using an instance in the same AZ).

By "doing local execution" do you mean that the plan is not distributed? If that's the case, we don't use PartitionSpans at all and assign all spans to the gateway.

Or do you mean that - when the plan is distributed - for local region we'd use a non-uniform probability distribution where the gateway would be favored? That's an interesting idea, I left a TODO about this. I think it's not as clear-cut decision as it might be on the first glance because PartitionSpans is used by other use cases than the query execution (e.g. CDC, BulkIO), and in those cases the cost of extra hop to another instance in the local region seems negligible in comparison to the work each instance will have to do, so it seems beneficial to spread the spans as evenly as possible.

pkg/sql/distsql_physical_planner.go line 1335 at r6 (raw file):

Previously, msirek (Mark Sirek) wrote…

Would it be worth checking if we have a soft limit, and favoring an order which has the local region appearing first in the list, followed by other close regions?

It looks like this comment refers to the case when len(regionToSQLInstanceIDs) == 0 (meaning that we could not determine the region information for any SQL instance), so it doesn't seem applicable since we can't really distinguish instances being in the local of a remote region.

cucaroach

Reviewed 2 of 3 files at r1, 4 of 4 files at r7, 1 of 1 files at r8, 4 of 4 files at r9, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @msirek and @yuzefovich)

pkg/sql/distsql_physical_planner.go line 1017 at r9 (raw file):

// CheckInstanceHealthAndVersion returns a information about a node's health and
// compatibility. The info is also recorded in planCtx.Nodes.

nit: planCtx.NodeStatuses
nit: "returns information" don't need the a

yuzefovich

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @cucaroach and @msirek)

pkg/sql/distsql_physical_planner.go line 1324 at r6 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

By "doing local execution" do you mean that the plan is not distributed? If that's the case, we don't use PartitionSpans at all and assign all spans to the gateway.

Or do you mean that - when the plan is distributed - for local region we'd use a non-uniform probability distribution where the gateway would be favored? That's an interesting idea, I left a TODO about this. I think it's not as clear-cut decision as it might be on the first glance because PartitionSpans is used by other use cases than the query execution (e.g. CDC, BulkIO), and in those cases the cost of extra hop to another instance in the local region seems negligible in comparison to the work each instance will have to do, so it seems beneficial to spread the spans as evenly as possible.

@msirek curious about your thoughts here to make sure I understood your comment.

msirek

Reviewed 3 of 4 files at r10.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @cucaroach, @msirek, and @yuzefovich)

pkg/sql/distsql_physical_planner.go line 1324 at r6 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

@msirek curious about your thoughts here to make sure I understood your comment.

I misunderstood the point of this code in my initial comment, thanks for the clarification. I guess we don't currently track availability zone information. But if we maintained a table of the average latencies between the different availability zones of the cluster nodes, we could try to favor selection of a node here (still randomized though in case all SQLs are executed from a single region) which has a lower latency when communicating with the gateway region (or gateway AZ, if we start tracking this). I'm not sure how much this would matter. It's maybe just something to make a note of to investigate in the future. If there are lots of messages between distributed nodes and the gateway, maybe the latencies would start to add up and this could have a measurable benefit.

pkg/sql/distsql_physical_planner.go line 1335 at r6 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

It looks like this comment refers to the case when len(regionToSQLInstanceIDs) == 0 (meaning that we could not determine the region information for any SQL instance), so it doesn't seem applicable since we can't really distinguish instances being in the local of a remote region.

Right, thanks. I'm curious under what situations we can't find the region information of instances.

This commit cleans up `PartitionSpans` implementations a bit: - it pulls out some of the code shared between `partitionSpansSystem` and `partitionSpansTenant` into `PartitionSpans` to avoid the duplication - it refactors the way we handle spans with no end keys (such spans are converted into GetRequests later on). This required a minor fix to the fake span resolver used in `fakedist` config - it converts a panic about "no spans" into an assertion error. Release note: None

This commit moves a few functions around in order to get better grouping of related functions. This is a pure mechanical change. Release note: None

yuzefovich

TFTRs!

bors r+

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @cucaroach, @msirek, and @yuzefovich)

pkg/sql/distsql_physical_planner.go line 1324 at r6 (raw file):

Previously, msirek (Mark Sirek) wrote…

I misunderstood the point of this code in my initial comment, thanks for the clarification. I guess we don't currently track availability zone information. But if we maintained a table of the average latencies between the different availability zones of the cluster nodes, we could try to favor selection of a node here (still randomized though in case all SQLs are executed from a single region) which has a lower latency when communicating with the gateway region (or gateway AZ, if we start tracking this). I'm not sure how much this would matter. It's maybe just something to make a note of to investigate in the future. If there are lots of messages between distributed nodes and the gateway, maybe the latencies would start to add up and this could have a measurable benefit.

Makes sense, expanded the TODO.

craig · 2022-08-02T20:52:53Z

Build failed (retrying...):

GitHub CI (Cockroach)

yuzefovich · 2022-08-02T20:58:28Z

Seems like some of the tests can be flaky.

bors r-

craig · 2022-08-02T20:58:30Z

Canceled.

This commit implements the region-aware DistSQL physical planning in the multi-tenant setup. Previously, we would use a round-robin strategy of assigning single-range spans to all available SQL instances and now we'll make a much better decision. The algorithm is as follows: 1. find all available SQL pods and determine their locality (via the `sqlinstance.Provider`) 2. iterate over all spans and break them down into single-range spans using the range cache that the gateway SQL instance maintains 3. for each single-range span, find the KV leaseholder and determine its locality (via the `kvtenantccl.Connector`), then pick a random SQL instance in the same locality 4. if there is no SQL instance in that locality, then assign the single-range span to the gateway SQL instance. At the moment, the only locality "tier" that we look at is the "region", but in the future we can easily extend it (say, to also include the "availability zone"). The choice of randomly picking the SQL instance in the same locality in the step 3 is done so that we distribute the load evenly among multiple pods. The step 2 was already being performed for the system tenant, so this commit extracts out some helpers to reuse the code as much as possible. Additionally, it fixes an omission in the physical planning in multi-tenancy when the query has a LIMIT (`getInstanceIDForScan`). If in step 1 we couldn't determine the region of any available SQL pod, then we fallback to the old naive locality-ignorant strategy of assigning pods in round-robin fashion. Release note: None (I don't think that any of these changes are user-visible since the serverless doesn't yet support the multi-region clusters.)

yuzefovich · 2022-08-02T21:19:38Z

I think I de-flaked the test but will wait for green CI this time.

yuzefovich · 2022-08-02T23:11:26Z

Looks good.

bors r+

craig · 2022-08-03T00:37:32Z

Build succeeded:

GitHub CI (Cockroach)

yuzefovich force-pushed the mt branch from 6315160 to b86eae1 Compare July 21, 2022 03:06

yuzefovich marked this pull request as ready for review July 21, 2022 03:09

yuzefovich requested a review from a team as a code owner July 21, 2022 03:09

yuzefovich requested review from cucaroach and msirek July 21, 2022 03:09

yuzefovich force-pushed the mt branch 2 times, most recently from 5c458f7 to 681158d Compare July 21, 2022 20:51

msirek approved these changes Jul 24, 2022

View reviewed changes

yuzefovich force-pushed the mt branch from 681158d to 3e1191d Compare July 25, 2022 16:47

yuzefovich commented Jul 25, 2022

View reviewed changes

cucaroach approved these changes Aug 1, 2022

View reviewed changes

yuzefovich force-pushed the mt branch from 3e1191d to 95a267d Compare August 1, 2022 16:15

yuzefovich commented Aug 1, 2022

View reviewed changes

msirek reviewed Aug 2, 2022

View reviewed changes

yuzefovich added 2 commits August 2, 2022 12:25

sql: move some functions around

de3ca17

This commit moves a few functions around in order to get better grouping of related functions. This is a pure mechanical change. Release note: None

yuzefovich force-pushed the mt branch from 95a267d to f3375ee Compare August 2, 2022 19:50

yuzefovich commented Aug 2, 2022

View reviewed changes

yuzefovich force-pushed the mt branch from f3375ee to 9eb5ac3 Compare August 2, 2022 21:19

craig bot merged commit 5cd0309 into cockroachdb:master Aug 3, 2022

yuzefovich deleted the mt branch August 3, 2022 00:38

Conversation

yuzefovich commented Jul 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Jul 21, 2022

Uh oh!

yuzefovich commented Jul 21, 2022

Uh oh!

msirek left a comment

Choose a reason for hiding this comment

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

cucaroach left a comment

Choose a reason for hiding this comment

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

msirek left a comment

Choose a reason for hiding this comment

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

craig bot commented Aug 2, 2022

Uh oh!

yuzefovich commented Aug 2, 2022

Uh oh!

craig bot commented Aug 2, 2022

Uh oh!

yuzefovich commented Aug 2, 2022

Uh oh!

yuzefovich commented Aug 2, 2022

Uh oh!

craig bot commented Aug 3, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yuzefovich commented Jul 21, 2022 •

edited

Loading