Skip to content

sql: implement locality-aware DistSQL planning in multi-tenant setup#84781

Merged
craig[bot] merged 3 commits intocockroachdb:masterfrom
yuzefovich:mt
Aug 3, 2022
Merged

sql: implement locality-aware DistSQL planning in multi-tenant setup#84781
craig[bot] merged 3 commits intocockroachdb:masterfrom
yuzefovich:mt

Conversation

@yuzefovich
Copy link
Copy Markdown
Member

@yuzefovich yuzefovich commented Jul 21, 2022

sql: clean up PartitionSpans a bit

This commit cleans up PartitionSpans implementations a bit:

  • it pulls out some of the code shared between partitionSpansSystem
    and partitionSpansTenant into PartitionSpans to avoid the
    duplication
  • it refactors the way we handle spans with no end keys (such spans are
    converted into GetRequests later on). This required a minor fix to the
    fake span resolver used in fakedist config
  • it converts a panic about "no spans" into an assertion error.

Release note: None

sql: move some functions around

This commit moves a few functions around in order to get better grouping
of related functions. This is a pure mechanical change.

Release note: None

sql: implement locality-aware DistSQL planning in multi-tenant setup

This commit implements the region-aware DistSQL physical planning in
the multi-tenant setup. Previously, we would use a round-robin strategy
of assigning single-range spans to all available SQL instances and now
we'll make a much better decision.

The algorithm is as follows:

  1. find all available SQL pods and determine their locality (via the
    sqlinstance.Provider)
  2. iterate over all spans and break them down into single-range spans
    using the range cache that the gateway SQL instance maintains
  3. for each single-range span, find the KV leaseholder and determine its
    locality (via the kvtenantccl.Connector), then pick a random SQL
    instance in the same locality
  4. if there is no SQL instance in that locality, then assign the
    single-range span to the gateway SQL instance.

At the moment, the only locality "tier" that we look at is the "region",
but in the future we can easily extend it (say, to also include the
"availability zone").

The choice of randomly picking the SQL instance in the same locality in
the step 3 is done so that we distribute the load evenly among multiple
pods.

The step 2 was already being performed for the system tenant, so this
commit extracts out some helpers to reuse the code as much as possible.
Additionally, it fixes an omission in the physical planning in
multi-tenancy when the query has a LIMIT (getInstanceIDForScan).

If in step 1 we couldn't determine the region of any available SQL pod,
then we fallback to the old naive locality-ignorant strategy of
assigning pods in round-robin fashion.

Fixes: #80678.

Release note: None (I don't think that any of these changes are
user-visible since the serverless doesn't yet support the multi-region
clusters.)

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@yuzefovich yuzefovich marked this pull request as ready for review July 21, 2022 03:09
@yuzefovich yuzefovich requested a review from a team as a code owner July 21, 2022 03:09
@yuzefovich yuzefovich requested review from cucaroach and msirek July 21, 2022 03:09
@yuzefovich
Copy link
Copy Markdown
Member Author

Here is the link to the google doc that Rachael prepared before going on a parental leave which might be helpful for reviewers to get some more background.

@yuzefovich yuzefovich force-pushed the mt branch 2 times, most recently from 5c458f7 to 681158d Compare July 21, 2022 20:51
Copy link
Copy Markdown
Contributor

@msirek msirek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good. I just had a couple questions/comments for possible improvement.
:lgtm:

Reviewed 2 of 3 files at r1, 3 of 3 files at r4, 1 of 1 files at r5, 4 of 4 files at r6, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)


pkg/sql/distsql_physical_planner.go line 1324 at r6 (raw file):

			// Pick a random instance in this region in order to spread the
			// load.
			return instancesInRegion[dsp.rng.Intn(len(instancesInRegion))]

If doing local execution, would it be possible to use an instance in the same availability zone as the gateway instance? I'm just wondering if that would help latency vs. using a random instance (or maybe still pick the instance randomly, but with a higher probability of using an instance in the same AZ).


pkg/sql/distsql_physical_planner.go line 1335 at r6 (raw file):

		dsp.rng.Shuffle(len(instances), func(i, j int) {
			instances[i], instances[j] = instances[j], instances[i]
		})

Would it be worth checking if we have a soft limit, and favoring an order which has the local region appearing first in the list, followed by other close regions?

Copy link
Copy Markdown
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @cucaroach and @msirek)


pkg/sql/distsql_physical_planner.go line 1324 at r6 (raw file):

Previously, msirek (Mark Sirek) wrote…

If doing local execution, would it be possible to use an instance in the same availability zone as the gateway instance? I'm just wondering if that would help latency vs. using a random instance (or maybe still pick the instance randomly, but with a higher probability of using an instance in the same AZ).

By "doing local execution" do you mean that the plan is not distributed? If that's the case, we don't use PartitionSpans at all and assign all spans to the gateway.

Or do you mean that - when the plan is distributed - for local region we'd use a non-uniform probability distribution where the gateway would be favored? That's an interesting idea, I left a TODO about this. I think it's not as clear-cut decision as it might be on the first glance because PartitionSpans is used by other use cases than the query execution (e.g. CDC, BulkIO), and in those cases the cost of extra hop to another instance in the local region seems negligible in comparison to the work each instance will have to do, so it seems beneficial to spread the spans as evenly as possible.


pkg/sql/distsql_physical_planner.go line 1335 at r6 (raw file):

Previously, msirek (Mark Sirek) wrote…

Would it be worth checking if we have a soft limit, and favoring an order which has the local region appearing first in the list, followed by other close regions?

It looks like this comment refers to the case when len(regionToSQLInstanceIDs) == 0 (meaning that we could not determine the region information for any SQL instance), so it doesn't seem applicable since we can't really distinguish instances being in the local of a remote region.

Copy link
Copy Markdown
Contributor

@cucaroach cucaroach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 3 files at r1, 4 of 4 files at r7, 1 of 1 files at r8, 4 of 4 files at r9, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @msirek and @yuzefovich)


pkg/sql/distsql_physical_planner.go line 1017 at r9 (raw file):

// CheckInstanceHealthAndVersion returns a information about a node's health and
// compatibility. The info is also recorded in planCtx.Nodes.

nit: planCtx.NodeStatuses
nit: "returns information" don't need the a

Copy link
Copy Markdown
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @cucaroach and @msirek)


pkg/sql/distsql_physical_planner.go line 1324 at r6 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

By "doing local execution" do you mean that the plan is not distributed? If that's the case, we don't use PartitionSpans at all and assign all spans to the gateway.

Or do you mean that - when the plan is distributed - for local region we'd use a non-uniform probability distribution where the gateway would be favored? That's an interesting idea, I left a TODO about this. I think it's not as clear-cut decision as it might be on the first glance because PartitionSpans is used by other use cases than the query execution (e.g. CDC, BulkIO), and in those cases the cost of extra hop to another instance in the local region seems negligible in comparison to the work each instance will have to do, so it seems beneficial to spread the spans as evenly as possible.

@msirek curious about your thoughts here to make sure I understood your comment.

Copy link
Copy Markdown
Contributor

@msirek msirek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 4 files at r10.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @cucaroach, @msirek, and @yuzefovich)


pkg/sql/distsql_physical_planner.go line 1324 at r6 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

@msirek curious about your thoughts here to make sure I understood your comment.

I misunderstood the point of this code in my initial comment, thanks for the clarification. I guess we don't currently track availability zone information. But if we maintained a table of the average latencies between the different availability zones of the cluster nodes, we could try to favor selection of a node here (still randomized though in case all SQLs are executed from a single region) which has a lower latency when communicating with the gateway region (or gateway AZ, if we start tracking this). I'm not sure how much this would matter. It's maybe just something to make a note of to investigate in the future. If there are lots of messages between distributed nodes and the gateway, maybe the latencies would start to add up and this could have a measurable benefit.


pkg/sql/distsql_physical_planner.go line 1335 at r6 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

It looks like this comment refers to the case when len(regionToSQLInstanceIDs) == 0 (meaning that we could not determine the region information for any SQL instance), so it doesn't seem applicable since we can't really distinguish instances being in the local of a remote region.

Right, thanks. I'm curious under what situations we can't find the region information of instances.

This commit cleans up `PartitionSpans` implementations a bit:
- it pulls out some of the code shared between `partitionSpansSystem`
and `partitionSpansTenant` into `PartitionSpans` to avoid the
duplication
- it refactors the way we handle spans with no end keys (such spans are
converted into GetRequests later on). This required a minor fix to the
fake span resolver used in `fakedist` config
- it converts a panic about "no spans" into an assertion error.

Release note: None
This commit moves a few functions around in order to get better grouping
of related functions. This is a pure mechanical change.

Release note: None
Copy link
Copy Markdown
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @cucaroach, @msirek, and @yuzefovich)


pkg/sql/distsql_physical_planner.go line 1324 at r6 (raw file):

Previously, msirek (Mark Sirek) wrote…

I misunderstood the point of this code in my initial comment, thanks for the clarification. I guess we don't currently track availability zone information. But if we maintained a table of the average latencies between the different availability zones of the cluster nodes, we could try to favor selection of a node here (still randomized though in case all SQLs are executed from a single region) which has a lower latency when communicating with the gateway region (or gateway AZ, if we start tracking this). I'm not sure how much this would matter. It's maybe just something to make a note of to investigate in the future. If there are lots of messages between distributed nodes and the gateway, maybe the latencies would start to add up and this could have a measurable benefit.

Makes sense, expanded the TODO.

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Aug 2, 2022

Build failed (retrying...):

@yuzefovich
Copy link
Copy Markdown
Member Author

Seems like some of the tests can be flaky.

bors r-

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Aug 2, 2022

Canceled.

This commit implements the region-aware DistSQL physical planning in
the multi-tenant setup. Previously, we would use a round-robin strategy
of assigning single-range spans to all available SQL instances and now
we'll make a much better decision.

The algorithm is as follows:
1. find all available SQL pods and determine their locality (via the
`sqlinstance.Provider`)
2. iterate over all spans and break them down into single-range spans
using the range cache that the gateway SQL instance maintains
3. for each single-range span, find the KV leaseholder and determine its
locality (via the `kvtenantccl.Connector`), then pick a random SQL
instance in the same locality
4. if there is no SQL instance in that locality, then assign the
single-range span to the gateway SQL instance.

At the moment, the only locality "tier" that we look at is the "region",
but in the future we can easily extend it (say, to also include the
"availability zone").

The choice of randomly picking the SQL instance in the same locality in
the step 3 is done so that we distribute the load evenly among multiple
pods.

The step 2 was already being performed for the system tenant, so this
commit extracts out some helpers to reuse the code as much as possible.
Additionally, it fixes an omission in the physical planning in
multi-tenancy when the query has a LIMIT (`getInstanceIDForScan`).

If in step 1 we couldn't determine the region of any available SQL pod,
then we fallback to the old naive locality-ignorant strategy of
assigning pods in round-robin fashion.

Release note: None (I don't think that any of these changes are
user-visible since the serverless doesn't yet support the multi-region
clusters.)
@yuzefovich
Copy link
Copy Markdown
Member Author

I think I de-flaked the test but will wait for green CI this time.

@yuzefovich
Copy link
Copy Markdown
Member Author

Looks good.

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Aug 3, 2022

Build succeeded:

@craig craig bot merged commit 5cd0309 into cockroachdb:master Aug 3, 2022
@yuzefovich yuzefovich deleted the mt branch August 3, 2022 00:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

distsql: add support for locality-awareness in distributed queries in multi-tenant environments

4 participants