kv: switching from ZONE to REGION survival causes unexpected data movement

When switching from ZONE to REGION survival in a 3 node cluster, only a single snapshot is necessary per range. This is because we switch from a topology that looks like:
```
region 1: voter (leaseholder), voter, voter
region 2: non-voter
region 3: non-voter
```
to a topology that looks like:
```
region 1: voter (leaseholder), voter
region 2: voter, voter
region 3: voter
```
So if both non-voters are promoted to voters, there should only be one snapshot necessary. Furthermore, we could do something smart about how we send that snapshot to avoid the WAN traffic - https://github.com/cockroachdb/cockroach/issues/42491. But let's ignore that for now.

In one of my tests, this is not what I saw. After switching from ZONE to REGION survivability, each range took the following steps:
```
1. add new voter in region 2
2. add new voter in region 3
3. remove non-voter in region 2
4. remove non-voter in region 3
5. move voter from region 1 to region 2
```
This resulted in a total of 3 range snapshots all sent over the WAN. This is a decent amount of wasted data movement, given that we had two perfectly good non-voting replicas that we could have promoted. Do we understand why we made these decisions?

[r6455_manual_enqueue_logs.txt](https://github.com/cockroachdb/cockroach/files/6328735/r6455_manual_enqueue_logs.txt)
[r6455 Range _ Debug _ Cockroach Console Before.pdf](https://github.com/cockroachdb/cockroach/files/6328734/r6455.Range._.Debug._.Cockroach.Console.pdf)
[r6455 Range _ Debug _ Cockroach Console After.pdf](https://github.com/cockroachdb/cockroach/files/6328738/r6455.Range._.Debug._.Cockroach.Console.After.pdf)

Here's the log from a second instance that hurts even more because it includes a non-voter that is deleted and then is quickly replaced by a voter on the same node.

[r6456_manual_enqueue_logs.txt](https://github.com/cockroachdb/cockroach/files/6328751/r6456_manual_enqueue_logs.txt)

_Note: this is an inefficiency, but certainly nothing that we need to rush to fix for v21.1.0. Everything still worked, it was just not as optimal as I was hoping._

Jira issue: CRDB-6780

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: switching from ZONE to REGION survival causes unexpected data movement #63810

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

kv: switching from ZONE to REGION survival causes unexpected data movement #63810

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions