kvserver: unxpected replication change from 3 to 2 voters

**Describe the problem**

When @aliher1911 and I looked into unhappy restore2TB runs in the context of https://github.com/cockroachdb/cockroach/issues/61396, we noticed unexpected replication changes. From a three-voter configuration, for some reason we're demoting a voter:

> [n7,s7,r39158/4:x] change replicas (add ‹[]› remove ‹[(n5,s5):1VOTER_DEMOTING_LEARNER]›): existing descriptor r39158:‹/Table/54/1/52061454{-/0}› [(n5,s5):1, (n7,s7):4, (n10,s10):3, next=5, gen=744, sticky=1617706884.328040814,0]

At that point, in the [full snippet](https://gist.github.com/aliher1911/2c421e67d9519d5c83abd62cb2ef6ced), the range goes unavailable because one of the two remaining voters (s10) is waiting for a snapshot.

**To Reproduce**

Run the `restore2TB/nodes=10` roachtest.

This *should* reproduce on any SHA preceding #64060, such as
d85d49dd00a3485aed74b0160e408a4e325f6fae, when running restore2TB.
It may not always happen but we saw it frequently, at least in
"unhappy" runs (as characterized by large pending snapshot counts).

**Expected behavior**

With 10 live nodes and atomic replication changes, there should never be a
reason to move from a three-voter to a two-voter configuration.o The only
explanation I have is that n5 might have been considered dead for 5 minutes
which would possibly trigger this issue (?!) but this is esssentially ruled out
by the full snippet, which indicates that n5 was live a minute after the
botched replication change (and it is thus unlikely to have been non-live for
the preceding minutes).

@aliher1911 if you have full logs from any of these experiments, mind going
through them to see if you have other examples of such replication changes,
and if so, posting the complete log directory (Google Drive).

**Additional data / screenshots**

**Environment:**

**Additional context**
Add any other context about the problem here.

cc @cockroachdb/kv 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: unxpected replication change from 3 to 2 voters #64064

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

kvserver: unxpected replication change from 3 to 2 voters #64064

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions