Skip to content

kvserver: unxpected replication change from 3 to 2 voters #64064

@tbg

Description

@tbg

Describe the problem

When @aliher1911 and I looked into unhappy restore2TB runs in the context of #61396, we noticed unexpected replication changes. From a three-voter configuration, for some reason we're demoting a voter:

[n7,s7,r39158/4:x] change replicas (add ‹[]› remove ‹[(n5,s5):1VOTER_DEMOTING_LEARNER]›): existing descriptor r39158:‹/Table/54/1/52061454{-/0}› [(n5,s5):1, (n7,s7):4, (n10,s10):3, next=5, gen=744, sticky=1617706884.328040814,0]

At that point, in the full snippet, the range goes unavailable because one of the two remaining voters (s10) is waiting for a snapshot.

To Reproduce

Run the restore2TB/nodes=10 roachtest.

This should reproduce on any SHA preceding #64060, such as
d85d49d, when running restore2TB.
It may not always happen but we saw it frequently, at least in
"unhappy" runs (as characterized by large pending snapshot counts).

Expected behavior

With 10 live nodes and atomic replication changes, there should never be a
reason to move from a three-voter to a two-voter configuration.o The only
explanation I have is that n5 might have been considered dead for 5 minutes
which would possibly trigger this issue (?!) but this is esssentially ruled out
by the full snippet, which indicates that n5 was live a minute after the
botched replication change (and it is thus unlikely to have been non-live for
the preceding minutes).

@aliher1911 if you have full logs from any of these experiments, mind going
through them to see if you have other examples of such replication changes,
and if so, posting the complete log directory (Google Drive).

Additional data / screenshots

Environment:

Additional context
Add any other context about the problem here.

cc @cockroachdb/kv

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-kv-replicationRelating to Raft, consensus, and coordination.C-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.GA-blockerbranch-masterFailures and bugs on the master branch.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions