Skip to content

storage: must never reuse state from old replicaIDs #40367

@tbg

Description

@tbg

I was looking at the range merge tests and in particular
TestStoreRangeMergeAddReplicaRace (at its time of introduction; it has since
changed due to learners and lost a lot of its context), see

ce93650

The worry is roughly that when a snapshot that does not contain a merge is
sitting on some store (which is not part of the range) and this store is then
added to the range, the replica will likely blow up while applying the merge
trigger, is the subsumed replica is very unlikely to be present - this colocation
is only achieved for the duration of the merge txn and for the stores that
are members of the range.

At the time of the below comment, the way such a situation would come to be is
because preemptive snapshots were sent before starting the replica change txn.
So you'd send a snapshot, then do a merge, and (before we introduced the range
desc generation for that purpose) the upreplication could succeed, boom.

However, this sequence of events might unfold just the same without any preemptive snaps:

  1. add replica on store
  2. remove replica from store; replicaGC is delayed
  3. run a merge
  4. readd store

Here, in 4 I don't mean "re-add as a learner and promote to voter"! It's enough to
readd it as a learner. Raft will start catching up the old snapshot and boom, we
hit the problem above.

This actually seems like a likely problem to occur in practice if replicas are
removed and re-added frequently while merges occur, though the fact that we still
send explicit learner snapshots could mask it somewhat (that snapshot will postdate
the merge, so it can hide the problem by fast-forwarding the old replica past the
merge).

Fixing this is probably not easy. We want to "not reuse old snapshots", i.e. when the
replicaID changes we want to delete all our state and start anew. This is far from a
trivial change, but it's in the spirit of changes that we want to make anyway to lower
the complexity in the storage package. Anyway the first step here is reproducing the
problem.

cc @danhhz @nvanbenschoten

// TestStoreRangeMergeAddReplicaRace verifies that an add replica request that
// occurs concurrently with a merge is aborted.
//
// To see why aborting the add replica request is necessary, consider two
// adjacent and collocated ranges, Q and R. Say the replicate queue decides to
// rebalance Q onto store S4. It will initiate a ChangeReplicas command that
// will send S4 a preemptive snapshot, then launch a replica-change transaction
// to update R's range descriptor with the new replica. Now say the merge queue
// decides to merge Q and R after the preemptive snapshot of Q has been sent to
// S4 but before the replica-change transaction has started. The merge can
// succeed because the ranges are still collocated. (The new replica of Q is
// only considered added once the replica-change transaction commits.) If the
// replica-change transaction were to commit, the new replica of Q on S4 would
// have a snapshot of Q that predated the merge. In order to catch up, it would
// need to apply the merge trigger, but the merge trigger will explode because
// S4 does not have a replica of R.
//
// To avoid this scenario, ChangeReplicas commands will abort if they discover
// the range descriptor has changed between when the snapshot is sent and when
// the replica-change transaction starts.
//
// There is a particularly diabolical edge case here. Consider the same
// situation as above, except that Q and R merge together and then split at
// exactly the same key, all before the replica-change transaction starts. Q's
// range descriptor will have the same start key, end key, and next replica ID
// that it did when the preemptive snapshot started. That is, it will look
// unchanged! To protect against this, range descriptors contain a generation
// counter, which is incremented on every split or merge. The presence of this
// counter means that ChangeReplicas commands can detect and abort if any merges
// have occurred since the preemptive snapshot, even if the sequence of splits
// or merges left the keyspan of the range unchanged. This diabolical edge case
// is tested here.

Metadata

Metadata

Assignees

Labels

A-kv-distributionRelating to rebalancing and leasing.C-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions