Skip to content

kv: range merge can result in serializability violation if applied through Raft snapshot on leaseholder #60520

@nvb

Description

@nvb

We prevent the post-merged range from serving writes below any reads previously served by the RHS by bumping the LHS’s timestamp cache to the RHS’s freeze time. That all works great when the LHS’s leaseholder applies the range merge trigger through normal Raft log application. But what if the LHS’s leaseholder applies the range merge trigger through a Raft snapshot? In these cases where the RHS is "subsumed" during the snapshot, the LHS's leaseholder does not bump its timestamp cache. This allows the post-merged range to invalidate reads served by the pre-merge RHS range.

For context, at one point, we had a bug related to learning about becoming the leaseholder through a snapshot, and so we now handle that case in Replica.applySnapshot by calling leasePostApply.

The reason why we've likely never actually seen this before is that it is extremely difficult to get a leaseholder to apply a range merge through a Raft snapshot. Here is a comment from a Slack thread that explains what is needed to create this situation:

Getting a LHS leaseholder to apply a merge through a snapshot is not easy. It requires a leader/leaseholder split, of course. But then it also requires the leaseholder to be partitioned from the leader and then the log to be truncated ahead of the leaseholder’s log. But for all intents and purposes, the leaseholder is the only one that ever makes the decision to truncate. So we need to hit one of the few edge cases where the log is truncated ahead of the leaseholder’s last log index.

But that’s not all. The LHS leaseholder is intimately involved in the merge process. So it needs to have been part of the quorum and properly applying log entries right up to the point of the merge trigger. Otherwise, the merge txn would have gotten stuck. But if it somehow gets to the merge trigger, forwards the proposal to the leader, gets partitioned, the log gets truncated, then it receives a snapshot that includes the merge, it won’t properly bumps its timestamp cache (if the LHS and RHS leaseholders were on different nodes) and it can then serve writes that rewrite history.

Metadata

Metadata

Assignees

Labels

A-kv-distributionRelating to rebalancing and leasing.A-kv-replicationRelating to Raft, consensus, and coordination.A-kv-transactionsRelating to MVCC and the transactional model.C-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.S-0-visible-logical-errorDatabase stores inconsistent data in some cases, or queries return invalid results silently.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions