-
Notifications
You must be signed in to change notification settings - Fork 4.1k
kv: range merge can result in serializability violation if applied through Raft snapshot on leaseholder #60520
Description
We prevent the post-merged range from serving writes below any reads previously served by the RHS by bumping the LHS’s timestamp cache to the RHS’s freeze time. That all works great when the LHS’s leaseholder applies the range merge trigger through normal Raft log application. But what if the LHS’s leaseholder applies the range merge trigger through a Raft snapshot? In these cases where the RHS is "subsumed" during the snapshot, the LHS's leaseholder does not bump its timestamp cache. This allows the post-merged range to invalidate reads served by the pre-merge RHS range.
For context, at one point, we had a bug related to learning about becoming the leaseholder through a snapshot, and so we now handle that case in Replica.applySnapshot by calling leasePostApply.
The reason why we've likely never actually seen this before is that it is extremely difficult to get a leaseholder to apply a range merge through a Raft snapshot. Here is a comment from a Slack thread that explains what is needed to create this situation:
Getting a LHS leaseholder to apply a merge through a snapshot is not easy. It requires a leader/leaseholder split, of course. But then it also requires the leaseholder to be partitioned from the leader and then the log to be truncated ahead of the leaseholder’s log. But for all intents and purposes, the leaseholder is the only one that ever makes the decision to truncate. So we need to hit one of the few edge cases where the log is truncated ahead of the leaseholder’s last log index.
But that’s not all. The LHS leaseholder is intimately involved in the merge process. So it needs to have been part of the quorum and properly applying log entries right up to the point of the merge trigger. Otherwise, the merge txn would have gotten stuck. But if it somehow gets to the merge trigger, forwards the proposal to the leader, gets partitioned, the log gets truncated, then it receives a snapshot that includes the merge, it won’t properly bumps its timestamp cache (if the LHS and RHS leaseholders were on different nodes) and it can then serve writes that rewrite history.