-
Notifications
You must be signed in to change notification settings - Fork 4.1k
storage: tryReproposeWithNewLeaseIndex may violate closed timestamp #42821
Description
Consider proposing a write and this code path:
cockroach/pkg/storage/replica_write.go
Lines 149 to 165 in 4850974
| ch, abandon, maxLeaseIndex, pErr := r.evalAndPropose(ctx, lease, ba, spans, ec.move()) | |
| if pErr != nil { | |
| if maxLeaseIndex != 0 { | |
| log.Fatalf( | |
| ctx, "unexpected max lease index %d assigned to failed proposal: %s, error %s", | |
| maxLeaseIndex, ba, pErr, | |
| ) | |
| } | |
| return nil, pErr | |
| } | |
| // A max lease index of zero is returned when no proposal was made or a lease was proposed. | |
| // In both cases, we don't need to communicate a MLAI. Furthermore, for lease proposals we | |
| // cannot communicate under the lease's epoch. Instead the code calls EmitMLAI explicitly | |
| // as a side effect of stepping up as leaseholder. | |
| if maxLeaseIndex != 0 { | |
| untrack(ctx, ctpb.Epoch(lease.Epoch), r.RangeID, ctpb.LAI(maxLeaseIndex)) | |
| } |
Let's say the maxLeaseIndex assigned here is 100 and the timestamp of the command is 1000. The closed timestamp tracker now knows 100 as the MLAI of a command that has an effect at ts=1000.
The proposal now hits this code path
cockroach/pkg/storage/replica_application_result.go
Lines 145 to 151 in 27c3d49
| case proposalIllegalLeaseIndex: | |
| // If we failed to apply at the right lease index, try again with a | |
| // new one. This is important for pipelined writes, since they don't | |
| // have a client watching to retry, so a failure to eventually apply | |
| // the proposal would be a user-visible error. | |
| pErr = r.tryReproposeWithNewLeaseIndex(ctx, cmd) | |
| if pErr != nil { |
and gets reproposed with a new lease index, let's say 200:
cockroach/pkg/storage/replica_application_result.go
Lines 210 to 215 in 27c3d49
| // Some tests check for this log message in the trace. | |
| log.VEventf(ctx, 2, "retry: proposalIllegalLeaseIndex") | |
| maxLeaseIndex, pErr := r.propose(ctx, p) | |
| if pErr != nil { | |
| log.Warningf(ctx, "failed to repropose with new lease index: %s", pErr) | |
| return pErr |
From the perspective of closed timestamps, this is problematic. A CT update might have been sent out stating that "if you've caught up to lease index 100, you've seen all the proposals that will have an effect on timestamp 1000". But it turns out that in this example, such a command will pop out of raft at lease index 200.
It seems that this can cause a violation of the invariant that any closed out MVCC snapshot is immutable.
As for fixing this, the "natural" attempt is to go through the tracker again. However, if we find that the tracker pushes the proposal (i.e. the original timestamp has closed out, as it has in this example), we need someone to reevaluate it.
It's possible that we would be better off removing this one-off below-raft-reproposal path completely and instead opting for having the client reevaluate the command unconditionally. The one-off path was originally introduced to accommodate pipelined writes (where the client returns early and isn't around to reevaluate the command) but this doesn't mean we can't reevaluate the batch, just that we have to create a goroutine to do so. This approach is less performant, but this is a rare case where simplicity is more important than performance.