-
Notifications
You must be signed in to change notification settings - Fork 4.1k
kvserver: remove special-casing for lease index 0 in refreshProposalsLocked #74711
Description
Touches #33007.
Is your feature request related to a problem? Please describe.
We have this code that essentially nukes any lease-applied-index-zero proposal that shows up in a refresh:
cockroach/pkg/kv/kvserver/replica_raft.go
Lines 1097 to 1109 in f68a5a7
| if p.command.MaxLeaseIndex == 0 { | |
| // Commands without a MaxLeaseIndex cannot be reproposed, as they might | |
| // apply twice. We also don't want to ask the proposer to retry these | |
| // special commands. | |
| r.cleanupFailedProposalLocked(p) | |
| log.VEventf(p.ctx, 2, "refresh (reason: %s) returning AmbiguousResultError for command "+ | |
| "without MaxLeaseIndex: %v", reason, p.command) | |
| p.finishApplication(ctx, proposalResult{Err: roachpb.NewError( | |
| roachpb.NewAmbiguousResultError( | |
| fmt.Sprintf("unknown status for command without MaxLeaseIndex "+ | |
| "at refreshProposalsLocked time (refresh reason: %s)", reason)))}) | |
| continue | |
| } |
The only such proposals are in fact leases:
cockroach/pkg/kv/kvserver/replica_proposal_buf.go
Lines 603 to 609 in 95a7671
| // Request a new max lease applied index for any request that isn't itself | |
| // a lease request. Lease requests don't need unique max lease index values | |
| // because their max lease indexes are ignored. See checkForcedErr. | |
| if !p.Request.IsLeaseRequest() { | |
| b.assignedLAI++ | |
| } | |
| lai := b.assignedLAI |
Some problems with terminating these proposals early:
- the proposal's age isn't checked, so this might nuke a perfectly good lease proposal that was literally just inserted
- it's getting in the way in kvserver: circuit-break requests to unavailable ranges #71806: we are trying to use the existence of old proposals as a trigger for the per-replica circuit breaker, but if there isn't a lease, then the only proposal is going to be a lease, and it's going to be terminated every ~3s which is much more aggressive than the timeout we're setting for what it means to be "old". So the breaker never trips but it should. (This is why I am filing this issue)
Describe the solution you'd like
I believe that leases have their own replay protection through the lease
sequence number. There is a surprising amount of logic in the lease code and so
I can't be entirely sure. Perhaps there are issues where multiple leases might
be proposed with the same sequence number (due to being Equivalent) but get
reordered? Equivalent is not commutative and it checks that the "later" lease
has a longer expiration (in the case of expiration-based leases) so there it
seems fine. And for epoch-based leases, basically only the epoch and start time
matter, which are always the same, so there isn't a problem either.
If we believe lease-related requests can safely be reproposed, since they are
the only requests special-cased in this way in refreshProposalsLocked, the
simplest solution would be to remove the special casing, or to defer their
removal until the breaker is tripped.