-
Notifications
You must be signed in to change notification settings - Fork 4.1k
storage: 4.5 seconds after a cluster creation is a bad time to run your transactions #32495
Description
A transaction running around 4.5s after cluster creation (think tests) can easily catch a TransactionAbortedError. This is because all ranges start up as having expiration-based leases (as a likely path dependency / accident - they all come from the first range, and splits maintain the original epoch-based lease on both sides). 4.5s later, these leases become eligible for a refresh. The new leases are generally epoch-based, and so it is not Equivalent() to the old one (and so it gets a new Sequence, which fact in turn causes the new lease acquisition to trigger this code which resets the timestamp cache.
After that ts cache wipe, a concurrent BeginTxn can fail its tscache check resulting in TransactionAbortedError(ABORT_REASON_TIMESTAMP_CACHE_REJECTED_POSSIBLE_REPLAY).
I believe we've seen this be a cause of flakiness for multiple tests.
Discussing with @bdarnell, it seems that we have a couple of options:
- if the rhs of a split wants epoch-based leases, have it not inherit the expiration-based lease from the lhs. Exactly how to do that is yet unclear. Can a range not have a lease at all? Perhaps we can give the rhs an expired lease.
- make
Lease.Equivalent()understand this transition from exp to epo, and have it consider the two equivalent.