kv/concurrency: clear lock holder in tryFreeLockOnReplicatedAcquire#50173
kv/concurrency: clear lock holder in tryFreeLockOnReplicatedAcquire#50173craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
sumeerbhola
left a comment
There was a problem hiding this comment.
it was possible for
requests with existing snapshots of the tree to begin waiting on the
partially cleared but not empty lockState. We would then remove
the lockState from the lock table's main tree, abandoning any waiter
that managed to slip in and begin waiting on the removed lockState.
It sounds like it should be easy to add a data-driven test case for lockTable that will fail prior to this fix. If yes, can you add one.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @sumeerbhola)
This commit fixes a fairly bad bug introduced by cockroachdb#49980. The bug was that we were not clearing `lockState.holder` while under lock in `tryFreeLockOnReplicatedAcquire`, so the lock was never marked as "empty" (see `isEmptyLock`). This meant that even though we held the `tree.mu` in `AcquireLock` continuously between the calls to `tryFreeLockOnReplicatedAcquire` and `tree.Delete`, it was possible for requests with existing snapshots of the tree to begin waiting on the partially cleared **but not empty** `lockState`. We would then remove the `lockState` from the lock table's main tree, abandoning any waiter that managed to slip in and begin waiting on the removed `lockState`. These waiters would eventually push the original holder of the lock, but even after their push succeeded and they resolved the corresponding intent, they would not receive any update on their lockTableGuard's channel. This should have been caught by some of the YCSB roachtests, but they haven't actually run for the past week due to various CI flakes. It would have also been caught by the new assertion the commit adds to `lockState.lockIsFree`.
711e0a1 to
c90702a
Compare
Good idea, done. I realized late into doing this that the tree snapshots made it much easier to hit than I had thought. I was thinking it required a race between the call to TFTR! bors r+ |
Build succeeded |
This commit fixes a fairly bad bug introduced by #49980. The bug was
that we were not clearing
lockState.holderwhile under lock intryFreeLockOnReplicatedAcquire, so the lock was never marked as"empty" (see
isEmptyLock). This meant that even though we held thetree.muinAcquireLockcontinuously between the calls totryFreeLockOnReplicatedAcquireandtree.Delete, it was possible forrequests with existing snapshots of the tree to begin waiting on the
partially cleared but not empty
lockState. We would then removethe
lockStatefrom the lock table's main tree, abandoning any waiterthat managed to slip in and begin waiting on the removed
lockState.These waiters would eventually push the original holder of the lock, but
even after their push succeeded and they resolved the corresponding
intent, they would not receive any update on their lockTableGuard's
channel.
This should have been caught by some of the YCSB roachtests, but they
haven't actually run for the past week due to various CI flakes. It would
have also been caught by the new assertion the commit adds to
lockState.lockIsFree.