Skip to content

kv/concurrency: clear lock holder in tryFreeLockOnReplicatedAcquire#50173

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
nvb:nvanbenschoten/clearLockFix
Jun 15, 2020
Merged

kv/concurrency: clear lock holder in tryFreeLockOnReplicatedAcquire#50173
craig[bot] merged 1 commit intocockroachdb:masterfrom
nvb:nvanbenschoten/clearLockFix

Conversation

@nvb
Copy link
Copy Markdown
Contributor

@nvb nvb commented Jun 13, 2020

This commit fixes a fairly bad bug introduced by #49980. The bug was
that we were not clearing lockState.holder while under lock in
tryFreeLockOnReplicatedAcquire, so the lock was never marked as
"empty" (see isEmptyLock). This meant that even though we held the
tree.mu in AcquireLock continuously between the calls to
tryFreeLockOnReplicatedAcquire and tree.Delete, it was possible for
requests with existing snapshots of the tree to begin waiting on the
partially cleared but not empty lockState. We would then remove
the lockState from the lock table's main tree, abandoning any waiter
that managed to slip in and begin waiting on the removed lockState.
These waiters would eventually push the original holder of the lock, but
even after their push succeeded and they resolved the corresponding
intent, they would not receive any update on their lockTableGuard's
channel.

This should have been caught by some of the YCSB roachtests, but they
haven't actually run for the past week due to various CI flakes. It would
have also been caught by the new assertion the commit adds to
lockState.lockIsFree.

@nvb nvb requested a review from sumeerbhola June 13, 2020 02:01
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was possible for
requests with existing snapshots of the tree to begin waiting on the
partially cleared but not empty lockState. We would then remove
the lockState from the lock table's main tree, abandoning any waiter
that managed to slip in and begin waiting on the removed lockState.

It sounds like it should be easy to add a data-driven test case for lockTable that will fail prior to this fix. If yes, can you add one.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @sumeerbhola)

This commit fixes a fairly bad bug introduced by cockroachdb#49980. The bug was
that we were not clearing `lockState.holder` while under lock in
`tryFreeLockOnReplicatedAcquire`, so the lock was never marked as
"empty" (see `isEmptyLock`). This meant that even though we held the
`tree.mu` in `AcquireLock` continuously between the calls to
`tryFreeLockOnReplicatedAcquire` and `tree.Delete`, it was possible for
requests with existing snapshots of the tree to begin waiting on the
partially cleared **but not empty** `lockState`. We would then remove
the `lockState` from the lock table's main tree, abandoning any waiter
that managed to slip in and begin waiting on the removed `lockState`.
These waiters would eventually push the original holder of the lock, but
even after their push succeeded and they resolved the corresponding
intent, they would not receive any update on their lockTableGuard's
channel.

This should have been caught by some of the YCSB roachtests, but they
haven't actually run for the past week due to various CI flakes. It would
have also been caught by the new assertion the commit adds to
`lockState.lockIsFree`.
@nvb nvb force-pushed the nvanbenschoten/clearLockFix branch from 711e0a1 to c90702a Compare June 15, 2020 16:32
@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Jun 15, 2020

It sounds like it should be easy to add a data-driven test case for lockTable that will fail prior to this fix. If yes, can you add one.

Good idea, done. I realized late into doing this that the tree snapshots made it much easier to hit than I had thought. I was thinking it required a race between the call to tryFreeLockOnReplicatedAcquire and the call to tree.Delete, but it didn't.

TFTR!

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jun 15, 2020

Build succeeded

@craig craig bot merged commit c1d0b4a into cockroachdb:master Jun 15, 2020
@nvb nvb deleted the nvanbenschoten/clearLockFix branch June 17, 2020 16:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants