storage: Fix deadlock in TestStoreRangeMergeSlowWatcher#37477
Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom May 15, 2019
Merged
storage: Fix deadlock in TestStoreRangeMergeSlowWatcher#37477craig[bot] merged 1 commit intocockroachdb:masterfrom
craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
Member
tbg
approved these changes
May 15, 2019
Member
tbg
left a comment
There was a problem hiding this comment.
I usually try to avoid sync.Cond when possible, but I don't see how to do that here without also significantly rewriting the test, which isn't going to be good use of anyone's time at this point.
Reviewed 1 of 1 files at r1.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @andreimatei and @darinpp)
This test is occasionally flaking under heavy race stress in CI runs. Here is the probable sequence of events: 1. A<-B merge starts and Subsume request locks down B. 2. Watcher on B sends PushTxn request, which is intercepted by the TestingRequestFilter in the test. 3. The merge txn aborts due to interference with the replica GC, since it's concurrently reading range descriptors. 4. The merge txn is retried, but the Watcher on B is locking the range, so it aborts again. 5. cockroachdb#4 repeats until the allowPushTxn channel fills up (it has capacity of 10). This causes a deadlock because the merge txn can't continue. Meanwhile, the watcher is blocked waiting for the results of the PushTxn request, which gets blocked waiting for the merge txn. The fix is to get rid of the arbitrarily limited channel size of 10 and use sync.Cond synchronization instead. Multiple retries of the merge txn will repeatedly signal the Cond, rather than fill up the channel. One of the problems with the channel was that there can be an imbalance between the number of items sent to the channel (by merge txns) with the number of items received from the channel (by the watcher). This imbalance meant the channel gradually filled up until finally the right sequence of events caused deadlock. Using a sync.Cond also fixes a race condition I saw several times, in which the merge transaction tries to send to the channel while it is being concurrently closed. Fixes cockroachdb#37477 Release note: None
Contributor
Author
|
bors r+ |
craig bot
pushed a commit
that referenced
this pull request
May 15, 2019
37477: storage: Fix deadlock in TestStoreRangeMergeSlowWatcher r=andy-kimball a=andy-kimball This test is occasionally flaking under heavy race stress in CI runs. Here is the probable sequence of events: 1. A<-B merge starts and Subsume request locks down B. 2. Watcher on B sends PushTxn request, which is intercepted by the TestingRequestFilter in the test. 3. The merge txn aborts due to interference with the replica GC, since it's concurrently reading range descriptors. 4. The merge txn is retried, but the Watcher on B is locking the range, so it aborts again. 5. #4 repeats until the allowPushTxn channel fills up (it has capacity of 10). This causes a deadlock because the merge txn can't continue. Meanwhile, the watcher is blocked waiting for the results of the PushTxn request, which gets blocked waiting for the merge txn. The fix is to get rid of the arbitrarily limited channel size of 10 and use sync.Cond synchronization instead. Multiple retries of the merge txn will repeatedly signal the Cond, rather than fill up the channel. One of the problems with the channel was that there can be an imbalance between the number of items sent to the channel (by merge txns) with the number of items received from the channel (by the watcher). This imbalance meant the channel gradually filled up until finally the right sequence of events caused deadlock. Using a sync.Cond also fixes a race condition I saw several times, in which the merge transaction tries to send to the channel while it is being concurrently closed. Release note: None Co-authored-by: Andrew Kimball <andyk@cockroachlabs.com>
Contributor
Build succeeded |
tbg
pushed a commit
to tbg/cockroach
that referenced
this pull request
May 27, 2019
This test is occasionally flaking under heavy race stress in CI runs. Here is the probable sequence of events: 1. A<-B merge starts and Subsume request locks down B. 2. Watcher on B sends PushTxn request, which is intercepted by the TestingRequestFilter in the test. 3. The merge txn aborts due to interference with the replica GC, since it's concurrently reading range descriptors. 4. The merge txn is retried, but the Watcher on B is locking the range, so it aborts again. 5. #4 repeats until the allowPushTxn channel fills up (it has capacity of 10). This causes a deadlock because the merge txn can't continue. Meanwhile, the watcher is blocked waiting for the results of the PushTxn request, which gets blocked waiting for the merge txn. The fix is to get rid of the arbitrarily limited channel size of 10 and use sync.Cond synchronization instead. Multiple retries of the merge txn will repeatedly signal the Cond, rather than fill up the channel. One of the problems with the channel was that there can be an imbalance between the number of items sent to the channel (by merge txns) with the number of items received from the channel (by the watcher). This imbalance meant the channel gradually filled up until finally the right sequence of events caused deadlock. Using a sync.Cond also fixes a race condition I saw several times, in which the merge transaction tries to send to the channel while it is being concurrently closed. Fixes cockroachdb#37477 Release note: None
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This test is occasionally flaking under heavy race stress in CI runs. Here
is the probable sequence of events:
TestingRequestFilter in the test.
concurrently reading range descriptors.
aborts again.
This causes a deadlock because the merge txn can't continue. Meanwhile, the
watcher is blocked waiting for the results of the PushTxn request, which gets
blocked waiting for the merge txn.
The fix is to get rid of the arbitrarily limited channel size of 10 and use
sync.Cond synchronization instead. Multiple retries of the merge txn will
repeatedly signal the Cond, rather than fill up the channel. One of the problems
with the channel was that there can be an imbalance between the number of items
sent to the channel (by merge txns) with the number of items received from the
channel (by the watcher). This imbalance meant the channel gradually filled up
until finally the right sequence of events caused deadlock.
Using a sync.Cond also fixes a race condition I saw several times, in which the
merge transaction tries to send to the channel while it is being concurrently
closed.
Release note: None