storage: Fix deadlock in TestStoreRangeMergeSlowWatcher by andy-kimball · Pull Request #37477 · cockroachdb/cockroach

andy-kimball · 2019-05-11T15:54:39Z

This test is occasionally flaking under heavy race stress in CI runs. Here
is the probable sequence of events:

A<-B merge starts and Subsume request locks down B.
Watcher on B sends PushTxn request, which is intercepted by the
TestingRequestFilter in the test.
The merge txn aborts due to interference with the replica GC, since it's
concurrently reading range descriptors.
The merge txn is retried, but the Watcher on B is locking the range, so it
aborts again.
Added diff of an InfoStore with a supplied Filter #4 repeats until the allowPushTxn channel fills up (it has capacity of 10).
This causes a deadlock because the merge txn can't continue. Meanwhile, the
watcher is blocked waiting for the results of the PushTxn request, which gets
blocked waiting for the merge txn.

The fix is to get rid of the arbitrarily limited channel size of 10 and use
sync.Cond synchronization instead. Multiple retries of the merge txn will
repeatedly signal the Cond, rather than fill up the channel. One of the problems
with the channel was that there can be an imbalance between the number of items
sent to the channel (by merge txns) with the number of items received from the
channel (by the watcher). This imbalance meant the channel gradually filled up
until finally the right sequence of events caused deadlock.

Using a sync.Cond also fixes a race condition I saw several times, in which the
merge transaction tries to send to the channel while it is being concurrently
closed.

Release note: None

cockroach-teamcity · 2019-05-11T15:54:45Z

This change is

tbg

& thanks!

I usually try to avoid sync.Cond when possible, but I don't see how to do that here without also significantly rewriting the test, which isn't going to be good use of anyone's time at this point.

Reviewed 1 of 1 files at r1.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei and @darinpp)

This test is occasionally flaking under heavy race stress in CI runs. Here is the probable sequence of events: 1. A<-B merge starts and Subsume request locks down B. 2. Watcher on B sends PushTxn request, which is intercepted by the TestingRequestFilter in the test. 3. The merge txn aborts due to interference with the replica GC, since it's concurrently reading range descriptors. 4. The merge txn is retried, but the Watcher on B is locking the range, so it aborts again. 5. cockroachdb#4 repeats until the allowPushTxn channel fills up (it has capacity of 10). This causes a deadlock because the merge txn can't continue. Meanwhile, the watcher is blocked waiting for the results of the PushTxn request, which gets blocked waiting for the merge txn. The fix is to get rid of the arbitrarily limited channel size of 10 and use sync.Cond synchronization instead. Multiple retries of the merge txn will repeatedly signal the Cond, rather than fill up the channel. One of the problems with the channel was that there can be an imbalance between the number of items sent to the channel (by merge txns) with the number of items received from the channel (by the watcher). This imbalance meant the channel gradually filled up until finally the right sequence of events caused deadlock. Using a sync.Cond also fixes a race condition I saw several times, in which the merge transaction tries to send to the channel while it is being concurrently closed. Fixes cockroachdb#37477 Release note: None

andy-kimball · 2019-05-15T15:48:29Z

bors r+

37477: storage: Fix deadlock in TestStoreRangeMergeSlowWatcher r=andy-kimball a=andy-kimball This test is occasionally flaking under heavy race stress in CI runs. Here is the probable sequence of events: 1. A<-B merge starts and Subsume request locks down B. 2. Watcher on B sends PushTxn request, which is intercepted by the TestingRequestFilter in the test. 3. The merge txn aborts due to interference with the replica GC, since it's concurrently reading range descriptors. 4. The merge txn is retried, but the Watcher on B is locking the range, so it aborts again. 5. #4 repeats until the allowPushTxn channel fills up (it has capacity of 10). This causes a deadlock because the merge txn can't continue. Meanwhile, the watcher is blocked waiting for the results of the PushTxn request, which gets blocked waiting for the merge txn. The fix is to get rid of the arbitrarily limited channel size of 10 and use sync.Cond synchronization instead. Multiple retries of the merge txn will repeatedly signal the Cond, rather than fill up the channel. One of the problems with the channel was that there can be an imbalance between the number of items sent to the channel (by merge txns) with the number of items received from the channel (by the watcher). This imbalance meant the channel gradually filled up until finally the right sequence of events caused deadlock. Using a sync.Cond also fixes a race condition I saw several times, in which the merge transaction tries to send to the channel while it is being concurrently closed. Release note: None Co-authored-by: Andrew Kimball <andyk@cockroachlabs.com>

craig · 2019-05-15T16:06:27Z

Build succeeded

GitHub CI (Cockroach)

This test is occasionally flaking under heavy race stress in CI runs. Here is the probable sequence of events: 1. A<-B merge starts and Subsume request locks down B. 2. Watcher on B sends PushTxn request, which is intercepted by the TestingRequestFilter in the test. 3. The merge txn aborts due to interference with the replica GC, since it's concurrently reading range descriptors. 4. The merge txn is retried, but the Watcher on B is locking the range, so it aborts again. 5. #4 repeats until the allowPushTxn channel fills up (it has capacity of 10). This causes a deadlock because the merge txn can't continue. Meanwhile, the watcher is blocked waiting for the results of the PushTxn request, which gets blocked waiting for the merge txn. The fix is to get rid of the arbitrarily limited channel size of 10 and use sync.Cond synchronization instead. Multiple retries of the merge txn will repeatedly signal the Cond, rather than fill up the channel. One of the problems with the channel was that there can be an imbalance between the number of items sent to the channel (by merge txns) with the number of items received from the channel (by the watcher). This imbalance meant the channel gradually filled up until finally the right sequence of events caused deadlock. Using a sync.Cond also fixes a race condition I saw several times, in which the merge transaction tries to send to the channel while it is being concurrently closed. Fixes cockroachdb#37477 Release note: None

andy-kimball requested review from a team, andreimatei and tbg May 11, 2019 15:54

andy-kimball requested a review from darinpp May 13, 2019 18:13

tbg approved these changes May 15, 2019

View reviewed changes

andy-kimball force-pushed the flake branch from 152fdea to aa605bf Compare May 15, 2019 14:32

craig bot merged commit aa605bf into cockroachdb:master May 15, 2019

andy-kimball deleted the flake branch May 15, 2019 16:15

tbg mentioned this pull request May 16, 2019

storage: TestStoreRangeMergeSlowWatcher times out under race [skipped] #37191

Closed

tbg mentioned this pull request May 27, 2019

backport-19.1: storage: Fix deadlock in TestStoreRangeMergeSlowWatcher #37874

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: Fix deadlock in TestStoreRangeMergeSlowWatcher#37477

storage: Fix deadlock in TestStoreRangeMergeSlowWatcher#37477
craig[bot] merged 1 commit intocockroachdb:masterfrom
andy-kimball:flake

andy-kimball commented May 11, 2019

Uh oh!

cockroach-teamcity commented May 11, 2019

Uh oh!

tbg left a comment

Uh oh!

andy-kimball commented May 15, 2019

Uh oh!

craig bot commented May 15, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

andy-kimball commented May 11, 2019

Uh oh!

cockroach-teamcity commented May 11, 2019

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

andy-kimball commented May 15, 2019

Uh oh!

craig bot commented May 15, 2019

Build succeeded

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants