Skip to content

storage: Fix deadlock in TestStoreRangeMergeSlowWatcher#37477

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
andy-kimball:flake
May 15, 2019
Merged

storage: Fix deadlock in TestStoreRangeMergeSlowWatcher#37477
craig[bot] merged 1 commit intocockroachdb:masterfrom
andy-kimball:flake

Conversation

@andy-kimball
Copy link
Copy Markdown
Contributor

This test is occasionally flaking under heavy race stress in CI runs. Here
is the probable sequence of events:

  1. A<-B merge starts and Subsume request locks down B.
  2. Watcher on B sends PushTxn request, which is intercepted by the
    TestingRequestFilter in the test.
  3. The merge txn aborts due to interference with the replica GC, since it's
    concurrently reading range descriptors.
  4. The merge txn is retried, but the Watcher on B is locking the range, so it
    aborts again.
  5. Added diff of an InfoStore with a supplied Filter #4 repeats until the allowPushTxn channel fills up (it has capacity of 10).
    This causes a deadlock because the merge txn can't continue. Meanwhile, the
    watcher is blocked waiting for the results of the PushTxn request, which gets
    blocked waiting for the merge txn.

The fix is to get rid of the arbitrarily limited channel size of 10 and use
sync.Cond synchronization instead. Multiple retries of the merge txn will
repeatedly signal the Cond, rather than fill up the channel. One of the problems
with the channel was that there can be an imbalance between the number of items
sent to the channel (by merge txns) with the number of items received from the
channel (by the watcher). This imbalance meant the channel gradually filled up
until finally the right sequence of events caused deadlock.

Using a sync.Cond also fixes a race condition I saw several times, in which the
merge transaction tries to send to the channel while it is being concurrently
closed.

Release note: None

@andy-kimball andy-kimball requested review from a team, andreimatei and tbg May 11, 2019 15:54
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@andy-kimball andy-kimball requested a review from darinpp May 13, 2019 18:13
Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: & thanks!

I usually try to avoid sync.Cond when possible, but I don't see how to do that here without also significantly rewriting the test, which isn't going to be good use of anyone's time at this point.

Reviewed 1 of 1 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei and @darinpp)

This test is occasionally flaking under heavy race stress in CI runs. Here
is the probable sequence of events:

1. A<-B merge starts and Subsume request locks down B.
2. Watcher on B sends PushTxn request, which is intercepted by the
   TestingRequestFilter in the test.
3. The merge txn aborts due to interference with the replica GC, since it's
   concurrently reading range descriptors.
4. The merge txn is retried, but the Watcher on B is locking the range, so it
   aborts again.
5. cockroachdb#4 repeats until the allowPushTxn channel fills up (it has capacity of 10).
   This causes a deadlock because the merge txn can't continue. Meanwhile, the
   watcher is blocked waiting for the results of the PushTxn request, which gets
   blocked waiting for the merge txn.

The fix is to get rid of the arbitrarily limited channel size of 10 and use
sync.Cond synchronization instead. Multiple retries of the merge txn will
repeatedly signal the Cond, rather than fill up the channel. One of the problems
with the channel was that there can be an imbalance between the number of items
sent to the channel (by merge txns) with the number of items received from the
channel (by the watcher). This imbalance meant the channel gradually filled up
until finally the right sequence of events caused deadlock.

Using a sync.Cond also fixes a race condition I saw several times, in which the
merge transaction tries to send to the channel while it is being concurrently
closed.

Fixes cockroachdb#37477

Release note: None
@andy-kimball
Copy link
Copy Markdown
Contributor Author

bors r+

craig bot pushed a commit that referenced this pull request May 15, 2019
37477: storage: Fix deadlock in TestStoreRangeMergeSlowWatcher r=andy-kimball a=andy-kimball

This test is occasionally flaking under heavy race stress in CI runs. Here
is the probable sequence of events:

1. A<-B merge starts and Subsume request locks down B.
2. Watcher on B sends PushTxn request, which is intercepted by the
   TestingRequestFilter in the test.
3. The merge txn aborts due to interference with the replica GC, since it's
   concurrently reading range descriptors.
4. The merge txn is retried, but the Watcher on B is locking the range, so it
   aborts again.
5. #4 repeats until the allowPushTxn channel fills up (it has capacity of 10).
   This causes a deadlock because the merge txn can't continue. Meanwhile, the
   watcher is blocked waiting for the results of the PushTxn request, which gets
   blocked waiting for the merge txn.

The fix is to get rid of the arbitrarily limited channel size of 10 and use
sync.Cond synchronization instead. Multiple retries of the merge txn will
repeatedly signal the Cond, rather than fill up the channel. One of the problems
with the channel was that there can be an imbalance between the number of items
sent to the channel (by merge txns) with the number of items received from the
channel (by the watcher). This imbalance meant the channel gradually filled up
until finally the right sequence of events caused deadlock.

Using a sync.Cond also fixes a race condition I saw several times, in which the
merge transaction tries to send to the channel while it is being concurrently
closed.

Release note: None

Co-authored-by: Andrew Kimball <andyk@cockroachlabs.com>
@craig
Copy link
Copy Markdown
Contributor

craig bot commented May 15, 2019

Build succeeded

@craig craig bot merged commit aa605bf into cockroachdb:master May 15, 2019
@andy-kimball andy-kimball deleted the flake branch May 15, 2019 16:15
tbg pushed a commit to tbg/cockroach that referenced this pull request May 27, 2019
This test is occasionally flaking under heavy race stress in CI runs. Here
is the probable sequence of events:

1. A<-B merge starts and Subsume request locks down B.
2. Watcher on B sends PushTxn request, which is intercepted by the
   TestingRequestFilter in the test.
3. The merge txn aborts due to interference with the replica GC, since it's
   concurrently reading range descriptors.
4. The merge txn is retried, but the Watcher on B is locking the range, so it
   aborts again.
5. #4 repeats until the allowPushTxn channel fills up (it has capacity of 10).
   This causes a deadlock because the merge txn can't continue. Meanwhile, the
   watcher is blocked waiting for the results of the PushTxn request, which gets
   blocked waiting for the merge txn.

The fix is to get rid of the arbitrarily limited channel size of 10 and use
sync.Cond synchronization instead. Multiple retries of the merge txn will
repeatedly signal the Cond, rather than fill up the channel. One of the problems
with the channel was that there can be an imbalance between the number of items
sent to the channel (by merge txns) with the number of items received from the
channel (by the watcher). This imbalance meant the channel gradually filled up
until finally the right sequence of events caused deadlock.

Using a sync.Cond also fixes a race condition I saw several times, in which the
merge transaction tries to send to the channel while it is being concurrently
closed.

Fixes cockroachdb#37477

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants