Skip to content

storage: avoid timer deadlock in quotaPool + add lint check#17796

Merged
nvb merged 2 commits intocockroachdb:masterfrom
nvb:nvanbenschoten/fix_deadlock
Aug 22, 2017
Merged

storage: avoid timer deadlock in quotaPool + add lint check#17796
nvb merged 2 commits intocockroachdb:masterfrom
nvb:nvanbenschoten/fix_deadlock

Conversation

@nvb
Copy link
Copy Markdown
Contributor

@nvb nvb commented Aug 22, 2017

Maybe fixes #17524.

When a timeutil.Timer is read from, it needs to set timer.Read = true
before calling timer.Reset or it risks deadlocking itself. A deadlock
in the quotaPool was possible because it missed this step. Once the
quotaPool deadlocked, it would block the CommandQueue and wreak havoc
on a cluster.

To prevent against this happening again, this change adds a new lint check that
assures that timeutil.Timer is used correctly.

checkSetTimeutilTimerRead assures that timeutil.Timer objects are used
correctly, to avoid race conditions and deadlocks. These timers require
callers to set their Read field to true when their channel has been received
on. If this field is not set and the timer's Reset method is called, we will
deadlock. This lint assures that the Read field is set in the most common
case where Reset is used, within a for-loop where each iteration blocks
on a select statement. The timers are usually used as timeouts on these
select statements, and need to be reset after each iteration.

for {
  timer.Reset(...)
  select {
    case <-timer.C:
      timer.Read = true   <--  lint verifies that this line is present
    case ...:
  }
}

Note that the lint is triggered even if Reset is not present in each iteration.
Even though this case will not deadlock without the timer.Read = true, I
think it's safer to mandate that all select cases set the Read flag anyway.
This prevents against issues calling Reset sometime after the for loop, which
could still deadlock.

@tschottdorf for the first commit, @tamird for the second.

Maybe fixes cockroachdb#17524.

When a `timeutil.Timer` is read from, it needs to set `timer.Read = true`
before calling `timer.Reset` or it risks deadlocking itself. A deadlock
in the `quotaPool` was possible because it missed this step. Once the
`quotaPool` deadlocked, it would block the `CommandQueue` and wreak havoc
on a cluster.
@nvb nvb requested review from a team, tamird and tbg August 22, 2017 00:58
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@tbg
Copy link
Copy Markdown
Member

tbg commented Aug 22, 2017

Wow, nice lint! LGTM, but @tamird should probably take a look at your AST craziness as well. Either way, hope we can merge this tomorrow.


Reviewed 1 of 1 files at r1, 3 of 3 files at r2.
Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.


pkg/cmd/metacheck/main.go, line 160 at r2 (raw file):

	selectorIsTimer := func(s *ast.SelectorExpr) bool {
		typ := j.Program.Info.TypeOf(s.X).String()
		return strings.HasSuffix(typ, "pkg/util/timeutil.Timer")

This is silly, but if we renamed that package, your lint would silently fail, right? Can we somehow import that package to make sure we explode if we move it?


Comments from Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator

:lgtm: Nice catch. I didn't look at the lint check, but even trying to add one is going above and beyond.


Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks pending.


pkg/storage/quota_pool.go, line 147 at r2 (raw file):

	for {
		slowTimer.Reset(base.SlowRequestThreshold)

Is there a reason you moved this Reset inside the loop? Ditto for the one below. That will make the loop warning multiple times which I don't think is desirable. Or maybe it is.


Comments from Reviewable

@tbg
Copy link
Copy Markdown
Member

tbg commented Aug 22, 2017

Review status: all files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.


pkg/storage/quota_pool.go, line 147 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Is there a reason you moved this Reset inside the loop? Ditto for the one below. That will make the loop warning multiple times which I don't think is desirable. Or maybe it is.

I think it is desirable. We were staring at logs today that had zero activity and we had to grep through the history to figure out if there was ever a message. Starting to repetitively log loads of messages here has its own problems, but we should work really hard to never get into that state. Since each of these repetitive messages would be bound to a client's context, I think we should try this out.


Comments from Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator

Review status: all files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.


pkg/storage/quota_pool.go, line 147 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I think it is desirable. We were staring at logs today that had zero activity and we had to grep through the history to figure out if there was ever a message. Starting to repetitively log loads of messages here has its own problems, but we should work really hard to never get into that state. Since each of these repetitive messages would be bound to a client's context, I think we should try this out.

Ok, maybe we should change the other slowTimers for consistency.


Comments from Reviewable

@nvb nvb force-pushed the nvanbenschoten/fix_deadlock branch from ee4b01b to 631aa79 Compare August 22, 2017 04:49
@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Aug 22, 2017

Review status: 3 of 4 files reviewed at latest revision, 2 unresolved discussions.


pkg/cmd/metacheck/main.go, line 160 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

This is silly, but if we renamed that package, your lint would silently fail, right? Can we somehow import that package to make sure we explode if we move it?

Good point. @tamird, is there a standard way we've addressed this in the past?


pkg/storage/quota_pool.go, line 147 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Ok, maybe we should change the other slowTimers for consistency.

It was already inside the loop, just in the <-slowTimer.C case body instead of directly in the for loop, so this shouldn't change behavior. Moving it up just makes it consistent with other places where we want multiple warnings.

The other three instances of slow timers are in:

  • DistSender.sendToReplicas
  • Replica.redirectOnOrAcquireLease
  • Replica.tryExecuteWriteBatch

As you point out, none of these warn multiple times. These should probably be consistent with quotaPool.acquire, but I'm wary of introducing log spam that could clutter the logs even further. That said, the timers should only go off once a minute. You two have probably spent the most time debugging issues by looking at logs, so I'll defer to your judgment.


Comments from Reviewable

@tamird
Copy link
Copy Markdown
Contributor

tamird commented Aug 22, 2017

:lgtm: nice job!


Reviewed 1 of 1 files at r1, 1 of 1 files at r3.
Review status: all files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.


pkg/cmd/metacheck/main.go, line 160 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Good point. @tamird, is there a standard way we've addressed this in the past?

Yep, go/build.Import is the standard way to address this; see for instance https://github.com/cockroachdb/cockroach/eebbd7a/master/pkg/cmd/benchmark/main.go#L55


pkg/cmd/metacheck/main.go, line 195 at r3 (raw file):

				}
				selectorName := fmt.Sprint(selector.X)
				if selector.Sel.String() != "C" {

this will also silently break if the name of this field changes. probably fine.


Comments from Reviewable

@tamird
Copy link
Copy Markdown
Contributor

tamird commented Aug 22, 2017

Review status: all files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.


pkg/cmd/metacheck/main.go, line 160 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

Yep, go/build.Import is the standard way to address this; see for instance https://github.com/cockroachdb/cockroach/eebbd7a/master/pkg/cmd/benchmark/main.go#L55

Oh, and I imagine you can use reflection to confirm the type name (and the field name, "C", below).


Comments from Reviewable

This change adds a new metacheck that assures that `timeutil.Timer`
is used correctly.

> checkSetTimeutilTimerRead assures that timeutil.Timer objects are used
> correctly, to avoid race conditions and deadlocks. These timers require
> callers to set their Read field to true when their channel has been received
> on. If this field is not set and the timer's Reset method is called, we will
> deadlock. This lint assures that the Read field is set in the most common
> case where Reset is used, within a for-loop where each iteration blocks
> on a select statement. The timers are usually used as timeouts on these
> select statements, and need to be reset after each iteration.
>
> for {
>   timer.Reset(...)
>   select {
>     case <-timer.C:
>       timer.Read = true   <--  lint verifies that this line is present
>     case ...:
>   }
> }

Note that the lint is triggered even if `Reset` is not present in each iteration.
Even though this case will not deadlock without the `timer.Read = true`, I
think it's safer to mandate that all select cases set the `Read` flag anyway.
This prevents against issues calling `Reset` sometime after the `for` loop, which
could still deadlock.
@nvb nvb force-pushed the nvanbenschoten/fix_deadlock branch from 631aa79 to bcfb816 Compare August 22, 2017 18:26
@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Aug 22, 2017

Review status: 3 of 4 files reviewed at latest revision, 3 unresolved discussions, some commit checks pending.


pkg/cmd/metacheck/main.go, line 160 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

Oh, and I imagine you can use reflection to confirm the type name (and the field name, "C", below).

Done.


pkg/cmd/metacheck/main.go, line 195 at r3 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

this will also silently break if the name of this field changes. probably fine.

Done.


Comments from Reviewable

@tamird
Copy link
Copy Markdown
Contributor

tamird commented Aug 22, 2017

Reviewed 2 of 3 files at r2, 1 of 1 files at r4.
Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks pending.


Comments from Reviewable

@nvb nvb merged commit ca37ad6 into cockroachdb:master Aug 22, 2017
@nvb nvb deleted the nvanbenschoten/fix_deadlock branch August 22, 2017 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

user cluster has three wedged timeseries ranges

5 participants