kvserver: reject requests when quiescing by andreimatei · Pull Request #52843 · cockroachdb/cockroach

andreimatei · 2020-08-14T22:20:08Z

This patch makes the Store reject requests once its stopper is
quiescing.

Before this patch, we didn't seem to have good protection against
requests not running after the stopper has been stopped. We've seen this
in some tests, where requests were racing with the engine closing.
Running after the stopper has stopped is generally pretty undefined
behavior, so let's avoid it.
I think the reason why we didn't see a lot of errors from such races is
that we're stopping the gRPC server even before we start quiescing, so
at least for requests coming from remote nodes we had some draining
window.

This is a lighter version of #51566. That patch was trying to run
requests as tasks, so that they properly synchronize with server
shutdown. This patch still allows races between requests that started
evaluating the server started quiescing and server shutdown.

Touches #51544

Release note: None

This patch makes the Store reject requests once its stopper is quiescing. Before this patch, we didn't seem to have good protection against requests not running after the stopper has been stopped. We've seen this in some tests, where requests were racing with the engine closing. Running after the stopper has stopped is generally pretty undefined behavior, so let's avoid it. I think the reason why we didn't see a lot of errors from such races is that we're stopping the gRPC server even before we start quiescing, so at least for requests coming from remote nodes we had some draining window. This is a lighter version of cockroachdb#51566. That patch was trying to run requests as tasks, so that they properly synchronize with server shutdown. This patch still allows races between requests that started evaluating the server started quiescing and server shutdown. Touches cockroachdb#51544 Release note: None

This patch runs some infrequent operations that might use the storage engine as tasks, and thus synchronizes them with server shutdown. In cockroachdb#51544 we've seen one of these cause a crash when executing after Pebble was shut down. Release note: None

cockroach-teamcity · 2020-08-14T22:20:16Z

This change is

nvb

Thanks for simplifying this. This is better, but unfortunately even this seems too heavyweight to me. Forcing every request in the system to synchronize on a single mutex just so we can have a best-effort check for quiescence does not seem like a particularly good trade-off. Even these short-lived mutex acquisitions impose a non-negligible performance cost, especially if they're globally scoped. We see this today in profiles, and we're nowhere near as efficient as we eventually will be.

As a general design principle when approaching process draining, I think we need to heavily optimize for the request serving path, potentially even at the expense of the draining path. A process will serve millions/billions of requests over its lifetime and will only drain once. So we should make things as cheap as possible for requests. If we want a best-effort check, we should use an atomic read, ideally on a variable isolated in its own cache line. If we want to wait for requests to drain, we should use an atomic counter, which is already what we're doing with the sync.WaitGroup in Stopper. If we need to do something fancier, let's at least try to piggyback on top of existing draining mechanisms in gRPC.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @bdarnell, @nvanbenschoten, and @tbg)

…eation This commit optimizes the Stopper for task creation by ripping out the existing heavyweight task tracking in production builds. I realized that my biggest concern with most of the proposals (cockroachdb#52843 and cockroachdb#51566) being floated to address cockroachdb#51544 was that they bought more into the inefficient tracking in the Stopper, not that they were doing anything inherently wrong themselves. Before this change, creating a task acquired an exclusive mutex and then wrote to a hashmap. At high levels of concurrency, this would have become a performance chokepoint. After this change, the cost of launching a Task is three atomic increments – one to acquire a read lock, one to register with a WaitGroup, and one to release the read lock. When no one is draining the Stopper, these are all wait-free operations, which means that task creation becomes wait-free. With a change like this, I would feel much more comfortable pushing on Stopper tasks to solve cockroachdb#51544.

andreimatei added 2 commits August 14, 2020 18:16

andreimatei requested review from ajwerner, bdarnell, nvb and tbg August 14, 2020 22:20

andreimatei mentioned this pull request Aug 14, 2020

kvserver: reject requests when quiescing #51566

Open

nvb reviewed Aug 16, 2020

View reviewed changes

nvb mentioned this pull request Aug 17, 2020

[DNM] stop: disable task tracking in release builds, optimize task creation #52894

Closed

tbg added the X-noremind Bots won't notify about PRs with X-noremind label May 6, 2021

tbg removed their request for review June 21, 2021 11:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: reject requests when quiescing #52843

kvserver: reject requests when quiescing #52843
andreimatei wants to merge 2 commits intocockroachdb:masterfrom
andreimatei:server.quiescence

andreimatei commented Aug 14, 2020

Uh oh!

cockroach-teamcity commented Aug 14, 2020

Uh oh!

nvb left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

andreimatei commented Aug 14, 2020

Uh oh!

cockroach-teamcity commented Aug 14, 2020

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants