kvserver: default raft scheduler concurrency can cause cascading failures on beefy machines

The default number of worker goroutines in the Raft scheduler is `8*runtime.NumCPUs()`. We have observed that, at least on v20.1, this can cause pathological behavior that is most likely to occur when the CPU and range count are both "high" (32 CPUs and 55k ranges did it in one recent example).

The pathological behavior entails a full breakdown of the system. The UI and all ranges stop working. It becomes nearly impossible to extract debugging information from the system.

From a goroutine dump (via `kill -ABRT`), we see many of the worker goroutines with the following stack:

```
sync.(*Mutex).Lock(...)
    /usr/local/go/src/sync/mutex.go:81
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).enqueueN(0xc0011ea900, 0x8, 0xc041aea000, 0x844d, 0x9800, 0xc038b821c0)
```

(`enqueue1` similarly shows up). These are contending on a mutex, which is thought to be the root cause of the pathological behavior. This all looks like https://github.com/golang/go/issues/33747, which was fixed in go1.14. CRDB v20.1 and v20.2 are both built with go1.13, which makes them susceptible to this bug. v21.1 will be built with go1.15, which has the fix.

Following the contention in the scheduler, we see outgoing raft message streams that are backed up because the recipient’s raft scheduler is unable to keep up. These have been seen stuck for dozens of minutes ([select, 17 minutes] etc):

```
goroutine 2707 [select]:
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/internal/transport.(*writeQuota).get(0xc014642600, 0xc000000052, 0x4d, 0x5)
        /go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/internal/transport/flowcontrol.go:59 +0xaa
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/internal/transport.(*http2Client).Write(0xc0120461c0, 0xc0042dab00, 0xc02dbaa480, 0x5, 0x60, 0xc01dd44a80, 0x4d, 0xbb, 0xc03e70ecf5, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/internal/transport/http2_client.go:840 +0x1ae
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*csAttempt).sendMsg(0xc013e8cd80, 0x416e3e0, 0xc014642700, 0xc03e70ecf0, 0x5, 0x5, 0xc01dd44a80, 0x4d, 0xbb, 0xc00bf3ccc0, ...)
        /go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/stream.go:828 +0x128
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*clientStream).SendMsg.func2(0xc013e8cd80, 0x4d, 0xbb)
        /go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/stream.go:693 +0xb3
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*clientStream).withRetry(0xc000139e60, 0xc018462c30, 0xc019b036f0, 0xc000263840, 0x0)
        /go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/stream.go:573 +0x360
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*clientStream).SendMsg(0xc000139e60, 0x416e3e0, 0xc014642700, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/stream.go:699 +0x399
```


To resolve the gridlock, we added the environment variable `COCKROACH_SCHEDULER_CONCURRENCY=64` to all nodes in the cluster and restarted.
We verifed the problem was solved by letting the cluster come together, watch the metrics for Raft leaders to be elected on all ranges, gradually add load back to the cluster and keep monitoring.

We need to set better defaults for the Raft scheduler worker pool. Additionally, we should understand whether the extent of the degradation was expected given the misconfiguration or whether there are more improvements to resilience we need to make. This will likely entail reproducing the problem locally.


gz#8824

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: default raft scheduler concurrency can cause cascading failures on beefy machines #56851

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

kvserver: default raft scheduler concurrency can cause cascading failures on beefy machines #56851

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions