-
Notifications
You must be signed in to change notification settings - Fork 4.1k
kvserver: default raft scheduler concurrency can cause cascading failures on beefy machines #56851
Description
The default number of worker goroutines in the Raft scheduler is 8*runtime.NumCPUs(). We have observed that, at least on v20.1, this can cause pathological behavior that is most likely to occur when the CPU and range count are both "high" (32 CPUs and 55k ranges did it in one recent example).
The pathological behavior entails a full breakdown of the system. The UI and all ranges stop working. It becomes nearly impossible to extract debugging information from the system.
From a goroutine dump (via kill -ABRT), we see many of the worker goroutines with the following stack:
sync.(*Mutex).Lock(...)
/usr/local/go/src/sync/mutex.go:81
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).enqueueN(0xc0011ea900, 0x8, 0xc041aea000, 0x844d, 0x9800, 0xc038b821c0)
(enqueue1 similarly shows up). These are contending on a mutex, which is thought to be the root cause of the pathological behavior. This all looks like golang/go#33747, which was fixed in go1.14. CRDB v20.1 and v20.2 are both built with go1.13, which makes them susceptible to this bug. v21.1 will be built with go1.15, which has the fix.
Following the contention in the scheduler, we see outgoing raft message streams that are backed up because the recipient’s raft scheduler is unable to keep up. These have been seen stuck for dozens of minutes ([select, 17 minutes] etc):
goroutine 2707 [select]:
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/internal/transport.(*writeQuota).get(0xc014642600, 0xc000000052, 0x4d, 0x5)
/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/internal/transport/flowcontrol.go:59 +0xaa
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/internal/transport.(*http2Client).Write(0xc0120461c0, 0xc0042dab00, 0xc02dbaa480, 0x5, 0x60, 0xc01dd44a80, 0x4d, 0xbb, 0xc03e70ecf5, 0x0, ...)
/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/internal/transport/http2_client.go:840 +0x1ae
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*csAttempt).sendMsg(0xc013e8cd80, 0x416e3e0, 0xc014642700, 0xc03e70ecf0, 0x5, 0x5, 0xc01dd44a80, 0x4d, 0xbb, 0xc00bf3ccc0, ...)
/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/stream.go:828 +0x128
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*clientStream).SendMsg.func2(0xc013e8cd80, 0x4d, 0xbb)
/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/stream.go:693 +0xb3
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*clientStream).withRetry(0xc000139e60, 0xc018462c30, 0xc019b036f0, 0xc000263840, 0x0)
/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/stream.go:573 +0x360
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*clientStream).SendMsg(0xc000139e60, 0x416e3e0, 0xc014642700, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/stream.go:699 +0x399
To resolve the gridlock, we added the environment variable COCKROACH_SCHEDULER_CONCURRENCY=64 to all nodes in the cluster and restarted.
We verifed the problem was solved by letting the cluster come together, watch the metrics for Raft leaders to be elected on all ranges, gradually add load back to the cluster and keep monitoring.
We need to set better defaults for the Raft scheduler worker pool. Additionally, we should understand whether the extent of the degradation was expected given the misconfiguration or whether there are more improvements to resilience we need to make. This will likely entail reproducing the problem locally.
gz#8824