kv: improve Raft scheduler behavior under CPU starvation#56860
kv: improve Raft scheduler behavior under CPU starvation#56860craig[bot] merged 3 commits intocockroachdb:masterfrom
Conversation
967c9d2 to
0a9c991
Compare
|
I ran some experiments and found this to work very well – even better than I was expecting. Here are the steps I took, along with the results: Trial 1:
|
|
@tbg what do you think about backporting the second commit ( |
petermattis
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @nvanbenschoten, and @tbg)
pkg/kv/kvserver/scheduler.go, line 67 at r3 (raw file):
// part, it implements a FIFO queueing policy with no prioritization of some // ranges over others. However, the queue can be configured with up to one // high-priority range, which will always be placed at the front when added.
Did you consider having a dedicated Raft scheduler goroutine for the node liveness range? This code has long since left my memory, so that might be a much larger lift. This simple priority mechanism seems relatively contained and your test results are quite compelling.
petermattis
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @nvanbenschoten, and @tbg)
pkg/kv/kvserver/replica_init.go, line 350 at r3 (raw file):
// Ranges to ensure that liveness never sees high Raft scheduler latency. if bytes.HasPrefix(desc.StartKey, keys.NodeLivenessPrefix) { r.store.scheduler.SetPriorityID(desc.RangeID)
Once SetPriorityID is called, it is never cleared. Could this be done in the store initialization code?
nvb
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @petermattis, and @tbg)
pkg/kv/kvserver/replica_init.go, line 350 at r3 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Once
SetPriorityIDis called, it is never cleared. Could this be done in the store initialization code?
The challenge is that we don't have a constant ID for the node liveness Range, though I think in practice it always ends up being 2 today. I'm not sure how to cleanly determine this ID in store initialization code without doing a meta range lookup.
pkg/kv/kvserver/scheduler.go, line 67 at r3 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Did you consider having a dedicated Raft scheduler goroutine for the node liveness range? This code has long since left my memory, so that might be a much larger lift. This simple priority mechanism seems relatively contained and your test results are quite compelling.
I didn't consider it when writing this, but it is a valid alternative. It probably would be a larger lift though and has the downside of requiring a goroutine per node even for nodes that don't have a replica for the liveness range. Are there upsides that you think outweigh this?
petermattis
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @nvanbenschoten, and @tbg)
pkg/kv/kvserver/replica_init.go, line 350 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
The challenge is that we don't have a constant ID for the node liveness Range, though I think in practice it always ends up being 2 today. I'm not sure how to cleanly determine this ID in store initialization code without doing a meta range lookup.
Ack. I had been mistakenly thinking that the node liveness range had a fixed ID.
pkg/kv/kvserver/scheduler.go, line 67 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I didn't consider it when writing this, but it is a valid alternative. It probably would be a larger lift though and has the downside of requiring a goroutine per node even for nodes that don't have a replica for the liveness range. Are there upsides that you think outweigh this?
The possible upside is that you'd be guaranteed to not queue processing of the node liveness range behind processing of any other range. With the priority approach here, if the node liveness range needs processing, it still has to wait for a worker to grab it. I suspect that will happen quickly, though.
ajwerner
left a comment
There was a problem hiding this comment.
Reviewed 1 of 1 files at r1, 3 of 3 files at r2, 3 of 3 files at r3.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten, @petermattis, and @tbg)
pkg/kv/kvserver/replica_init.go, line 350 at r3 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Ack. I had been mistakenly thinking that the node liveness range had a fixed ID.
The node liveness range may move to a node after it starts up. This method will be called on every existing replica during store startup anyway.
pkg/kv/kvserver/store_raft.go, line 116 at r2 (raw file):
if len(toEnqueue) > 0 { s.scheduler.EnqueueRaftRequests(toEnqueue...) }
nit: push the 0 len optimization down into EnqueueRaftRequests
0a9c991 to
d48f597
Compare
nvb
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @petermattis, and @tbg)
pkg/kv/kvserver/scheduler.go, line 67 at r3 (raw file):
Previously, petermattis (Peter Mattis) wrote…
The possible upside is that you'd be guaranteed to not queue processing of the node liveness range behind processing of any other range. With the priority approach here, if the node liveness range needs processing, it still has to wait for a worker to grab it. I suspect that will happen quickly, though.
That's a good point. The change here will ensure that the liveness range is quickly processed regardless of the queue size, but don't ensure that the range will be quickly processed if the workers are all very slow in processing other ranges. But I tend to agree that such cases should be rare, and in those cases, we must be hitting serious slowness below this level. This would imply that getting node liveness through the Raft scheduler is probably still not enough to ensure it makes progress.
pkg/kv/kvserver/store_raft.go, line 116 at r2 (raw file):
Previously, ajwerner wrote…
if len(toEnqueue) > 0 { s.scheduler.EnqueueRaftRequests(toEnqueue...) }nit: push the 0 len optimization down into
EnqueueRaftRequests
Done.
tbg
left a comment
There was a problem hiding this comment.
Please add a release note, probably even link to #56851 from it.
The experiments were run with the default raft scheduler concurrency, right (i.e. 32 workers per node)?
The quiescence test here:
cockroach/pkg/cmd/roachtest/kv.go
Line 475 in 44de8d8
should work much better now, shouldn't it? We had neutered it quite a bit (dropped the range count lower and lower), but I think you may have just fixed the issues with it. See if you can set it up to roughly match your experiment and run it a few times, it will be very nice to have confidence that CRDB can support a vastly larger range count with your changes. (This doesn't need to hold up this PR, I imagine we want to run these tests a bunch after making them more aggressive, which takes time).
I'm on board with backporting the second commit. I would even entertain backporting the third after some bake time. The only way in which I could see this commit backfire (without a bug) is that the liveness range could starve out processing for other ranges, which seems improbable.
Reviewed 1 of 1 files at r1, 4 of 4 files at r4, 3 of 3 files at r5.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten and @petermattis)
pkg/kv/kvserver/replica_init.go, line 350 at r5 (raw file):
// Ranges to ensure that liveness never sees high Raft scheduler latency. if bytes.HasPrefix(desc.StartKey, keys.NodeLivenessPrefix) { r.store.scheduler.SetPriorityID(desc.RangeID)
Looks like this isn't tested, isn't that relatively easy to add? Start a server, look up the liveness range, get the range id, pull the priority ID from the scheduler (via a helper in helpers_test.go) and that's it?
Of course this works, as indicated by your experiments, but a more direct proof would be nice.
pkg/kv/kvserver/scheduler.go, line 123 at r5 (raw file):
if q.priorityID != 0 && q.priorityID != id { panic(fmt.Sprintf( "priority range ID already set: old=%d, new=%d",
This is a bit odd, there's nothing about SetPriorityID "externally" that indicates that this is not allowed. Perhaps a comment on the method is enough.
pkg/kv/kvserver/store.go, line 103 at r1 (raw file):
var storeSchedulerConcurrency = envutil.EnvOrDefaultInt( "COCKROACH_SCHEDULER_CONCURRENCY", min(8*runtime.NumCPU(), 96))
A comment here wouldn't hurt - the 8*runtime.NumCPU() was determined to show peak performance on a particular benchmarking setup using some variant of a KV workload (@petermattis might remember particulars) and the cap arises from observations of severe mutex contention at high scheduler concurrency.
petermattis
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten, @petermattis, and @tbg)
pkg/kv/kvserver/store.go, line 103 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
A comment here wouldn't hurt - the
8*runtime.NumCPU()was determined to show peak performance on a particular benchmarking setup using some variant of a KV workload (@petermattis might remember particulars) and the cap arises from observations of severe mutex contention at high scheduler concurrency.
My recollection is somewhat hazy. I recall doing some kv workload performance tests and 8*runtime.NumCPU() appeared to be a reasonable value. I don't recall this testing being particularly rigorous and it is possibly something that could be fruitfully revisited.
petermattis
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten, @petermattis, and @tbg)
pkg/kv/kvserver/store.go, line 103 at r1 (raw file):
Git history tells us that the testing wasn't particularly rigorous: 9a68241
The COCKROACH_SCHEDULER_CONCURRENCY env var controls the number of
scheduler goroutines used for Raft processing. In testing, the previous
default of 2**NumCPU proved to be too low to achieve good parallelization
of batch commits. Even with proposer-evalued-KV landing, we want lots of
concurrent batch commits so they can be grouped and written to RocksDB
more efficiently. The new default of 8*NumCPU was determined
experimentally on 8 CPU machines.
Notice that this was done way back in April, 2017. The stone ages.
Btw, nightly performance tests on higher CPU machines tends to flatline above 32 CPUs. I wonder if this will PR will have an effect.
Relates to cockroachdb#56851. In investigations like cockroachdb#56851, we've seen the mutex in the Raft scheduler collapse due to too much concurrency. To address this, we needed to drop the scheduler's goroutine pool size to bound the amount of contention on the mutex to ensure that the scheduler was able to schedule any goroutines. This commit caps this concurrency to 96, instead of letting it grow unbounded as a function of the number of cores on the system. Release note (performance improvement): The Raft processing goroutine pool's size is now capped at 96. This was observed to prevent instability on large machines (32+ vCPU) in clusters with many ranges (50k+ per node).
Relates to cockroachdb#56851. In cockroachdb#56851, we saw that all of the Raft transport's receiving goroutines were stuck in the Raft scheduler, attempting to enqueue Ranges in response to coalesced heartbeats. We saw this in stacktraces like: ``` goroutine 321096 [semacquire]: sync.runtime_SemacquireMutex(0xc00007099c, 0xc005822a00, 0x1) /usr/local/go/src/runtime/sema.go:71 +0x47 sync.(*Mutex).lockSlow(0xc000070998) /usr/local/go/src/sync/mutex.go:138 +0xfc sync.(*Mutex).Lock(...) /usr/local/go/src/sync/mutex.go:81 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).enqueue1(0xc000070980, 0x4, 0x19d8cb, 0x1) /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:261 +0xb0 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).EnqueueRaftRequest(0xc000070980, 0x19d8cb) /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:299 +0x3e github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).HandleRaftUncoalescedRequest(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc01288e5c0, 0x4ba44c0, 0xc014ff2b40, 0x0) /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:175 +0x201 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).uncoalesceBeats(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc035790a80, 0x37, 0x43, 0x100000001, 0x29b00000000, 0x0, 0x400000004, ...) /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:110 +0x33b github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).HandleRaftRequest(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc02be585f0, 0x4ba44c0, 0xc014ff2b40, 0x0) /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:130 +0x1be github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).handleRaftRequest(0xc000188780, 0x4becc00, 0xc019f31b60, 0xc02be585f0, 0x4ba44c0, 0xc014ff2b40, 0x0) /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/raft_transport.go:299 +0xab github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).RaftMessageBatch.func1.1.1(0x4c3fac0, 0xc00d3ccdf0, 0xc000188780, 0x4becc00, 0xc019f31b60, 0x95fe98, 0x40c5720) /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/raft_transport.go:370 +0x199 ``` In that issue, we also saw that too much concurrency on the Raft scheduler's Mutex had caused the mutex to collapse (get stuck in the slow path, in the OS kernel) and hundreds of goroutines to pile up on it. We suspect that part of the problem here was that each of the coalesced heartbeats was locking the Raft scheduler once per Range. So a coalesced heartbeat that contained 10k ranges would lock the scheduler 10k times on the receiver. The commit attempts to alleviate this issue by batch enqueuing Ranges in the Raft scheduler in response to coalesced heartbeats. This has a slight fixed overhead (i.e. the need for a slice) but in response, reduces the load that coalesced heartbeats place on the Raft scheduler's mutex by a factor of 128 (`enqueueChunkSize`). This should reduce the impact that a large number of Ranges have on contention in the Raft scheduler. Release note (performance improvement): Interactions between Raft heartbeats and the Raft goroutine pool scheduler are now more efficient and avoid excessive mutex contention. This was observed to prevent instability on large machines (32+ vCPU) in clusters with many ranges (50k+ per node).
Relates to cockroachdb#56851. In cockroachdb#56851 and in many other investigations, we've seen cases where the NodeLiveness Range has a hard time performing writes when a system is under heavy load. We already split RPC traffic into two classes, ensuring that NodeLiveness traffic does not get stuck behind traffic on user ranges. However, to this point, it was still possible for the NodeLiveness range to get stuck behind other Ranges in the Raft scheduler, leading to high scheduling latency for Raft operations. This commit addresses this by prioritizing the NodeLiveness range above all others in the Raft scheduler. This prioritization mechanism is naive, but should be effective. It should also not run into any issues with fairness or starvation of other ranges, as such starvation is not possible as long as the scheduler concurrency (8*num_cpus) is above the number of high priority ranges (1). Release note (performance improvement): The Raft scheduler now prioritizes the node liveness Range. This was observed to prevent instability on large machines (32+ vCPU) in clusters with many ranges (50k+ per node).
d48f597 to
900729b
Compare
nvb
left a comment
There was a problem hiding this comment.
Please add a release note, probably even link to #56851 from it.
Done.
The experiments were run with the default raft scheduler concurrency, right (i.e. 32 workers per node)?
Right.
The quiescence test here should work much better now, shouldn't it?
Yes, it should be able to handle many more ranges without falling over. That doesn't mean that these ranges will be able to handle much traffic when none are quiesced, as Raft scheduler latency (see #56943) will still be quite large for all non-liveness ranges.
I'll experiment with updating that roachtest in a follow-up PR.
I'm on board with backporting the second commit. I would even entertain backporting the third after some bake time.
That plan sounds good to me.
The only way in which I could see this commit backfire (without a bug) is that the liveness range could starve out processing for other ranges, which seems improbable.
I'm not even sure that this is possible. I touched a bit on it in the commit message, but I think we're safe from any theoretical starvation as long as the scheduler concurrency (8*num_cpus) is above the number of high priority ranges (1), which it always will be.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @petermattis, and @tbg)
pkg/kv/kvserver/replica_init.go, line 350 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Looks like this isn't tested, isn't that relatively easy to add? Start a server, look up the liveness range, get the range id, pull the priority ID from the scheduler (via a helper in helpers_test.go) and that's it?
Of course this works, as indicated by your experiments, but a more direct proof would be nice.
Done.
pkg/kv/kvserver/scheduler.go, line 123 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This is a bit odd, there's nothing about SetPriorityID "externally" that indicates that this is not allowed. Perhaps a comment on the method is enough.
Done.
pkg/kv/kvserver/store.go, line 103 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Git history tells us that the testing wasn't particularly rigorous: 9a68241
The COCKROACH_SCHEDULER_CONCURRENCY env var controls the number of
scheduler goroutines used for Raft processing. In testing, the previous
default of 2**NumCPU proved to be too low to achieve good parallelization
of batch commits. Even with proposer-evalued-KV landing, we want lots of
concurrent batch commits so they can be grouped and written to RocksDB
more efficiently. The new default of 8*NumCPU was determined
experimentally on 8 CPU machines.Notice that this was done way back in April, 2017. The stone ages.
Btw, nightly performance tests on higher CPU machines tends to flatline above 32 CPUs. I wonder if this will PR will have an effect.
Done.
petermattis
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner and @tbg)
|
bors r+ |
|
Build failed (retrying...): |
|
Build failed: |
|
bors r+ |
|
Build failed (retrying...): |
|
Build failed: |
|
Courtesy rebors r+ |
|
bors r+ |
|
Build succeeded: |
Addresses a TODO. Avoids allocations for small numbers of messages. This is common for inactive ranges that are still heartbeating. It should help with cases like cockroachdb#56860.
This was responsible for an allocation per processReady and processRequestQueue call. This means that it would cause two heap allocations per Range on each replica during a Raft heartbeat. As a result, it was the largest source of allocations by far in tests like cockroachdb#56860 with lots of idle Ranges heartbeating frequently. This change should reduce the cost of handling these heartbeats, reducing the impact the heartbeats have on other active Ranges.
57007: kv: eliminate heap allocations throughout Raft messaging and processing logic r=nvanbenschoten a=nvanbenschoten This PR contains a few improvements within the Raft messaging and processing code to avoid common heap allocations that stuck out during tests in #56860. This will improve the general efficiency of the Raft pipeline and will specifically reduce the cost of handling large coalesced heartbeats. This will reduce the impact that these heartbeats have on the [scheduler latency](#56943) of other active ranges. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>




Fixes #56851.
This PR contains 3 commits that should improve the health of a cluster under CPU starvation and with many Ranges. I ran a series of experiments (see #56860 (comment)) which demonstrate that the combination of these commits improves the health of a cluster with many ranges dramatically, ensuring that liveness never falls over and that liveness heartbeat latency stays constant even as all other ranges become overloaded.
kv: cap COCKROACH_SCHEDULER_CONCURRENCY at 96
In investigations like #56851, we've seen the mutex in the Raft scheduler collapse due to too much concurrency. To address this, we needed to drop the scheduler's goroutine pool size to bound the amount of contention on the mutex to ensure that the scheduler was able to schedule any goroutines.
This commit caps this concurrency to 96, instead of letting it grow unbounded as a function of the number of cores on the system.
kv: batch enqueue Ranges in Raft scheduler for coalesced heartbeats
In #56851, we saw that all of the Raft transport's receiving goroutines were stuck in the Raft scheduler, attempting to enqueue Ranges in response to coalesced heartbeats. We saw this in stacktraces like:
In that issue, we also saw that too much concurrency on the Raft scheduler's Mutex had caused the mutex to collapse (get stuck in the slow path, in the OS kernel) and hundreds of goroutines to pile up on it.
We suspect that part of the problem here was that each of the coalesced heartbeats was locking the Raft scheduler once per Range. So a coalesced heartbeat that contained 10k ranges would lock the scheduler 10k times on the receiver.
The commit attempts to alleviate this issue by batch enqueuing Ranges in the Raft scheduler in response to coalesced heartbeats. This has a slight fixed overhead (i.e. the need for a slice) but in response, reduces the load that coalesced heartbeats place on the Raft scheduler's mutex by a factor of 128 (
enqueueChunkSize). This should reduce the impact that a large number of Ranges have on contention in the Raft scheduler.kv: prioritize NodeLiveness Range in Raft scheduler
In #56851 and in many other investigations, we've seen cases where the NodeLiveness Range has a hard time performing writes when a system is under heavy load. We already split RPC traffic into two classes, ensuring that NodeLiveness traffic does not get stuck behind traffic on user ranges. However, to this point, it was still possible for the NodeLiveness range to get stuck behind other Ranges in the Raft scheduler, leading to high scheduling latency for Raft operations.
This commit addresses this by prioritizing the NodeLiveness range above all others in the Raft scheduler. This prioritization mechanism is naive, but should be effective. It should also not run into any issues with fairness or starvation of other ranges, as such starvation is not possible as long as the scheduler concurrency (8*num_cpus) is above the number of high priority ranges (1).
@ajwerner I'm adding you here specifically because we've talked about the need for something like the last commit a few times.