kv: improve Raft scheduler behavior under CPU starvation by nvb · Pull Request #56860 · cockroachdb/cockroach

nvb · 2020-11-18T17:49:30Z

This PR contains 3 commits that should improve the health of a cluster under CPU starvation and with many Ranges. I ran a series of experiments (see #56860 (comment)) which demonstrate that the combination of these commits improves the health of a cluster with many ranges dramatically, ensuring that liveness never falls over and that liveness heartbeat latency stays constant even as all other ranges become overloaded.

kv: cap COCKROACH_SCHEDULER_CONCURRENCY at 96

In investigations like #56851, we've seen the mutex in the Raft scheduler collapse due to too much concurrency. To address this, we needed to drop the scheduler's goroutine pool size to bound the amount of contention on the mutex to ensure that the scheduler was able to schedule any goroutines.

This commit caps this concurrency to 96, instead of letting it grow unbounded as a function of the number of cores on the system.

kv: batch enqueue Ranges in Raft scheduler for coalesced heartbeats

In #56851, we saw that all of the Raft transport's receiving goroutines were stuck in the Raft scheduler, attempting to enqueue Ranges in response to coalesced heartbeats. We saw this in stacktraces like:

goroutine 321096 [semacquire]:
sync.runtime_SemacquireMutex(0xc00007099c, 0xc005822a00, 0x1)
	/usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc000070998)
	/usr/local/go/src/sync/mutex.go:138 +0xfc
sync.(*Mutex).Lock(...)
	/usr/local/go/src/sync/mutex.go:81
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).enqueue1(0xc000070980, 0x4, 0x19d8cb, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:261 +0xb0
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).EnqueueRaftRequest(0xc000070980, 0x19d8cb)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:299 +0x3e
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).HandleRaftUncoalescedRequest(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc01288e5c0, 0x4ba44c0, 0xc014ff2b40, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:175 +0x201
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).uncoalesceBeats(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc035790a80, 0x37, 0x43, 0x100000001, 0x29b00000000, 0x0, 0x400000004, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:110 +0x33b
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).HandleRaftRequest(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc02be585f0, 0x4ba44c0, 0xc014ff2b40, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:130 +0x1be
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).handleRaftRequest(0xc000188780, 0x4becc00, 0xc019f31b60, 0xc02be585f0, 0x4ba44c0, 0xc014ff2b40, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/raft_transport.go:299 +0xab
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).RaftMessageBatch.func1.1.1(0x4c3fac0, 0xc00d3ccdf0, 0xc000188780, 0x4becc00, 0xc019f31b60, 0x95fe98, 0x40c5720)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/raft_transport.go:370 +0x199

In that issue, we also saw that too much concurrency on the Raft scheduler's Mutex had caused the mutex to collapse (get stuck in the slow path, in the OS kernel) and hundreds of goroutines to pile up on it.

We suspect that part of the problem here was that each of the coalesced heartbeats was locking the Raft scheduler once per Range. So a coalesced heartbeat that contained 10k ranges would lock the scheduler 10k times on the receiver.

The commit attempts to alleviate this issue by batch enqueuing Ranges in the Raft scheduler in response to coalesced heartbeats. This has a slight fixed overhead (i.e. the need for a slice) but in response, reduces the load that coalesced heartbeats place on the Raft scheduler's mutex by a factor of 128 (enqueueChunkSize). This should reduce the impact that a large number of Ranges have on contention in the Raft scheduler.

kv: prioritize NodeLiveness Range in Raft scheduler

In #56851 and in many other investigations, we've seen cases where the NodeLiveness Range has a hard time performing writes when a system is under heavy load. We already split RPC traffic into two classes, ensuring that NodeLiveness traffic does not get stuck behind traffic on user ranges. However, to this point, it was still possible for the NodeLiveness range to get stuck behind other Ranges in the Raft scheduler, leading to high scheduling latency for Raft operations.

This commit addresses this by prioritizing the NodeLiveness range above all others in the Raft scheduler. This prioritization mechanism is naive, but should be effective. It should also not run into any issues with fairness or starvation of other ranges, as such starvation is not possible as long as the scheduler concurrency (8*num_cpus) is above the number of high priority ranges (1).

@ajwerner I'm adding you here specifically because we've talked about the need for something like the last commit a few times.

cockroach-teamcity · 2020-11-18T17:49:39Z

This change is

nvb · 2020-11-18T21:15:49Z

I ran some experiments and found this to work very well – even better than I was expecting. Here are the steps I took, along with the results:

Run 3 node roachprod cluster with n1-standard-4 instances.
Disable quiescence using `COCKROACH_DISABLE_QUIESCENCE=true`.
Disable range merges using `set cluster setting kv.range_merge.queue_enabled=false;`.
Create 300k ranges using `workload init kv --splits=300000`.

Restart cluster without anything in this PR.
Run: `SELECT count(*) from kv.kv;`.
Monitor: `liveness.heartbeatlatency.p90` and `liveness.livenodes`.

Restart cluster with heartbeat batching but no liveness prioritization.
Run: `SELECT count(*) from kv.kv;`.
Monitor: `liveness.heartbeatlatency.p90` and `liveness.livenodes`.

Restart cluster with heartbeat batching and liveness prioritization.
Run: `SELECT count(*) from kv.kv;`.
Monitor: `liveness.heartbeatlatency.p90` and `liveness.livenodes`.

Trial 1: `master` with nothing from this PR.

During this trial, liveness heartbeat latency grew and the cluster lost liveness and fell over in about 6 minutes. This is what we've become accustomed to seeing.

The following graphs show:

pane 1: liveness.heartbeatlatency-p99
pane 2: replicas + replicas.quiescent
pane 3: liveness.livenodes

Trial 2: only commit 2 from this PR – `batch enqueue Ranges in Raft scheduler for coalesced heartbeats`.

During this trial, liveness heartbeat latency grew much more gradually and the cluster never completely fell over. However, we did see transient liveness failures as the liveness latency neared and surpassed 4.5 seconds. This improvement indicates that the scheduler contention was hurting things on master, but that even with reduced contention, the FIFO queueing policy in the Raft scheduler eventually leads to such long scheduler latency for the node liveness range that the cluster can't sustain liveness.

Trial 3: everything from this PR.

During this trial, we saw things dramatically improve again. Node liveness latency remained steady the entire time, at well below 1s in the tail, and all 300k ranges were eventually able to unquiesce after about an hour. The other ranges in the cluster were still not super happy because scheduler latency (a metric I'm going to write an issue about adding) for all non-liveness ranges was very high. However, we never lost liveness and the cluster never fell over, which was the major focus of this PR.

We can also see that CPU was completely saturated throughout the entire run.

nvb · 2020-11-18T21:22:25Z

@tbg what do you think about backporting the second commit (batch enqueue Ranges in Raft scheduler for coalesced heartbeats)? The other commits seem a little risky for a backport because they change behavior, but that one seems like a clear win.

petermattis

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @nvanbenschoten, and @tbg)

pkg/kv/kvserver/scheduler.go, line 67 at r3 (raw file):

// part, it implements a FIFO queueing policy with no prioritization of some
// ranges over others. However, the queue can be configured with up to one
// high-priority range, which will always be placed at the front when added.

Did you consider having a dedicated Raft scheduler goroutine for the node liveness range? This code has long since left my memory, so that might be a much larger lift. This simple priority mechanism seems relatively contained and your test results are quite compelling.

petermattis

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @nvanbenschoten, and @tbg)

pkg/kv/kvserver/replica_init.go, line 350 at r3 (raw file):

	// Ranges to ensure that liveness never sees high Raft scheduler latency.
	if bytes.HasPrefix(desc.StartKey, keys.NodeLivenessPrefix) {
		r.store.scheduler.SetPriorityID(desc.RangeID)

Once SetPriorityID is called, it is never cleared. Could this be done in the store initialization code?

nvb

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @petermattis, and @tbg)

pkg/kv/kvserver/replica_init.go, line 350 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Once SetPriorityID is called, it is never cleared. Could this be done in the store initialization code?

The challenge is that we don't have a constant ID for the node liveness Range, though I think in practice it always ends up being 2 today. I'm not sure how to cleanly determine this ID in store initialization code without doing a meta range lookup.

pkg/kv/kvserver/scheduler.go, line 67 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Did you consider having a dedicated Raft scheduler goroutine for the node liveness range? This code has long since left my memory, so that might be a much larger lift. This simple priority mechanism seems relatively contained and your test results are quite compelling.

I didn't consider it when writing this, but it is a valid alternative. It probably would be a larger lift though and has the downside of requiring a goroutine per node even for nodes that don't have a replica for the liveness range. Are there upsides that you think outweigh this?

petermattis

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @nvanbenschoten, and @tbg)

pkg/kv/kvserver/replica_init.go, line 350 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

The challenge is that we don't have a constant ID for the node liveness Range, though I think in practice it always ends up being 2 today. I'm not sure how to cleanly determine this ID in store initialization code without doing a meta range lookup.

Ack. I had been mistakenly thinking that the node liveness range had a fixed ID.

pkg/kv/kvserver/scheduler.go, line 67 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I didn't consider it when writing this, but it is a valid alternative. It probably would be a larger lift though and has the downside of requiring a goroutine per node even for nodes that don't have a replica for the liveness range. Are there upsides that you think outweigh this?

The possible upside is that you'd be guaranteed to not queue processing of the node liveness range behind processing of any other range. With the priority approach here, if the node liveness range needs processing, it still has to wait for a worker to grab it. I suspect that will happen quickly, though.

ajwerner

Reviewed 1 of 1 files at r1, 3 of 3 files at r2, 3 of 3 files at r3.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten, @petermattis, and @tbg)

pkg/kv/kvserver/replica_init.go, line 350 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Ack. I had been mistakenly thinking that the node liveness range had a fixed ID.

The node liveness range may move to a node after it starts up. This method will be called on every existing replica during store startup anyway.

pkg/kv/kvserver/store_raft.go, line 116 at r2 (raw file):

	if len(toEnqueue) > 0 {
		s.scheduler.EnqueueRaftRequests(toEnqueue...)
	}

nit: push the 0 len optimization down into EnqueueRaftRequests

nvb

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @petermattis, and @tbg)

pkg/kv/kvserver/scheduler.go, line 67 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

The possible upside is that you'd be guaranteed to not queue processing of the node liveness range behind processing of any other range. With the priority approach here, if the node liveness range needs processing, it still has to wait for a worker to grab it. I suspect that will happen quickly, though.

That's a good point. The change here will ensure that the liveness range is quickly processed regardless of the queue size, but don't ensure that the range will be quickly processed if the workers are all very slow in processing other ranges. But I tend to agree that such cases should be rare, and in those cases, we must be hitting serious slowness below this level. This would imply that getting node liveness through the Raft scheduler is probably still not enough to ensure it makes progress.

pkg/kv/kvserver/store_raft.go, line 116 at r2 (raw file):

Previously, ajwerner wrote…

	if len(toEnqueue) > 0 {
		s.scheduler.EnqueueRaftRequests(toEnqueue...)
	}
nit: push the 0 len optimization down into EnqueueRaftRequests

Done.

tbg

Please add a release note, probably even link to #56851 from it.

The experiments were run with the default raft scheduler concurrency, right (i.e. 32 workers per node)?

The quiescence test here:

cockroach/pkg/cmd/roachtest/kv.go

Line 475 in 44de8d8

Name: fmt.Sprintf("kv/splits/nodes=3/quiesce=%t", item.quiesce),

should work much better now, shouldn't it? We had neutered it quite a bit (dropped the range count lower and lower), but I think you may have just fixed the issues with it. See if you can set it up to roughly match your experiment and run it a few times, it will be very nice to have confidence that CRDB can support a vastly larger range count with your changes. (This doesn't need to hold up this PR, I imagine we want to run these tests a bunch after making them more aggressive, which takes time).

I'm on board with backporting the second commit. I would even entertain backporting the third after some bake time. The only way in which I could see this commit backfire (without a bug) is that the liveness range could starve out processing for other ranges, which seems improbable.

Reviewed 1 of 1 files at r1, 4 of 4 files at r4, 3 of 3 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten and @petermattis)

pkg/kv/kvserver/replica_init.go, line 350 at r5 (raw file):

	// Ranges to ensure that liveness never sees high Raft scheduler latency.
	if bytes.HasPrefix(desc.StartKey, keys.NodeLivenessPrefix) {
		r.store.scheduler.SetPriorityID(desc.RangeID)

Looks like this isn't tested, isn't that relatively easy to add? Start a server, look up the liveness range, get the range id, pull the priority ID from the scheduler (via a helper in helpers_test.go) and that's it?
Of course this works, as indicated by your experiments, but a more direct proof would be nice.

pkg/kv/kvserver/scheduler.go, line 123 at r5 (raw file):

	if q.priorityID != 0 && q.priorityID != id {
		panic(fmt.Sprintf(
			"priority range ID already set: old=%d, new=%d",

This is a bit odd, there's nothing about SetPriorityID "externally" that indicates that this is not allowed. Perhaps a comment on the method is enough.

pkg/kv/kvserver/store.go, line 103 at r1 (raw file):

var storeSchedulerConcurrency = envutil.EnvOrDefaultInt(
	"COCKROACH_SCHEDULER_CONCURRENCY", min(8*runtime.NumCPU(), 96))

A comment here wouldn't hurt - the 8*runtime.NumCPU() was determined to show peak performance on a particular benchmarking setup using some variant of a KV workload (@petermattis might remember particulars) and the cap arises from observations of severe mutex contention at high scheduler concurrency.

petermattis

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten, @petermattis, and @tbg)

pkg/kv/kvserver/store.go, line 103 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

A comment here wouldn't hurt - the 8*runtime.NumCPU() was determined to show peak performance on a particular benchmarking setup using some variant of a KV workload (@petermattis might remember particulars) and the cap arises from observations of severe mutex contention at high scheduler concurrency.

My recollection is somewhat hazy. I recall doing some kv workload performance tests and 8*runtime.NumCPU() appeared to be a reasonable value. I don't recall this testing being particularly rigorous and it is possibly something that could be fruitfully revisited.

petermattis

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten, @petermattis, and @tbg)

pkg/kv/kvserver/store.go, line 103 at r1 (raw file):
Git history tells us that the testing wasn't particularly rigorous: 9a68241

The COCKROACH_SCHEDULER_CONCURRENCY env var controls the number of
scheduler goroutines used for Raft processing. In testing, the previous
default of 2**NumCPU proved to be too low to achieve good parallelization
of batch commits. Even with proposer-evalued-KV landing, we want lots of
concurrent batch commits so they can be grouped and written to RocksDB
more efficiently. The new default of 8*NumCPU was determined
experimentally on 8 CPU machines.

Notice that this was done way back in April, 2017. The stone ages.

Btw, nightly performance tests on higher CPU machines tends to flatline above 32 CPUs. I wonder if this will PR will have an effect.

Relates to cockroachdb#56851. In investigations like cockroachdb#56851, we've seen the mutex in the Raft scheduler collapse due to too much concurrency. To address this, we needed to drop the scheduler's goroutine pool size to bound the amount of contention on the mutex to ensure that the scheduler was able to schedule any goroutines. This commit caps this concurrency to 96, instead of letting it grow unbounded as a function of the number of cores on the system. Release note (performance improvement): The Raft processing goroutine pool's size is now capped at 96. This was observed to prevent instability on large machines (32+ vCPU) in clusters with many ranges (50k+ per node).

Relates to cockroachdb#56851. In cockroachdb#56851, we saw that all of the Raft transport's receiving goroutines were stuck in the Raft scheduler, attempting to enqueue Ranges in response to coalesced heartbeats. We saw this in stacktraces like: ``` goroutine 321096 [semacquire]: sync.runtime_SemacquireMutex(0xc00007099c, 0xc005822a00, 0x1) /usr/local/go/src/runtime/sema.go:71 +0x47 sync.(*Mutex).lockSlow(0xc000070998) /usr/local/go/src/sync/mutex.go:138 +0xfc sync.(*Mutex).Lock(...) /usr/local/go/src/sync/mutex.go:81 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).enqueue1(0xc000070980, 0x4, 0x19d8cb, 0x1) /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:261 +0xb0 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).EnqueueRaftRequest(0xc000070980, 0x19d8cb) /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:299 +0x3e github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).HandleRaftUncoalescedRequest(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc01288e5c0, 0x4ba44c0, 0xc014ff2b40, 0x0) /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:175 +0x201 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).uncoalesceBeats(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc035790a80, 0x37, 0x43, 0x100000001, 0x29b00000000, 0x0, 0x400000004, ...) /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:110 +0x33b github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).HandleRaftRequest(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc02be585f0, 0x4ba44c0, 0xc014ff2b40, 0x0) /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:130 +0x1be github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).handleRaftRequest(0xc000188780, 0x4becc00, 0xc019f31b60, 0xc02be585f0, 0x4ba44c0, 0xc014ff2b40, 0x0) /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/raft_transport.go:299 +0xab github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).RaftMessageBatch.func1.1.1(0x4c3fac0, 0xc00d3ccdf0, 0xc000188780, 0x4becc00, 0xc019f31b60, 0x95fe98, 0x40c5720) /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/raft_transport.go:370 +0x199 ``` In that issue, we also saw that too much concurrency on the Raft scheduler's Mutex had caused the mutex to collapse (get stuck in the slow path, in the OS kernel) and hundreds of goroutines to pile up on it. We suspect that part of the problem here was that each of the coalesced heartbeats was locking the Raft scheduler once per Range. So a coalesced heartbeat that contained 10k ranges would lock the scheduler 10k times on the receiver. The commit attempts to alleviate this issue by batch enqueuing Ranges in the Raft scheduler in response to coalesced heartbeats. This has a slight fixed overhead (i.e. the need for a slice) but in response, reduces the load that coalesced heartbeats place on the Raft scheduler's mutex by a factor of 128 (`enqueueChunkSize`). This should reduce the impact that a large number of Ranges have on contention in the Raft scheduler. Release note (performance improvement): Interactions between Raft heartbeats and the Raft goroutine pool scheduler are now more efficient and avoid excessive mutex contention. This was observed to prevent instability on large machines (32+ vCPU) in clusters with many ranges (50k+ per node).

Relates to cockroachdb#56851. In cockroachdb#56851 and in many other investigations, we've seen cases where the NodeLiveness Range has a hard time performing writes when a system is under heavy load. We already split RPC traffic into two classes, ensuring that NodeLiveness traffic does not get stuck behind traffic on user ranges. However, to this point, it was still possible for the NodeLiveness range to get stuck behind other Ranges in the Raft scheduler, leading to high scheduling latency for Raft operations. This commit addresses this by prioritizing the NodeLiveness range above all others in the Raft scheduler. This prioritization mechanism is naive, but should be effective. It should also not run into any issues with fairness or starvation of other ranges, as such starvation is not possible as long as the scheduler concurrency (8*num_cpus) is above the number of high priority ranges (1). Release note (performance improvement): The Raft scheduler now prioritizes the node liveness Range. This was observed to prevent instability on large machines (32+ vCPU) in clusters with many ranges (50k+ per node).

nvb

Please add a release note, probably even link to #56851 from it.

Done.

The experiments were run with the default raft scheduler concurrency, right (i.e. 32 workers per node)?

Right.

The quiescence test here should work much better now, shouldn't it?

Yes, it should be able to handle many more ranges without falling over. That doesn't mean that these ranges will be able to handle much traffic when none are quiesced, as Raft scheduler latency (see #56943) will still be quite large for all non-liveness ranges.

I'll experiment with updating that roachtest in a follow-up PR.

I'm on board with backporting the second commit. I would even entertain backporting the third after some bake time.

That plan sounds good to me.

The only way in which I could see this commit backfire (without a bug) is that the liveness range could starve out processing for other ranges, which seems improbable.

I'm not even sure that this is possible. I touched a bit on it in the commit message, but I think we're safe from any theoretical starvation as long as the scheduler concurrency (8*num_cpus) is above the number of high priority ranges (1), which it always will be.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @petermattis, and @tbg)

pkg/kv/kvserver/replica_init.go, line 350 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Looks like this isn't tested, isn't that relatively easy to add? Start a server, look up the liveness range, get the range id, pull the priority ID from the scheduler (via a helper in helpers_test.go) and that's it?
Of course this works, as indicated by your experiments, but a more direct proof would be nice.

Done.

pkg/kv/kvserver/scheduler.go, line 123 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This is a bit odd, there's nothing about SetPriorityID "externally" that indicates that this is not allowed. Perhaps a comment on the method is enough.

Done.

pkg/kv/kvserver/store.go, line 103 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Git history tells us that the testing wasn't particularly rigorous: 9a68241

The COCKROACH_SCHEDULER_CONCURRENCY env var controls the number of
scheduler goroutines used for Raft processing. In testing, the previous
default of 2**NumCPU proved to be too low to achieve good parallelization
of batch commits. Even with proposer-evalued-KV landing, we want lots of
concurrent batch commits so they can be grouped and written to RocksDB
more efficiently. The new default of 8*NumCPU was determined
experimentally on 8 CPU machines.

Notice that this was done way back in April, 2017. The stone ages.

Btw, nightly performance tests on higher CPU machines tends to flatline above 32 CPUs. I wonder if this will PR will have an effect.

Done.

petermattis

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner and @tbg)

nvb · 2020-11-20T17:33:22Z

bors r+

craig · 2020-11-20T18:28:17Z

Build failed (retrying...):

GitHub CI (Cockroach)

craig · 2020-11-20T20:05:24Z

Build failed:

GitHub CI (Cockroach)

nvb · 2020-11-21T01:38:15Z

acceptance/bank/node-restart seems to be hitting pq: query execution canceled due to statement timeout across a number of PRs. I'm not sure when or why that became flaky, but it's not due to this so I'm going to merge.

bors r+

craig · 2020-11-21T03:07:16Z

Build failed (retrying...):

GitHub CI (Cockroach)

craig · 2020-11-21T04:08:35Z

Build failed:

GitHub CI (Cockroach)

jordanlewis · 2020-11-22T19:46:41Z

Courtesy rebors r+

jordanlewis · 2020-11-22T19:46:44Z

bors r+

craig · 2020-11-22T20:58:56Z

Build succeeded:

GitHub CI (Cockroach)

Addresses a TODO. Avoids allocations for small numbers of messages. This is common for inactive ranges that are still heartbeating. It should help with cases like cockroachdb#56860.

This was responsible for an allocation per processReady and processRequestQueue call. This means that it would cause two heap allocations per Range on each replica during a Raft heartbeat. As a result, it was the largest source of allocations by far in tests like cockroachdb#56860 with lots of idle Ranges heartbeating frequently. This change should reduce the cost of handling these heartbeats, reducing the impact the heartbeats have on other active Ranges.

57007: kv: eliminate heap allocations throughout Raft messaging and processing logic r=nvanbenschoten a=nvanbenschoten This PR contains a few improvements within the Raft messaging and processing code to avoid common heap allocations that stuck out during tests in #56860. This will improve the general efficiency of the Raft pipeline and will specifically reduce the cost of handling large coalesced heartbeats. This will reduce the impact that these heartbeats have on the [scheduler latency](#56943) of other active ranges. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

nvb requested review from ajwerner and tbg November 18, 2020 17:49

nvb force-pushed the nvanbenschoten/raftSchedulerOpt branch from 967c9d2 to 0a9c991 Compare November 18, 2020 18:27

petermattis reviewed Nov 18, 2020

View reviewed changes

nvb commented Nov 18, 2020

View reviewed changes

petermattis reviewed Nov 19, 2020

View reviewed changes

ajwerner approved these changes Nov 19, 2020

View reviewed changes

nvb force-pushed the nvanbenschoten/raftSchedulerOpt branch from 0a9c991 to d48f597 Compare November 19, 2020 03:23

nvb commented Nov 19, 2020

View reviewed changes

tbg requested a review from petermattis November 19, 2020 08:43

tbg approved these changes Nov 19, 2020

View reviewed changes

petermattis reviewed Nov 19, 2020

View reviewed changes

nvb mentioned this pull request Nov 20, 2020

kv: add raft.scheduler.latency histogram metric #56943

Merged

nvb added 3 commits November 19, 2020 23:27

nvb force-pushed the nvanbenschoten/raftSchedulerOpt branch from d48f597 to 900729b Compare November 20, 2020 04:50

nvb commented Nov 20, 2020

View reviewed changes

petermattis approved these changes Nov 20, 2020

View reviewed changes

craig bot merged commit be3ff28 into cockroachdb:master Nov 22, 2020

nvb mentioned this pull request Nov 23, 2020

kv: eliminate heap allocations throughout Raft messaging and processing logic #57007

Merged

nvb deleted the nvanbenschoten/raftSchedulerOpt branch November 23, 2020 02:49

This was referenced Nov 23, 2020

release-20.2: kv: improve Raft scheduler behavior under CPU starvation #57008

Merged

release-20.1: kv: improve Raft scheduler behavior under CPU starvation #57009

Merged

tbg mentioned this pull request Nov 25, 2020

tracing: internally remove WithSeparateRecording #56343

Merged

ajwerner mentioned this pull request Jan 19, 2021

kv: make disk I/O asynchronous with respect to Raft state machine #17500

Closed

This was referenced May 3, 2021

release-20.2: kv: improve Raft scheduler behavior under CPU starvation #64567

Merged

release-20.1: kv: improve Raft scheduler behavior under CPU starvation #64568

Merged

ericharmeling mentioned this pull request May 13, 2021

Update "cold starts of large clusters" known limitation cockroachdb/docs#7188

Closed

nvb mentioned this pull request Jun 8, 2021

storage: Improve reliability of node liveness #19699

Closed

13 tasks

erikgrinaker mentioned this pull request Apr 29, 2023

kvserver: support multiple priority ranges in Raft scheduler #101023

Merged

Conversation

nvb commented Nov 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

kv: cap COCKROACH_SCHEDULER_CONCURRENCY at 96

kv: batch enqueue Ranges in Raft scheduler for coalesced heartbeats

kv: prioritize NodeLiveness Range in Raft scheduler

Uh oh!

cockroach-teamcity commented Nov 18, 2020

Uh oh!

nvb commented Nov 18, 2020

Trial 1: master with nothing from this PR.

Trial 2: only commit 2 from this PR – batch enqueue Ranges in Raft scheduler for coalesced heartbeats.

Trial 3: everything from this PR.

Uh oh!

nvb commented Nov 18, 2020

Uh oh!

petermattis left a comment

Choose a reason for hiding this comment

Uh oh!

petermattis left a comment

Choose a reason for hiding this comment

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

petermattis left a comment

Choose a reason for hiding this comment

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

petermattis left a comment

Choose a reason for hiding this comment

Uh oh!

petermattis left a comment

Choose a reason for hiding this comment

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

petermattis left a comment

Choose a reason for hiding this comment

Uh oh!

nvb commented Nov 20, 2020

Uh oh!

craig bot commented Nov 20, 2020

Uh oh!

craig bot commented Nov 20, 2020

Uh oh!

nvb commented Nov 21, 2020

Uh oh!

craig bot commented Nov 21, 2020

Uh oh!

craig bot commented Nov 21, 2020

Uh oh!

jordanlewis commented Nov 22, 2020

Uh oh!

jordanlewis commented Nov 22, 2020

Uh oh!

craig bot commented Nov 22, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

nvb commented Nov 18, 2020 •

edited

Loading

Trial 1: `master` with nothing from this PR.

Trial 2: only commit 2 from this PR – `batch enqueue Ranges in Raft scheduler for coalesced heartbeats`.