Skip to content

kv: improve Raft scheduler behavior under CPU starvation#56860

Merged
craig[bot] merged 3 commits intocockroachdb:masterfrom
nvb:nvanbenschoten/raftSchedulerOpt
Nov 22, 2020
Merged

kv: improve Raft scheduler behavior under CPU starvation#56860
craig[bot] merged 3 commits intocockroachdb:masterfrom
nvb:nvanbenschoten/raftSchedulerOpt

Conversation

@nvb
Copy link
Copy Markdown
Contributor

@nvb nvb commented Nov 18, 2020

Fixes #56851.

This PR contains 3 commits that should improve the health of a cluster under CPU starvation and with many Ranges. I ran a series of experiments (see #56860 (comment)) which demonstrate that the combination of these commits improves the health of a cluster with many ranges dramatically, ensuring that liveness never falls over and that liveness heartbeat latency stays constant even as all other ranges become overloaded.

kv: cap COCKROACH_SCHEDULER_CONCURRENCY at 96

In investigations like #56851, we've seen the mutex in the Raft scheduler collapse due to too much concurrency. To address this, we needed to drop the scheduler's goroutine pool size to bound the amount of contention on the mutex to ensure that the scheduler was able to schedule any goroutines.

This commit caps this concurrency to 96, instead of letting it grow unbounded as a function of the number of cores on the system.

kv: batch enqueue Ranges in Raft scheduler for coalesced heartbeats

In #56851, we saw that all of the Raft transport's receiving goroutines were stuck in the Raft scheduler, attempting to enqueue Ranges in response to coalesced heartbeats. We saw this in stacktraces like:

goroutine 321096 [semacquire]:
sync.runtime_SemacquireMutex(0xc00007099c, 0xc005822a00, 0x1)
	/usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc000070998)
	/usr/local/go/src/sync/mutex.go:138 +0xfc
sync.(*Mutex).Lock(...)
	/usr/local/go/src/sync/mutex.go:81
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).enqueue1(0xc000070980, 0x4, 0x19d8cb, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:261 +0xb0
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).EnqueueRaftRequest(0xc000070980, 0x19d8cb)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:299 +0x3e
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).HandleRaftUncoalescedRequest(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc01288e5c0, 0x4ba44c0, 0xc014ff2b40, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:175 +0x201
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).uncoalesceBeats(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc035790a80, 0x37, 0x43, 0x100000001, 0x29b00000000, 0x0, 0x400000004, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:110 +0x33b
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).HandleRaftRequest(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc02be585f0, 0x4ba44c0, 0xc014ff2b40, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:130 +0x1be
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).handleRaftRequest(0xc000188780, 0x4becc00, 0xc019f31b60, 0xc02be585f0, 0x4ba44c0, 0xc014ff2b40, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/raft_transport.go:299 +0xab
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).RaftMessageBatch.func1.1.1(0x4c3fac0, 0xc00d3ccdf0, 0xc000188780, 0x4becc00, 0xc019f31b60, 0x95fe98, 0x40c5720)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/raft_transport.go:370 +0x199

In that issue, we also saw that too much concurrency on the Raft scheduler's Mutex had caused the mutex to collapse (get stuck in the slow path, in the OS kernel) and hundreds of goroutines to pile up on it.

We suspect that part of the problem here was that each of the coalesced heartbeats was locking the Raft scheduler once per Range. So a coalesced heartbeat that contained 10k ranges would lock the scheduler 10k times on the receiver.

The commit attempts to alleviate this issue by batch enqueuing Ranges in the Raft scheduler in response to coalesced heartbeats. This has a slight fixed overhead (i.e. the need for a slice) but in response, reduces the load that coalesced heartbeats place on the Raft scheduler's mutex by a factor of 128 (enqueueChunkSize). This should reduce the impact that a large number of Ranges have on contention in the Raft scheduler.

kv: prioritize NodeLiveness Range in Raft scheduler

In #56851 and in many other investigations, we've seen cases where the NodeLiveness Range has a hard time performing writes when a system is under heavy load. We already split RPC traffic into two classes, ensuring that NodeLiveness traffic does not get stuck behind traffic on user ranges. However, to this point, it was still possible for the NodeLiveness range to get stuck behind other Ranges in the Raft scheduler, leading to high scheduling latency for Raft operations.

This commit addresses this by prioritizing the NodeLiveness range above all others in the Raft scheduler. This prioritization mechanism is naive, but should be effective. It should also not run into any issues with fairness or starvation of other ranges, as such starvation is not possible as long as the scheduler concurrency (8*num_cpus) is above the number of high priority ranges (1).

@ajwerner I'm adding you here specifically because we've talked about the need for something like the last commit a few times.

@nvb nvb requested review from ajwerner and tbg November 18, 2020 17:49
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@nvb nvb force-pushed the nvanbenschoten/raftSchedulerOpt branch from 967c9d2 to 0a9c991 Compare November 18, 2020 18:27
@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Nov 18, 2020

I ran some experiments and found this to work very well – even better than I was expecting. Here are the steps I took, along with the results:

Run 3 node roachprod cluster with n1-standard-4 instances.
Disable quiescence using `COCKROACH_DISABLE_QUIESCENCE=true`.
Disable range merges using `set cluster setting kv.range_merge.queue_enabled=false;`.
Create 300k ranges using `workload init kv --splits=300000`.

Restart cluster without anything in this PR.
Run: `SELECT count(*) from kv.kv;`.
Monitor: `liveness.heartbeatlatency.p90` and `liveness.livenodes`.

Restart cluster with heartbeat batching but no liveness prioritization.
Run: `SELECT count(*) from kv.kv;`.
Monitor: `liveness.heartbeatlatency.p90` and `liveness.livenodes`.

Restart cluster with heartbeat batching and liveness prioritization.
Run: `SELECT count(*) from kv.kv;`.
Monitor: `liveness.heartbeatlatency.p90` and `liveness.livenodes`.

Trial 1: master with nothing from this PR.

During this trial, liveness heartbeat latency grew and the cluster lost liveness and fell over in about 6 minutes. This is what we've become accustomed to seeing.

The following graphs show:

pane 1: liveness.heartbeatlatency-p99
pane 2: replicas + replicas.quiescent
pane 3: liveness.livenodes

Screen Shot 2020-11-18 at 2 22 33 PM

Trial 2: only commit 2 from this PR – batch enqueue Ranges in Raft scheduler for coalesced heartbeats.

During this trial, liveness heartbeat latency grew much more gradually and the cluster never completely fell over. However, we did see transient liveness failures as the liveness latency neared and surpassed 4.5 seconds. This improvement indicates that the scheduler contention was hurting things on master, but that even with reduced contention, the FIFO queueing policy in the Raft scheduler eventually leads to such long scheduler latency for the node liveness range that the cluster can't sustain liveness.

Screen Shot 2020-11-18 at 3 24 50 PM

Trial 3: everything from this PR.

During this trial, we saw things dramatically improve again. Node liveness latency remained steady the entire time, at well below 1s in the tail, and all 300k ranges were eventually able to unquiesce after about an hour. The other ranges in the cluster were still not super happy because scheduler latency (a metric I'm going to write an issue about adding) for all non-liveness ranges was very high. However, we never lost liveness and the cluster never fell over, which was the major focus of this PR.

Screen Shot 2020-11-18 at 2 21 42 PM

We can also see that CPU was completely saturated throughout the entire run.

Screen Shot 2020-11-18 at 2 23 59 PM

@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Nov 18, 2020

@tbg what do you think about backporting the second commit (batch enqueue Ranges in Raft scheduler for coalesced heartbeats)? The other commits seem a little risky for a backport because they change behavior, but that one seems like a clear win.

Copy link
Copy Markdown
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @nvanbenschoten, and @tbg)


pkg/kv/kvserver/scheduler.go, line 67 at r3 (raw file):

// part, it implements a FIFO queueing policy with no prioritization of some
// ranges over others. However, the queue can be configured with up to one
// high-priority range, which will always be placed at the front when added.

Did you consider having a dedicated Raft scheduler goroutine for the node liveness range? This code has long since left my memory, so that might be a much larger lift. This simple priority mechanism seems relatively contained and your test results are quite compelling.

Copy link
Copy Markdown
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @nvanbenschoten, and @tbg)


pkg/kv/kvserver/replica_init.go, line 350 at r3 (raw file):

	// Ranges to ensure that liveness never sees high Raft scheduler latency.
	if bytes.HasPrefix(desc.StartKey, keys.NodeLivenessPrefix) {
		r.store.scheduler.SetPriorityID(desc.RangeID)

Once SetPriorityID is called, it is never cleared. Could this be done in the store initialization code?

Copy link
Copy Markdown
Contributor Author

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @petermattis, and @tbg)


pkg/kv/kvserver/replica_init.go, line 350 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Once SetPriorityID is called, it is never cleared. Could this be done in the store initialization code?

The challenge is that we don't have a constant ID for the node liveness Range, though I think in practice it always ends up being 2 today. I'm not sure how to cleanly determine this ID in store initialization code without doing a meta range lookup.


pkg/kv/kvserver/scheduler.go, line 67 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Did you consider having a dedicated Raft scheduler goroutine for the node liveness range? This code has long since left my memory, so that might be a much larger lift. This simple priority mechanism seems relatively contained and your test results are quite compelling.

I didn't consider it when writing this, but it is a valid alternative. It probably would be a larger lift though and has the downside of requiring a goroutine per node even for nodes that don't have a replica for the liveness range. Are there upsides that you think outweigh this?

Copy link
Copy Markdown
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @nvanbenschoten, and @tbg)


pkg/kv/kvserver/replica_init.go, line 350 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

The challenge is that we don't have a constant ID for the node liveness Range, though I think in practice it always ends up being 2 today. I'm not sure how to cleanly determine this ID in store initialization code without doing a meta range lookup.

Ack. I had been mistakenly thinking that the node liveness range had a fixed ID.


pkg/kv/kvserver/scheduler.go, line 67 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I didn't consider it when writing this, but it is a valid alternative. It probably would be a larger lift though and has the downside of requiring a goroutine per node even for nodes that don't have a replica for the liveness range. Are there upsides that you think outweigh this?

The possible upside is that you'd be guaranteed to not queue processing of the node liveness range behind processing of any other range. With the priority approach here, if the node liveness range needs processing, it still has to wait for a worker to grab it. I suspect that will happen quickly, though.

Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r1, 3 of 3 files at r2, 3 of 3 files at r3.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten, @petermattis, and @tbg)


pkg/kv/kvserver/replica_init.go, line 350 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Ack. I had been mistakenly thinking that the node liveness range had a fixed ID.

The node liveness range may move to a node after it starts up. This method will be called on every existing replica during store startup anyway.


pkg/kv/kvserver/store_raft.go, line 116 at r2 (raw file):

	if len(toEnqueue) > 0 {
		s.scheduler.EnqueueRaftRequests(toEnqueue...)
	}

nit: push the 0 len optimization down into EnqueueRaftRequests

@nvb nvb force-pushed the nvanbenschoten/raftSchedulerOpt branch from 0a9c991 to d48f597 Compare November 19, 2020 03:23
Copy link
Copy Markdown
Contributor Author

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @petermattis, and @tbg)


pkg/kv/kvserver/scheduler.go, line 67 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

The possible upside is that you'd be guaranteed to not queue processing of the node liveness range behind processing of any other range. With the priority approach here, if the node liveness range needs processing, it still has to wait for a worker to grab it. I suspect that will happen quickly, though.

That's a good point. The change here will ensure that the liveness range is quickly processed regardless of the queue size, but don't ensure that the range will be quickly processed if the workers are all very slow in processing other ranges. But I tend to agree that such cases should be rare, and in those cases, we must be hitting serious slowness below this level. This would imply that getting node liveness through the Raft scheduler is probably still not enough to ensure it makes progress.


pkg/kv/kvserver/store_raft.go, line 116 at r2 (raw file):

Previously, ajwerner wrote…
	if len(toEnqueue) > 0 {
		s.scheduler.EnqueueRaftRequests(toEnqueue...)
	}

nit: push the 0 len optimization down into EnqueueRaftRequests

Done.

@tbg tbg requested a review from petermattis November 19, 2020 08:43
Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a release note, probably even link to #56851 from it.

The experiments were run with the default raft scheduler concurrency, right (i.e. 32 workers per node)?

The quiescence test here:

Name: fmt.Sprintf("kv/splits/nodes=3/quiesce=%t", item.quiesce),

should work much better now, shouldn't it? We had neutered it quite a bit (dropped the range count lower and lower), but I think you may have just fixed the issues with it. See if you can set it up to roughly match your experiment and run it a few times, it will be very nice to have confidence that CRDB can support a vastly larger range count with your changes. (This doesn't need to hold up this PR, I imagine we want to run these tests a bunch after making them more aggressive, which takes time).

I'm on board with backporting the second commit. I would even entertain backporting the third after some bake time. The only way in which I could see this commit backfire (without a bug) is that the liveness range could starve out processing for other ranges, which seems improbable.

Reviewed 1 of 1 files at r1, 4 of 4 files at r4, 3 of 3 files at r5.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten and @petermattis)


pkg/kv/kvserver/replica_init.go, line 350 at r5 (raw file):

	// Ranges to ensure that liveness never sees high Raft scheduler latency.
	if bytes.HasPrefix(desc.StartKey, keys.NodeLivenessPrefix) {
		r.store.scheduler.SetPriorityID(desc.RangeID)

Looks like this isn't tested, isn't that relatively easy to add? Start a server, look up the liveness range, get the range id, pull the priority ID from the scheduler (via a helper in helpers_test.go) and that's it?
Of course this works, as indicated by your experiments, but a more direct proof would be nice.


pkg/kv/kvserver/scheduler.go, line 123 at r5 (raw file):

	if q.priorityID != 0 && q.priorityID != id {
		panic(fmt.Sprintf(
			"priority range ID already set: old=%d, new=%d",

This is a bit odd, there's nothing about SetPriorityID "externally" that indicates that this is not allowed. Perhaps a comment on the method is enough.


pkg/kv/kvserver/store.go, line 103 at r1 (raw file):

var storeSchedulerConcurrency = envutil.EnvOrDefaultInt(
	"COCKROACH_SCHEDULER_CONCURRENCY", min(8*runtime.NumCPU(), 96))

A comment here wouldn't hurt - the 8*runtime.NumCPU() was determined to show peak performance on a particular benchmarking setup using some variant of a KV workload (@petermattis might remember particulars) and the cap arises from observations of severe mutex contention at high scheduler concurrency.

Copy link
Copy Markdown
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten, @petermattis, and @tbg)


pkg/kv/kvserver/store.go, line 103 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

A comment here wouldn't hurt - the 8*runtime.NumCPU() was determined to show peak performance on a particular benchmarking setup using some variant of a KV workload (@petermattis might remember particulars) and the cap arises from observations of severe mutex contention at high scheduler concurrency.

My recollection is somewhat hazy. I recall doing some kv workload performance tests and 8*runtime.NumCPU() appeared to be a reasonable value. I don't recall this testing being particularly rigorous and it is possibly something that could be fruitfully revisited.

Copy link
Copy Markdown
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten, @petermattis, and @tbg)


pkg/kv/kvserver/store.go, line 103 at r1 (raw file):
Git history tells us that the testing wasn't particularly rigorous: 9a68241

The COCKROACH_SCHEDULER_CONCURRENCY env var controls the number of
scheduler goroutines used for Raft processing. In testing, the previous
default of 2**NumCPU proved to be too low to achieve good parallelization
of batch commits. Even with proposer-evalued-KV landing, we want lots of
concurrent batch commits so they can be grouped and written to RocksDB
more efficiently. The new default of 8*NumCPU was determined
experimentally on 8 CPU machines.

Notice that this was done way back in April, 2017. The stone ages.

Btw, nightly performance tests on higher CPU machines tends to flatline above 32 CPUs. I wonder if this will PR will have an effect.

nvb added 3 commits November 19, 2020 23:27
Relates to cockroachdb#56851.

In investigations like cockroachdb#56851, we've seen the mutex in the Raft
scheduler collapse due to too much concurrency. To address this, we
needed to drop the scheduler's goroutine pool size to bound the amount
of contention on the mutex to ensure that the scheduler was able to
schedule any goroutines.

This commit caps this concurrency to 96, instead of letting it grow
unbounded as a function of the number of cores on the system.

Release note (performance improvement): The Raft processing goroutine
pool's size is now capped at 96. This was observed to prevent instability
on large machines (32+ vCPU) in clusters with many ranges (50k+ per node).
Relates to cockroachdb#56851.

In cockroachdb#56851, we saw that all of the Raft transport's receiving goroutines
were stuck in the Raft scheduler, attempting to enqueue Ranges in
response to coalesced heartbeats. We saw this in stacktraces like:
```
goroutine 321096 [semacquire]:
sync.runtime_SemacquireMutex(0xc00007099c, 0xc005822a00, 0x1)
	/usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc000070998)
	/usr/local/go/src/sync/mutex.go:138 +0xfc
sync.(*Mutex).Lock(...)
	/usr/local/go/src/sync/mutex.go:81
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).enqueue1(0xc000070980, 0x4, 0x19d8cb, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:261 +0xb0
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).EnqueueRaftRequest(0xc000070980, 0x19d8cb)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:299 +0x3e
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).HandleRaftUncoalescedRequest(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc01288e5c0, 0x4ba44c0, 0xc014ff2b40, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:175 +0x201
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).uncoalesceBeats(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc035790a80, 0x37, 0x43, 0x100000001, 0x29b00000000, 0x0, 0x400000004, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:110 +0x33b
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).HandleRaftRequest(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc02be585f0, 0x4ba44c0, 0xc014ff2b40, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:130 +0x1be
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).handleRaftRequest(0xc000188780, 0x4becc00, 0xc019f31b60, 0xc02be585f0, 0x4ba44c0, 0xc014ff2b40, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/raft_transport.go:299 +0xab
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).RaftMessageBatch.func1.1.1(0x4c3fac0, 0xc00d3ccdf0, 0xc000188780, 0x4becc00, 0xc019f31b60, 0x95fe98, 0x40c5720)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/raft_transport.go:370 +0x199
```

In that issue, we also saw that too much concurrency on the Raft
scheduler's Mutex had caused the mutex to collapse (get stuck in the
slow path, in the OS kernel) and hundreds of goroutines to pile up on
it.

We suspect that part of the problem here was that each of the coalesced
heartbeats was locking the Raft scheduler once per Range. So a coalesced
heartbeat that contained 10k ranges would lock the scheduler 10k times
on the receiver.

The commit attempts to alleviate this issue by batch enqueuing Ranges in
the Raft scheduler in response to coalesced heartbeats. This has a
slight fixed overhead (i.e. the need for a slice) but in response,
reduces the load that coalesced heartbeats place on the Raft scheduler's
mutex by a factor of 128 (`enqueueChunkSize`). This should reduce the
impact that a large number of Ranges have on contention in the Raft
scheduler.

Release note (performance improvement): Interactions between Raft heartbeats
and the Raft goroutine pool scheduler are now more efficient and avoid excessive
mutex contention. This was observed to prevent instability on large machines
(32+ vCPU) in clusters with many ranges (50k+ per node).
Relates to cockroachdb#56851.

In cockroachdb#56851 and in many other investigations, we've seen cases where the
NodeLiveness Range has a hard time performing writes when a system is
under heavy load. We already split RPC traffic into two classes,
ensuring that NodeLiveness traffic does not get stuck behind traffic on
user ranges. However, to this point, it was still possible for the
NodeLiveness range to get stuck behind other Ranges in the Raft
scheduler, leading to high scheduling latency for Raft operations.

This commit addresses this by prioritizing the NodeLiveness range above
all others in the Raft scheduler. This prioritization mechanism is
naive, but should be effective. It should also not run into any issues
with fairness or starvation of other ranges, as such starvation is not
possible as long as the scheduler concurrency (8*num_cpus) is above the
number of high priority ranges (1).

Release note (performance improvement): The Raft scheduler now prioritizes
the node liveness Range. This was observed to prevent instability on large
machines (32+ vCPU) in clusters with many ranges (50k+ per node).
@nvb nvb force-pushed the nvanbenschoten/raftSchedulerOpt branch from d48f597 to 900729b Compare November 20, 2020 04:50
Copy link
Copy Markdown
Contributor Author

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a release note, probably even link to #56851 from it.

Done.

The experiments were run with the default raft scheduler concurrency, right (i.e. 32 workers per node)?

Right.

The quiescence test here should work much better now, shouldn't it?

Yes, it should be able to handle many more ranges without falling over. That doesn't mean that these ranges will be able to handle much traffic when none are quiesced, as Raft scheduler latency (see #56943) will still be quite large for all non-liveness ranges.

I'll experiment with updating that roachtest in a follow-up PR.

I'm on board with backporting the second commit. I would even entertain backporting the third after some bake time.

That plan sounds good to me.

The only way in which I could see this commit backfire (without a bug) is that the liveness range could starve out processing for other ranges, which seems improbable.

I'm not even sure that this is possible. I touched a bit on it in the commit message, but I think we're safe from any theoretical starvation as long as the scheduler concurrency (8*num_cpus) is above the number of high priority ranges (1), which it always will be.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @petermattis, and @tbg)


pkg/kv/kvserver/replica_init.go, line 350 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Looks like this isn't tested, isn't that relatively easy to add? Start a server, look up the liveness range, get the range id, pull the priority ID from the scheduler (via a helper in helpers_test.go) and that's it?
Of course this works, as indicated by your experiments, but a more direct proof would be nice.

Done.


pkg/kv/kvserver/scheduler.go, line 123 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This is a bit odd, there's nothing about SetPriorityID "externally" that indicates that this is not allowed. Perhaps a comment on the method is enough.

Done.


pkg/kv/kvserver/store.go, line 103 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Git history tells us that the testing wasn't particularly rigorous: 9a68241

The COCKROACH_SCHEDULER_CONCURRENCY env var controls the number of
scheduler goroutines used for Raft processing. In testing, the previous
default of 2**NumCPU proved to be too low to achieve good parallelization
of batch commits. Even with proposer-evalued-KV landing, we want lots of
concurrent batch commits so they can be grouped and written to RocksDB
more efficiently. The new default of 8*NumCPU was determined
experimentally on 8 CPU machines.

Notice that this was done way back in April, 2017. The stone ages.

Btw, nightly performance tests on higher CPU machines tends to flatline above 32 CPUs. I wonder if this will PR will have an effect.

Done.

Copy link
Copy Markdown
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner and @tbg)

@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Nov 20, 2020

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Nov 20, 2020

Build failed (retrying...):

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Nov 20, 2020

Build failed:

@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Nov 21, 2020

acceptance/bank/node-restart seems to be hitting pq: query execution canceled due to statement timeout across a number of PRs. I'm not sure when or why that became flaky, but it's not due to this so I'm going to merge.

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Nov 21, 2020

Build failed (retrying...):

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Nov 21, 2020

Build failed:

@jordanlewis
Copy link
Copy Markdown
Member

Courtesy rebors r+

@jordanlewis
Copy link
Copy Markdown
Member

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Nov 22, 2020

Build succeeded:

@craig craig bot merged commit be3ff28 into cockroachdb:master Nov 22, 2020
nvb added a commit to nvb/cockroach that referenced this pull request Nov 23, 2020
Addresses a TODO. Avoids allocations for small numbers of messages. This
is common for inactive ranges that are still heartbeating. It should
help with cases like cockroachdb#56860.
nvb added a commit to nvb/cockroach that referenced this pull request Nov 23, 2020
This was responsible for an allocation per processReady and
processRequestQueue call. This means that it would cause two heap
allocations per Range on each replica during a Raft heartbeat. As
a result, it was the largest source of allocations by far in tests
like cockroachdb#56860 with lots of idle Ranges heartbeating frequently.

This change should reduce the cost of handling these heartbeats,
reducing the impact the heartbeats have on other active Ranges.
@nvb nvb deleted the nvanbenschoten/raftSchedulerOpt branch November 23, 2020 02:49
craig bot pushed a commit that referenced this pull request Nov 24, 2020
57007: kv: eliminate heap allocations throughout Raft messaging and processing logic r=nvanbenschoten a=nvanbenschoten

This PR contains a few improvements within the Raft messaging and processing code to avoid common heap allocations that stuck out during tests in #56860. This will improve the general efficiency of the Raft pipeline and will specifically reduce the cost of handling large coalesced heartbeats. This will reduce the impact that these heartbeats have on the [scheduler latency](#56943) of other active ranges.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

kvserver: default raft scheduler concurrency can cause cascading failures on beefy machines

6 participants