[dnm] kvserver: subject MsgApp to admission control#81289
[dnm] kvserver: subject MsgApp to admission control#81289tbg wants to merge 7 commits intocockroachdb:masterfrom
Conversation
1db3037 to
1f46dd0
Compare
|
First experiment: Naively backpressuring on the incoming raft stream via local admission control in https://github.com/cockroachdb/cockroach/blob/1f46dd08d8ad3fa82038abb6852a5920f0f93305/pkg/kv/kvserver/store_raft.go#L299-L301. This wreaked havoc immediately (including unavailable liveness, I think) since the liveness leaseholder was probably on n3, and n3 was backpressuring on each incoming MsgApp regardless of which range it was for. "Most" quota pools in the system were depleted: It would've been interesting to see the metrics in that state but alas, metrics timeseries ranges were also all unavailable. |
9bb0516 to
f5d8594
Compare
They would previously be `store1` since we mostly run single-store
deployments. It's often helpful to target a particular node, for
example for lease preferences.
Before:
```
select node_id, attrs from crdb_internal.kv_store_status;
node_id | attrs
----------+-------------------------------------
1 | ["store1"]
```
After:
```
node_id | attrs
1 | ["node1", "node1store1", "store1"]
```
Release note: None
<what was there before: Previously, ...> <why it needed to change: This was inadequate because ...> <what you did about it: To address this, this patch ...> Release note (<category, see below>): <what> <show> <why>
Custom charts were previously difficult to read because the graph didn't include any indication of what was being graphed. This is now rectified by printing the metrics name. This isn't the most user friendly thing (ideally we would also include the actual title, and make the help text available) but with my limited TSX knowledge this is solving 90% of the problem with a 61 character change. Related to cockroachdb#81035. Release note: None
Release note: None
Wait in handleRaftReady, which isn't great either, but at least allows us to target individual ranges (mod Go scheduler concurrency). Release note: None
f5d8594 to
cfc7c74
Compare
Sloppy for now Release note: None
|
For a second experiment, I moved the backpressure to I also pre-split 1000 ranges for this workload. Unfortunately, n1 (which has all leases pinned to it) was starting to get high read-amp as well, but we can see in the graphs that for quite some time before that, n3 is struggling with high read-amp but that it is staying bounded - meaning admission control on the raft follower path is doing its job. This is most obvious in the bottom graph, which measures specifically time spent in admission control in handleRaftReady (red is n3). |
In the experiment, n1 ended up with high r-amp and started backpressuring below raft as well. This isn't the intention since those writes will be doubly backpressured. We ended up in a regime where n1 is constantly overloaded and n3 is actually doing ok, perhaps metaunstable. Release note: None
|
There's an internal thread here about whether this experiment is flawed in the sense that the cluster cannot possibly sustain 6mb/s, since I'm running it with the stock binary now (leases on n1 and n2) and n1 and n2 are the ones that have high read amp, despite being provisioned at 250mb/s (double the default). |
This is a less ad-hoc version of the experiment in cockroachdb#81289, where I messed with the EBS configuration. This can't be done programmatically, and so here we use an IO nemesis on n3 instead. Release note: None
This is a less ad-hoc version of the experiment in cockroachdb#81289, where I messed with the EBS configuration. This can't be done programmatically, and so here we use an IO nemesis on n3 instead. Release note: None
|
This is being productionized in #83851. |
This is is a less ad-hoc version of the experiment in cockroachdb#81289, where I messed with the EBS configuration. This can't be done programmatically, and so here we use an IO nemesis on n3 instead. Part of cockroachdb#79215. Closes cockroachdb#81834. Release note: None
This is is a less ad-hoc version of the experiment in cockroachdb#81289, where I messed with the EBS configuration. This can't be done programmatically, and so here we use an IO nemesis on n3 instead. Part of cockroachdb#79215. Closes cockroachdb#81834. Release note: None



Context: #79215 (comment)
Release note: None