[WIP] admission,kvserver: admission control for snapshot ingest#80914
[WIP] admission,kvserver: admission control for snapshot ingest#80914sumeerbhola wants to merge 1 commit intocockroachdb:masterfrom
Conversation
tbg
left a comment
There was a problem hiding this comment.
I like the basic approach. It seems simple enough and I suspect that it will work "well enough", certainly a lot better than what we have now (which is no throttling). A few questions/comments:
I've basically been trying to reproduce the snapshot-induced LSM overload in #80755 and it's been somewhat difficult. You need a relatively large cluster and a steady workload on it which runs the cluster relatively close to the edge (but smoothly enough so that the cluster doesn't dip into a bad LSM on its own), with lots of data (so that there are enough snapshots when a node is decommissioned) plus the LSM needs to be full at all levels on most of the keyspace, which I think means there need to be writes all over the keyspace over time. It's like #80589 with extra complications. Curious if you have other ideas for validating a change such as this one.
We need to think about how this integrates with the snapshot sender path. In your second example, you show how foreground writes can permanently starve out a snapshot. Naively, this snapshot will have some store's replicate queue waiting on it, and in some sense "all of the work" has already been done - the data has been streamed over the network, et cetera. A store that more or less permanently stalls snapshots at ingestion time is, for the purposes of the allocator, not a target for new replicas. This relates to the next point.
I wonder if there is a way to get some kind of weighted fair queueing instead of the clear prioritization of foreground traffic, where instead of stalling forever, the snapshot gets a fraction of the allocated tokens for exclusive use. This is probably a naive idea, but just to get it across in an example, we could say that at token replenish time (i.e. on each 15s interval), if the snapshot bucket is negative, we move (say) 20% of the newly issued tokens to the snapshot pool:
I0: snapshot-tokens=-475mb regular-tokens=0mb // end of I0
// 100mb replenishment
I1: snapshot-tokens=-375mb regular-tokens=100mb // this would be the regular way
I1: snapshot-tokens=-375mb+20mb=-370mb regular-tokens=80mb=(100-20)mb // this would be the suggested way
This gives the snapshot some guaranteed progress, at the expense of throttling the foreground workload a little more. So your second example would read as below. I surely made arithmetic errors here, I hope the idea is clear - essentially when the snapshot bucket is in debt, it recruits (say) 20% of the tokens just for itself. In the end point, 0%, we get the behavior currently in this PR, so it's strictly a generalization. I do wonder if something similar should happen general for low prio / normal prio / high prio requests. It seems generally good to avoid starvation, unless it's explicitly requested (i.e. priority -inf)
// Example 2:
// Normal work is consuming all the 375MB of tokens.
// For simplicity all lateral transfers are 20mb, in reality they would be a
// percentage of tokens made available.
//
// I0:
// start: snapshot-tokens=375MB, regular-tokens=375MB
// One 500MB snapshot ingestion into L0, and regular work 375MB
// end: snapshot-tokens=375MB-870MB=-500MB, regular-tokens=375MB-375MB=0
// I1: (token replenishment of 375MB: lateral transfer of 20mb from regular to snaps)
// start: snapshot-tokens=-105MB, regular-tokens=355MB <--
// Regular work 355MB
// end: snapshot-tokens=-460MB, regular-tokens=0
// I2: (token replenishment of 375MB: lateral transfer of 20mb from regular to snaps)
// start: snapshot-tokens=-65MB, regular-tokens=355mb
// Regular work 355MB
// end: snapshot-tokens=-420MB, regular-tokens=0
// I3: (token replenishment of 375MB: lateral transfer of 20mb from regular to snaps)
// start: snapshot-tokens=-45MB, regular-tokens=355mb
// Regular work 355MB
// end: snapshot-tokens=-380MB, regular-tokens=0
// I3: (token replenishment of 385MB: lateral transfer of 20mb from regular to snaps)
// start: snapshot-tokens=25MB, regular-tokens=365mb
// Regular work 365MB
// Snapshot 500mb
// end: snapshot-tokens=-475MB, regular-tokens=0mb
// ...
Another question I have is about the "perfect world" end state. As long as admission control has to trigger before we throttle snapshots, they will essentially flow unchecked until the LSM is in a "concerning" state. At 1000 files and 20 sublevels, is the LSM still fit for a production workload (see this Slack thread)? I think so far our approach is - when admission control is on, that's a bad state to be in, and the allocator is supposed to move load around. But with snapshots, the goal ultimately has to be to slow them down before admission control considers the LSM to be in "bad shape". So should we "always" apply admission control throttling to snapshots, i.e. admit only what can be sustained by the rate at which we compact out of L0? The "always" will need a caveat, since if L0 is mostly empty, pebble may chose not to compact out of it for a while (?) and so the L0 compaction rate would appear as close to zero, meaning very aggressive throttling when no throttling would be appropriate.
Similar considerations might apply in general, and I think relate to this code:
cockroach/pkg/util/admission/granter.go
Lines 1711 to 1714 in 41aa4aa
With a much more aggressive threshold (say 10 sublevels, 300 SSTs, I just made that up) but an admission factor that starts out at near 1 and then lowers gradually to 0.5 as a function of the sublevel and SST count, could be appropriate.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)
pkg/util/admission/granter.go line 503 at r1 (raw file):
// which is currently 512MB) in a single atomic operation. // - If all these bytes go into L0, they will at most increase the sublevel // count by 1. Also, they will add at most 5 files.
also, each of the SSTs has a rangedel in it that spans the bounds, so hopefully they're "cheaper" for reads? You never have to "read below" this SST because it doesn't have any holes. So they shouldn't matter as much for the L0Filecount.
For my education, why does a snap only increase sublevel count by at most 1?
pkg/util/admission/granter.go line 506 at r1 (raw file):
// // We make some assumptions: // - Most of the stores in a cluster are healthy, ignoring snapshots, and
this sounds like the assumption is that stores are ignoring snapshots :-) I think you mean that they are healthy in the absence of snapshots.
pkg/util/admission/granter.go line 517 at r1 (raw file):
// count, when the threshold is first exceeded is enough for normal // user-facing traffic. // - If the token count is consumed by a 512MB snapshot, this could prevent
This isn't really an assumption, more an observation that already needs knowledge of what the proposed mechanism is.
pkg/util/admission/granter.go line 563 at r1 (raw file):
// end: snapshot-tokens=-350MB, regular-tokens=75MB // I2: (token replenishment of 375MB) // start: snapshot-tokens=25B, regular-tokens=375MB
MB
The approach here is to use a special set of tokens for snapshots kvStoreTokenGranter.availableRangeSnapshotIOTokens that allows for over-commitment, so that normal work is not directly affected. There is a long code comment in kvStoreTokenGranter with justification. Informs cockroachdb#80607 Release note: None
7a3da5a to
ca065a0
Compare
sumeerbhola
left a comment
There was a problem hiding this comment.
(there are a lot of good questions here, some of which we discussed in our chat. Apologies for only selectively answering the remaining for now -- a bunch of the unanswered ones are probably design choices that happened to be "simple and effective", rather than perfect).
At 1000 files and 20 sublevels, is the LSM still fit for a production workload
That's a very good question. Just repeating what I mentioned in our synchronous conversation:
This is something that has been in the back of my mind since the initial admission control work. We don't yet have a good understanding of the latency impact of getting "close to resource saturation" (high CPU utilization, somewhat elevated read amp). This is due to the lack of proper admission control roachtests that have a mix of high and low priority traffic, and latency measurement of the high priority traffic. If we find that the latency impact is significant, we can start computing limited tokens at a lower threshold, and use that token count to replenish tokens for less important work.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/kv/kvserver/store_snapshot.go line 713 at r2 (raw file):
// throttleSnapshot (called by reserveSnapshot) that we probably want here // too. Consider moving this code into reserveSnapshot. snapshotGranter := s.cfg.KVAdmissionController.GetStoreSnapshotGranter(s.StoreID())
I moved this up to before the snapshot is streamed, as we were discussing today.
pkg/util/admission/granter.go line 503 at r1 (raw file):
For my education, why does a snap only increase sublevel count by at most 1?
Because there are N non-overlapping sstables, so they don't stack on top of each other in the key space.
pkg/util/admission/granter.go line 506 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
this sounds like the assumption is that stores are ignoring snapshots :-) I think you mean that they are healthy in the absence of snapshots.
Fixed.
pkg/util/admission/granter.go line 517 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This isn't really an assumption, more an observation that already needs knowledge of what the proposed mechanism is.
Fixed
pkg/util/admission/granter.go line 563 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
MB
Done
|
Let me know when you're picking this back up again / if you have time to talk about it again. |
|
@tbg the next step is to see how this fares in a snapshot caused overload experiment, and then iterate. I was hoping for KV/replication participation in that. |
|
Ack, thanks! KV/Repl will talk about it tomorrow. |
|
For posterity, here's my setup with which I think I managed to repro snap-induced overload: #80589 (comment) |
The approach here is to use a special set of tokens for snapshots
kvStoreTokenGranter.availableRangeSnapshotIOTokens that allows
for over-commitment, so that normal work is not directly
affected. There is a long code comment in kvStoreTokenGranter
with justification.
Informs #80607
Release note: None