-
Notifications
You must be signed in to change notification settings - Fork 4.1k
admission,kvserver: subject snapshot ingestion to admission control #80607
Description
We currently throttle writes on the receiving store based on store health (e.g. via admission control or via specialized AddSSTable throttling). However, this only takes into account the local store health, and not the associated write cost on followers during replication, which isn't always throttled. We've seen this lead to hotspots where follower stores get overwhelmed, since the follower writes bypass admission control. A similar problem exists with snapshot application.
This has been touched on in several other issues as well:
- admission,kv,bulk: unify (local) store overload protection via admission control #75066
- admission: graceful degradation #82114
- admission: roachtest to reproduce follower snapshot overload #82116
- [WIP] admission,kvserver: admission control for snapshot ingest #80914
- kvserver: store with high read amplification should not be a target of rebalancing #73714 kvserver: snapshot ingestion rates are not bounded #74694
A snapshot can be 512MB (the maximum size of a range). Rapid ingestion of snapshots can cause an inverted LSM e.g. https://github.com/cockroachlabs/support/issues/1558 where we were ingesting ~1.6GB of snapshots every 1min, of which ~1GB were being ingested into L0.
- Unlike other activities that integrate with admission control, streaming the snapshot from the sender to receiver can take 16+s (512MB at a rate of 32MB/s, the default value of
kv.snapshot_recovery.max_rate). Admission control currently calculates tokens at 15s time granularity, and expects admission work to be short-lived compared to this 15s interval (say < 1s). - Snapshots need to be ingested atomically, since they can be used to catchup a replica that has fallen behind. This atomicity is important for ensuring the replica state is consistent in case of a crash. This means the atomic ingest operation can add 512MB to L0 in the worst case.
Solution sketch:
-
We assume that
- Overloaded stores are rare and so we don't have to worry about differentiating the priority of REBALANCE snapshots and RECOVERY snapshots from the perspective of store-write admission control.
- The allocator will usually not try to add replicas to a store with a high L0 sublevel count or high file count. This is not an essential requirement.
-
The kv.snapshot_recovery.max_rate setting continues to be driven by the rate of resource consumption on the source and of the network. That is, it has nothing to do with how fast the destination can ingest. This is reasonable since the ingestion will be atomic, after all the ssts for the snapshot have been locally written, so it doesn't matter how fast the data for a snapshot arrives. This is ignoring the fact that writing the ssts also consumes disk write bandwidth, which may be constrained -- this is acceptable since the write amplification is the biggest consumer of resources.
-
We keep the Store.snapshotApplySem to limit the number of snapshots that are streaming over their data. After the local ssts are ready to be ingested, we ask for admission tokens equal to the total size of the ssts. This is assuming we have switched store-write admission control to use byte tokens (see admission: change store write admission control to use byte tokens #80480). Once granted the ingestion is performed and after that the snapshotApplySem is released. This is where having a single priority for all snapshot ingestion is convenient -- we don't have to worry about a high-priority snapshot waiting for snapshotApplySem while a low-priority snapshot is waiting for store-write admission.
-
Potential issues:
- Consuming 512MB of tokens in one shot can starve out other traffic. This is ok since the ingest got to the head of the queue so there must not be more important work waiting. Also, the 512MB is across 5 non-overlapping sstables, so will add at most 1 sublevel to L0.
- Available tokens will become negative. This is ok, since the negative value is proper accounting and the 1s refilling of tokens will eventually bring it back to a positive token count. If there was no higher priority waiting work for an instant of time, that caused the snapshot to consume all the tokens, later arriving higher priority work will need to wait -- if this becomes a problem we could start differentiating the tokens consumed by snapshot ingests from other work, and allow some overcommitment (this may work because a snapshot ingest adds at most 1 sublevel).
- The snapshot will exceed the deadline while waiting in the admission queue. This is wasted work, which is ideally to be avoided, but we can mitigate by (a) the aforementioned allocator decision-making causing this to be rare, (b) whenever we exceed the deadline, we could add an extra delay for future acquisitions of snapshotApplySem (i.e., add some sort of backoff scheme before the streaming of a snapshot starts).
-
Misc things to figure out:
- In a multi-tenant setting, we will need to decide which tenant these ingestions are attributed to, since we have rough fair sharing across tenants. Should they be using the tenant of the range, or the system tenant.
Jira issue: CRDB-15330
Epic CRDB-34248