admission,kv,bulk: unify (local) store overload protection via admission control

Consider a store with the capacity to accept writes at a rate of R bytes/s. This is a thought exercise in that R is not fixed, and is affected by various factors like disk provisioning (which can dynamically change), whether the write was a batch written via the WAL or an ingested sstable (and how many bytes are landing in L0), and compaction concurrency adjustment. We have two categories of mechanisms that attempt to prevent store overload:
- Capacity unaware mechanisms: These include
  - `Engine.PreIngestDelay`: This applies a delay per "file" that is being ingested if we are over `rocksdb.ingest_backpressure.l0_file_count_threshold` (default of 20). The delay is proportional to how far above the threshold the store is, and uses `rocksdb.ingest_backpressure.max_delay`. It is unaware of the size of the file, i.e., the number of bytes being written. The bytes being written can vary significantly for bulk operations based on how many ranges are being buffered in `BufferingAdder` before generating sstables. This delay is applied (a) above raft to `AddSSTableRequest` even if it is being written as a batch (`IngestAsWrites` is true), (b) below raft in `addSSTablePreApply`, if the `AddSSTableRequest` was `!IngestAsWrites`.
  - Concurrency limit on `AddSSTableRequest`: Applied at proposal time, using `kv.bulk_io_write.concurrent_addsstable_requests`, `kv.bulk_io_write.concurrent_addsstable_as_writes_requests`.
  - Concurrency limit on snapshot application (which is done by ingesting sstables): `Store.snapshotApplySem`.

- Capacity aware mechanisms: Admission control uses two overload thresholds `admission.l0_sub_level_count_overload_threshold` (also 20, like the bulk back-pressuring threshold) and `admission.l0_file_count_overload_threshold` (1000) to decide when to limit admission control tokens for writes. The code estimates the capacity R based on how fast compactions are removing bytes from L0. It is unaware of the exact bytes that will be added by individual requests and computes an estimate per request based on past behavior. It is used only at proposal time (`Node.Batch` calls `KVAdmissionController.AdmitKVWork`).

This setup has multiple deficiencies:
- Tuning delay or concurrency knobs is not practical. The knobs will be either (a) too conservative and leave unused throughput, resulting in customer questions on why some bulk operation is running slowly while the nodes are underutilized, (b) too aggressive and we won't even know until a rare workload pushes against those knob settings and causes a node to overload.
- Different knobs control different kinds of work. Someone may validate that stressing the system along each kind of work is not causing the system to overload, but it maybe that if all the work kinds spike (e.g. building secondary index and backup restore and snapshot application) that a node becomes overloaded.
- Admission control tunes itself but isn't aware of the bytes that will be written by each work item. Estimates are fine for small writes but proper accounting for larger writes is preferable.

We have existing issues for
- Using store health of all replicas for bulk operations https://github.com/cockroachdb/cockroach/issues/57247 -- we treat that as orthogonal here and assume that we will continue to need a lower level mechanism that serves as a final safeguard to protect an individual store. That is, we may continue to do below-raft throttling even though there are concerns with doing so due to blocking a worker in the raft scheduler (which should be ok since the worker pool is per store), or holding latches for too long.
- https://github.com/cockroachdb/cockroach/issues/57248 discusses throttling of GC operations, that may do many writes -- we already subject these to admission control at the proposer, but without knowledge of the bytes being written. If each GCRequest can do a large batch write we should compute the bytes being written so that admission control can utilize that information.

We propose to unify all these overload protection mechanisms such that there is one source of byte tokens representing what can be admitted and one queue of requests waiting for admission.
- Admission control logic will be enhanced to compute byte-based tokens for store writes. Those requests that provide their byte size (which should be all large requests) will consume these tokens. Estimates will be used for two purposes (a) requests that don't provide their byte size, for which the estimate will be used to decide how many tokens to consume (b) computing the fraction of an ingest request that will end up in L0 (to adjust the token consumption for an ingest reques). Just like token estimation, these estimates are continually adjusted based on stats (at every 15s interval).
- All the paths that currently used capacity unaware mechanisms will call into the admission control WorkQueue for admission (they will not call the subsequent WorkQueue that throttled KV work based on cpu). After the admitted work is done, each ingest request will also provide information on how many bytes were added to L0 (this will need a small Pebble API change), so that the token consumption can be fixed and we have data for future estimates.
  - We will need to prevent double counting: requests that went through admission control above raft at the leaseholder should not be subject to admission control below raft at the same node.
- Priority or tenant assignment: An open question is how "background" work like snapshot application or index construction should be prioritized in this unified setup. One could give such work a lower priority if it is indeed lower priority than foreground traffic. There may be exceptions e.g. we may not want a lower priority for below-raft throttling. Alternatively, we could also use a made-up TenantID and rely on the proportional scheduling of tokens done in `admission.WorkQueue` across tenants. Admission control does not currently have support for different tenant weights, but that is easy to add if needed.

Deficiencies:
- The overload threshold at which we want to start constraining writes could be different for user-facing and background operations: By sharing a single queue and a single source of tokens for that queue we also share the same overload thresholds. This is probably not an immediate problem since `rocksdb.ingest_backpressure.l0_file_count_threshold` and `admission.l0_sub_level_count_overload_threshold` both default to 20. There is a way to address this in the future via a hierarchical token bucket scheme: the `admission.ioLoadListener` would produce high_overload_tokens and low_overload_tokens where the background operations have to consume both, while foreground operations only use the former.

cc: @erikgrinaker @dt @nvanbenschoten


Jira issue: CRDB-12450

Epic CRDB-14607

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

admission,kv,bulk: unify (local) store overload protection via admission control #75066

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

admission,kv,bulk: unify (local) store overload protection via admission control #75066

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions