-
Notifications
You must be signed in to change notification settings - Fork 4.1k
rangefeed: memory accounting and budgeting #73616
Description
Rangefeed processors (<=1 per replica) send events via an internal buffered channel, which currently has a capacity of 4096 events:
| case p.eventC <- ev: |
If the channel fills up, the call will block and backpressure Raft command application until a 50ms timeout expires, at which point the rangefeed is closed and the caller must run a catchup scan to resume the rangefeed:
cockroach/pkg/kv/kvserver/rangefeed/processor.go
Lines 514 to 521 in 134de82
| case p.eventC <- ev: | |
| case <-p.stoppedC: | |
| // Already stopped. Do nothing. | |
| case <-time.After(timeout): | |
| // Sending on the eventC channel would have blocked. | |
| // Instead, tear down the processor and return immediately. | |
| p.sendStop(newErrBufferCapacityExceeded()) | |
| return false |
Currently, rangefeed events are generally small, i.e. single key/value pairs (although they can become quite large in some workloads). In #70434, we will begin propagating AddSSTable events including SST contents across rangefeeds, which will generally be much larger than key/value events. To avoid these piling up in the buffer, but also to better handle workloads with large key/value pairs, we need to implement memory budgeting for this buffer.
It is unclear whether the memory budget should be at the replica, store, or node level. A node-level limit is appealing because there can be a large variance in memory needs across replicas, so this avoids the need to oversubscribe memory budgets. However, a shared node budget means that a single table or tenant that's emitting a large amount of events can disrupt range feeds running across unrelated tables or tenants. A tiered structure with a shared node-level budget and a smaller per-replica budget is likely preferable. These budgets must be configurable, either using relative or absolute values.
Note that downstream consumers (e.g. changefeeds and streaming cluster replication) will generally implement their own buffering and backpressure mechanisms. As such, the internal rangefeed channel only needs to buffer events until they make it into the consumer's buffer, and the current limit can likely be reduced significantly. However, consumers may run on other nodes, so events may need to be transferred across the network, although in this case network buffers also come into play.
We should also keep in mind the need to do memory budgeting for catchup scans, see #69596.
Jira issue: CRDB-11667