Skip to content

rangefeed: memory accounting and budgeting #73616

@erikgrinaker

Description

@erikgrinaker

Rangefeed processors (<=1 per replica) send events via an internal buffered channel, which currently has a capacity of 4096 events:

If the channel fills up, the call will block and backpressure Raft command application until a 50ms timeout expires, at which point the rangefeed is closed and the caller must run a catchup scan to resume the rangefeed:

case p.eventC <- ev:
case <-p.stoppedC:
// Already stopped. Do nothing.
case <-time.After(timeout):
// Sending on the eventC channel would have blocked.
// Instead, tear down the processor and return immediately.
p.sendStop(newErrBufferCapacityExceeded())
return false

Currently, rangefeed events are generally small, i.e. single key/value pairs (although they can become quite large in some workloads). In #70434, we will begin propagating AddSSTable events including SST contents across rangefeeds, which will generally be much larger than key/value events. To avoid these piling up in the buffer, but also to better handle workloads with large key/value pairs, we need to implement memory budgeting for this buffer.

It is unclear whether the memory budget should be at the replica, store, or node level. A node-level limit is appealing because there can be a large variance in memory needs across replicas, so this avoids the need to oversubscribe memory budgets. However, a shared node budget means that a single table or tenant that's emitting a large amount of events can disrupt range feeds running across unrelated tables or tenants. A tiered structure with a shared node-level budget and a smaller per-replica budget is likely preferable. These budgets must be configurable, either using relative or absolute values.

Note that downstream consumers (e.g. changefeeds and streaming cluster replication) will generally implement their own buffering and backpressure mechanisms. As such, the internal rangefeed channel only needs to buffer events until they make it into the consumer's buffer, and the current limit can likely be reduced significantly. However, consumers may run on other nodes, so events may need to be transferred across the network, although in this case network buffers also come into play.

We should also keep in mind the need to do memory budgeting for catchup scans, see #69596.

Jira issue: CRDB-11667

Metadata

Metadata

Assignees

Labels

A-kv-replicationRelating to Raft, consensus, and coordination.C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions