rangefeed: memory accounting and budgeting

Rangefeed processors (<=1 per replica) send events via an internal buffered channel, which currently has a capacity of 4096 events:

https://github.com/cockroachdb/cockroach/blob/134de82180a6dc4e4405e45450d1c0036f59b36f/pkg/kv/kvserver/rangefeed/processor.go#L509

If the channel fills up, the call will block and backpressure Raft command application until a 50ms timeout expires, at which point the rangefeed is closed and the caller must run a catchup scan to resume the rangefeed:

https://github.com/cockroachdb/cockroach/blob/134de82180a6dc4e4405e45450d1c0036f59b36f/pkg/kv/kvserver/rangefeed/processor.go#L514-L521

Currently, rangefeed events are generally small, i.e. single key/value pairs (although they can become quite large in some workloads). In #70434, we will begin propagating AddSSTable events including SST contents across rangefeeds, which will generally be much larger than key/value events. To avoid these piling up in the buffer, but also to better handle workloads with large key/value pairs, we need to implement memory budgeting for this buffer.

It is unclear whether the memory budget should be at the replica, store, or node level. A node-level limit is appealing because there can be a large variance in memory needs across replicas, so this avoids the need to oversubscribe memory budgets. However, a shared node budget means that a single table or tenant that's emitting a large amount of events can disrupt range feeds running across unrelated tables or tenants. A tiered structure with a shared node-level budget and a smaller per-replica budget is likely preferable. These budgets must be configurable, either using relative or absolute values.

Note that downstream consumers (e.g. changefeeds and streaming cluster replication) will generally implement their own buffering and backpressure mechanisms. As such, the internal rangefeed channel only needs to buffer events until they make it into the consumer's buffer, and the current limit can likely be reduced significantly. However, consumers may run on other nodes, so events may need to be transferred across the network, although in this case network buffers also come into play.

We should also keep in mind the need to do memory budgeting for catchup scans, see #69596.

Jira issue: CRDB-11667

	case p.eventC <- ev:
	case <-p.stoppedC:
	// Already stopped. Do nothing.
	case <-time.After(timeout):
	// Sending on the eventC channel would have blocked.
	// Instead, tear down the processor and return immediately.
	p.sendStop(newErrBufferCapacityExceeded())
	return false

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rangefeed: memory accounting and budgeting #73616

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

rangefeed: memory accounting and budgeting #73616

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions