kvserver,cdc: high rangefeed count leads to scheduler overload

**Describe the problem**

In recent incidents we've observed a high number of rangefeeds resulting in CPU scheduler overload, causing excessive scheduling latencies. These scheduling latencies in turn contribute to high foreground latencies. All this can happen even with CPU % being far from saturation (<25%). The underlying cause for all of this is the various O(ranges) goroutines we have on both the client and server side; after using MuxRangefeed, we have:
```
Client side:
  a. 1 per range in singleRangeFeed
 
    kvcoord           dist_sender_rangefeed.go:516             (*DistSender).singleRangeFeed.func3()
    kvcoord           dist_sender_rangefeed_canceler.go:100    (*stuckRangeFeedCanceler).do(*, *)
    kvcoord           dist_sender_rangefeed.go:515             (*DistSender).singleRangeFeed(#1050, {#184, *}, {{*, *, *}, {*, *, *}}, {*, ...}, ...)
    kvcoord           dist_sender_rangefeed.go:336             (*DistSender).partialRangeFeed(#1050, {#184, *}, *, {{*, *, *}, {*, *, *}}, ...)
    kvcoord           dist_sender_rangefeed.go:120             (*DistSender).RangeFeed.func1.1({#184, *})
    ctxgroup          ctxgroup.go:169                          Group.GoCtx.func1()
    errgroup          errgroup.go:57                           (*Group).Go.func1()

Server side:
  b. 1 per range blocked in RangeFeed

    kvserver          replica_rangefeed.go:254                 (*Replica).rangeFeedWithRangeID(*, #505, *, {#43378, *})
    kvserver          replica_rangefeed.go:150                 (*Replica).RangeFeed(...)
    kvserver          store.go:3082                            (*Store).RangeFeed(#1543, *, {#505, *})
    kvserver          stores.go:216                            (*Stores).RangeFeed(*, *, {#505, *})
    server            node.go:1227                             (*Node).RangeFeed(#505, *, {#505, *})
    rpc               context.go:753                           internalClientAdapter.RangeFeed.func1()

  c. 1 per range in the processor
  
    rangefeed         processor.go:269                         (*Processor).run(*, {#5, *}, *, *, #1177)
    rangefeed         processor.go:222                         (*Processor).Start.func1({#5, *})
    stop              stopper.go:494                           (*Stopper).RunAsyncTaskEx.func2()

  d. 1 per range in the registration output loop

    rangefeed         registry.go:310                          (*registration).outputLoop(*, {#485, *})
    rangefeed         registry.go:335                          (*registration).runOutputLoop(*, {#5, *}, *)
    rangefeed         processor.go:297                         (*Processor).run.func1({#5, *})
    stop              stopper.go:494                           (*Stopper).RunAsyncTaskEx.func2()
```

Known incidents:
- https://github.com/cockroachlabs/support/issues/2036
- https://github.com/cockroachlabs/support/issues/1997

<details>
<summary>To understand, consider these metrics from an actual incident.</summary>

It ends up looking something like:

![image](https://user-images.githubusercontent.com/10536690/216187221-8947cd2a-9eaf-4a96-a773-f43587295d5e.png)

![image](https://user-images.githubusercontent.com/10536690/216187285-a3c1c7dc-7356-4d42-8e69-308fca1c8ffc.png)

In these incidents, we also see in AC kicking in, but helplessly since it's unable to keep runnable g's stable under it's configured threshold. Consider the KV-Queue on n113 is firing during this period for example, and it observing wild swings between used and total slots (where used >> total at some points), so there's a lot of spurious goroutine wakeups. p75 AC delay is quite high (1s+), across all CPU queues. Dividing the delay rate of n113's KV queue by it's admission rate, we get a mean latency of 277.78ms per request, which is very high. 

![image](https://user-images.githubusercontent.com/10536690/216187785-6f8eb40b-950e-409d-94c4-8974bd3af5f2.png)

![image](https://user-images.githubusercontent.com/10536690/216187855-f81d7862-7ac8-433d-940e-8f11bc41798b.png)

</details>

**To Reproduce**

Run a changefeed over a table with many underlying ranges. We should start with adding a roachtest reproducing this goroutine count induced latency hit; it's tracked here: https://github.com/cockroachdb/cockroach/issues/95236.

**Describe alternatives you've considered**

- Reducing the frequency of these goroutine wakeups has helped in incidents, for ex. by increasing `kv.rangefeed.closed_timestamp_refresh_interval` to be 2s instead of 200ms (read from `kv.closed_timestamp.side_transport_interval`).
- Reducing the number of ranges over which we need to establish rangefeeds. Depending on the usecases, if we could either leverage the predicate pushdown we've introduced in https://www.cockroachlabs.com/docs/stable/cdc-transformations.html for primary keys, or add smarts to also consider secondary indexes.
- @miretskiy has a prototype where he shuts down quiescent ranges for some period of time, reducing not only client goroutines but also the server-side ones.

Jira issue: CRDB-24101

Epic CRDB-26372

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver,cdc: high rangefeed count leads to scheduler overload #96395

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

kvserver,cdc: high rangefeed count leads to scheduler overload #96395

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions