Skip to content

kvserver,cdc: high rangefeed count leads to scheduler overload #96395

@irfansharif

Description

@irfansharif

Describe the problem

In recent incidents we've observed a high number of rangefeeds resulting in CPU scheduler overload, causing excessive scheduling latencies. These scheduling latencies in turn contribute to high foreground latencies. All this can happen even with CPU % being far from saturation (<25%). The underlying cause for all of this is the various O(ranges) goroutines we have on both the client and server side; after using MuxRangefeed, we have:

Client side:
  a. 1 per range in singleRangeFeed
 
    kvcoord           dist_sender_rangefeed.go:516             (*DistSender).singleRangeFeed.func3()
    kvcoord           dist_sender_rangefeed_canceler.go:100    (*stuckRangeFeedCanceler).do(*, *)
    kvcoord           dist_sender_rangefeed.go:515             (*DistSender).singleRangeFeed(#1050, {#184, *}, {{*, *, *}, {*, *, *}}, {*, ...}, ...)
    kvcoord           dist_sender_rangefeed.go:336             (*DistSender).partialRangeFeed(#1050, {#184, *}, *, {{*, *, *}, {*, *, *}}, ...)
    kvcoord           dist_sender_rangefeed.go:120             (*DistSender).RangeFeed.func1.1({#184, *})
    ctxgroup          ctxgroup.go:169                          Group.GoCtx.func1()
    errgroup          errgroup.go:57                           (*Group).Go.func1()

Server side:
  b. 1 per range blocked in RangeFeed

    kvserver          replica_rangefeed.go:254                 (*Replica).rangeFeedWithRangeID(*, #505, *, {#43378, *})
    kvserver          replica_rangefeed.go:150                 (*Replica).RangeFeed(...)
    kvserver          store.go:3082                            (*Store).RangeFeed(#1543, *, {#505, *})
    kvserver          stores.go:216                            (*Stores).RangeFeed(*, *, {#505, *})
    server            node.go:1227                             (*Node).RangeFeed(#505, *, {#505, *})
    rpc               context.go:753                           internalClientAdapter.RangeFeed.func1()

  c. 1 per range in the processor
  
    rangefeed         processor.go:269                         (*Processor).run(*, {#5, *}, *, *, #1177)
    rangefeed         processor.go:222                         (*Processor).Start.func1({#5, *})
    stop              stopper.go:494                           (*Stopper).RunAsyncTaskEx.func2()

  d. 1 per range in the registration output loop

    rangefeed         registry.go:310                          (*registration).outputLoop(*, {#485, *})
    rangefeed         registry.go:335                          (*registration).runOutputLoop(*, {#5, *}, *)
    rangefeed         processor.go:297                         (*Processor).run.func1({#5, *})
    stop              stopper.go:494                           (*Stopper).RunAsyncTaskEx.func2()

Known incidents:

To understand, consider these metrics from an actual incident.

It ends up looking something like:

image

image

In these incidents, we also see in AC kicking in, but helplessly since it's unable to keep runnable g's stable under it's configured threshold. Consider the KV-Queue on n113 is firing during this period for example, and it observing wild swings between used and total slots (where used >> total at some points), so there's a lot of spurious goroutine wakeups. p75 AC delay is quite high (1s+), across all CPU queues. Dividing the delay rate of n113's KV queue by it's admission rate, we get a mean latency of 277.78ms per request, which is very high.

image

image

To Reproduce

Run a changefeed over a table with many underlying ranges. We should start with adding a roachtest reproducing this goroutine count induced latency hit; it's tracked here: #95236.

Describe alternatives you've considered

  • Reducing the frequency of these goroutine wakeups has helped in incidents, for ex. by increasing kv.rangefeed.closed_timestamp_refresh_interval to be 2s instead of 200ms (read from kv.closed_timestamp.side_transport_interval).
  • Reducing the number of ranges over which we need to establish rangefeeds. Depending on the usecases, if we could either leverage the predicate pushdown we've introduced in https://www.cockroachlabs.com/docs/stable/cdc-transformations.html for primary keys, or add smarts to also consider secondary indexes.
  • @miretskiy has a prototype where he shuts down quiescent ranges for some period of time, reducing not only client goroutines but also the server-side ones.

Jira issue: CRDB-24101

Epic CRDB-26372

Metadata

Metadata

Assignees

Labels

A-admission-controlA-kv-rangefeedRangefeed infrastructure, server+clientC-performancePerf of queries or internals. Solution not expected to change functional behavior.

Type

No type

Projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions