Skip to content

kvcoord: stuckRangeFeedCanceler can fire during event processing #92570

@erikgrinaker

Description

@erikgrinaker

In #86820, we added stuckRangeFeedCanceler which will restart stuck rangefeeds after a period of inactivity. This was meant as a mitigation for #86818. However, it can wrongfully fire if client-side event processing is slow as well (notably if the event sink is slow), and in this case it will not return errRestartStuckRange which is automatically retried by the DistSender, but instead a bare context cancellation error which is propagated back up to the client, causing the entire changefeed to restart.

The watcher is meant to fire here:

event, err := stream.Recv()
if err == io.EOF {
return args.Timestamp, nil
}
if err != nil {
if stuckWatcher.stuck() {
afterCatchUpScan := catchupRes == nil
return args.Timestamp, ds.handleStuckEvent(&args, afterCatchUpScan, stuckWatcher.threshold())
}
return args.Timestamp, err
}
stuckWatcher.ping() // starts timer on first event only

However, the ping registers a time.AfterFunc hook which cancels the rangefeed context here:

w.t = time.AfterFunc(3*threshold/2, func() {
// NB: important to store _stuck before canceling, since we
// want the caller to be able to detect stuck() after ctx
// cancels.
atomic.StoreInt32(&w._stuck, 1)
w.cancel()
})

This can fire during event processing here:

onRangeEvent(args.Replica.NodeID, desc.RangeID, event)
select {
case eventCh <- msg:
case <-ctx.Done():
return args.Timestamp, ctx.Err()
}

In which case it returns the bare ctx.Err() rather than errRestartStuckRange. It should only apply to the Recv() call and similar upstream activity, not downstream activity.

Jira issue: CRDB-21873

Metadata

Metadata

Assignees

Labels

A-kv-replicationRelating to Raft, consensus, and coordination.C-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.docs-known-limitation

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions