-
Notifications
You must be signed in to change notification settings - Fork 4.1k
kvcoord: stuckRangeFeedCanceler can fire during event processing #92570
Description
In #86820, we added stuckRangeFeedCanceler which will restart stuck rangefeeds after a period of inactivity. This was meant as a mitigation for #86818. However, it can wrongfully fire if client-side event processing is slow as well (notably if the event sink is slow), and in this case it will not return errRestartStuckRange which is automatically retried by the DistSender, but instead a bare context cancellation error which is propagated back up to the client, causing the entire changefeed to restart.
The watcher is meant to fire here:
cockroach/pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go
Lines 643 to 654 in 98ab461
| event, err := stream.Recv() | |
| if err == io.EOF { | |
| return args.Timestamp, nil | |
| } | |
| if err != nil { | |
| if stuckWatcher.stuck() { | |
| afterCatchUpScan := catchupRes == nil | |
| return args.Timestamp, ds.handleStuckEvent(&args, afterCatchUpScan, stuckWatcher.threshold()) | |
| } | |
| return args.Timestamp, err | |
| } | |
| stuckWatcher.ping() // starts timer on first event only |
However, the ping registers a time.AfterFunc hook which cancels the rangefeed context here:
cockroach/pkg/kv/kvclient/kvcoord/dist_sender_rangefeed_canceler.go
Lines 88 to 94 in 2675c7c
| w.t = time.AfterFunc(3*threshold/2, func() { | |
| // NB: important to store _stuck before canceling, since we | |
| // want the caller to be able to detect stuck() after ctx | |
| // cancels. | |
| atomic.StoreInt32(&w._stuck, 1) | |
| w.cancel() | |
| }) |
This can fire during event processing here:
cockroach/pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go
Lines 686 to 692 in 98ab461
| onRangeEvent(args.Replica.NodeID, desc.RangeID, event) | |
| select { | |
| case eventCh <- msg: | |
| case <-ctx.Done(): | |
| return args.Timestamp, ctx.Err() | |
| } |
In which case it returns the bare ctx.Err() rather than errRestartStuckRange. It should only apply to the Recv() call and similar upstream activity, not downstream activity.
Jira issue: CRDB-21873