Skip to content

kvcoord: Correctly handle stuck rangefeeds#92582

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
miretskiy:rf
Nov 29, 2022
Merged

kvcoord: Correctly handle stuck rangefeeds#92582
craig[bot] merged 1 commit intocockroachdb:masterfrom
miretskiy:rf

Conversation

@miretskiy
Copy link
Copy Markdown
Contributor

Fixes #92570

A watcher responsible for restarting stuck range feeds may incorrectly cancel rangefeed if the downstream event consumer is slow.

Release note (bug fix): Fix incorrect cancellation logic when attempting to detect stuck range feeds.

@miretskiy miretskiy requested a review from a team as a code owner November 28, 2022 15:47
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

// ping notifies the canceler that the rangefeed has received an
// event, i.e. is making progress.
func (w *stuckRangeFeedCanceler) ping() {
func (w *stuckRangeFeedCanceler) do(fn func() error) error {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way this is currently structured, we don't need to return error from either of these functions. The closure can close over err and we can check it outside of do().

There are a couple of other approaches we can consider:

  • Have do() also do error handling, returning errRestartStuckRange when the rangefeed is stuck.

  • Return true if the rangefeed is stuck when the function returns.

These would allow us to get rid of the stuck() method and simplify the caller logic. I think I have a preference for returning errRestartStuckRange and encapsulating as much of the logic as possible inside do(), but it's not a strongly held view. The current approach is also fine.

Copy link
Copy Markdown
Contributor Author

@miretskiy miretskiy Nov 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack -- I was thinking of doing that. Still, I think returning an error is better -- I prefer not to close over errors.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I can get rid of stuck() -- we call it in few places , and I think it just reads better when error handling is explicit.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I'm not sure why we bother with returning errStuckRangefeed from do() in that case, since none of the callers use it anyway. You call though, nbd.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted return of errStuckRange.

Copy link
Copy Markdown
Contributor Author

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if I can get rid of stuck() -- but other comments addressed.

Copy link
Copy Markdown
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor comments, the core logic seems fine. Preemptively approving once addressed.

We should also have a test, either for the DistSender or as a CDC roachtest, which makes sure a stalled sink won't cause undue restarts.

// ping notifies the canceler that the rangefeed has received an
// event, i.e. is making progress.
func (w *stuckRangeFeedCanceler) ping() {
func (w *stuckRangeFeedCanceler) do(fn func() error) error {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I'm not sure why we bother with returning errStuckRangefeed from do() in that case, since none of the callers use it anyway. You call though, nbd.

@miretskiy
Copy link
Copy Markdown
Contributor Author

ely approving once addressed.

We should also have a test, either for the DistSender or as a CDC roachtest

Added canceller test.

Fixes cockroachdb#92570

A watcher responsible for restarting stuck range feeds
may incorrectly cancel rangefeed if the downstream event
consumer is slow.

Release note (bug fix): Fix incorrect cancellation logic
when attempting to detect stuck range feeds.
@miretskiy
Copy link
Copy Markdown
Contributor Author

bors r+

1 similar comment
@miretskiy
Copy link
Copy Markdown
Contributor Author

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Nov 29, 2022

Already running a review

@craig craig bot merged commit edc6fda into cockroachdb:master Nov 29, 2022
@craig
Copy link
Copy Markdown
Contributor

craig bot commented Nov 29, 2022

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

kvcoord: stuckRangeFeedCanceler can fire during event processing

3 participants