kvcoord: DistSender rangefeed bookkeeping had an off-by-one#91116
Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom Nov 2, 2022
Merged
Conversation
Member
It turns out that two commits occurred about two months apart to address some off-by-one errors due to disagreements regarding the inclusivity or exclusivity of bounds of time intervals. In cockroachdb#79525 we added a next call to compensate for the catch-up scan occurring at an inclusive time. In cockroachdb#82451 we made the catch- up scan act exclusively, like the rest of the kvserver code has assumed. The end result is that we now actually do the catch up scan one tick later than we had intended. This resulted in some flakey tests, and in cases where the closed timestamp pushed a writing transaction, may have resulted in missing rows. This was uncovered deflaking cockroachdb#90764. With some added logging we see: ``` I221102 01:31:44.444557 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:667 [nsql1,rangefeed=lease,dest_n=1,dest_s=1,dest_r=53] 3882 RangeFeedEvent: span:<key:"\376\222\213" end_key:"\376\222\214" > resolved_ts:<wall_time:166735270430458388 > E221102 01:31:44.445042 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:653 [nsql1,rangefeed=lease,dest_n=1,dest_s=1,dest_r=53] 3886 RangeFeedError: retry rangefeed (REASON_RANGE_SPLIT) I221102 01:31:44.480676 2388 sql/internal.go:1321 [nsql1,job=810294652971450369,scExec,id=106,mutation=1] 3947 txn committed at 1667352704.380458388,1 I221102 01:31:44.485558 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:420 [nsql1,rangefeed=lease] 3965 RangeFeed /Tenant/10/Table/{3-4} disconnected with last checkpoint 105.097693ms ago: retry rangefeed (REASON_RANGE_SPLIT) ``` Notice that the commit for the schema change occurred at `1667352704.380458388,1` and the resolved event was at `1667352704.380458388`. As the code was before, we'd perform the catch-up scan at `1667352704.380458388,2` and miss the write we needed to see. Fixes cockroachdb#90764. Release note (bug fix): Fixed a bug which, in rare cases, could result in a changefeed missing rows which occur around the time of a split in writing transactions which take longer than the closed timestamp target duration (defaults to 3s).
aede747 to
46bbd61
Compare
miretskiy
approved these changes
Nov 2, 2022
| // Timestamp field in the request is exclusive, meaning if we send | ||
| // the request with exactly the ResolveTS, we'll see only rows after | ||
| // that timestamp. | ||
| args.Timestamp.Forward(t.ResolvedTS) |
Contributor
There was a problem hiding this comment.
This stuff is SO subtle.. Argh. Excellent catch, @ajwerner
Contributor
Author
|
TFTR bors r+ |
Contributor
|
Build failed: |
Contributor
Author
|
bors r+ flaked on schemachange workload |
Contributor
|
Build succeeded: |
Contributor
Author
|
blathers backport 22.2 |
Contributor
Author
|
blathers backport 22.1 |
This was referenced Nov 11, 2022
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
It turns out that two commits occurred about two months apart to address some off-by-one errors due to disagreements regarding the inclusivity or exclusivity of bounds of time intervals. In #79525 we added a next call to compensate for the catch-up scan occurring at an inclusive time. In #82451 we made the catch- up scan act exclusively, like the rest of the kvserver code has assumed. The end result is that we now actually do the catch up scan one tick later than we had intended.
This resulted in some flakey tests, and in cases where the closed timestamp pushed a writing transaction, may have resulted in missing rows. This was uncovered deflaking #90764. With some added logging we see:
Notice that the commit for the schema change occurred at
1667352704.380458388,1and the resolved event was at1667352704.380458388. As the code was before, we'd perform the catch-up scan at1667352704.380458388,2and miss the write we needed to see.Fixes #90764.
Release note (bug fix): Fixed a bug which, in rare cases, could result in a changefeed missing rows which occur around the time of a split in writing transactions which take longer than the closed timestamp target duration (defaults to 3s).