kvcoord: DistSender rangefeed bookkeeping had an off-by-one by ajwerner · Pull Request #91116 · cockroachdb/cockroach

ajwerner · 2022-11-02T02:01:29Z

It turns out that two commits occurred about two months apart to address some off-by-one errors due to disagreements regarding the inclusivity or exclusivity of bounds of time intervals. In #79525 we added a next call to compensate for the catch-up scan occurring at an inclusive time. In #82451 we made the catch- up scan act exclusively, like the rest of the kvserver code has assumed. The end result is that we now actually do the catch up scan one tick later than we had intended.

This resulted in some flakey tests, and in cases where the closed timestamp pushed a writing transaction, may have resulted in missing rows. This was uncovered deflaking #90764. With some added logging we see:

I221102 01:31:44.444557 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:667  [nsql1,rangefeed=lease,dest_n=1,dest_s=1,dest_r=53] 3882  RangeFeedEvent: span:<key:"\376\222\213" end_key:"\376\222\214" > resolved_ts:<wall_time:166735270430458388 >
E221102 01:31:44.445042 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:653  [nsql1,rangefeed=lease,dest_n=1,dest_s=1,dest_r=53] 3886  RangeFeedError: retry rangefeed (REASON_RANGE_SPLIT)
I221102 01:31:44.480676 2388 sql/internal.go:1321  [nsql1,job=810294652971450369,scExec,id=106,mutation=1] 3947  txn committed at 1667352704.380458388,1
I221102 01:31:44.485558 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:420  [nsql1,rangefeed=lease] 3965  RangeFeed /Tenant/10/Table/{3-4} disconnected with last checkpoint 105.097693ms ago: retry rangefeed (REASON_RANGE_SPLIT)

Notice that the commit for the schema change occurred at 1667352704.380458388,1 and the resolved event was at 1667352704.380458388. As the code was before, we'd perform the catch-up scan at 1667352704.380458388,2 and miss the write we needed to see.

Fixes #90764.

Release note (bug fix): Fixed a bug which, in rare cases, could result in a changefeed missing rows which occur around the time of a split in writing transactions which take longer than the closed timestamp target duration (defaults to 3s).

cockroach-teamcity · 2022-11-02T02:01:37Z

This change is

It turns out that two commits occurred about two months apart to address some off-by-one errors due to disagreements regarding the inclusivity or exclusivity of bounds of time intervals. In cockroachdb#79525 we added a next call to compensate for the catch-up scan occurring at an inclusive time. In cockroachdb#82451 we made the catch- up scan act exclusively, like the rest of the kvserver code has assumed. The end result is that we now actually do the catch up scan one tick later than we had intended. This resulted in some flakey tests, and in cases where the closed timestamp pushed a writing transaction, may have resulted in missing rows. This was uncovered deflaking cockroachdb#90764. With some added logging we see: ``` I221102 01:31:44.444557 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:667 [nsql1,rangefeed=lease,dest_n=1,dest_s=1,dest_r=53] 3882 RangeFeedEvent: span:<key:"\376\222\213" end_key:"\376\222\214" > resolved_ts:<wall_time:166735270430458388 > E221102 01:31:44.445042 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:653 [nsql1,rangefeed=lease,dest_n=1,dest_s=1,dest_r=53] 3886 RangeFeedError: retry rangefeed (REASON_RANGE_SPLIT) I221102 01:31:44.480676 2388 sql/internal.go:1321 [nsql1,job=810294652971450369,scExec,id=106,mutation=1] 3947 txn committed at 1667352704.380458388,1 I221102 01:31:44.485558 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:420 [nsql1,rangefeed=lease] 3965 RangeFeed /Tenant/10/Table/{3-4} disconnected with last checkpoint 105.097693ms ago: retry rangefeed (REASON_RANGE_SPLIT) ``` Notice that the commit for the schema change occurred at `1667352704.380458388,1` and the resolved event was at `1667352704.380458388`. As the code was before, we'd perform the catch-up scan at `1667352704.380458388,2` and miss the write we needed to see. Fixes cockroachdb#90764. Release note (bug fix): Fixed a bug which, in rare cases, could result in a changefeed missing rows which occur around the time of a split in writing transactions which take longer than the closed timestamp target duration (defaults to 3s).

miretskiy · 2022-11-02T12:26:39Z

pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go

+					// Timestamp field in the request is exclusive, meaning if we send
+					// the request with exactly the ResolveTS, we'll see only rows after
+					// that timestamp.
+					args.Timestamp.Forward(t.ResolvedTS)


This stuff is SO subtle.. Argh. Excellent catch, @ajwerner

ajwerner · 2022-11-02T13:08:55Z

TFTR

bors r+

craig · 2022-11-02T13:49:53Z

Build failed:

Bazel Essential CI (Cockroach)

ajwerner · 2022-11-02T13:58:56Z

bors r+

flaked on schemachange workload

craig · 2022-11-02T15:47:49Z

Build succeeded:

Bazel Essential CI (Cockroach)

ajwerner · 2022-11-11T15:59:47Z

blathers backport 22.2

ajwerner · 2022-11-11T15:59:54Z

blathers backport 22.1

ajwerner requested review from erikgrinaker, miretskiy and samiskin November 2, 2022 02:01

ajwerner requested a review from a team as a code owner November 2, 2022 02:01

ajwerner mentioned this pull request Nov 2, 2022

sql: lease acquisition can repeatedly contend with DDL statement when closed_timestamp target_duration and side_transport_interval are very low #89900

Open

ajwerner force-pushed the ajwerner/fix-rangefeed-off-by-one branch from aede747 to 46bbd61 Compare November 2, 2022 02:21

miretskiy approved these changes Nov 2, 2022

View reviewed changes

craig bot merged commit 206fc07 into cockroachdb:master Nov 2, 2022

ajwerner mentioned this pull request Nov 2, 2022

rangefeed: ensure correct use of exclusive start time #82488

Closed

This was referenced Nov 11, 2022

release-22.2: kvcoord: DistSender rangefeed bookkeeping had an off-by-one #91748

Merged

release-22.1: kvcoord: DistSender rangefeed bookkeeping had an off-by-one #91749

Merged

ajwerner mentioned this pull request Dec 6, 2022

sql/catalog/lease: acceptance/version-upgrade can sometimes wait a full lease duration leading to flakiness #84382

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvcoord: DistSender rangefeed bookkeeping had an off-by-one#91116

kvcoord: DistSender rangefeed bookkeeping had an off-by-one#91116
craig[bot] merged 1 commit intocockroachdb:masterfrom
ajwerner:ajwerner/fix-rangefeed-off-by-one

ajwerner commented Nov 2, 2022

Uh oh!

cockroach-teamcity commented Nov 2, 2022

Uh oh!

miretskiy Nov 2, 2022

Uh oh!

ajwerner commented Nov 2, 2022

Uh oh!

craig bot commented Nov 2, 2022

Uh oh!

ajwerner commented Nov 2, 2022

Uh oh!

craig bot commented Nov 2, 2022

Uh oh!

ajwerner commented Nov 11, 2022

Uh oh!

ajwerner commented Nov 11, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ajwerner commented Nov 2, 2022

Uh oh!

cockroach-teamcity commented Nov 2, 2022

Uh oh!

miretskiy Nov 2, 2022

Choose a reason for hiding this comment

Uh oh!

ajwerner commented Nov 2, 2022

Uh oh!

craig bot commented Nov 2, 2022

Uh oh!

ajwerner commented Nov 2, 2022

Uh oh!

craig bot commented Nov 2, 2022

Uh oh!

ajwerner commented Nov 11, 2022

Uh oh!

ajwerner commented Nov 11, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants