scexec: treat BatchTimestampBeforeGCError as permanent by rafiss · Pull Request #139203 · cockroachdb/cockroach

rafiss · 2025-01-16T07:27:38Z

sql: fix ordering of arguments for IndexBackfillPlanner

The "now" argument was being passed as the "writeAsOf" parameter.

jobsprotectedts: fix calculation in TryToProtectBeforeGC

This function was attempting to compute 80% of the time until the GC
threshold is reached. However, it did not take into account that the
read was happening with a historical timestamp. Now this is part of the
calculation.

scexec: treat BatchTimestampBeforeGCError as permanent

This uses the same check and testing that was added in
a7cce24 for the materialized view
backfill.

fixes #126260

Release note (bug fix): Fixed a bug where the error
"batch timestamp T must be after replica GC threshold" could occur
during a schema change backfill operation, and cause the schema change
job to retry infinitely. Now this error is treated as permanent, and
will cause the job to enter the failed state.

The "now" argument was being passed as the "writeAsOf" parameter. Release note: None

This function was attempting to compute 80% of the time until the GC threshold is reached. However, it did not take into account that the read was happening with a historical timestamp. Now this is part of the calculation. Release note: None

cockroach-teamcity · 2025-01-16T07:27:51Z

This change is

fqazi

Nice find and good work!

Reviewed 1 of 1 files at r1, 2 of 2 files at r2, 2 of 2 files at r3, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @annrpom, @kev-cao, and @rafiss)

pkg/sql/backfill_protected_timestamp_test.go line 416 at r3 (raw file):

			tableKey := ts.Codec().TablePrefix(tableID)

			// ctx, cancel = context.WithCancel(ctx)

Nit: delete

pkg/sql/index_backfiller.go line 124 at r1 (raw file):

		descriptor,
		now,
		progress.MinimumWriteTimestamp,

Nice find!

I wonder if that causes some of the index validations failures that we see randomly. Maybe worth adding a release note if we think it could cause the index counts to mismatch.

rafiss

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @annrpom, @fqazi, and @kev-cao)

pkg/sql/index_backfiller.go line 124 at r1 (raw file):

Previously, fqazi (Faizan Qazi) wrote…

Nice find!

I wonder if that causes some of the index validations failures that we see randomly. Maybe worth adding a release note if we think it could cause the index counts to mismatch.

i couldn't come up with a test case that would show a bug, but it does feel like something might be lurking. i'll skip the release note, since i'm not sure what to put in it without having a repro of a bug.

pkg/sql/backfill_protected_timestamp_test.go line 416 at r3 (raw file):

Previously, fqazi (Faizan Qazi) wrote…

Nit: delete

done

rafiss

tftr!

bors r+

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @annrpom, @fqazi, and @kev-cao)

craig · 2025-01-16T15:24:00Z

Canceled.

rafiss · 2025-01-16T15:24:12Z

bors r+

rafiss · 2025-01-16T15:44:37Z

bors r-

craig · 2025-01-16T15:44:41Z

Canceled.

This uses the same check and testing that was added in a7cce24 for the materialized view backfill. Release note (bug fix): Fixed a bug where the error "batch timestamp T must be after replica GC threshold" could occur during a schema change backfill operation, and cause the schema change job to retry infinitely. Now this error is treated as permanent, and will cause the job to enter the failed state.

Release note: None

rafiss · 2025-01-16T19:01:27Z

bors r+

craig · 2025-01-16T20:00:08Z

Build succeeded:

blathers-crl · 2025-01-16T20:00:16Z

Based on the specified backports for this PR, I applied new labels to the following linked issue(s). Please adjust the labels as needed to match the branches actually affected by the issue(s), including adding any known older branches.

Issue #126260: branch-release-24.1, branch-release-24.2, branch-release-24.3.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

When cockroachdb#73861 got merged, we seem to have accidentally made it so the index backfiller will never select a new read timestamp. That is contrary to the goals of cockroachdb#63945, where we initially added the ability to advance the read timestamp forward. This mistake is why we were seeing behavior where a GC threshold error would cause the backfill to retry infinitely (which in turn led us to treat as a permanent error in cockroachdb#139203, which was something we never should have done). We never noticed the bug because when CREATE INDEX was added to the declarative schema changer in cockroachdb#92128, we turned off the declarative schema changer in the test that would have caught this (i.e. TestIndexBackfillAfterGC). This commit makes it so we choose a current timestamp as the read timestamp when planning the backfill, and fixes up that test to show that this works for index backfills in the declarative schema changer. Release note (bug fix): Fixed a bug where a GC threshold error (which appears as "batch timestamp must be after replica GC threshold ...") could cause a schema change that backfills data to fail. Now, the error will cause the backfill to be retried at a higher timestamp to avoid the error.

143270: kvserver: better obs in TestTxnReadWithinUncertaintyIntervalAfterRangeMerge r=tbg a=tbg Closes #143260. This is a complex test with a rare failure mode. Trace all of the relevant operations so that we can meaningfully engage with it. Epic: none Release note: none 143451: sql,backfill: avoid holding a protected timestamp during the backfill r=rafiss a=rafiss ### sql,rowexec: use WriteTS as timestamp for AddSSTable in backfill PR #64023 made it so we use the backfill's read TS as the batch TS for the AddSSTable operation. Simultaneously with that change, #63945 was being worked on, which made it so that we use a fixed _write_ timestamp for the keys that are backfilled. This write timestamp is the actual minimum lower bound timestamp for the backfill operation, and the read timstamp is allowed to change, so it makes more sense to use the write timestamp for the AddSSTable operation. Looking at the diffs in the PRs makes it seem pretty likely that this was just missed as part of a trivial branch skew. This commit does not cause any behavior change, since at the time of writing, the read timestamp and write timestamp are always identical for the backfill. Release note: None ### sql,backfill: choose new read timestamp when backfill job is resumed When #73861 got merged, we seem to have accidentally made it so the index backfiller will never select a new read timestamp. That is contrary to the goals of #63945, where we initially added the ability to advance the read timestamp forward. This mistake is why we were seeing behavior where a GC threshold error would cause the backfill to retry infinitely (which in turn led us to treat as a permanent error in #139203, which was something we never should have done). We never noticed the bug because when CREATE INDEX was added to the declarative schema changer in #92128, we turned off the declarative schema changer in the test that would have caught this (i.e. TestIndexBackfillAfterGC). This commit makes it so we choose a current timestamp as the read timestamp when planning the backfill, and fixes up that test to show that this works for index backfills in the declarative schema changer. Release note (bug fix): Fixed a bug where a GC threshold error (which appears as "batch timestamp must be after replica GC threshold ...") could cause a schema change that backfills data to fail. Now, the error will cause the backfill to be retried at a higher timestamp to avoid the error. ### sql, backfill: use a current timestamp to scan each chunk of the source index There's no need for the backfill to use a fixed read timestamp for the entire job. Instead, each chunk of the original index that needs to be scanned can use a current timestamp. This allows us to remove all the logic for adding protected timestamps for the backfill. Release note (performance improvement): Schema changes that require data to be backfilled no longer hold a protected timestamp for the entire duration of the backfill, which means there is less overhead caused by MVCC garbage collection after the backfill completes. ### sql: stop setting readAsOf in index backfiller As of the parent commit, this is unused by the index backfiller. Release note: None --- fixes #140629 informs #142339 informs #142117 informs #141773 Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com> Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>

When cockroachdb#73861 got merged, we seem to have accidentally made it so the index backfiller will never select a new read timestamp. That is contrary to the goals of cockroachdb#63945, where we initially added the ability to advance the read timestamp forward. This mistake is why we were seeing behavior where a GC threshold error would cause the backfill to retry infinitely (which in turn led us to treat as a permanent error in cockroachdb#139203, which was something we never should have done). We never noticed the bug because when CREATE INDEX was added to the declarative schema changer in cockroachdb#92128, we turned off the declarative schema changer in the test that would have caught this (i.e. TestIndexBackfillAfterGC). This commit makes it so we choose a current timestamp as the read timestamp when planning the backfill, and fixes up that test to show that this works for index backfills in the declarative schema changer. Release note (bug fix): Fixed a bug where a GC threshold error (which appears as "batch timestamp must be after replica GC threshold ...") could cause a schema change that backfills data to fail. Now, the error will cause the backfill to be retried at a higher timestamp to avoid the error.

rafiss added 2 commits January 16, 2025 02:26

sql: fix ordering of arguments for IndexBackfillPlanner

faad69d

The "now" argument was being passed as the "writeAsOf" parameter. Release note: None

rafiss force-pushed the index-backfill-gc-retry branch from 2bc00a3 to 3018b55 Compare January 16, 2025 08:05

rafiss added backport-24.1.x Flags PRs that need to be backported to 24.1. backport-24.2.x backport-24.3.x Flags PRs that need to be backported to 24.3 labels Jan 16, 2025

rafiss marked this pull request as ready for review January 16, 2025 14:49

rafiss requested review from a team as code owners January 16, 2025 14:49

rafiss requested review from annrpom, fqazi and kev-cao and removed request for a team January 16, 2025 14:49

fqazi approved these changes Jan 16, 2025

View reviewed changes

rafiss commented Jan 16, 2025

View reviewed changes

rafiss force-pushed the index-backfill-gc-retry branch from 3018b55 to f8ee046 Compare January 16, 2025 15:23

rafiss added 2 commits January 16, 2025 13:21

scerrors: make schemaChangerUserError implement SafeFormatError

3afe619

Release note: None

rafiss force-pushed the index-backfill-gc-retry branch from f8ee046 to 3afe619 Compare January 16, 2025 18:26

craig bot merged commit 0e13eaa into cockroachdb:master Jan 16, 2025

blathers-crl bot mentioned this pull request Jan 16, 2025

sql: schema change repeatedly retries with gcttl error #126260

Closed

rafiss deleted the index-backfill-gc-retry branch January 17, 2025 17:12

shailendra-patel mentioned this pull request Jan 25, 2025

sql: schema change job failure on drt-chaos #139800

Closed

rafiss mentioned this pull request Mar 25, 2025

sql,backfill: avoid holding a protected timestamp during the backfill #143451

Merged

rafiss mentioned this pull request Mar 26, 2025

release-25.1: sql,backfill: avoid holding a protected timestamp during the backfill #143520

Merged

rafiss mentioned this pull request Mar 27, 2025

release-24.3: sql,backfill: avoid holding a protected timestamp during the backfill #143594

Merged

rafiss mentioned this pull request Mar 27, 2025

release-24.1: sql,backfill: avoid holding a protected timestamp during the backfill #143595

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scexec: treat BatchTimestampBeforeGCError as permanent#139203

scexec: treat BatchTimestampBeforeGCError as permanent#139203
craig[bot] merged 4 commits intocockroachdb:masterfrom
rafiss:index-backfill-gc-retry

rafiss commented Jan 16, 2025

Uh oh!

cockroach-teamcity commented Jan 16, 2025

Uh oh!

fqazi left a comment

Uh oh!

rafiss left a comment

Uh oh!

rafiss left a comment

Uh oh!

craig bot commented Jan 16, 2025

Uh oh!

rafiss commented Jan 16, 2025

Uh oh!

rafiss commented Jan 16, 2025

Uh oh!

craig bot commented Jan 16, 2025

Uh oh!

rafiss commented Jan 16, 2025

Uh oh!

craig bot commented Jan 16, 2025

Uh oh!

blathers-crl bot commented Jan 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rafiss commented Jan 16, 2025

sql: fix ordering of arguments for IndexBackfillPlanner

jobsprotectedts: fix calculation in TryToProtectBeforeGC

scexec: treat BatchTimestampBeforeGCError as permanent

Uh oh!

cockroach-teamcity commented Jan 16, 2025

Uh oh!

fqazi left a comment

Choose a reason for hiding this comment

Uh oh!

rafiss left a comment

Choose a reason for hiding this comment

Uh oh!

rafiss left a comment

Choose a reason for hiding this comment

Uh oh!

craig bot commented Jan 16, 2025

Uh oh!

rafiss commented Jan 16, 2025

Uh oh!

rafiss commented Jan 16, 2025

Uh oh!

craig bot commented Jan 16, 2025

Uh oh!

rafiss commented Jan 16, 2025

Uh oh!

craig bot commented Jan 16, 2025

Uh oh!

blathers-crl bot commented Jan 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants