scexec: treat BatchTimestampBeforeGCError as permanent#139203
scexec: treat BatchTimestampBeforeGCError as permanent#139203craig[bot] merged 4 commits intocockroachdb:masterfrom
Conversation
The "now" argument was being passed as the "writeAsOf" parameter. Release note: None
This function was attempting to compute 80% of the time until the GC threshold is reached. However, it did not take into account that the read was happening with a historical timestamp. Now this is part of the calculation. Release note: None
2bc00a3 to
3018b55
Compare
fqazi
left a comment
There was a problem hiding this comment.
Nice find and good work!
Reviewed 1 of 1 files at r1, 2 of 2 files at r2, 2 of 2 files at r3, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @annrpom, @kev-cao, and @rafiss)
pkg/sql/backfill_protected_timestamp_test.go line 416 at r3 (raw file):
tableKey := ts.Codec().TablePrefix(tableID) // ctx, cancel = context.WithCancel(ctx)
Nit: delete
pkg/sql/index_backfiller.go line 124 at r1 (raw file):
descriptor, now, progress.MinimumWriteTimestamp,
Nice find!
I wonder if that causes some of the index validations failures that we see randomly. Maybe worth adding a release note if we think it could cause the index counts to mismatch.
rafiss
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @annrpom, @fqazi, and @kev-cao)
pkg/sql/index_backfiller.go line 124 at r1 (raw file):
Previously, fqazi (Faizan Qazi) wrote…
Nice find!
I wonder if that causes some of the index validations failures that we see randomly. Maybe worth adding a release note if we think it could cause the index counts to mismatch.
i couldn't come up with a test case that would show a bug, but it does feel like something might be lurking. i'll skip the release note, since i'm not sure what to put in it without having a repro of a bug.
pkg/sql/backfill_protected_timestamp_test.go line 416 at r3 (raw file):
Previously, fqazi (Faizan Qazi) wrote…
Nit: delete
done
rafiss
left a comment
There was a problem hiding this comment.
tftr!
bors r+
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @annrpom, @fqazi, and @kev-cao)
3018b55 to
f8ee046
Compare
|
Canceled. |
|
bors r+ |
|
bors r- |
|
Canceled. |
This uses the same check and testing that was added in a7cce24 for the materialized view backfill. Release note (bug fix): Fixed a bug where the error "batch timestamp T must be after replica GC threshold" could occur during a schema change backfill operation, and cause the schema change job to retry infinitely. Now this error is treated as permanent, and will cause the job to enter the failed state.
Release note: None
f8ee046 to
3afe619
Compare
|
bors r+ |
|
Based on the specified backports for this PR, I applied new labels to the following linked issue(s). Please adjust the labels as needed to match the branches actually affected by the issue(s), including adding any known older branches. Issue #126260: branch-release-24.1, branch-release-24.2, branch-release-24.3. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
When cockroachdb#73861 got merged, we seem to have accidentally made it so the index backfiller will never select a new read timestamp. That is contrary to the goals of cockroachdb#63945, where we initially added the ability to advance the read timestamp forward. This mistake is why we were seeing behavior where a GC threshold error would cause the backfill to retry infinitely (which in turn led us to treat as a permanent error in cockroachdb#139203, which was something we never should have done). We never noticed the bug because when CREATE INDEX was added to the declarative schema changer in cockroachdb#92128, we turned off the declarative schema changer in the test that would have caught this (i.e. TestIndexBackfillAfterGC). This commit makes it so we choose a current timestamp as the read timestamp when planning the backfill, and fixes up that test to show that this works for index backfills in the declarative schema changer. Release note (bug fix): Fixed a bug where a GC threshold error (which appears as "batch timestamp must be after replica GC threshold ...") could cause a schema change that backfills data to fail. Now, the error will cause the backfill to be retried at a higher timestamp to avoid the error.
143270: kvserver: better obs in TestTxnReadWithinUncertaintyIntervalAfterRangeMerge r=tbg a=tbg Closes #143260. This is a complex test with a rare failure mode. Trace all of the relevant operations so that we can meaningfully engage with it. Epic: none Release note: none 143451: sql,backfill: avoid holding a protected timestamp during the backfill r=rafiss a=rafiss ### sql,rowexec: use WriteTS as timestamp for AddSSTable in backfill PR #64023 made it so we use the backfill's read TS as the batch TS for the AddSSTable operation. Simultaneously with that change, #63945 was being worked on, which made it so that we use a fixed _write_ timestamp for the keys that are backfilled. This write timestamp is the actual minimum lower bound timestamp for the backfill operation, and the read timstamp is allowed to change, so it makes more sense to use the write timestamp for the AddSSTable operation. Looking at the diffs in the PRs makes it seem pretty likely that this was just missed as part of a trivial branch skew. This commit does not cause any behavior change, since at the time of writing, the read timestamp and write timestamp are always identical for the backfill. Release note: None ### sql,backfill: choose new read timestamp when backfill job is resumed When #73861 got merged, we seem to have accidentally made it so the index backfiller will never select a new read timestamp. That is contrary to the goals of #63945, where we initially added the ability to advance the read timestamp forward. This mistake is why we were seeing behavior where a GC threshold error would cause the backfill to retry infinitely (which in turn led us to treat as a permanent error in #139203, which was something we never should have done). We never noticed the bug because when CREATE INDEX was added to the declarative schema changer in #92128, we turned off the declarative schema changer in the test that would have caught this (i.e. TestIndexBackfillAfterGC). This commit makes it so we choose a current timestamp as the read timestamp when planning the backfill, and fixes up that test to show that this works for index backfills in the declarative schema changer. Release note (bug fix): Fixed a bug where a GC threshold error (which appears as "batch timestamp must be after replica GC threshold ...") could cause a schema change that backfills data to fail. Now, the error will cause the backfill to be retried at a higher timestamp to avoid the error. ### sql, backfill: use a current timestamp to scan each chunk of the source index There's no need for the backfill to use a fixed read timestamp for the entire job. Instead, each chunk of the original index that needs to be scanned can use a current timestamp. This allows us to remove all the logic for adding protected timestamps for the backfill. Release note (performance improvement): Schema changes that require data to be backfilled no longer hold a protected timestamp for the entire duration of the backfill, which means there is less overhead caused by MVCC garbage collection after the backfill completes. ### sql: stop setting readAsOf in index backfiller As of the parent commit, this is unused by the index backfiller. Release note: None --- fixes #140629 informs #142339 informs #142117 informs #141773 Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com> Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
When cockroachdb#73861 got merged, we seem to have accidentally made it so the index backfiller will never select a new read timestamp. That is contrary to the goals of cockroachdb#63945, where we initially added the ability to advance the read timestamp forward. This mistake is why we were seeing behavior where a GC threshold error would cause the backfill to retry infinitely (which in turn led us to treat as a permanent error in cockroachdb#139203, which was something we never should have done). We never noticed the bug because when CREATE INDEX was added to the declarative schema changer in cockroachdb#92128, we turned off the declarative schema changer in the test that would have caught this (i.e. TestIndexBackfillAfterGC). This commit makes it so we choose a current timestamp as the read timestamp when planning the backfill, and fixes up that test to show that this works for index backfills in the declarative schema changer. Release note (bug fix): Fixed a bug where a GC threshold error (which appears as "batch timestamp must be after replica GC threshold ...") could cause a schema change that backfills data to fail. Now, the error will cause the backfill to be retried at a higher timestamp to avoid the error.
When cockroachdb#73861 got merged, we seem to have accidentally made it so the index backfiller will never select a new read timestamp. That is contrary to the goals of cockroachdb#63945, where we initially added the ability to advance the read timestamp forward. This mistake is why we were seeing behavior where a GC threshold error would cause the backfill to retry infinitely (which in turn led us to treat as a permanent error in cockroachdb#139203, which was something we never should have done). We never noticed the bug because when CREATE INDEX was added to the declarative schema changer in cockroachdb#92128, we turned off the declarative schema changer in the test that would have caught this (i.e. TestIndexBackfillAfterGC). This commit makes it so we choose a current timestamp as the read timestamp when planning the backfill, and fixes up that test to show that this works for index backfills in the declarative schema changer. Release note (bug fix): Fixed a bug where a GC threshold error (which appears as "batch timestamp must be after replica GC threshold ...") could cause a schema change that backfills data to fail. Now, the error will cause the backfill to be retried at a higher timestamp to avoid the error.
When cockroachdb#73861 got merged, we seem to have accidentally made it so the index backfiller will never select a new read timestamp. That is contrary to the goals of cockroachdb#63945, where we initially added the ability to advance the read timestamp forward. This mistake is why we were seeing behavior where a GC threshold error would cause the backfill to retry infinitely (which in turn led us to treat as a permanent error in cockroachdb#139203, which was something we never should have done). We never noticed the bug because when CREATE INDEX was added to the declarative schema changer in cockroachdb#92128, we turned off the declarative schema changer in the test that would have caught this (i.e. TestIndexBackfillAfterGC). This commit makes it so we choose a current timestamp as the read timestamp when planning the backfill, and fixes up that test to show that this works for index backfills in the declarative schema changer. Release note (bug fix): Fixed a bug where a GC threshold error (which appears as "batch timestamp must be after replica GC threshold ...") could cause a schema change that backfills data to fail. Now, the error will cause the backfill to be retried at a higher timestamp to avoid the error.
sql: fix ordering of arguments for IndexBackfillPlanner
The "now" argument was being passed as the "writeAsOf" parameter.
jobsprotectedts: fix calculation in TryToProtectBeforeGC
This function was attempting to compute 80% of the time until the GC
threshold is reached. However, it did not take into account that the
read was happening with a historical timestamp. Now this is part of the
calculation.
scexec: treat BatchTimestampBeforeGCError as permanent
This uses the same check and testing that was added in
a7cce24 for the materialized view
backfill.
fixes #126260
Release note (bug fix): Fixed a bug where the error
"batch timestamp T must be after replica GC threshold" could occur
during a schema change backfill operation, and cause the schema change
job to retry infinitely. Now this error is treated as permanent, and
will cause the job to enter the failed state.