jobs: batch writes when backfilling job_type column in system.jobs by jayshrivastava · Pull Request #105750 · cockroachdb/cockroach

jayshrivastava · 2023-06-28T19:07:19Z

Previously, the migration which backfills the job_type column in system.jobs could hang indefinately due to contention on the jobs table. This change updates the migration to batch the backfill using multiple transactions.

Informs: #105663
Epic: None

cockroach-teamcity · 2023-06-28T19:07:27Z

This change is

andyyang890

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru and @jayshrivastava)

pkg/upgrade/helpers.go line 25 at r1 (raw file):

// prevents large backfill migrations from failing continuously due to
// contention.
var JobsBackfillBatchSize = envutil.EnvOrDefaultInt("COCKROACH_UPGRADE_JOB_BACKFILL_BATCH", 100)

Just curious, why do we use an environment variable here? Is it to let clients change this setting themselves? If it's just for tests, wouldn't a regular variable suffice?

jayshrivastava

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru and @dt)

pkg/upgrade/helpers.go line 25 at r1 (raw file):

Previously, andyyang890 (Andy Yang) wrote…

Just curious, why do we use an environment variable here? Is it to let clients change this setting themselves? If it's just for tests, wouldn't a regular variable suffice?

@dt added that originally in #104545. It's used in prod, so having this knob may be useful in case things fail. I'm not sure why we use it instead of a cluster setting though.

andyyang890

Reviewed 5 of 6 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru and @dt)

pkg/upgrade/helpers.go line 25 at r1 (raw file):

Previously, jayshrivastava (Jayant) wrote…

@dt added that originally in #104545. It's used in prod, so having this knob may be useful in case things fail. I'm not sure why we use it instead of a cluster setting though.

Ah I see, I guess a followup question is why are we moving it into this file/package?

It seems like it should stay in the file for the upgrade that it's specifically tied to. And also maybe another knob should be created for this upgrade instead of sharing it between the two?

andyyang890

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @dt, and @jayshrivastava)

pkg/upgrade/upgrades/alter_jobs_add_job_type.go line 39 at r1 (raw file):

`

const backfillTypeColumnStmt = `

Also, is there a reason to not use something like this where $1 will be the batch size?

UPDATE system.jobs
SET job_type = crdb_internal.job_payload_type(payload)
WHERE job_type IS NULL
LIMIT $1

jayshrivastava

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @andyyang890, and @dt)

pkg/upgrade/helpers.go line 25 at r1 (raw file):

Previously, andyyang890 (Andy Yang) wrote…

Ah I see, I guess a followup question is why are we moving it into this file/package?

It seems like it should stay in the file for the upgrade that it's specifically tied to. And also maybe another knob should be created for this upgrade instead of sharing it between the two?

Eh maybe. The job info table is bigger than system.jobs, but they are still O(num jobs). That's why I feel that they can use the same batch size. I felt it might be cleaner to put in the helpers.go file in the upgrade package and share it in both migrations (and future migrations that change the jobs tables).

pkg/upgrade/upgrades/alter_jobs_add_job_type.go line 39 at r1 (raw file):

Previously, andyyang890 (Andy Yang) wrote…

Also, is there a reason to not use something like this where $1 will be the batch size?
UPDATE system.jobs
SET job_type = crdb_internal.job_payload_type(payload)
WHERE job_type IS NULL
LIMIT $1

Using the ID let's you create log messages containing the ID for more observability while the backfill is progressing. pkg/upgrade/upgrades/backfill_job_info_table_migration.go does something similar, presumably for that reason. I use a CTE and select at the end because RETURNING clauses don't support things like max(id).

Using the ID method also doesn't miss any jobs because the set of jobs is finite - Newly created jobs will populate the job_type column created by the migration immediately before this one.

andyyang890

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @dt, and @jayshrivastava)

pkg/upgrade/helpers.go line 25 at r1 (raw file):

Previously, jayshrivastava (Jayant) wrote…

Eh maybe. The job info table is bigger than system.jobs, but they are still O(num jobs). That's why I feel that they can use the same batch size. I felt it might be cleaner to put in the helpers.go file in the upgrade package and share it in both migrations (and future migrations that change the jobs tables).

I would still argue for leaving it in the upgrades package because it's ultimately an implementation detail of these specific upgrades. The upgrade package is for the actual upgrade infrastructure.

It still feels slightly icky to me to share the knob between two different migrations, especially as they touch different tables, but if you feel strongly about it, I guess it's fine.

pkg/upgrade/upgrades/alter_jobs_add_job_type.go line 39 at r1 (raw file):

Previously, jayshrivastava (Jayant) wrote…

Using the ID let's you create log messages containing the ID for more observability while the backfill is progressing. pkg/upgrade/upgrades/backfill_job_info_table_migration.go does something similar, presumably for that reason. I use a CTE and select at the end because RETURNING clauses don't support things like max(id).

Using the ID method also doesn't miss any jobs because the set of jobs is finite - Newly created jobs will populate the job_type column created by the migration immediately before this one.

Ah got it.

jayshrivastava

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @andyyang890, and @dt)

pkg/upgrade/helpers.go line 25 at r1 (raw file):

Previously, andyyang890 (Andy Yang) wrote…

I would still argue for leaving it in the upgrades package because it's ultimately an implementation detail of these specific upgrades. The upgrade package is for the actual upgrade infrastructure.

It still feels slightly icky to me to share the knob between two different migrations, especially as they touch different tables, but if you feel strongly about it, I guess it's fine.

Yeah that makes sense. Done.

andyyang890

Reviewed 1 of 6 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @adityamaru and @jayshrivastava)

Previously, the migration which backfills the `job_type` column in `system.jobs` could hang indefinately due to contention on the jobs table. This change updates the migration to batch the backfill using multiple transactions. Informs: cockroachdb#105663 Epic: None

jayshrivastava · 2023-06-30T18:42:16Z

bors r=andyyang890

jayshrivastava · 2023-07-05T13:53:45Z

bors r=andyyang890

craig · 2023-07-05T14:59:33Z

Build succeeded:

Bazel Essential CI (Cockroach)

blathers-crl · 2023-07-05T14:59:40Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from df30a45 to blathers/backport-release-23.1-105750: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

In cockroachdb#105750 we split the backfill of the job_type column across multiple transactions. This introduced a bug in which we would modify the resumeAfter variable that controlled the batching before the transaction succeeded. In the face of a transaction retry, this would result in some rows not having their job_type column populated. This was caught in nightly tests that attempted to use the crdb_internal.system_jobs virtual index on the job_type column. Here, we apply the same fix that we applied in cockroachdb#104752 for the same type of bug. Fixes cockroachdb#106347 Fixes cockroachdb#106246 Release note (bug fix): Fixes a bug where a transaction retry during the backfill of the job_type column in the jobs table could result some job records with no job_type value.

106236: admission: avoid recursive grant chain r=irfansharif a=irfansharif Fixes #105185. Fixes #105613. In #97599 we introduced a non-blocking admission interface for below-raft, replication admission control. When doing so, we unintentionally violated the 'requester' interface -- when 'requester.granted()' is invoked, the granter expects to admit a single queued request. The code layering made it so that after granting one, when doing post-hoc token adjustments, if we observed the granter was exhausted but now no longer was so, we'd try to grant again. This resulted in admitting work recursively, with a callstack as deep as the admit chain. Not only is that undesirable design-wise, it also triggered panics in the granter that wasn't expecting more than one request being admitted. Recursively we were incrementing the grant chain index, which overflowed (it was in int8, so happened readily with long enough admit chains), after which we panic-ed when using a negative index to access an array. We add a test that fails without the changes. The failure can also be triggered by applying the diff below (which reverts to the older, buggy behavior): ``` dev test pkg/kv/kvserver -f TestFlowControlGranterAdmitOneByOne -v --show-logs ``` ```diff diff --git i/pkg/util/admission/granter.go w/pkg/util/admission/granter.go index ba42213..7c526fbb3d8 100644 --- i/pkg/util/admission/granter.go +++ w/pkg/util/admission/granter.go `@@` -374,7 +374,7 `@@` func (cg *kvStoreTokenChildGranter) storeWriteDone( func (cg *kvStoreTokenChildGranter) storeReplicatedWorkAdmittedLocked( originalTokens int64, admittedInfo storeReplicatedWorkAdmittedInfo, ) (additionalTokens int64) { - return cg.parent.storeReplicatedWorkAdmittedLocked(cg.workClass, originalTokens, admittedInfo, false /* canGrantAnother */) + return cg.parent.storeReplicatedWorkAdmittedLocked(cg.workClass, originalTokens, admittedInfo, true /* canGrantAnother */) } ``` Release note: None 106378: upgrades: fix txn retry bug in upgrade batching r=adityamaru a=stevendanna In #105750 we split the backfill of the job_type column across multiple transactions. This introduced a bug in which we would modify the resumeAfter variable that controlled the batching before the transaction succeeded. In the face of a transaction retry, this would result in some rows not having their job_type column populated. This was caught in nightly tests that attempted to use the crdb_internal.system_jobs virtual index on the job_type column. Here, we apply the same fix that we applied in #104752 for the same type of bug. Fixes #106347 Fixes #106246 Release note (bug fix): Fixes a bug where a transaction retry during the backfill of the job_type column in the jobs table could result some job records with no job_type value. 106408: ci: remove `lint` job from GitHub CI r=rail a=rickystewart With `staticcheck` and `unused` working identically under `lint` in Bazel and `make` now, it's time! Delete this file so that GitHub CI lint stops running. This is the *last* GitHub CI job. :) Now only Bazel builds and tests will run on PR's. Epic: CRDB-15060 Release note: None Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com> Co-authored-by: Steven Danna <danna@cockroachlabs.com> Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>

In #105750 we split the backfill of the job_type column across multiple transactions. This introduced a bug in which we would modify the resumeAfter variable that controlled the batching before the transaction succeeded. In the face of a transaction retry, this would result in some rows not having their job_type column populated. This was caught in nightly tests that attempted to use the crdb_internal.system_jobs virtual index on the job_type column. Here, we apply the same fix that we applied in #104752 for the same type of bug. Fixes #106347 Fixes #106246 Release note (bug fix): Fixes a bug where a transaction retry during the backfill of the job_type column in the jobs table could result some job records with no job_type value.

jayshrivastava force-pushed the job-type-batch branch from f2dce87 to 72e567f Compare June 28, 2023 19:09

jayshrivastava added backport-22.2.x backport-23.1.x PAST MAINTENANCE SUPPORT: 23.1 patch releases via ER request only labels Jun 28, 2023

jayshrivastava requested a review from adityamaru June 28, 2023 19:11

jayshrivastava marked this pull request as ready for review June 28, 2023 19:11

jayshrivastava requested a review from a team as a code owner June 28, 2023 19:11

jayshrivastava force-pushed the job-type-batch branch from 72e567f to bdf5b68 Compare June 28, 2023 19:24

andyyang890 reviewed Jun 28, 2023

View reviewed changes

jayshrivastava commented Jun 28, 2023

View reviewed changes

andyyang890 reviewed Jun 28, 2023

View reviewed changes

jayshrivastava commented Jun 28, 2023

View reviewed changes

andyyang890 reviewed Jun 29, 2023

View reviewed changes

jayshrivastava force-pushed the job-type-batch branch from bdf5b68 to ede38ce Compare June 29, 2023 16:04

jayshrivastava commented Jun 29, 2023

View reviewed changes

andyyang890 approved these changes Jun 29, 2023

View reviewed changes

andyyang890 removed the backport-22.2.x label Jun 29, 2023

jayshrivastava force-pushed the job-type-batch branch from ede38ce to df30a45 Compare June 29, 2023 18:15

craig bot merged commit 972c379 into cockroachdb:master Jul 5, 2023

jayshrivastava mentioned this pull request Jul 5, 2023

release-23.1: jobs: batch writes when backfilling job_type column in system.jobs #106179

Merged

tbg mentioned this pull request Jul 6, 2023

kvserver: splits retrying forever #106285

Closed

stevendanna mentioned this pull request Jul 7, 2023

upgrades: fix txn retry bug in upgrade batching #106378

Merged

blathers-crl bot mentioned this pull request Jul 7, 2023

release-23.1: upgrades: fix txn retry bug in upgrade batching #106412

Merged

Conversation

jayshrivastava commented Jun 28, 2023

Uh oh!

cockroach-teamcity commented Jun 28, 2023

Uh oh!

andyyang890 left a comment

Choose a reason for hiding this comment

Uh oh!

jayshrivastava left a comment

Choose a reason for hiding this comment

Uh oh!

andyyang890 left a comment

Choose a reason for hiding this comment

Uh oh!

andyyang890 left a comment

Choose a reason for hiding this comment

Uh oh!

jayshrivastava left a comment

Choose a reason for hiding this comment

Uh oh!

andyyang890 left a comment

Choose a reason for hiding this comment

Uh oh!

jayshrivastava left a comment

Choose a reason for hiding this comment

Uh oh!

andyyang890 left a comment

Choose a reason for hiding this comment

Uh oh!

jayshrivastava commented Jun 30, 2023

Uh oh!

jayshrivastava commented Jul 5, 2023

Uh oh!

craig bot commented Jul 5, 2023

Uh oh!

blathers-crl bot commented Jul 5, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants