Skip to content

jobs: "system.job_info does not exist" during cluster upgrade #103239

@renatolabs

Description

@renatolabs

While testing some changes to the backup-restore/mixed-version roachtest, I saw a restore fail with the following error:

pq: CreateAdoptableJobInTxn: write-job-info-delete: relation "system.job_info" does not exist

This seems to happen when a RESTORE is run while the cluster is upgrading (migrations running in the background). Since the error message happens in the job layer, I believe the issue is unrelated to the restore logic itself.

Reproduction

#103228 contains the work-in-progress changes I was testing; the last commit in that PR is a series of changes to make the issue reproduce more quickly. Running the backup-restore/mixed-version test on that branch with a specific seed [1] (known to cause a restore to run during upgrade) reproduces this bug about ~10-20% of the times in about 15 mins.

For convenience, see TC run on the aforementioned PR [2], where we saw 2 failures out of 10 runs.

Let me know what else I can do to help debug this.

Update: an easier way to reproduce this bug seems to be by running the simpler acceptance/version-upgrade test using a seed that causes the schemachange workload to run concurrently with migrations. -8690666577594439584 is one such seed.

[1] 2167957990363226999
[2] https://teamcity.cockroachdb.com/viewLog.html?buildId=10059101&buildTypeId=Cockroach_Nightlies_RoachtestNightlyGceBazel&tab=buildLog#_state=600

Jira issue: CRDB-27894

Metadata

Metadata

Assignees

Labels

A-jobsC-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.T-jobs

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions