-
Notifications
You must be signed in to change notification settings - Fork 4.1k
jobs: "system.job_info does not exist" during cluster upgrade #103239
Description
While testing some changes to the backup-restore/mixed-version roachtest, I saw a restore fail with the following error:
pq: CreateAdoptableJobInTxn: write-job-info-delete: relation "system.job_info" does not exist
This seems to happen when a RESTORE is run while the cluster is upgrading (migrations running in the background). Since the error message happens in the job layer, I believe the issue is unrelated to the restore logic itself.
Reproduction
#103228 contains the work-in-progress changes I was testing; the last commit in that PR is a series of changes to make the issue reproduce more quickly. Running the backup-restore/mixed-version test on that branch with a specific seed [1] (known to cause a restore to run during upgrade) reproduces this bug about ~10-20% of the times in about 15 mins.
For convenience, see TC run on the aforementioned PR [2], where we saw 2 failures out of 10 runs.
Let me know what else I can do to help debug this.
Update: an easier way to reproduce this bug seems to be by running the simpler acceptance/version-upgrade test using a seed that causes the schemachange workload to run concurrently with migrations. -8690666577594439584 is one such seed.
[1] 2167957990363226999
[2] https://teamcity.cockroachdb.com/viewLog.html?buildId=10059101&buildTypeId=Cockroach_Nightlies_RoachtestNightlyGceBazel&tab=buildLog#_state=600
Jira issue: CRDB-27894