While testing some changes to the backup-restore/mixed-version roachtest, I saw a restore fail with the following error:
pq: CreateAdoptableJobInTxn: write-job-info-delete: relation "system.job_info" does not exist
This seems to happen when a RESTORE is run while the cluster is upgrading (migrations running in the background). Since the error message happens in the job layer, I believe the issue is unrelated to the restore logic itself.
Reproduction
#103228 contains the work-in-progress changes I was testing; the last commit in that PR is a series of changes to make the issue reproduce more quickly. Running the backup-restore/mixed-version test on that branch with a specific seed [1] (known to cause a restore to run during upgrade) reproduces this bug about ~10-20% of the times in about 15 mins.
For convenience, see TC run on the aforementioned PR [2], where we saw 2 failures out of 10 runs.
Let me know what else I can do to help debug this.
Update: an easier way to reproduce this bug seems to be by running the simpler acceptance/version-upgrade test using a seed that causes the schemachange workload to run concurrently with migrations. -8690666577594439584 is one such seed.
[1] 2167957990363226999
[2] https://teamcity.cockroachdb.com/viewLog.html?buildId=10059101&buildTypeId=Cockroach_Nightlies_RoachtestNightlyGceBazel&tab=buildLog#_state=600
Jira issue: CRDB-27894
While testing some changes to the
backup-restore/mixed-versionroachtest, I saw a restore fail with the following error:This seems to happen when a
RESTOREis run while the cluster is upgrading (migrations running in the background). Since the error message happens in the job layer, I believe the issue is unrelated to the restore logic itself.Reproduction
#103228 contains the work-in-progress changes I was testing; the last commit in that PR is a series of changes to make the issue reproduce more quickly. Running the
backup-restore/mixed-versiontest on that branch with a specific seed [1] (known to cause a restore to run during upgrade) reproduces this bug about ~10-20% of the times in about 15 mins.For convenience, see TC run on the aforementioned PR [2], where we saw 2 failures out of 10 runs.
Let me know what else I can do to help debug this.
Update: an easier way to reproduce this bug seems to be by running the simpler
acceptance/version-upgradetest using a seed that causes the schemachange workload to run concurrently with migrations.-8690666577594439584is one such seed.[1]
2167957990363226999[2] https://teamcity.cockroachdb.com/viewLog.html?buildId=10059101&buildTypeId=Cockroach_Nightlies_RoachtestNightlyGceBazel&tab=buildLog#_state=600
Jira issue: CRDB-27894