bulkio, sql: mutations with job IDs are persisted in backups and cause problems after restoring

When taking backups, we currently preserve the contents of `mutationJobs` and `gc_mutations` on the table descriptors, which contain job IDs for the schema change. This causes problems when restoring the backup on a different cluster, where jobs with those IDs don't exist.

For a simple reproduction (with `gc_mutations`), create a table with an index and drop the index, which leaves a job in the "waiting for GC TTL" state, and then take a backup of the table:
```
root@:26257/defaultdb> create table t (a int, index (a));
CREATE TABLE

Time: 10.054ms

root@:26257/defaultdb> insert into t values (1), (2), (3);
INSERT 3

Time: 3.78ms

root@:26257/defaultdb> drop index t@t_a_idx;
DROP INDEX

Time: 112.24ms

root@:26257/defaultdb> show jobs;
        job_id       |   job_type    |              description              | statement | user_name | status  |   running_status   |             created              |             started              | finished |             modified             | fraction_completed | error | coordinator_id
+--------------------+---------------+---------------------------------------+-----------+-----------+---------+--------------------+----------------------------------+----------------------------------+----------+----------------------------------+--------------------+-------+----------------+
  521128172017778689 | SCHEMA CHANGE | DROP INDEX defaultdb.public.t@t_a_idx |           | root      | running | waiting for GC TTL | 2020-01-15 16:34:57.023391+00:00 | 2020-01-15 16:34:57.048795+00:00 | NULL     | 2020-01-15 16:34:57.095392+00:00 |                  0 |       |           NULL
(1 row)

Time: 6.022ms

root@:26257/defaultdb> backup t to 'nodelocal:///backup';
        job_id       |  status   | fraction_completed | rows | index_entries | system_records | bytes
+--------------------+-----------+--------------------+------+---------------+----------------+-------+
  521128953384796161 | succeeded |                  1 |    3 |             0 |              0 |    57
(1 row)

Time: 50.286ms
```
Inspecting the backup with `protoc --decode_raw` shows that field 33, which is `gc_mutations`, is preserved:
```
    33 {
      1: 2
      2: 1579106097092344000
      3: 521128172017778689
    }
```

 Then start a fresh cluster and restore the table, and do something to trigger the async schema changer on that table, like updating the zone config to change `gc.ttlseconds`:
```
root@:26257/defaultdb> restore table t from 'nodelocal:///backup';
        job_id       |  status   | fraction_completed | rows | index_entries | system_records | bytes
+--------------------+-----------+--------------------+------+---------------+----------------+-------+
  521133970934759425 | succeeded |                  1 |    3 |             0 |              0 |    57
(1 row)

Time: 63.545ms

root@:26257/defaultdb> alter table t configure zone using gc.ttlseconds='10';
CONFIGURE ZONE 1

Time: 7.229ms
```
Eventually log statements like this will appearing several times per second (I tested this with a local 3-node cluster and a 1-node cluster):
```
I200115 17:39:13.936893 159 sql/sqlbase/structured.go:1516  [n1,scExec] publish: descID=52 (t) version=7 mtime=1970-01-01 00:00:00 +0000 UTC
W200115 17:39:13.938967 159 sql/schema_changer.go:1991  [n1] Error executing schema change: job with ID 521128172017778689 does not exist
I200115 17:39:13.943209 159 sql/sqlbase/structured.go:1516  [n1,scExec] publish: descID=52 (t) version=7 mtime=1970-01-01 00:00:00 +0000 UTC
W200115 17:39:13.945117 159 sql/schema_changer.go:1991  [n1] Error executing schema change: job with ID 521128172017778689 does not exist
```
I believe this is happening because `SchemaChanger.maybeGCMutations()` is getting called in a loop (from `SchemaChanger.exec()`), and it calls `Publish()` but never succeeds because the callback to update the job status on success can't find the job: https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/schema_changer.go#L789

I'm not sure why the attempt occurs multiple times per second, or how much of this is caused by retries inside `Publish()`/`PublishMultiple()` and how much of it is caused by retries in the `SchemaChangeManager`. The retries seem to happen more frequently on a 3-node cluster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bulkio, sql: mutations with job IDs are persisted in backups and cause problems after restoring #44019

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bulkio, sql: mutations with job IDs are persisted in backups and cause problems after restoring #44019

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions