-
Notifications
You must be signed in to change notification settings - Fork 4.1k
schemachange: attempting to update succeeded job over and over #38088
Description
A customer cluster got all gunked up because a schema change (or a table truncation?) fails a couple of times a second with the following amusing message:
W190603 13:33:12.043023 211 sql/schema_changer.go:1586 [n1] Error executing schema change: failed to update job 456021744522723331: cannot update progress on succeeded job (id 456021744522723331)
Why someone is trying to update the progress of a succeeded job, I do not know. Two nodes racing on finishing the schema change maybe?
The schema change in question is:
456021744522723331 | SCHEMA CHANGE | TRUNCATE TABLE <redacted> CASCADE
The table has id: 4191 and state: DROP and drop_job_id: 456021744522723331
These schema change retries kill us because, with every one, we seem to acquire and release the "schema change lease" for this table (I can see this by diffing consecutive versions of the descriptor) which eventually leads to the system config range being unable to accept writes because it's gotten too big and it can't be split.
Debug.zip here (internal only)
@dt you want this one?