Skip to content

Online DDL: identify VReplication retrying failure, terminate migration#9958

Closed
shlomi-noach wants to merge 2 commits intovitessio:mainfrom
planetscale:vitess-migration-liveness
Closed

Online DDL: identify VReplication retrying failure, terminate migration#9958
shlomi-noach wants to merge 2 commits intovitessio:mainfrom
planetscale:vitess-migration-liveness

Conversation

@shlomi-noach
Copy link
Copy Markdown
Contributor

Description

There's a set of errors that VReplication encounters, that it doesn't fail on, it just keeps retrying forever. Recently, #9538 introduced time_heartbeat. By looking at both time_heartbeat and time_updated
we are able to determine whether a vreplication stream is making no progress. We look at the greater value of the two. If it is not making progress for X minutes, then vreplication is effectively not doing its work.

Online DDL executor now looks at that (greater of the two) value, persists it in schema_migrations, and only updates liveness_timestamp of the (vitess strategy) migration if it sees an increase in the value.

What this means is in effect, if VReplication is stuck/retrying, then liveness_timestamp does not increase. In turn, the executor will see that the migration has not indicated liveness in quite a while, and will eventually (configured at 10min) terminate the migration and mark it as failed.

So this gives us a 10min reaction time to a hanging VReplication stream.

Related Issue(s)

Tracking: #6926

Checklist

  • Should this PR be backported?
  • Tests were added or are not required
  • Documentation was added or is not required

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Copy Markdown
Contributor Author

The only failing test is Upgrade Downgrade Testing Query Serving, which we are aware of and is unrelated to this PR.

@shlomi-noach
Copy link
Copy Markdown
Contributor Author

On top of the fix in this PR, we also have #9973 which identifies specific error codes and completely fails the vreplication workflow on those errors. This PR then serves as a sort of "plan B" for when vreplication still insists on retrying and never makes progress.

@shlomi-noach
Copy link
Copy Markdown
Contributor Author

implicitly merged by #9973

@shlomi-noach shlomi-noach deleted the vitess-migration-liveness branch March 24, 2022 18:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: Query Serving Type: Enhancement Logical improvement (somewhere between a bug and feature)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant