Online DDL: identify VReplication retrying failure, terminate migration#9958
Closed
shlomi-noach wants to merge 2 commits intovitessio:mainfrom
Closed
Online DDL: identify VReplication retrying failure, terminate migration#9958shlomi-noach wants to merge 2 commits intovitessio:mainfrom
shlomi-noach wants to merge 2 commits intovitessio:mainfrom
Conversation
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
3 tasks
Contributor
Author
|
The only failing test is |
Contributor
Author
|
On top of the fix in this PR, we also have #9973 which identifies specific error codes and completely fails the vreplication workflow on those errors. This PR then serves as a sort of "plan B" for when vreplication still insists on retrying and never makes progress. |
Contributor
Author
|
implicitly merged by #9973 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
There's a set of errors that VReplication encounters, that it doesn't fail on, it just keeps retrying forever. Recently, #9538 introduced
time_heartbeat. By looking at bothtime_heartbeatandtime_updatedwe are able to determine whether a vreplication stream is making no progress. We look at the greater value of the two. If it is not making progress for X minutes, then vreplication is effectively not doing its work.
Online DDL executor now looks at that (greater of the two) value, persists it in
schema_migrations, and only updatesliveness_timestampof the (vitessstrategy) migration if it sees an increase in the value.What this means is in effect, if VReplication is stuck/retrying, then
liveness_timestampdoes not increase. In turn, the executor will see that the migration has not indicated liveness in quite a while, and will eventually (configured at 10min) terminate the migration and mark it asfailed.So this gives us a
10minreaction time to a hanging VReplication stream.Related Issue(s)
Tracking: #6926
Checklist