release-14.0 backport: Fail VReplication workflows on errors that persist and unrecoverable errors#10573
Merged
shlomi-noach merged 3 commits intovitessio:release-14.0from Jun 23, 2022
Conversation
…errors (vitessio#10429) * Fail workflow if same error persists too long. Fail for unrecoverable errors also in non-online ddl workflows Signed-off-by: Rohit Nayak <rohit@planetscale.com> * Update max time default to 15m, was 1m for testing purposes Signed-off-by: Rohit Nayak <rohit@planetscale.com> * Leverage vterrors for Equals; attempt to address my own nits Signed-off-by: Matt Lord <mattalord@gmail.com> * sanity: validate range of vreplication_retry_delay and of vreplication_max_time_to_retry_on_error Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * Fix flags test Signed-off-by: Rohit Nayak <rohit@planetscale.com> * Remove leftover log.Flush() Signed-off-by: Rohit Nayak <rohit@planetscale.com> * Revert validations min/max settings on retry delay since it is breaking unit tests that set the value to a very small value Signed-off-by: Rohit Nayak <rohit@planetscale.com> * captilize per request Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> Co-authored-by: Matt Lord <mattalord@gmail.com> Co-authored-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
3 tasks
Contributor
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Bug fixes
Non-trivial changes
New/Existing features
Backward compatibility
|
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
frouioui
approved these changes
Jun 23, 2022
GuptaManan100
approved these changes
Jun 23, 2022
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Back port of #10429
Description
As part of an initial design decision, VReplication workflows always retry in case it encounters an error after sleeping for 5 seconds. The reasoning was that, for large reshards/migrations and perpetual materialize workflows, we could often encounter recoverable errors like PRS, restarting of vttablets/mysql servers, network partitions etc. So rather than error out waiting for an operator to manually restart workflows we decided to keep retrying.
Since we only retried every five seconds any resource wastage due to continuously retrying unrecoverable workflows would be small and in most cases we would transparently recover and make forward progress with minimum downtime. This is especially important for Materialize workflows were the user is expecting near realtime performance.
Usually the vreplication workflows would be setup manually and the possibility of errors due to schema issues was minimal and so this approach worked well. However with the introduction of vreplication-based online DDL workflows we see a lot of automated use where user-specified DDLs are directly used to configure vreplication workflows. Incorrect DDLs can thus result in errors that result in prolonged retries that are not recoverable.
Error reporting in VReplication is also not great: we update the
messagecolumn in the_vt.vreplicationtable, but that can get overwritten when we retry. We do also log errors in the_vt.vreplication_logtableA change was introduced recently in Online DDL workflows to mitigate this: we look up the error against a set of MySQL errors that we knew were not recoverable and in that case we put the workflow in an error state. Then there are no more automated retries and a manual restart after fixing the error is expected.
However there are still unrecoverable schema-related errors that are not yet mapped or do not map cleanly to MySQL errors. There could also be misconfigured workflows (example: no replicas in a keyspace when the tablet type is set to only replicas, incorrect cell settings etc). Continuously retrying workflows in such cases can delay detecting them.
This PR:
--vreplication_max_time_to_retry_errors(default: 15 minutes).For above cases it directly moves the workflow to
Errorstate, which is then reported inWorkflow Show.Checklist