Skip to content

release-14.0 backport: Fail VReplication workflows on errors that persist and unrecoverable errors#10573

Merged
shlomi-noach merged 3 commits intovitessio:release-14.0from
planetscale:v14-rn-vr-fail-on-repeated-errors
Jun 23, 2022
Merged

release-14.0 backport: Fail VReplication workflows on errors that persist and unrecoverable errors#10573
shlomi-noach merged 3 commits intovitessio:release-14.0from
planetscale:v14-rn-vr-fail-on-repeated-errors

Conversation

@shlomi-noach
Copy link
Copy Markdown
Contributor

Back port of #10429


Description

As part of an initial design decision, VReplication workflows always retry in case it encounters an error after sleeping for 5 seconds. The reasoning was that, for large reshards/migrations and perpetual materialize workflows, we could often encounter recoverable errors like PRS, restarting of vttablets/mysql servers, network partitions etc. So rather than error out waiting for an operator to manually restart workflows we decided to keep retrying.

Since we only retried every five seconds any resource wastage due to continuously retrying unrecoverable workflows would be small and in most cases we would transparently recover and make forward progress with minimum downtime. This is especially important for Materialize workflows were the user is expecting near realtime performance.

Usually the vreplication workflows would be setup manually and the possibility of errors due to schema issues was minimal and so this approach worked well. However with the introduction of vreplication-based online DDL workflows we see a lot of automated use where user-specified DDLs are directly used to configure vreplication workflows. Incorrect DDLs can thus result in errors that result in prolonged retries that are not recoverable.

Error reporting in VReplication is also not great: we update the message column in the _vt.vreplication table, but that can get overwritten when we retry. We do also log errors in the _vt.vreplication_log table

A change was introduced recently in Online DDL workflows to mitigate this: we look up the error against a set of MySQL errors that we knew were not recoverable and in that case we put the workflow in an error state. Then there are no more automated retries and a manual restart after fixing the error is expected.

However there are still unrecoverable schema-related errors that are not yet mapped or do not map cleanly to MySQL errors. There could also be misconfigured workflows (example: no replicas in a keyspace when the tablet type is set to only replicas, incorrect cell settings etc). Continuously retrying workflows in such cases can delay detecting them.

This PR:

  • extends the check for unrecoverable errors to all workflow types, not just Online DDLs
  • for all workflows, detects errors that persist for more than the 🚩 new vttablet flag
    --vreplication_max_time_to_retry_errors (default: 15 minutes).

For above cases it directly moves the workflow to Error state, which is then reported in Workflow Show.

Checklist

  • "Backport me!" label has been added if this change should be backported
    • We should backport this to 14.0.0-rc, but no further
  • Tests were added or are not required
  • Documentation was added or is not required

…errors (vitessio#10429)

* Fail workflow if same error persists too long. Fail for unrecoverable errors also in non-online ddl workflows

Signed-off-by: Rohit Nayak <rohit@planetscale.com>

* Update max time default to 15m, was 1m for testing purposes

Signed-off-by: Rohit Nayak <rohit@planetscale.com>

* Leverage vterrors for Equals; attempt to address my own nits

Signed-off-by: Matt Lord <mattalord@gmail.com>

* sanity: validate range of vreplication_retry_delay and of vreplication_max_time_to_retry_on_error

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* Fix flags test

Signed-off-by: Rohit Nayak <rohit@planetscale.com>

* Remove leftover log.Flush()

Signed-off-by: Rohit Nayak <rohit@planetscale.com>

* Revert validations min/max settings on retry delay since it is breaking unit tests that set the value to a very small value

Signed-off-by: Rohit Nayak <rohit@planetscale.com>

* captilize per request

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

Co-authored-by: Matt Lord <mattalord@gmail.com>
Co-authored-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • If this is a change that users need to know about, please apply the release notes (needs details) label so that merging is blocked unless the summary release notes document is included.
  • If a new flag is being introduced, review whether it is really needed. The flag names should be clear and intuitive (as far as possible), and the flag's help should be descriptive.
  • If a workflow is added or modified, each items in Jobs should be named in order to mark it as required. If the workflow should be required, the GitHub Admin should be notified.

Bug fixes

  • There should be at least one unit or end-to-end test.
  • The Pull Request description should either include a link to an issue that describes the bug OR an actual description of the bug and how to reproduce, along with a description of the fix.

Non-trivial changes

  • There should be some code comments as to why things are implemented the way they are.

New/Existing features

  • Should be documented, either by modifying the existing documentation or creating new documentation.
  • New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • vtctl command output order should be stable and awk-able.

@shlomi-noach shlomi-noach mentioned this pull request Jun 23, 2022
43 tasks
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach shlomi-noach merged commit d60ee05 into vitessio:release-14.0 Jun 23, 2022
@shlomi-noach shlomi-noach deleted the v14-rn-vr-fail-on-repeated-errors branch June 23, 2022 15:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Backport This is a backport Component: VReplication Type: Enhancement Logical improvement (somewhere between a bug and feature)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants