VReplication: fail on unrecoverable errors by shlomi-noach · Pull Request #9973 · vitessio/vitess

shlomi-noach · 2022-03-24T08:27:44Z

Description

In this PR we identify a specific set of MySQL errors for which VReplication now fails.
Normally, when vreplication/controller sees an error, it backs off and attempts a new vreplication stream, again and again and again.

Now, we identify some errors for which it just doesn't make sense to continue. There errors are caught by vplayer and vcopier, and propagated all the way up to the controller.

If the controller sees these specific errors, it:

sets state to Error (binlogplayer.BlpError)
ensures the message indicates the error
quits

The specific errors are found in vreplication/utils.go, and generally include:

Attempt to assign an invalid value (value too long, invalid character set, invalid enum value, NULL to non-nullable column)
Unique key violation
No default value for column

These failures can only originate with incompatible data types or schemas in source & target tables.

In addition, OnlineDDL executor now intercepts binlogplayer.BlpError state in _vt.vreplication and proceeds to fail the migration. Previously (#9958) we would just time out. This reduces migration failure response from up to 10min to up to 1min.

Related Issue(s)

Initial commit does not include tests yet. I'll add the tests in onlineddl/vrepl_suite so that Online DDL tests the VReplication changes. It makes most sense, because OnlineDDL is the only flow where we even expect these type of failures.

Tests TBD

Checklist

Should this PR be backported?
Tests were added or are not required
Documentation was added or is not required

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

…vrepl-unrecoverable-error Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

rohit-nayak-ps · 2022-03-24T09:21:35Z

The initial Error state related changes look good.

I am a little wary about adding the new state since we might have state-conditional code that might confuse the Error state with another one. But you should be able to surface any issues while manually testing any of the error conditions. I will run some workflows manually on my end, injecting artificial errors, as well once you have the PR done.

For now I identified that we need to handle the new state at these points, by checking additionally for the Error state:

vreplication/controller.go:100

	// Nothing to do if replication is stopped.
	if params["state"] == binlogplayer.BlpStopped {
		ct.cancel = func() {}
		close(ct.done)
		return ct, nil
	}

vreplication/engine.go:758

		if qr.Rows[0][1].ToString() == binlogplayer.BlpStopped {
			return fmt.Errorf("replication has stopped at %v before reaching position %v, message: %s", current, mPos, qr.Rows[0][2].ToString())
		}

vreplicator.go:210

		// If any of the operations below changed state to Stopped, we should return.
		if settings.State == binlogplayer.BlpStopped {
			return nil
		}

Also the (VExec) Workflow commands:

workflow/server.go:401

		switch {
		case strings.Contains(strings.ToLower(stream.Message), "error"):
			stream.State = "Error"
		case stream.State == "Running" && len(stream.CopyStates) > 0:
			stream.State = "Copying"
		case stream.State == "Running" && int64(time.Now().Second())-timeUpdatedSeconds > 10:
			stream.State = "Lagging"
		}

vexec.go:646

func updateState(message, state string, cs []copyState, timeUpdated int64) string {
	if strings.Contains(strings.ToLower(message), "error") {
		state = "Error"
	} else if state == "Running" && len(cs) > 0 {
		state = "Copying"
	} else if state == "Running" && int64(time.Now().Second())-timeUpdated > 10 /* seconds */ {
		state = "Lagging"
	}
	return state
}

rohit-nayak-ps · 2022-03-24T09:21:54Z

You also need to cherry-pick the fix to the unit tests ...

shlomi-noach · 2022-03-24T09:22:48Z

You also need to cherry-pick the fix to the unit tests ...

Ah, yes, thank you!

shlomi-noach · 2022-03-24T09:24:17Z

@rohit-nayak-ps I'm thinking, we can do this in smaller steps, and only break out of the controller when these errors are encountered AND this is an OnlineDDL workflow type. What do you think?

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

rohit-nayak-ps · 2022-03-24T09:35:42Z

@rohit-nayak-ps I'm thinking, we can do this in smaller steps, and only break out of the controller when these errors are encountered AND this is an OnlineDDL workflow type. What do you think?

Are you saying you will only check and go into the Error state for OnlineDDLs?

shlomi-noach · 2022-03-24T09:37:34Z

Are you saying you will only check and go into the Error state for OnlineDDLs?

Yes, I propose we will only go into the error state, as well as completely quit the workflow -- only in online ddl. Does this increase confidence in the change?

rohit-nayak-ps · 2022-03-24T09:39:26Z

Yes, I propose we will only go into the error state, as well as completely quit the workflow -- only in online ddl. Does this increase confidence in the change?

Good idea. Let's do that.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

shlomi-noach · 2022-03-24T10:00:39Z

Good idea. Let's do that.

@rohit-nayak-ps done. Do I still need to worry about the new state in all aforementioned locations?

I'm also thinking, if this is too complex for now, we can add a column called unrecoverable_error (bool) in _vt.vreplication and update that column, instead of updating the state.

What do you think?

rohit-nayak-ps · 2022-03-24T10:08:33Z

@rohit-nayak-ps done. Do I still need to worry about the new state in all aforementioned locations?

I'm also thinking, if this is too complex for now, we can add a column called unrecoverable_error (bool) in _vt.vreplication and update that column, instead of updating the state.

If you have tested one or two error situations in OnlineDDL then this approach should be fine. It is something we have been planning to add to VReplication workflows anyway and this will be a great start towards it.

We can update the other locations when we expand error handling to other workflows.

shlomi-noach · 2022-03-24T10:23:30Z

sounds good! Yes, tested locally many situations and shortly formally adding tests.

shlomi-noach · 2022-03-24T10:30:44Z

Oh I should note this PR extends #9958 ; so it looks a bit bigger than it would when #9958 is merged.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

shlomi-noach · 2022-03-24T11:54:23Z

Tests added!

shlomi-noach · 2022-03-24T11:54:38Z

Ready to review.

rohit-nayak-ps

lgtm

shlomi-noach · 2022-03-24T18:22:48Z

OK then!

mattlord

I'm late the party but this LGTM! Thanks!

* Online DDL: identify VReplication retrying failure, terminate migration Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * add VReplicationWorkflowType Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * add workflow_type column to _vt.vreplication Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * online DDL executor sets workflow_type value in _vt.vreplication Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * setting strict sql_mode for OnlineDDL type workflows Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * VReplication: fail on unrecoverable errors Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * mistakenly hid variable Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * only bail out if workflow type is onlineddl Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * expect, then actual, not the other way around Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * added tests for failure scenarios Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

shlomi-noach added 7 commits March 23, 2022 13:53

Online DDL: identify VReplication retrying failure, terminate migration

7c191ae

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

add VReplicationWorkflowType

2b2a5da

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

add workflow_type column to _vt.vreplication

fad06aa

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

online DDL executor sets workflow_type value in _vt.vreplication

0740baa

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

setting strict sql_mode for OnlineDDL type workflows

6679ad6

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

Merge remote-tracking branch 'origin/vitess-migration-liveness' into …

5b0a148

…vrepl-unrecoverable-error Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

VReplication: fail on unrecoverable errors

4127ce6

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

shlomi-noach requested a review from rohit-nayak-ps March 24, 2022 08:27

shlomi-noach added Type: Bug Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: VReplication release notes none labels Mar 24, 2022

mistakenly hid variable

3cff5c7

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

shlomi-noach mentioned this pull request Mar 24, 2022

Online DDL: identify VReplication retrying failure, terminate migration #9958

Closed

3 tasks

Merge branch 'main' into vrepl-unrecoverable-error

4ad4837

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

only bail out if workflow type is onlineddl

de77dc8

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

shlomi-noach added 2 commits March 24, 2022 13:53

expect, then actual, not the other way around

c1b0210

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

added tests for failure scenarios

06ed1e1

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

shlomi-noach marked this pull request as ready for review March 24, 2022 11:54

shlomi-noach requested review from harshit-gangal, mattlord and systay as code owners March 24, 2022 11:54

rohit-nayak-ps approved these changes Mar 24, 2022

View reviewed changes

shlomi-noach merged commit 8d4e8c3 into vitessio:main Mar 24, 2022

shlomi-noach deleted the vrepl-unrecoverable-error branch March 24, 2022 18:23

mattlord approved these changes Mar 24, 2022

View reviewed changes

shlomi-noach added release notes and removed release notes none labels May 24, 2022

shlomi-noach mentioned this pull request May 31, 2022

OnlineDDL/vrepl: more error codes, spatial types #10394

Merged

3 tasks

This was referenced Jun 21, 2022

VReplication: more unrecoverable error codes #10559

Merged

Backport to v14: VReplication: more unrecoverable error codes (#10559) #10562

Merged

Conversation

shlomi-noach commented Mar 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue(s)

Checklist

Uh oh!

rohit-nayak-ps commented Mar 24, 2022

Uh oh!

rohit-nayak-ps commented Mar 24, 2022

Uh oh!

shlomi-noach commented Mar 24, 2022

Uh oh!

shlomi-noach commented Mar 24, 2022

Uh oh!

rohit-nayak-ps commented Mar 24, 2022

Uh oh!

shlomi-noach commented Mar 24, 2022

Uh oh!

rohit-nayak-ps commented Mar 24, 2022

Uh oh!

shlomi-noach commented Mar 24, 2022

Uh oh!

rohit-nayak-ps commented Mar 24, 2022

Uh oh!

shlomi-noach commented Mar 24, 2022

Uh oh!

shlomi-noach commented Mar 24, 2022

Uh oh!

shlomi-noach commented Mar 24, 2022

Uh oh!

shlomi-noach commented Mar 24, 2022

Uh oh!

rohit-nayak-ps left a comment

Choose a reason for hiding this comment

Uh oh!

shlomi-noach commented Mar 24, 2022

Uh oh!

mattlord left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shlomi-noach commented Mar 24, 2022 •

edited

Loading