Skip to content

VReplication: fail on unrecoverable errors#9973

Merged
shlomi-noach merged 12 commits intovitessio:mainfrom
planetscale:vrepl-unrecoverable-error
Mar 24, 2022
Merged

VReplication: fail on unrecoverable errors#9973
shlomi-noach merged 12 commits intovitessio:mainfrom
planetscale:vrepl-unrecoverable-error

Conversation

@shlomi-noach
Copy link
Copy Markdown
Contributor

@shlomi-noach shlomi-noach commented Mar 24, 2022

Description

In this PR we identify a specific set of MySQL errors for which VReplication now fails.
Normally, when vreplication/controller sees an error, it backs off and attempts a new vreplication stream, again and again and again.

Now, we identify some errors for which it just doesn't make sense to continue. There errors are caught by vplayer and vcopier, and propagated all the way up to the controller.

If the controller sees these specific errors, it:

  • sets state to Error (binlogplayer.BlpError)
  • ensures the message indicates the error
  • quits

The specific errors are found in vreplication/utils.go, and generally include:

  • Attempt to assign an invalid value (value too long, invalid character set, invalid enum value, NULL to non-nullable column)
  • Unique key violation
  • No default value for column

These failures can only originate with incompatible data types or schemas in source & target tables.

In addition, OnlineDDL executor now intercepts binlogplayer.BlpError state in _vt.vreplication and proceeds to fail the migration. Previously (#9958) we would just time out. This reduces migration failure response from up to 10min to up to 1min.

Related Issue(s)

Initial commit does not include tests yet. I'll add the tests in onlineddl/vrepl_suite so that Online DDL tests the VReplication changes. It makes most sense, because OnlineDDL is the only flow where we even expect these type of failures.

Tests TBD

Checklist

  • Should this PR be backported?
  • Tests were added or are not required
  • Documentation was added or is not required

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
…vrepl-unrecoverable-error

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach shlomi-noach added Type: Bug Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: VReplication release notes none labels Mar 24, 2022
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@rohit-nayak-ps
Copy link
Copy Markdown
Member

The initial Error state related changes look good.

I am a little wary about adding the new state since we might have state-conditional code that might confuse the Error state with another one. But you should be able to surface any issues while manually testing any of the error conditions. I will run some workflows manually on my end, injecting artificial errors, as well once you have the PR done.

For now I identified that we need to handle the new state at these points, by checking additionally for the Error state:

vreplication/controller.go:100

	// Nothing to do if replication is stopped.
	if params["state"] == binlogplayer.BlpStopped {
		ct.cancel = func() {}
		close(ct.done)
		return ct, nil
	}

vreplication/engine.go:758

		if qr.Rows[0][1].ToString() == binlogplayer.BlpStopped {
			return fmt.Errorf("replication has stopped at %v before reaching position %v, message: %s", current, mPos, qr.Rows[0][2].ToString())
		}

vreplicator.go:210

		// If any of the operations below changed state to Stopped, we should return.
		if settings.State == binlogplayer.BlpStopped {
			return nil
		}

Also the (VExec) Workflow commands:

workflow/server.go:401

		switch {
		case strings.Contains(strings.ToLower(stream.Message), "error"):
			stream.State = "Error"
		case stream.State == "Running" && len(stream.CopyStates) > 0:
			stream.State = "Copying"
		case stream.State == "Running" && int64(time.Now().Second())-timeUpdatedSeconds > 10:
			stream.State = "Lagging"
		}

vexec.go:646

func updateState(message, state string, cs []copyState, timeUpdated int64) string {
	if strings.Contains(strings.ToLower(message), "error") {
		state = "Error"
	} else if state == "Running" && len(cs) > 0 {
		state = "Copying"
	} else if state == "Running" && int64(time.Now().Second())-timeUpdated > 10 /* seconds */ {
		state = "Lagging"
	}
	return state
}

@rohit-nayak-ps
Copy link
Copy Markdown
Member

You also need to cherry-pick the fix to the unit tests ...

@shlomi-noach
Copy link
Copy Markdown
Contributor Author

You also need to cherry-pick the fix to the unit tests ...

Ah, yes, thank you!

@shlomi-noach
Copy link
Copy Markdown
Contributor Author

@rohit-nayak-ps I'm thinking, we can do this in smaller steps, and only break out of the controller when these errors are encountered AND this is an OnlineDDL workflow type. What do you think?

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@rohit-nayak-ps
Copy link
Copy Markdown
Member

@rohit-nayak-ps I'm thinking, we can do this in smaller steps, and only break out of the controller when these errors are encountered AND this is an OnlineDDL workflow type. What do you think?

Are you saying you will only check and go into the Error state for OnlineDDLs?

@shlomi-noach
Copy link
Copy Markdown
Contributor Author

Are you saying you will only check and go into the Error state for OnlineDDLs?

Yes, I propose we will only go into the error state, as well as completely quit the workflow -- only in online ddl. Does this increase confidence in the change?

@rohit-nayak-ps
Copy link
Copy Markdown
Member

Yes, I propose we will only go into the error state, as well as completely quit the workflow -- only in online ddl. Does this increase confidence in the change?

Good idea. Let's do that.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Copy Markdown
Contributor Author

Good idea. Let's do that.

@rohit-nayak-ps done. Do I still need to worry about the new state in all aforementioned locations?

I'm also thinking, if this is too complex for now, we can add a column called unrecoverable_error (bool) in _vt.vreplication and update that column, instead of updating the state.

What do you think?

@rohit-nayak-ps
Copy link
Copy Markdown
Member

@rohit-nayak-ps done. Do I still need to worry about the new state in all aforementioned locations?

I'm also thinking, if this is too complex for now, we can add a column called unrecoverable_error (bool) in _vt.vreplication and update that column, instead of updating the state.

If you have tested one or two error situations in OnlineDDL then this approach should be fine. It is something we have been planning to add to VReplication workflows anyway and this will be a great start towards it.

We can update the other locations when we expand error handling to other workflows.

@shlomi-noach
Copy link
Copy Markdown
Contributor Author

sounds good! Yes, tested locally many situations and shortly formally adding tests.

@shlomi-noach
Copy link
Copy Markdown
Contributor Author

Oh I should note this PR extends #9958 ; so it looks a bit bigger than it would when #9958 is merged.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Copy Markdown
Contributor Author

Tests added!

@shlomi-noach shlomi-noach marked this pull request as ready for review March 24, 2022 11:54
@shlomi-noach
Copy link
Copy Markdown
Contributor Author

Ready to review.

Copy link
Copy Markdown
Member

@rohit-nayak-ps rohit-nayak-ps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@shlomi-noach
Copy link
Copy Markdown
Contributor Author

OK then!

@shlomi-noach shlomi-noach merged commit 8d4e8c3 into vitessio:main Mar 24, 2022
@shlomi-noach shlomi-noach deleted the vrepl-unrecoverable-error branch March 24, 2022 18:23
Copy link
Copy Markdown
Member

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm late the party but this LGTM! Thanks!

DAlperin pushed a commit to DAlperin/vitess that referenced this pull request Mar 25, 2022
* Online DDL: identify VReplication retrying failure, terminate migration

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* add VReplicationWorkflowType

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* add workflow_type column to _vt.vreplication

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* online DDL executor sets workflow_type value in _vt.vreplication

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* setting strict sql_mode for OnlineDDL type workflows

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* VReplication: fail on unrecoverable errors

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* mistakenly hid variable

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* only bail out if workflow type is onlineddl

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* expect, then actual, not the other way around

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

* added tests for failure scenarios

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: VReplication Type: Bug Type: Enhancement Logical improvement (somewhere between a bug and feature)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants