VReplication: fail on unrecoverable errors#9973
Conversation
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
…vrepl-unrecoverable-error Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
|
The initial Error state related changes look good. I am a little wary about adding the new state since we might have state-conditional code that might confuse the Error state with another one. But you should be able to surface any issues while manually testing any of the error conditions. I will run some workflows manually on my end, injecting artificial errors, as well once you have the PR done. For now I identified that we need to handle the new state at these points, by checking additionally for the Error state: vreplication/controller.go:100 vreplication/engine.go:758 vreplicator.go:210 Also the (VExec) Workflow commands: workflow/server.go:401 vexec.go:646 |
|
You also need to cherry-pick the fix to the unit tests ... |
Ah, yes, thank you! |
|
@rohit-nayak-ps I'm thinking, we can do this in smaller steps, and only break out of the |
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Are you saying you will only check and go into the Error state for OnlineDDLs? |
Yes, I propose we will only go into the error state, as well as completely quit the workflow -- only in online ddl. Does this increase confidence in the change? |
Good idea. Let's do that. |
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@rohit-nayak-ps done. Do I still need to worry about the new state in all aforementioned locations? I'm also thinking, if this is too complex for now, we can add a column called What do you think? |
If you have tested one or two error situations in OnlineDDL then this approach should be fine. It is something we have been planning to add to VReplication workflows anyway and this will be a great start towards it. We can update the other locations when we expand error handling to other workflows. |
|
sounds good! Yes, tested locally many situations and shortly formally adding tests. |
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
|
Tests added! |
|
Ready to review. |
|
OK then! |
mattlord
left a comment
There was a problem hiding this comment.
I'm late the party but this LGTM! Thanks!
* Online DDL: identify VReplication retrying failure, terminate migration Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * add VReplicationWorkflowType Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * add workflow_type column to _vt.vreplication Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * online DDL executor sets workflow_type value in _vt.vreplication Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * setting strict sql_mode for OnlineDDL type workflows Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * VReplication: fail on unrecoverable errors Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * mistakenly hid variable Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * only bail out if workflow type is onlineddl Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * expect, then actual, not the other way around Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * added tests for failure scenarios Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Description
In this PR we identify a specific set of MySQL errors for which VReplication now fails.
Normally, when vreplication/controller sees an error, it backs off and attempts a new vreplication stream, again and again and again.
Now, we identify some errors for which it just doesn't make sense to continue. There errors are caught by
vplayerandvcopier, and propagated all the way up to the controller.If the controller sees these specific errors, it:
Error(binlogplayer.BlpError)The specific errors are found in
vreplication/utils.go, and generally include:NULLto non-nullable column)These failures can only originate with incompatible data types or schemas in source & target tables.
In addition, OnlineDDL executor now intercepts
binlogplayer.BlpErrorstate in_vt.vreplicationand proceeds to fail the migration. Previously (#9958) we would just time out. This reduces migration failure response from up to10minto up to1min.Related Issue(s)
Initial commit does not include tests yet. I'll add the tests in
onlineddl/vrepl_suiteso that Online DDL tests the VReplication changes. It makes most sense, because OnlineDDL is the only flow where we even expect these type of failures.Tests TBD
Checklist