Skip to content

OnlineDDL/Vreplication stress test: investigating failures#8390

Merged
shlomi-noach merged 4 commits intovitessio:release-11.0from
planetscale:online-ddl-vrepl-stress-intense-failure
Jul 6, 2021
Merged

OnlineDDL/Vreplication stress test: investigating failures#8390
shlomi-noach merged 4 commits intovitessio:release-11.0from
planetscale:online-ddl-vrepl-stress-intense-failure

Conversation

@shlomi-noach
Copy link
Copy Markdown
Contributor

@shlomi-noach shlomi-noach commented Jun 27, 2021

Description update

Fixes a bug in OnlineDDL/VReplication where we would not wait for the correct position before cutting over. Now waiting for the correct position.

Description

In #8364 (comment) I found data inconsistency that didn't seem to be related to the PR.

In this PR we intensify the stress in the onlineddl_vrepl_stress test so as to surface the problem.

Related Issue(s)

#6926

Checklist

  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

…hidden issue; write op_order for better tracking

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach shlomi-noach changed the title OnlineDDL/Vreplicatoin stress test: investigating failures OnlineDDL/Vreplication stress test: investigating failures Jun 27, 2021
…after tables

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Copy Markdown
Contributor Author

Failures reproduce.

Importing test logic from #8364, we now utilize -vreplication-test-suite to get hold of both _before and _after tables, and to be able to compare them.

Sample diff error:

Jun 27 12:05:51         	            	
Jun 27 12:05:51         	            	Diff:
Jun 27 12:05:51         	            	--- Expected
Jun 27 12:05:51         	            	+++ Actual
Jun 27 12:05:51         	            	@@ -1520,3 +1520,3 @@
Jun 27 12:05:51         	            	 1848	c74b2355	23188		2021-06-27 12:05:39	0
Jun 27 12:05:51         	            	-1849	fec85a12	26177		2021-06-27 12:05:23	5
Jun 27 12:05:51         	            	+1849	fec85a12	22642		2021-06-27 12:05:23	4
Jun 27 12:05:51         	            	 1851	8a3372f7	25610		2021-06-27 12:05:40	0
Jun 27 12:05:51         	Test:       	TestSchemaChange/ALTER_TABLE_with_workload_4/5
Jun 27 12:05:51         	Messages:   	results mismatch: (
Jun 27 12:05:51         	            			SELECT * FROM stress_test_before order by id
Jun 27 12:05:51         	            		) and (
Jun 27 12:05:51         	            			SELECT * FROM stress_test_after order by id
Jun 27 12:05:51         	            		)

@shlomi-noach
Copy link
Copy Markdown
Contributor Author

I tried to add an artificial sleep of 15 seconds right after tmClient.VReplicationWaitForPos(), in suspicion that the problem is premature cut-over. Test still fails.

@shlomi-noach
Copy link
Copy Markdown
Contributor Author

shlomi-noach commented Jun 27, 2021

Another example of failure. By values of max op_order we can see that the diffs are in the most recent changes. So it still seems to me like the cut-over is incorrect and premature.

Jun 27 12:23:45 # count rows in table (before): 3381
Jun 27 12:23:45 # count rows in table (after): 3380
Jun 27 12:23:45 # max op_order in table (before): 26040
Jun 27 12:23:45 # max op_order in table (after): 26023
...
Jun 27 12:23:45         	            	
Jun 27 12:23:45         	            	Diff:
Jun 27 12:23:45         	            	--- Expected
Jun 27 12:23:45         	            	+++ Actual
Jun 27 12:23:45         	            	@@ -998,3 +998,3 @@
Jun 27 12:23:45         	            	 1224	0a49779e	12529		2021-06-27 12:23:29	0
Jun 27 12:23:45         	            	-1225	e33e0fc0	26032		2021-06-27 12:23:21	4
Jun 27 12:23:45         	            	+1225	e33e0fc0	14044		2021-06-27 12:23:21	3
Jun 27 12:23:45         	            	 1226	1a67b96c	24218		2021-06-27 12:23:30	2
Jun 27 12:23:45         	            	@@ -1440,3 +1440,3 @@
Jun 27 12:23:45         	            	 1768	a3cdcce6	24559		2021-06-27 12:23:35	0
Jun 27 12:23:45         	            	-1769	7a87d98b	26037		2021-06-27 12:23:19	6
Jun 27 12:23:45         	            	+1769	7a87d98b	24557		2021-06-27 12:23:19	5
Jun 27 12:23:45         	            	 1771	dbe2ac43	25505		2021-06-27 12:23:35	0
Jun 27 12:23:45         	            	@@ -1696,3 +1696,3 @@
Jun 27 12:23:45         	            	 2070	b7a30ede	24098		2021-06-27 12:23:29	2
Jun 27 12:23:45         	            	-2071	0d0d47d3	26028		2021-06-27 12:23:19	5
Jun 27 12:23:45         	            	+2071	0d0d47d3	25586		2021-06-27 12:23:19	4
Jun 27 12:23:45         	            	 2072	d0acfc4b	17668		2021-06-27 12:23:31	0
Jun 27 12:23:45         	            	@@ -2064,3 +2064,3 @@
Jun 27 12:23:45         	            	 2506	e0a70cac	26023		2021-06-27 12:23:30	2
Jun 27 12:23:45         	            	-2507	1297a7e9	26033		2021-06-27 12:23:34	2
Jun 27 12:23:45         	            	+2507	1297a7e9	24553		2021-06-27 12:23:34	1
Jun 27 12:23:45         	            	 2508	6e8fd112	8913		2021-06-27 12:23:28	0
Jun 27 12:23:45         	            	@@ -2145,3 +2145,3 @@
Jun 27 12:23:45         	            	 2598	cc517af8	21892		2021-06-27 12:23:33	0
Jun 27 12:23:45         	            	-2599	3c4fa360	26035		2021-06-27 12:23:32	2
Jun 27 12:23:45         	            	+2599	3c4fa360	25830		2021-06-27 12:23:32	1
Jun 27 12:23:45         	            	 2600	b1e41f4e	23133		2021-06-27 12:23:19	3
Jun 27 12:23:45         	            	@@ -2191,3 +2191,3 @@
Jun 27 12:23:45         	            	 2650	4c10162c	25877		2021-06-27 12:23:29	2
Jun 27 12:23:45         	            	-2652	10c43e00	26027		2021-06-27 12:23:15	3
Jun 27 12:23:45         	            	+2652	10c43e00	20807		2021-06-27 12:23:15	2
Jun 27 12:23:45         	            	 2653	e36e3de6	22989		2021-06-27 12:23:34	0
Jun 27 12:23:45         	            	@@ -2330,3 +2330,3 @@
Jun 27 12:23:45         	            	 2826	8d91baec	24034		2021-06-27 12:23:30	1
Jun 27 12:23:45         	            	-2827	1d7ab285	26026		2021-06-27 12:23:34	2
Jun 27 12:23:45         	            	+2827	1d7ab285	25263		2021-06-27 12:23:34	1
Jun 27 12:23:45         	            	 2828	e889510a	24659		2021-06-27 12:23:27	5
Jun 27 12:23:45         	            	@@ -2361,3 +2361,3 @@
Jun 27 12:23:45         	            	 2859	49fc4bfa	9497		2021-06-27 12:23:26	3
Jun 27 12:23:45         	            	-2860	ba99b1cb	26040		2021-06-27 12:23:28	3
Jun 27 12:23:45         	            	+2860	ba99b1cb	24534		2021-06-27 12:23:28	2
Jun 27 12:23:45         	            	 2861	b8f96caa	24714		2021-06-27 12:23:17	3
Jun 27 12:23:45         	            	@@ -2455,3 +2455,2 @@
Jun 27 12:23:45         	            	 2983	7bd6a23b	23320		2021-06-27 12:23:34	0
Jun 27 12:23:45         	            	-2984	f512f195	26030		2021-06-27 12:23:35	0
Jun 27 12:23:45         	            	 2985	1b327c82	20772		2021-06-27 12:23:33	0
Jun 27 12:23:45         	            	@@ -2947,3 +2946,3 @@
Jun 27 12:23:45         	            	 3582	161b0866	21013		2021-06-27 12:23:30	1
Jun 27 12:23:45         	            	-3583	7725ed7b	26031		2021-06-27 12:23:28	1
Jun 27 12:23:45         	            	+3583	7725ed7b	9727		2021-06-27 12:23:28	0
Jun 27 12:23:45         	            	 3584	201060bf	24188		2021-06-27 12:23:17	5

each diff is in the op_order range of 26023..26040. Remember some of those events are deletes, some inserts, some updates. But the diffs seem to all be at the last few events.

@shlomi-noach
Copy link
Copy Markdown
Contributor Author

This diff, for example:

Jun 27 12:23:45         	            	@@ -2064,3 +2064,3 @@
Jun 27 12:23:45         	            	 2506	e0a70cac	26023		2021-06-27 12:23:30	2
Jun 27 12:23:45         	            	-2507	1297a7e9	26033		2021-06-27 12:23:34	2
Jun 27 12:23:45         	            	+2507	1297a7e9	24553		2021-06-27 12:23:34	1
Jun 27 12:23:45         	            	 2508	6e8fd112	8913		2021-06-27 12:23:28	0

shows a missing UPDATE on target table on row id=2507. Last column is updates and can only increment.

@shlomi-noach
Copy link
Copy Markdown
Contributor Author

shlomi-noach commented Jun 27, 2021

This diff:

Jun 27 12:23:45         	            	@@ -2455,3 +2455,2 @@
Jun 27 12:23:45         	            	 2983	7bd6a23b	23320		2021-06-27 12:23:34	0
Jun 27 12:23:45         	            	-2984	f512f195	26030		2021-06-27 12:23:35	0
Jun 27 12:23:45         	            	 2985	1b327c82	20772		2021-06-27 12:23:33	0

shows a missing INSERT, that took place on original table but never made it to target table.

@shlomi-noach
Copy link
Copy Markdown
Contributor Author

I believe I'm reading the incorrect pos for which vreplication needs to wait when cutting-over. To be continued.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Copy Markdown
Contributor Author

OK looks good to me now. Turns out I was indeed waiting for the wrong Pos. Now waiting for the conn.PrimaryPosition() (ie current GTID on the primary tablet) rather than relying on vreplicaiton.pos (completely wrong, not sure how I ended up using that).

Tests are now consistently passing.

The only material change in this PR is onlineddl/executor.go. All the rest is fancy test changes which give more visibility into test outputs.

@shlomi-noach shlomi-noach marked this pull request as ready for review June 28, 2021 06:12
shlomi-noach added a commit to planetscale/vitess that referenced this pull request Jun 28, 2021
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Copy Markdown
Contributor Author

This is an important fix that should make it into v11

@systay systay changed the base branch from main to release-11.0 July 6, 2021 10:01
Copy link
Copy Markdown
Collaborator

@systay systay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@shlomi-noach shlomi-noach merged commit b1629c4 into vitessio:release-11.0 Jul 6, 2021
@shlomi-noach shlomi-noach deleted the online-ddl-vrepl-stress-intense-failure branch July 6, 2021 11:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants