Skip to content

VReplication based online DDL: mini stress test CI#7492

Merged
shlomi-noach merged 65 commits intovitessio:masterfrom
planetscale:vreplication-online-ddl-mini-stress-test
Mar 1, 2021
Merged

VReplication based online DDL: mini stress test CI#7492
shlomi-noach merged 65 commits intovitessio:masterfrom
planetscale:vreplication-online-ddl-mini-stress-test

Conversation

@shlomi-noach
Copy link
Copy Markdown
Contributor

Description

#7419 introduces VReplication via Online DDL, but the tests in #7419 only validate that the schema change applies; they do not test migration with traffic.

This PR addsa CI job which introduces a mini stress workload.

An "app" applies insert/update/delete statements onto a table while altering that table. The app keeps count of all successful statements. When the app stop writes, it compares its own tracking with the table data. This validates that VReplication does no corrupt data (eg by losing rows).

There's actually an initial failure, which I'm looking into.

Related Issue(s)

Checklist

  • Should this PR be backported?
  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

Impacted Areas in Vitess

Components that this PR will affect:

  • Query Serving
  • VReplication
  • Cluster Management
  • Build/CI
  • VTAdmin

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
…parser

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
…t on this branch's radar

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
…ap query

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
…writes to table, waiting for updated pos, renaming tables, releasing table, releasing locks

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach shlomi-noach mentioned this pull request Feb 15, 2021
8 tasks
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach shlomi-noach marked this pull request as ready for review February 15, 2021 10:12
@shlomi-noach
Copy link
Copy Markdown
Contributor Author

The way this test works:

  • creates a table
  • starts an "app" that writes to the table
  • app uses a concurrency of 5 connections
  • each connection randomly inserts or updates or deletes from the table
  • the query result (error, rows affected) tells the app everything it needs about the changes metrics, of which it keeps track
  • we run a schema migration that modifies some column
  • we wait for migration to complete
  • we stop the "app"
  • we expect app's metrics to be consistent with modified table. We conclude that no data was lost/corrupted in the migration.

We run a cycle of 5 such tests. They all must pass 100%. No fluctuations allowed. No retries.

@shlomi-noach
Copy link
Copy Markdown
Contributor Author

This test works most of the time, but here's one local failure I got. Like 1 in 50?

Feb 15 12:14:05 I0215 12:14:05.497705  891374 onlineddl_vrepl_mini_stress_test.go:420] initTable begin
Feb 15 12:14:15 I0215 12:14:15.989486  891374 onlineddl_vrepl_mini_stress_test.go:441] initTable complete
Feb 15 12:14:15 I0215 12:14:15.989729  891374 onlineddl_vrepl_mini_stress_test.go:403] Running multiple connections
Feb 15 12:14:15 I0215 12:14:15.989793  891374 onlineddl_vrepl_mini_stress_test.go:369] Running single connection
Feb 15 12:14:15 I0215 12:14:15.989859  891374 onlineddl_vrepl_mini_stress_test.go:369] Running single connection
Feb 15 12:14:15 I0215 12:14:15.989905  891374 onlineddl_vrepl_mini_stress_test.go:369] Running single connection
Feb 15 12:14:15 I0215 12:14:15.990049  891374 onlineddl_vrepl_mini_stress_test.go:369] Running single connection
Feb 15 12:14:15 I0215 12:14:15.990119  891374 onlineddl_vrepl_mini_stress_test.go:369] Running single connection
Feb 15 12:14:15 # Generated UUID (for debug purposes):
Feb 15 12:14:15 <8ef160eb_6f76_11eb_a128_f875a4d24e90>
Feb 15 12:14:35 I0215 12:14:35.999625  891374 vtctlclient_process.go:158] Executing vtctlclient with command: vtctlclient -server localhost:20406 OnlineDDL ks show recent
Feb 15 12:14:36 # 'vtctlclient OnlineDDL show recent' output (for debug purposes):
Feb 15 12:14:36 +------------------+-------+--------------+-------------+------------+--------------------------------------+----------+---------------------+---------------------+------------------+
Feb 15 12:14:36 |      Tablet      | shard | mysql_schema | mysql_table | ddl_action |            migration_uuid            | strategy |  started_timestamp  | completed_timestamp | migration_status |
Feb 15 12:14:36 +------------------+-------+--------------+-------------+------------+--------------------------------------+----------+---------------------+---------------------+------------------+
Feb 15 12:14:36 | zone1-0000003268 |     0 | vt_ks        | stress_test | alter      | 2b75c838_6f76_11eb_a128_f875a4d24e90 | online   | 2021-02-15 12:11:33 | 2021-02-15 12:11:36 | complete         |
Feb 15 12:14:36 | zone1-0000003268 |     0 | vt_ks        | stress_test | alter      | 3d1e10f9_6f76_11eb_a128_f875a4d24e90 | online   | 2021-02-15 12:12:01 | 2021-02-15 12:12:04 | complete         |
Feb 15 12:14:36 | zone1-0000003268 |     0 | vt_ks        | stress_test | alter      | 54570d97_6f76_11eb_a128_f875a4d24e90 | online   | 2021-02-15 12:12:44 | 2021-02-15 12:12:49 | complete         |
Feb 15 12:14:36 | zone1-0000003268 |     0 | vt_ks        | stress_test | alter      | 677f0501_6f76_11eb_a128_f875a4d24e90 | online   | 2021-02-15 12:13:14 | 2021-02-15 12:13:16 | complete         |
Feb 15 12:14:36 | zone1-0000003268 |     0 | vt_ks        | stress_test | alter      | 7b841b50_6f76_11eb_a128_f875a4d24e90 | online   | 2021-02-15 12:13:46 | 2021-02-15 12:13:48 | complete         |
Feb 15 12:14:36 | zone1-0000003268 |     0 | vt_ks        | stress_test | alter      | 8ef160eb_6f76_11eb_a128_f875a4d24e90 | online   | 2021-02-15 12:14:18 | 2021-02-15 12:14:22 | complete         |
Feb 15 12:14:36 | zone1-0000003272 |     1 | vt_ks        | stress_test | alter      | 2b75c838_6f76_11eb_a128_f875a4d24e90 | online   | 2021-02-15 12:11:34 | 2021-02-15 12:11:36 | complete         |
Feb 15 12:14:36 | zone1-0000003272 |     1 | vt_ks        | stress_test | alter      | 3d1e10f9_6f76_11eb_a128_f875a4d24e90 | online   | 2021-02-15 12:12:01 | 2021-02-15 12:12:04 | complete         |
Feb 15 12:14:36 | zone1-0000003272 |     1 | vt_ks        | stress_test | alter      | 54570d97_6f76_11eb_a128_f875a4d24e90 | online   | 2021-02-15 12:12:44 | 2021-02-15 12:12:49 | complete         |
Feb 15 12:14:36 | zone1-0000003272 |     1 | vt_ks        | stress_test | alter      | 677f0501_6f76_11eb_a128_f875a4d24e90 | online   | 2021-02-15 12:13:14 | 2021-02-15 12:13:16 | complete         |
Feb 15 12:14:36 | zone1-0000003272 |     1 | vt_ks        | stress_test | alter      | 7b841b50_6f76_11eb_a128_f875a4d24e90 | online   | 2021-02-15 12:13:46 | 2021-02-15 12:13:48 | complete         |
Feb 15 12:14:36 | zone1-0000003272 |     1 | vt_ks        | stress_test | alter      | 8ef160eb_6f76_11eb_a128_f875a4d24e90 | online   | 2021-02-15 12:14:18 | 2021-02-15 12:14:22 | complete         |
Feb 15 12:14:36 +------------------+-------+--------------+-------------+------------+--------------------------------------+----------+---------------------+---------------------+------------------+
Feb 15 12:14:36 
Feb 15 12:14:36 I0215 12:14:36.030609  891374 vtctlclient_process.go:158] Executing vtctlclient with command: vtctlclient -server localhost:20406 VExec ks.8ef160eb_6f76_11eb_a128_f875a4d24e90 select migration_status, message from _vt.schema_migrations
Feb 15 12:14:36 # 'vtctlclient VExec' output (for debug purposes):
Feb 15 12:14:36 +------------------+------------------+---------+
Feb 15 12:14:36 |      Tablet      | migration_status | message |
Feb 15 12:14:36 +------------------+------------------+---------+
Feb 15 12:14:36 | zone1-0000003268 | complete         |         |
Feb 15 12:14:36 | zone1-0000003272 | complete         |         |
Feb 15 12:14:36 +------------------+------------------+---------+
Feb 15 12:14:36 
Feb 15 12:14:36 I0215 12:14:36.050251  891374 onlineddl_vrepl_mini_stress_test.go:411] Running multiple connections: done
Feb 15 12:14:36 I0215 12:14:36.050265  891374 onlineddl_vrepl_mini_stress_test.go:413] Cancelling single connection
Feb 15 12:14:36 I0215 12:14:36.070869  891374 onlineddl_vrepl_mini_stress_test.go:380] Terminating single connection
Feb 15 12:14:36 I0215 12:14:36.070982  891374 onlineddl_vrepl_mini_stress_test.go:413] Cancelling single connection
Feb 15 12:14:36 I0215 12:14:36.080158  891374 onlineddl_vrepl_mini_stress_test.go:380] Terminating single connection
Feb 15 12:14:36 I0215 12:14:36.080266  891374 onlineddl_vrepl_mini_stress_test.go:413] Cancelling single connection
Feb 15 12:14:36 I0215 12:14:36.082813  891374 onlineddl_vrepl_mini_stress_test.go:380] Terminating single connection
Feb 15 12:14:36 I0215 12:14:36.082834  891374 onlineddl_vrepl_mini_stress_test.go:413] Cancelling single connection
Feb 15 12:14:36 I0215 12:14:36.106955  891374 onlineddl_vrepl_mini_stress_test.go:380] Terminating single connection
Feb 15 12:14:36 I0215 12:14:36.107029  891374 onlineddl_vrepl_mini_stress_test.go:413] Cancelling single connection
Feb 15 12:14:36 I0215 12:14:36.114428  891374 onlineddl_vrepl_mini_stress_test.go:380] Terminating single connection
Feb 15 12:14:36 I0215 12:14:36.114490  891374 onlineddl_vrepl_mini_stress_test.go:416] All connections cancelled
Feb 15 12:14:38 I0215 12:14:38.050508  891374 onlineddl_vrepl_mini_stress_test.go:447] writeMetrics: WriteMetrics: inserts=2389, updates=1068, deletes=195, inserts-deletes=2194, updates-deletes=873
Feb 15 12:14:38 I0215 12:14:38.058688  891374 onlineddl_vrepl_mini_stress_test.go:459] testSelectTableMetrics, row: map[num_rows:INT64(2193) sum_updates:INT64(873)]
Feb 15 12:14:38     onlineddl_vrepl_mini_stress_test.go:468: 
Feb 15 12:14:38         	Error Trace:	onlineddl_vrepl_mini_stress_test.go:468
Feb 15 12:14:38         	            				onlineddl_vrepl_mini_stress_test.go:218
Feb 15 12:14:38         	Error:      	Not equal: 
Feb 15 12:14:38         	            	expected: 2193
Feb 15 12:14:38         	            	actual  : 2194
Feb 15 12:14:38         	Test:       	TestSchemaChange/ALTER_TABLE_with_workload_5/5
Feb 15 12:14:38 --- FAIL: TestSchemaChange (208.65s)
Feb 15 12:14:38     --- PASS: TestSchemaChange/create_schema (0.21s)
Feb 15 12:14:38     --- PASS: TestSchemaChange/workload_without_ALTER_TABLE (9.56s)
Feb 15 12:14:38     --- PASS: TestSchemaChange/ALTER_TABLE_without_workload (29.98s)
Feb 15 12:14:38     --- PASS: TestSchemaChange/ALTER_TABLE_with_workload_1/5 (31.67s)
Feb 15 12:14:38     --- FAIL: TestSchemaChange/ALTER_TABLE_with_workload_2/5 (38.95s)
Feb 15 12:14:38     --- PASS: TestSchemaChange/ALTER_TABLE_with_workload_3/5 (32.10s)
Feb 15 12:14:38     --- PASS: TestSchemaChange/ALTER_TABLE_with_workload_4/5 (33.62s)
Feb 15 12:14:38     --- FAIL: TestSchemaChange/ALTER_TABLE_with_workload_5/5 (32.56s)
Feb 15 12:14:38 FAIL
Feb 15 12:14:48 FAIL	vitess.io/vitess/go/test/endtoend/onlineddl_vrepl_stress	253.832s
Feb 15 12:14:48 FAIL

I'm not sure why the "updates" comparison failed: whether this is a bug in vreplication DDL or a bug in how the mockup app computes what should be the "updates" expected value, given some concurrency. Looking into.

@shlomi-noach
Copy link
Copy Markdown
Contributor Author

what's interesting this is the only kind of failure I saw, and I saw it three times, and in all three times the problem is that the test met 1 more update than expected.

@shlomi-noach
Copy link
Copy Markdown
Contributor Author

shlomi-noach commented Feb 15, 2021

Got another failure, this time counting inserts, so it's not specific to update's. On the other hand, deletes affects both, so it might be the culprit.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Copy Markdown
Contributor Author

The good news: got this same error in https://github.com/vitessio/vitess/pull/7492/checks?check_run_id=1903552159. Quote:

== RUN   TestSchemaChange/workload_without_ALTER_TABLE
I0215 13:15:21.165637   16689 onlineddl_vrepl_mini_stress_test.go:468] initTable begin
I0215 13:15:26.464866   16689 onlineddl_vrepl_mini_stress_test.go:489] initTable complete
I0215 13:15:26.464897   16689 onlineddl_vrepl_mini_stress_test.go:495] WriteMetrics: inserts=1615, updates=408, deletes=74, inserts-deletes=1541, updates-deletes=334,
insertsAttempts=2048, insertsFailures=0, insertsNoops=433, inserts=1615,
updatesAttempts=1024, updatesFailures=0, updatesNoops=616, updates=408,
deletesAttempts=1023, deletesFailures=0, deletesNoops=949, deletes=74,
I0215 13:15:26.471534   16689 onlineddl_vrepl_mini_stress_test.go:507] testSelectTableMetrics, row: map[num_rows:INT64(1540) sum_updates:INT64(333)]
    onlineddl_vrepl_mini_stress_test.go:516: 
        	Error Trace:	onlineddl_vrepl_mini_stress_test.go:516
        	            				onlineddl_vrepl_mini_stress_test.go:221
        	Error:      	Not equal: 
        	            	expected: 1541
        	            	actual  : 1540
        	Test:       	TestSchemaChange/workload_without_ALTER_TABLE
    onlineddl_vrepl_mini_stress_test.go:517: 
        	Error Trace:	onlineddl_vrepl_mini_stress_test.go:517
        	            				onlineddl_vrepl_mini_stress_test.go:221
        	Error:      	Not equal: 
        	            	expected: 334
        	            	actual  : 333
        	Test:       	TestSchemaChange/workload_without_ALTER_TABLE

I'm very happy I introduced this test: it tests the workload without actually running an ALTER TABLE. Good! That means the test's "app" logic is flawed. While it does not prove it, it seems like ALTER TABLE is sound. But we'll know better when we fix the app logic.

@shlomi-noach
Copy link
Copy Markdown
Contributor Author

BTW I switched "expected" and "actual". "actual" is what's found in the database. "expected" is from the app's notes.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Copy Markdown
Contributor Author

Exhausting my ideas. Not sure where this deviation of data comes from. It's unrelated to online DDL because it happens even without running a migration.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Copy Markdown
Contributor Author

Exhausting my ideas.

Meh. It was a synchronization issue. Nothing to see here.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
@shlomi-noach
Copy link
Copy Markdown
Contributor Author

Am happy with these tests now. Am going to merge this PR into #7419 to make #7419 more complete.

@rohit-nayak-ps
Copy link
Copy Markdown
Member

The related commits are part of #7419 and this PR can be closed, right?

@shlomi-noach shlomi-noach merged commit 61454be into vitessio:master Mar 1, 2021
@shlomi-noach shlomi-noach deleted the vreplication-online-ddl-mini-stress-test branch March 1, 2021 05:38
@shlomi-noach
Copy link
Copy Markdown
Contributor Author

The related commits are part of #7419 and this PR can be closed, right?

Yes, that's true! This PR is incorporated into #7419 and now implicitly merged since #7419 is merged.

@askdba askdba added this to the v10.0 milestone Mar 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants