replication: fix race in accessing vclock by applier and tx threads#7092
Conversation
|
(I like to read your investigations: you transform mysterious effects into easy-to-see sequences of conditions and actions.) We have |
|
@Totktonada, thanks, glad to hear that! Yep, I'll try to put together a more comprehensive test and commit it together with the fix. |
350ce26 to
a23353b
Compare
|
@Totktonada, thanks for the suggestion! |
Gerold103
left a comment
There was a problem hiding this comment.
Why is it that NO_CHANGELOG=internal fix? Didn't we already release beta2 with this bug? I mean, the unreleased/ folder of changelogs doesn't have this feature. It means it was already flushed into a released .md file, wasn't it?
As for the fix - I don't like that we now copy quite expensive vclock in one another place but can't propose anything better. Good work anyway, I suppose its debug was pretty intricate.
a23353b to
f00b911
Compare
|
@Gerold103, @Totktonada, thanks for your reviews! Here are the incremental changes: diffdiff --git a/changelogs/unreleased/gh-7089-applier-vclock-race.md b/changelogs/unreleased/gh-7089-applier-vclock-race.md
new file mode 100644
index 000000000..48fa18385
--- /dev/null
+++ b/changelogs/unreleased/gh-7089-applier-vclock-race.md
@@ -0,0 +1,4 @@
+## bugfix/replication
+
+* Fix a case when replication was broken on master side with
+ `ER_INVALID_MSGPACK` (gh-7089).
diff --git a/perf/gh-7089-vclock-copy/Makefile b/perf/gh-7089-vclock-copy/Makefile
index 0d86e50e9..0d2dda5fd 100644
--- a/perf/gh-7089-vclock-copy/Makefile
+++ b/perf/gh-7089-vclock-copy/Makefile
@@ -1,3 +1,5 @@
+SHELL = /bin/bash
+
test: test_none stop
test_wal: test_write stop
@@ -9,10 +11,10 @@ start_swarm:
for i in {2..11}; do ./swarm.lua $${i}; done
test_none: clean dirs start_swarm
- ./speedtest.lua none
+ ./speedtest.lua none 1
test_write: clean dirs start_swarm
- ./speedtest.lua write
+ ./speedtest.lua write 1
stop stop1:
for i in {2..11}; do pkill -F $${i}/$${i}.pid || true; done
diff --git a/perf/gh-7089-vclock-copy/speedtest.lua b/perf/gh-7089-vclock-copy/speedtest.lua
index 5cdee61ab..4c4b5e275 100755
--- a/perf/gh-7089-vclock-copy/speedtest.lua
+++ b/perf/gh-7089-vclock-copy/speedtest.lua
@@ -9,10 +9,15 @@ local mode = arg[1] or 'none'
assert(mode == 'write' or mode == 'none',
"mode should be either 'write' or 'none'")
+local id = tonumber(arg[2]) or 1
+
+assert(id < 2 or id > 11,
+ 'The id should be outside of the occupied range [2, 11]')
+
local fiber = require('fiber')
box.cfg{
- listen = 3301,
+ listen = 3300 + id,
replication = {
3302,
3303,
@@ -30,8 +35,8 @@ box.cfg{
-- performance. It's okay to replicate TO a node with disabled WAL. You only
-- can't replicate FROM it.
wal_mode = mode,
- work_dir = '1',
- log = '1.log',
+ work_dir = tostring(id),
+ log = id..'.log',
}
box.schema.space.create('test', {if_not_exists = true}) |
I just thought it isn't worth adding a changelog. But I don't mind, so added one.
Me neither. |
Totktonada
left a comment
There was a problem hiding this comment.
I guess I was added to reviewers to look on the benchmark.
TBH, I didn't play around by hands, but visually it looks quite obvious how to run it and obtain results. It is the main thing. I don't want let you wait a decent review from me. I'll approve as 'looks visually ok'.
When applier ack writer was moved to applier thread, it was overlooked
that it would start sharing replicaset.vclock between two threads.
This could lead to the following replication errors on master:
relay//102/main:reader V> Got a corrupted row:
relay//102/main:reader V> 00000000: 81 00 00 81 26 81 01 09 02 01
Such a row has an incorrectly-encoded vclock: `81 01 09 02 01`.
When writer fiber encoded the vclock length (`81`), there was only one
vclock component: {1: 9}, but at the moment of iterating over the
components, another WAL write was reported to TX thread, which bumped
the second vclock component {1: 9, 2: 1}.
Let's fix the race by delivering a copy of current replicaset vclock to
the applier thread.
Also add a perf test to the perf/ directory.
Closes tarantool#7089
Part-of tarantool/tarantool-qa#166
NO_DOC=internal fix
NO_TEST=hard to test
f00b911 to
16a83e1
Compare
When applier ack writer was moved to applier thread, it was overlooked
that it would start sharing replicaset.vclock between two threads.
This could lead to the following replication errors on master:
relay//102/main:reader V> Got a corrupted row:
relay//102/main:reader V> 00000000: 81 00 00 81 26 81 01 09 02 01
Such a row has an incorrectly-encoded vclock:
81 01 09 02 01.When writer fiber encoded the vclock length (
81), there was only onevclock component: {1: 9}, but at the moment of iterating over the
components, another WAL write was reported to TX thread, which bumped
the second vclock component {1: 9, 2: 1}.
Let's fix the race by delivering a copy of current replicaset vclock to
the applier thread.
Closes #7089
Part-of tarantool/tarantool-qa#166
NO_DOC=internal fix
NO_CHANGELOG=internal fix
NO_TEST=hard to test
A note on why such a seemingly slow fix was chosen:
I was afraid that adding another
vclock_copyon each wal_write would slow tarantool considerably.That's why I conducted some tests:
I ran tarantool in wal_mode='None' and wal_mode='Write', the server was connected to 10 remote masters, it had 10 applier threads. So each "wal_write" (even in wal_mode='none') on it would result in copying over 10 extra vclock. Which should be quite slow.
But I couldn't notice any difference in RPS.
Here's the test:
Start 10 nodes in a fullmesh, insert something on each of them so that their vclock becomes huge:
tarantool swarm.lua 2,tarantool swarm lua 3, ...tarantool swarm.lua 11swarm.lua
Now start the testing Tarantool:
tarantool -i speedtest.luaand issue
test(num_fibers_to_test)as many times as you want.speedtest.lua
On my machine both patched and canonical Tarantool have around 1 300 000 RPS +- 1 % when wal_mode is 'none'.
When wal_mode is 'write', both Tarantools have around 920 000 RPS +- 1 %.