Skip to content

replication: fix race in accessing vclock by applier and tx threads#7092

Merged
kyukhin merged 1 commit intotarantool:masterfrom
sergepetrenko:tx-applier-vclock-race
May 18, 2022
Merged

replication: fix race in accessing vclock by applier and tx threads#7092
kyukhin merged 1 commit intotarantool:masterfrom
sergepetrenko:tx-applier-vclock-race

Conversation

@sergepetrenko
Copy link
Collaborator

When applier ack writer was moved to applier thread, it was overlooked
that it would start sharing replicaset.vclock between two threads.

This could lead to the following replication errors on master:

relay//102/main:reader V> Got a corrupted row:
relay//102/main:reader V> 00000000: 81 00 00 81 26 81 01 09 02 01

Such a row has an incorrectly-encoded vclock: 81 01 09 02 01.
When writer fiber encoded the vclock length (81), there was only one
vclock component: {1: 9}, but at the moment of iterating over the
components, another WAL write was reported to TX thread, which bumped
the second vclock component {1: 9, 2: 1}.

Let's fix the race by delivering a copy of current replicaset vclock to
the applier thread.

Closes #7089
Part-of tarantool/tarantool-qa#166

NO_DOC=internal fix
NO_CHANGELOG=internal fix
NO_TEST=hard to test

A note on why such a seemingly slow fix was chosen:

I was afraid that adding another vclock_copy on each wal_write would slow tarantool considerably.
That's why I conducted some tests:

I ran tarantool in wal_mode='None' and wal_mode='Write', the server was connected to 10 remote masters, it had 10 applier threads. So each "wal_write" (even in wal_mode='none') on it would result in copying over 10 extra vclock. Which should be quite slow.
But I couldn't notice any difference in RPS.

Here's the test:
Start 10 nodes in a fullmesh, insert something on each of them so that their vclock becomes huge:
tarantool swarm.lua 2, tarantool swarm lua 3, ... tarantool swarm.lua 11

swarm.lua
local id = tonumber(arg[1])
assert(id >= 2 and id <= 31)

box.cfg{
    listen = 3300 + id,
    replication = {
        3302,
        3303,
        3304,
        3305,
        3306,
        3307,
        3308,
        3309,
        3310,
        3311,
    },
    wal_dir = './'..id,
    memtx_dir = './'..id,
    background = true,
    log = './'..id..'.log',
    pid_file = './'..id..'.pid',
}

box.once('bootstrap', function()
    box.schema.user.grant('guest', 'replication')
end)

box.space._schema:replace{"smth to bump vclock "..id}

Now start the testing Tarantool:
tarantool -i speedtest.lua
and issue test(num_fibers_to_test) as many times as you want.

speedtest.lua
local fiber = require('fiber')

box.cfg{
    listen = 3301,
    replication = {
        3302,
        3303,
        3304,
        3305,
        3306,
        3307,
        3308,
        3309,
        3310,
        3311,
    },
    replication_threads = 10,
    -- Disable WAL on a node to notice slightest differences in TX thread
    -- performance. It's okay to replicate TO a node with disabled WAL. You only
    -- can't replicate FROM it.
    --wal_mode = 'None',
    log='./1.log',
}

box.schema.space.create('test', {if_not_exists = true})
box.space.test:create_index('pk', {if_not_exists = true})
box.snapshot()

local function replace_func(num_iters)
    for i = 1, num_iters do
        box.space.test:replace{i, i}
    end
end

function test(num_fibers)
    local fibers = {}
    local num_replaces = 1e6
    local num_iters = num_replaces / num_fibers
    local start = fiber.time()
    for i = 1, num_fibers do
        local fib = fiber.new(replace_func, num_iters)
        fib:set_joinable(true)
        table.insert(fibers, fib)
    end
    assert(#fibers == num_fibers, "Fibers created successfully")
    for _, fib in pairs(fibers) do
        fib:join()
    end
    fiber.yield()
    local dt = fiber.time() - start
    return dt, num_replaces / dt
end

On my machine both patched and canonical Tarantool have around 1 300 000 RPS +- 1 % when wal_mode is 'none'.
When wal_mode is 'write', both Tarantools have around 920 000 RPS +- 1 %.

@Totktonada
Copy link
Contributor

(I like to read your investigations: you transform mysterious effects into easy-to-see sequences of conditions and actions.)

We have perf/ directory in the project. How about placing the performance test to it with a Makefile to run? Later we can enable the test into a regular performance tracking.

@sergepetrenko
Copy link
Collaborator Author

@Totktonada, thanks, glad to hear that!

Yep, I'll try to put together a more comprehensive test and commit it together with the fix.

@sergepetrenko sergepetrenko force-pushed the tx-applier-vclock-race branch 2 times, most recently from 350ce26 to a23353b Compare May 11, 2022 14:23
@sergepetrenko
Copy link
Collaborator Author

@Totktonada, thanks for the suggestion!
I've put the perf test into perf/gh-7089-vclock-copy folder.
Please, check if such perf test is ok for you.

Copy link
Collaborator

@Gerold103 Gerold103 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it that NO_CHANGELOG=internal fix? Didn't we already release beta2 with this bug? I mean, the unreleased/ folder of changelogs doesn't have this feature. It means it was already flushed into a released .md file, wasn't it?

As for the fix - I don't like that we now copy quite expensive vclock in one another place but can't propose anything better. Good work anyway, I suppose its debug was pretty intricate.

@sergepetrenko sergepetrenko force-pushed the tx-applier-vclock-race branch from a23353b to f00b911 Compare May 16, 2022 14:24
@sergepetrenko
Copy link
Collaborator Author

@Gerold103, @Totktonada, thanks for your reviews!

Here are the incremental changes:

diff
diff --git a/changelogs/unreleased/gh-7089-applier-vclock-race.md b/changelogs/unreleased/gh-7089-applier-vclock-race.md
new file mode 100644
index 000000000..48fa18385
--- /dev/null
+++ b/changelogs/unreleased/gh-7089-applier-vclock-race.md
@@ -0,0 +1,4 @@
+## bugfix/replication
+
+* Fix a case when replication was broken on master side with
+  `ER_INVALID_MSGPACK` (gh-7089).
diff --git a/perf/gh-7089-vclock-copy/Makefile b/perf/gh-7089-vclock-copy/Makefile
index 0d86e50e9..0d2dda5fd 100644
--- a/perf/gh-7089-vclock-copy/Makefile
+++ b/perf/gh-7089-vclock-copy/Makefile
@@ -1,3 +1,5 @@
+SHELL = /bin/bash
+
 test: test_none stop
 
 test_wal: test_write stop
@@ -9,10 +11,10 @@ start_swarm:
 	for i in {2..11}; do ./swarm.lua $${i}; done
 
 test_none: clean dirs start_swarm
-	./speedtest.lua none
+	./speedtest.lua none 1
 
 test_write: clean dirs start_swarm
-	./speedtest.lua write
+	./speedtest.lua write 1
 
 stop stop1:
 	for i in {2..11}; do pkill -F $${i}/$${i}.pid || true; done
diff --git a/perf/gh-7089-vclock-copy/speedtest.lua b/perf/gh-7089-vclock-copy/speedtest.lua
index 5cdee61ab..4c4b5e275 100755
--- a/perf/gh-7089-vclock-copy/speedtest.lua
+++ b/perf/gh-7089-vclock-copy/speedtest.lua
@@ -9,10 +9,15 @@ local mode = arg[1] or 'none'
 assert(mode == 'write' or mode == 'none',
        "mode should be either 'write' or 'none'")
 
+local id = tonumber(arg[2]) or 1
+
+assert(id < 2 or id > 11,
+       'The id should be outside of the occupied range [2, 11]')
+
 local fiber = require('fiber')
 
 box.cfg{
-    listen = 3301,
+    listen = 3300 + id,
     replication = {
         3302,
         3303,
@@ -30,8 +35,8 @@ box.cfg{
     -- performance. It's okay to replicate TO a node with disabled WAL. You only
     -- can't replicate FROM it.
     wal_mode = mode,
-    work_dir = '1',
-    log = '1.log',
+    work_dir = tostring(id),
+    log = id..'.log',
 }
 
 box.schema.space.create('test', {if_not_exists = true})

@sergepetrenko
Copy link
Collaborator Author

@Gerold103

Why is it that NO_CHANGELOG=internal fix? Didn't we already release beta2 with this bug? I mean, the unreleased/ folder of changelogs doesn't have this feature. It means it was already flushed into a released .md file, wasn't it?

I just thought it isn't worth adding a changelog. But I don't mind, so added one.

As for the fix - I don't like that we now copy quite expensive vclock in one another place but can't propose anything better.

Me neither.

Copy link
Contributor

@Totktonada Totktonada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I was added to reviewers to look on the benchmark.

TBH, I didn't play around by hands, but visually it looks quite obvious how to run it and obtain results. It is the main thing. I don't want let you wait a decent review from me. I'll approve as 'looks visually ok'.

When applier ack writer was moved to applier thread, it was overlooked
that it would start sharing replicaset.vclock between two threads.

This could lead to the following replication errors on master:

 relay//102/main:reader V> Got a corrupted row:
 relay//102/main:reader V> 00000000: 81 00 00 81 26 81 01 09 02 01

Such a row has an incorrectly-encoded vclock: `81 01 09 02 01`.
When writer fiber encoded the vclock length (`81`), there was only one
vclock component: {1: 9}, but at the moment of iterating over the
components, another WAL write was reported to TX thread, which bumped
the second vclock component {1: 9, 2: 1}.

Let's fix the race by delivering a copy of current replicaset vclock to
the applier thread.

Also add a perf test to the perf/ directory.

Closes tarantool#7089
Part-of tarantool/tarantool-qa#166

NO_DOC=internal fix
NO_TEST=hard to test
@sergepetrenko sergepetrenko force-pushed the tx-applier-vclock-race branch from f00b911 to 16a83e1 Compare May 17, 2022 11:03
@kyukhin kyukhin merged commit ddec704 into tarantool:master May 18, 2022
@sergepetrenko sergepetrenko deleted the tx-applier-vclock-race branch May 31, 2022 11:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Race when accessing replicaset.vclock from applier thread

4 participants