replication: fix race in accessing vclock by applier and tx threads by sergepetrenko · Pull Request #7092 · tarantool/tarantool

sergepetrenko · 2022-04-27T19:27:29Z

When applier ack writer was moved to applier thread, it was overlooked
that it would start sharing replicaset.vclock between two threads.

This could lead to the following replication errors on master:

relay//102/main:reader V> Got a corrupted row:
relay//102/main:reader V> 00000000: 81 00 00 81 26 81 01 09 02 01

Such a row has an incorrectly-encoded vclock: 81 01 09 02 01.
When writer fiber encoded the vclock length (81), there was only one
vclock component: {1: 9}, but at the moment of iterating over the
components, another WAL write was reported to TX thread, which bumped
the second vclock component {1: 9, 2: 1}.

Let's fix the race by delivering a copy of current replicaset vclock to
the applier thread.

Closes #7089
Part-of tarantool/tarantool-qa#166

NO_DOC=internal fix
NO_CHANGELOG=internal fix
NO_TEST=hard to test

A note on why such a seemingly slow fix was chosen:

I was afraid that adding another vclock_copy on each wal_write would slow tarantool considerably.
That's why I conducted some tests:

I ran tarantool in wal_mode='None' and wal_mode='Write', the server was connected to 10 remote masters, it had 10 applier threads. So each "wal_write" (even in wal_mode='none') on it would result in copying over 10 extra vclock. Which should be quite slow.
But I couldn't notice any difference in RPS.

Here's the test:
Start 10 nodes in a fullmesh, insert something on each of them so that their vclock becomes huge:
tarantool swarm.lua 2, tarantool swarm lua 3, ... tarantool swarm.lua 11

swarm.lua

local id = tonumber(arg[1])
assert(id >= 2 and id <= 31)

box.cfg{
    listen = 3300 + id,
    replication = {
        3302,
        3303,
        3304,
        3305,
        3306,
        3307,
        3308,
        3309,
        3310,
        3311,
    },
    wal_dir = './'..id,
    memtx_dir = './'..id,
    background = true,
    log = './'..id..'.log',
    pid_file = './'..id..'.pid',
}

box.once('bootstrap', function()
    box.schema.user.grant('guest', 'replication')
end)

box.space._schema:replace{"smth to bump vclock "..id}

Now start the testing Tarantool:
tarantool -i speedtest.lua
and issue test(num_fibers_to_test) as many times as you want.

speedtest.lua

local fiber = require('fiber')

box.cfg{
    listen = 3301,
    replication = {
        3302,
        3303,
        3304,
        3305,
        3306,
        3307,
        3308,
        3309,
        3310,
        3311,
    },
    replication_threads = 10,
    -- Disable WAL on a node to notice slightest differences in TX thread
    -- performance. It's okay to replicate TO a node with disabled WAL. You only
    -- can't replicate FROM it.
    --wal_mode = 'None',
    log='./1.log',
}

box.schema.space.create('test', {if_not_exists = true})
box.space.test:create_index('pk', {if_not_exists = true})
box.snapshot()

local function replace_func(num_iters)
    for i = 1, num_iters do
        box.space.test:replace{i, i}
    end
end

function test(num_fibers)
    local fibers = {}
    local num_replaces = 1e6
    local num_iters = num_replaces / num_fibers
    local start = fiber.time()
    for i = 1, num_fibers do
        local fib = fiber.new(replace_func, num_iters)
        fib:set_joinable(true)
        table.insert(fibers, fib)
    end
    assert(#fibers == num_fibers, "Fibers created successfully")
    for _, fib in pairs(fibers) do
        fib:join()
    end
    fiber.yield()
    local dt = fiber.time() - start
    return dt, num_replaces / dt
end

On my machine both patched and canonical Tarantool have around 1 300 000 RPS +- 1 % when wal_mode is 'none'.
When wal_mode is 'write', both Tarantools have around 920 000 RPS +- 1 %.

Totktonada · 2022-04-27T23:22:31Z

(I like to read your investigations: you transform mysterious effects into easy-to-see sequences of conditions and actions.)

We have perf/ directory in the project. How about placing the performance test to it with a Makefile to run? Later we can enable the test into a regular performance tracking.

sergepetrenko · 2022-04-28T14:39:39Z

@Totktonada, thanks, glad to hear that!

Yep, I'll try to put together a more comprehensive test and commit it together with the fix.

sergepetrenko · 2022-05-11T14:41:19Z

@Totktonada, thanks for the suggestion!
I've put the perf test into perf/gh-7089-vclock-copy folder.
Please, check if such perf test is ok for you.

Gerold103

Why is it that NO_CHANGELOG=internal fix? Didn't we already release beta2 with this bug? I mean, the unreleased/ folder of changelogs doesn't have this feature. It means it was already flushed into a released .md file, wasn't it?

As for the fix - I don't like that we now copy quite expensive vclock in one another place but can't propose anything better. Good work anyway, I suppose its debug was pretty intricate.

perf/gh-7089-vclock-copy/Makefile

sergepetrenko · 2022-05-16T14:26:32Z

@Gerold103, @Totktonada, thanks for your reviews!

Here are the incremental changes:

diff

diff --git a/changelogs/unreleased/gh-7089-applier-vclock-race.md b/changelogs/unreleased/gh-7089-applier-vclock-race.md
new file mode 100644
index 000000000..48fa18385
--- /dev/null
+++ b/changelogs/unreleased/gh-7089-applier-vclock-race.md
@@ -0,0 +1,4 @@
+## bugfix/replication
+
+* Fix a case when replication was broken on master side with
+  `ER_INVALID_MSGPACK` (gh-7089).
diff --git a/perf/gh-7089-vclock-copy/Makefile b/perf/gh-7089-vclock-copy/Makefile
index 0d86e50e9..0d2dda5fd 100644
--- a/perf/gh-7089-vclock-copy/Makefile
+++ b/perf/gh-7089-vclock-copy/Makefile
@@ -1,3 +1,5 @@
+SHELL = /bin/bash
+
 test: test_none stop
 
 test_wal: test_write stop
@@ -9,10 +11,10 @@ start_swarm:
 	for i in {2..11}; do ./swarm.lua $${i}; done
 
 test_none: clean dirs start_swarm
-	./speedtest.lua none
+	./speedtest.lua none 1
 
 test_write: clean dirs start_swarm
-	./speedtest.lua write
+	./speedtest.lua write 1
 
 stop stop1:
 	for i in {2..11}; do pkill -F $${i}/$${i}.pid || true; done
diff --git a/perf/gh-7089-vclock-copy/speedtest.lua b/perf/gh-7089-vclock-copy/speedtest.lua
index 5cdee61ab..4c4b5e275 100755
--- a/perf/gh-7089-vclock-copy/speedtest.lua
+++ b/perf/gh-7089-vclock-copy/speedtest.lua
@@ -9,10 +9,15 @@ local mode = arg[1] or 'none'
 assert(mode == 'write' or mode == 'none',
        "mode should be either 'write' or 'none'")
 
+local id = tonumber(arg[2]) or 1
+
+assert(id < 2 or id > 11,
+       'The id should be outside of the occupied range [2, 11]')
+
 local fiber = require('fiber')
 
 box.cfg{
-    listen = 3301,
+    listen = 3300 + id,
     replication = {
         3302,
         3303,
@@ -30,8 +35,8 @@ box.cfg{
     -- performance. It's okay to replicate TO a node with disabled WAL. You only
     -- can't replicate FROM it.
     wal_mode = mode,
-    work_dir = '1',
-    log = '1.log',
+    work_dir = tostring(id),
+    log = id..'.log',
 }
 
 box.schema.space.create('test', {if_not_exists = true})

sergepetrenko · 2022-05-16T14:28:06Z

@Gerold103

Why is it that NO_CHANGELOG=internal fix? Didn't we already release beta2 with this bug? I mean, the unreleased/ folder of changelogs doesn't have this feature. It means it was already flushed into a released .md file, wasn't it?

I just thought it isn't worth adding a changelog. But I don't mind, so added one.

As for the fix - I don't like that we now copy quite expensive vclock in one another place but can't propose anything better.

Me neither.

changelogs/unreleased/gh-7089-applier-vclock-race.md

Totktonada

I guess I was added to reviewers to look on the benchmark.

TBH, I didn't play around by hands, but visually it looks quite obvious how to run it and obtain results. It is the main thing. I don't want let you wait a decent review from me. I'll approve as 'looks visually ok'.

When applier ack writer was moved to applier thread, it was overlooked that it would start sharing replicaset.vclock between two threads. This could lead to the following replication errors on master: relay//102/main:reader V> Got a corrupted row: relay//102/main:reader V> 00000000: 81 00 00 81 26 81 01 09 02 01 Such a row has an incorrectly-encoded vclock: `81 01 09 02 01`. When writer fiber encoded the vclock length (`81`), there was only one vclock component: {1: 9}, but at the moment of iterating over the components, another WAL write was reported to TX thread, which bumped the second vclock component {1: 9, 2: 1}. Let's fix the race by delivering a copy of current replicaset vclock to the applier thread. Also add a perf test to the perf/ directory. Closes tarantool#7089 Part-of tarantool/tarantool-qa#166 NO_DOC=internal fix NO_TEST=hard to test

sergepetrenko force-pushed the tx-applier-vclock-race branch 2 times, most recently from 350ce26 to a23353b Compare May 11, 2022 14:23

sergepetrenko requested review from Gerold103 and Totktonada May 11, 2022 14:39

sergepetrenko assigned Totktonada and Gerold103 May 11, 2022

Gerold103 approved these changes May 12, 2022

View reviewed changes

Totktonada reviewed May 13, 2022

View reviewed changes

perf/gh-7089-vclock-copy/Makefile Show resolved Hide resolved

Totktonada reviewed May 13, 2022

View reviewed changes

perf/gh-7089-vclock-copy/Makefile Outdated Show resolved Hide resolved

sergepetrenko force-pushed the tx-applier-vclock-race branch from a23353b to f00b911 Compare May 16, 2022 14:24

sergepetrenko requested a review from Totktonada May 16, 2022 14:26

sergepetrenko unassigned Gerold103 May 16, 2022

Totktonada reviewed May 16, 2022

View reviewed changes

changelogs/unreleased/gh-7089-applier-vclock-race.md Outdated Show resolved Hide resolved

Totktonada approved these changes May 16, 2022

View reviewed changes

sergepetrenko force-pushed the tx-applier-vclock-race branch from f00b911 to 16a83e1 Compare May 17, 2022 11:03

sergepetrenko unassigned Totktonada May 17, 2022

kyukhin merged commit ddec704 into tarantool:master May 18, 2022

sergepetrenko deleted the tx-applier-vclock-race branch May 31, 2022 11:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

replication: fix race in accessing vclock by applier and tx threads#7092

replication: fix race in accessing vclock by applier and tx threads#7092
kyukhin merged 1 commit intotarantool:masterfrom
sergepetrenko:tx-applier-vclock-race

sergepetrenko commented Apr 27, 2022

Uh oh!

Totktonada commented Apr 27, 2022

Uh oh!

sergepetrenko commented Apr 28, 2022

Uh oh!

sergepetrenko commented May 11, 2022

Uh oh!

Gerold103 left a comment

Uh oh!

Uh oh!

Uh oh!

sergepetrenko commented May 16, 2022

Uh oh!

sergepetrenko commented May 16, 2022

Uh oh!

Uh oh!

Totktonada left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

sergepetrenko commented Apr 27, 2022

A note on why such a seemingly slow fix was chosen:

Uh oh!

Totktonada commented Apr 27, 2022

Uh oh!

sergepetrenko commented Apr 28, 2022

Uh oh!

sergepetrenko commented May 11, 2022

Uh oh!

Gerold103 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sergepetrenko commented May 16, 2022

Uh oh!

sergepetrenko commented May 16, 2022

Uh oh!

Uh oh!

Totktonada left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants