-
Notifications
You must be signed in to change notification settings - Fork 403
Closed
Labels
3.2Target is 3.2 and all newer release/master branchesTarget is 3.2 and all newer release/master branchesbugSomething isn't workingSomething isn't workingcrashraftRAFT protocolRAFT protocol
Description
I'm running a testing scenario with 5 instances RAFT replicaset and two box.ctl.promote() calls. In attempts to stabilize it I run it in 16 parallel threads 100 times and I found that box.ctl.promote() sometimes yields infinitely.
test/config-luatest/reproduce_infinite_promotion_test.lua
local fiber = require('fiber')
local t = require('luatest')
local cbuilder = require('luatest.cbuilder')
local cluster = require('test.config-luatest.cluster')
local g = t.group()
g.before_all(cluster.init)
g.after_each(cluster.drop)
g.after_all(cluster.clean)
-- {{{ Helpers
local function reload(server)
server:exec(function()
local config = require('config')
config:reload()
end)
end
local function promote(server)
server:exec(function()
local log = require('log')
t.helpers.retrying({timeout = 60}, function()
-- XXX: The goal is to hang the box.ctl.promote() call
-- below.
log.info('DEBUG: before box.ctl.promote()')
box.ctl.promote()
log.info('DEBUG: after box.ctl.promote()')
t.helpers.retrying({timeout = 1}, function()
assert(box.info.ro == false)
end)
end)
end)
end
-- }}} Helpers
g.test_try_to_reproduce = function(g)
local config = cbuilder:new()
:set_replicaset_option('replication.failover', 'election')
:add_instance('i-001', {})
:add_instance('i-002', {})
:add_instance('i-003', {})
:add_instance('i-004', {})
:add_instance('i-005', {})
:config()
local cluster = cluster.new(g, config)
cluster:start()
-- Promote i-002 to leaders.
promote(cluster['i-002'])
-- Load config without i-004 and i-005 on i-001.
local config_2 = cbuilder:new(config)
:set_replicaset_option('instances.i-004', nil)
:set_replicaset_option('instances.i-005', nil)
:config()
cluster:sync(config_2)
reload(cluster['i-001'])
-- Make i-001 RW. Wait in a fiber to catch if it hangs.
local latch = fiber.channel(0)
fiber.new(function()
fiber.self():name('promote i-001')
promote(cluster['i-001'])
latch:put(true)
end)
local res = latch:get(60)
t.assert_equals(res, true, 'timed out')
endRun it as follows:
$ ./test/test-run.py $(yes config-luatest/reproduce_infinite_promotion_test.lua | head -n 100)Logs from a failed attempt to run:
i-001 | 2024-11-20 00:53:03.538 [19199] main/141/main/test.config-luatest.reproduce_infinite_promotion_test reproduce_infinite_promotion_test.lua:29 I> DEBUG: before box.ctl.promote()
i-001 | 2024-11-20 00:53:03.539 [19199] main/141/main raft.c:891 I> RAFT: bump term to 6, follow
i-001 | 2024-11-20 00:53:03.540 [19199] main/141/main raft.c:929 I> RAFT: vote for 1, follow
i-001 | 2024-11-20 00:53:03.542 [19199] main/123/raft_worker raft.c:785 I> RAFT: persisted state {term: 6, vote: 1}
i-001 | 2024-11-20 00:53:03.543 [19199] main/123/raft_worker raft.c:874 I> RAFT: enter candidate state with 1 self vote
i-001 | 2024-11-20 00:53:03.551 [19199] main/115/applier/replicator@unix/:./i-003.iproto raft.c:507 I> RAFT: message {term: 6, state: follower} from 3
i-001 | 2024-11-20 00:53:03.552 [19199] main/114/applier/replicator@unix/:./i-002.iproto raft.c:507 I> RAFT: message {term: 6, state: follower} from 5
i-001 | 2024-11-20 00:53:03.553 [19199] main/114/applier/replicator@unix/:./i-002.iproto raft.c:507 I> RAFT: message {term: 6, vote: 4, state: follower} from 5
i-001 | 2024-11-20 00:53:03.553 [19199] main/114/applier/replicator@unix/:./i-002.iproto raft.c:583 I> RAFT: vote request is skipped - competing candidate
i-001 | 2024-11-20 00:53:03.554 [19199] main/115/applier/replicator@unix/:./i-003.iproto raft.c:507 I> RAFT: message {term: 6, vote: 1, state: follower} from 3
i-001 | 2024-11-20 00:53:03.554 [19199] main/115/applier/replicator@unix/:./i-003.iproto raft.c:594 I> RAFT: accepted vote for self, vote count is 2/3
i-001 | 2024-11-20 00:53:03.555 [19199] main/114/applier/replicator@unix/:./i-002.iproto raft.c:507 I> RAFT: message {term: 6, vote: 4, leader: 4, leader is seen: true, state: follower} from 5
i-001 | 2024-11-20 00:53:03.556 [19199] main/114/applier/replicator@unix/:./i-002.iproto raft.c:583 I> RAFT: vote request is skipped - competing candidate
i-001 | 2024-11-20 00:53:03.557 [19199] main/115/applier/replicator@unix/:./i-003.iproto raft.c:507 I> RAFT: message {term: 6, vote: 1, leader: 4, leader is seen: true, state: follower} from 3
i-001 | 2024-11-20 00:53:03.557 [19199] main/115/applier/replicator@unix/:./i-003.iproto raft.c:594 I> RAFT: accepted vote for self, vote count is 2/3
i-001 | 2024-11-20 00:54:03.565 [19199] main main.cc:201 C> got signal 15 - TerminatedSeveral times I even got an assertion failure:
i-001 | tarantool: ./src/box/raft.c:527: box_raft_try_promote: Assertion `!raft->is_candidate' failed.(gdb) bt
#0 0x00007f3deb939e6c in __pthread_kill_implementation () from /lib64/libc.so.6
#1 0x00007f3deb8eb612 in raise () from /lib64/libc.so.6
#2 0x00007f3deb8d44ed in abort () from /lib64/libc.so.6
#3 0x00007f3deb8d4415 in __assert_fail_base.cold () from /lib64/libc.so.6
#4 0x00007f3deb8e4192 in __assert_fail () from /lib64/libc.so.6
#5 0x00005592fadf3344 in box_raft_try_promote () at /home/alex/p/tarantool-meta/tarantool/src/box/raft.c:527
#6 0x00005592fae038ce in box_promote () at /home/alex/p/tarantool-meta/tarantool/src/box/box.cc:3095
#7 0x00005592faf22afb in lbox_ctl_promote (L=0x408141d8) at /home/alex/p/tarantool-meta/tarantool/src/box/lua/ctl.c:104
<...>Coredump: tarantool-core-N-202411200205-rade.tar.xz.tar.gz (sorry for the double compression: I need .tar.zx to fit to the 25MiB file size limit and .tar.gz to pass the file name filter).
Logs (from the run that produced the coredump): 004_config-luatest.tar.gz.
The part of the log that looks relevant:
$ grep ^i-001 004_config-luatest/run.log
<...>
i-001 | 2024-11-20 01:37:59.875 [5153] main/128/main raft.c:891 I> RAFT: bump term to 6, follow
i-001 | 2024-11-20 01:37:59.875 [5153] main/128/main raft.c:929 I> RAFT: vote for 3, follow
i-001 | 2024-11-20 01:37:59.895 [5153] main/115/applier/replicator@unix/:./i-003.iproto box.cc:592 I> leaving orphan mode
i-001 | 2024-11-20 01:37:59.895 [5153] main/134/applier/replicator@unix/:./i-002.iproto raft.c:507 I> RAFT: message {term: 5, vote: 4, leader: 4, state: leader} from 4
i-001 | 2024-11-20 01:37:59.895 [5153] main/134/applier/replicator@unix/:./i-002.iproto raft.c:522 I> RAFT: the message is ignored due to outdated term - current term is 6
i-001 | 2024-11-20 01:37:59.895 [5153] main/134/applier/replicator@unix/:./i-002.iproto box.cc:592 I> leaving orphan mode
i-001 | tarantool: ./src/box/raft.c:527: box_raft_try_promote: Assertion `!raft->is_candidate' failed.
<...>Snaps/xlogs (from the run that produced the coredump): var.tar.gz.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
3.2Target is 3.2 and all newer release/master branchesTarget is 3.2 and all newer release/master branchesbugSomething isn't workingSomething isn't workingcrashraftRAFT protocolRAFT protocol