Skip to content

Hang/crash in box.ctl.promote() #10836

@Totktonada

Description

@Totktonada

I'm running a testing scenario with 5 instances RAFT replicaset and two box.ctl.promote() calls. In attempts to stabilize it I run it in 16 parallel threads 100 times and I found that box.ctl.promote() sometimes yields infinitely.

test/config-luatest/reproduce_infinite_promotion_test.lua
local fiber = require('fiber')
local t = require('luatest')
local cbuilder = require('luatest.cbuilder')
local cluster = require('test.config-luatest.cluster')

local g = t.group()

g.before_all(cluster.init)
g.after_each(cluster.drop)
g.after_all(cluster.clean)

-- {{{ Helpers

local function reload(server)
    server:exec(function()
        local config = require('config')

        config:reload()
    end)
end

local function promote(server)
    server:exec(function()
        local log = require('log')

        t.helpers.retrying({timeout = 60}, function()
            -- XXX: The goal is to hang the box.ctl.promote() call
            -- below.
            log.info('DEBUG: before box.ctl.promote()')
            box.ctl.promote()
            log.info('DEBUG: after box.ctl.promote()')
            t.helpers.retrying({timeout = 1}, function()
                assert(box.info.ro == false)
            end)
        end)
    end)
end

-- }}} Helpers

g.test_try_to_reproduce = function(g)
    local config = cbuilder:new()
        :set_replicaset_option('replication.failover', 'election')
        :add_instance('i-001', {})
        :add_instance('i-002', {})
        :add_instance('i-003', {})
        :add_instance('i-004', {})
        :add_instance('i-005', {})
        :config()

    local cluster = cluster.new(g, config)
    cluster:start()

    -- Promote i-002 to leaders.
    promote(cluster['i-002'])

    -- Load config without i-004 and i-005 on i-001.
    local config_2 = cbuilder:new(config)
        :set_replicaset_option('instances.i-004', nil)
        :set_replicaset_option('instances.i-005', nil)
        :config()
    cluster:sync(config_2)
    reload(cluster['i-001'])

    -- Make i-001 RW. Wait in a fiber to catch if it hangs.
    local latch = fiber.channel(0)
    fiber.new(function()
        fiber.self():name('promote i-001')
        promote(cluster['i-001'])
        latch:put(true)
    end)
    local res = latch:get(60)
    t.assert_equals(res, true, 'timed out')
end

Run it as follows:

$ ./test/test-run.py $(yes config-luatest/reproduce_infinite_promotion_test.lua | head -n 100)

Logs from a failed attempt to run:

i-001 | 2024-11-20 00:53:03.538 [19199] main/141/main/test.config-luatest.reproduce_infinite_promotion_test reproduce_infinite_promotion_test.lua:29 I> DEBUG: before box.ctl.promote()
i-001 | 2024-11-20 00:53:03.539 [19199] main/141/main raft.c:891 I> RAFT: bump term to 6, follow
i-001 | 2024-11-20 00:53:03.540 [19199] main/141/main raft.c:929 I> RAFT: vote for 1, follow
i-001 | 2024-11-20 00:53:03.542 [19199] main/123/raft_worker raft.c:785 I> RAFT: persisted state {term: 6, vote: 1}
i-001 | 2024-11-20 00:53:03.543 [19199] main/123/raft_worker raft.c:874 I> RAFT: enter candidate state with 1 self vote
i-001 | 2024-11-20 00:53:03.551 [19199] main/115/applier/replicator@unix/:./i-003.iproto raft.c:507 I> RAFT: message {term: 6, state: follower} from 3
i-001 | 2024-11-20 00:53:03.552 [19199] main/114/applier/replicator@unix/:./i-002.iproto raft.c:507 I> RAFT: message {term: 6, state: follower} from 5
i-001 | 2024-11-20 00:53:03.553 [19199] main/114/applier/replicator@unix/:./i-002.iproto raft.c:507 I> RAFT: message {term: 6, vote: 4, state: follower} from 5
i-001 | 2024-11-20 00:53:03.553 [19199] main/114/applier/replicator@unix/:./i-002.iproto raft.c:583 I> RAFT: vote request is skipped - competing candidate
i-001 | 2024-11-20 00:53:03.554 [19199] main/115/applier/replicator@unix/:./i-003.iproto raft.c:507 I> RAFT: message {term: 6, vote: 1, state: follower} from 3
i-001 | 2024-11-20 00:53:03.554 [19199] main/115/applier/replicator@unix/:./i-003.iproto raft.c:594 I> RAFT: accepted vote for self, vote count is 2/3
i-001 | 2024-11-20 00:53:03.555 [19199] main/114/applier/replicator@unix/:./i-002.iproto raft.c:507 I> RAFT: message {term: 6, vote: 4, leader: 4, leader is seen: true, state: follower} from 5
i-001 | 2024-11-20 00:53:03.556 [19199] main/114/applier/replicator@unix/:./i-002.iproto raft.c:583 I> RAFT: vote request is skipped - competing candidate
i-001 | 2024-11-20 00:53:03.557 [19199] main/115/applier/replicator@unix/:./i-003.iproto raft.c:507 I> RAFT: message {term: 6, vote: 1, leader: 4, leader is seen: true, state: follower} from 3
i-001 | 2024-11-20 00:53:03.557 [19199] main/115/applier/replicator@unix/:./i-003.iproto raft.c:594 I> RAFT: accepted vote for self, vote count is 2/3
i-001 | 2024-11-20 00:54:03.565 [19199] main main.cc:201 C> got signal 15 - Terminated

Several times I even got an assertion failure:

i-001 | tarantool: ./src/box/raft.c:527: box_raft_try_promote: Assertion `!raft->is_candidate' failed.
(gdb) bt
#0  0x00007f3deb939e6c in __pthread_kill_implementation () from /lib64/libc.so.6
#1  0x00007f3deb8eb612 in raise () from /lib64/libc.so.6
#2  0x00007f3deb8d44ed in abort () from /lib64/libc.so.6
#3  0x00007f3deb8d4415 in __assert_fail_base.cold () from /lib64/libc.so.6
#4  0x00007f3deb8e4192 in __assert_fail () from /lib64/libc.so.6
#5  0x00005592fadf3344 in box_raft_try_promote () at /home/alex/p/tarantool-meta/tarantool/src/box/raft.c:527
#6  0x00005592fae038ce in box_promote () at /home/alex/p/tarantool-meta/tarantool/src/box/box.cc:3095
#7  0x00005592faf22afb in lbox_ctl_promote (L=0x408141d8) at /home/alex/p/tarantool-meta/tarantool/src/box/lua/ctl.c:104
<...>

Coredump: tarantool-core-N-202411200205-rade.tar.xz.tar.gz (sorry for the double compression: I need .tar.zx to fit to the 25MiB file size limit and .tar.gz to pass the file name filter).

Logs (from the run that produced the coredump): 004_config-luatest.tar.gz.

The part of the log that looks relevant:

$ grep ^i-001 004_config-luatest/run.log
<...>
i-001 | 2024-11-20 01:37:59.875 [5153] main/128/main raft.c:891 I> RAFT: bump term to 6, follow
i-001 | 2024-11-20 01:37:59.875 [5153] main/128/main raft.c:929 I> RAFT: vote for 3, follow
i-001 | 2024-11-20 01:37:59.895 [5153] main/115/applier/replicator@unix/:./i-003.iproto box.cc:592 I> leaving orphan mode
i-001 | 2024-11-20 01:37:59.895 [5153] main/134/applier/replicator@unix/:./i-002.iproto raft.c:507 I> RAFT: message {term: 5, vote: 4, leader: 4, state: leader} from 4
i-001 | 2024-11-20 01:37:59.895 [5153] main/134/applier/replicator@unix/:./i-002.iproto raft.c:522 I> RAFT: the message is ignored due to outdated term - current term is 6
i-001 | 2024-11-20 01:37:59.895 [5153] main/134/applier/replicator@unix/:./i-002.iproto box.cc:592 I> leaving orphan mode
i-001 | tarantool: ./src/box/raft.c:527: box_raft_try_promote: Assertion `!raft->is_candidate' failed.
<...>

Snaps/xlogs (from the run that produced the coredump): var.tar.gz.

Metadata

Metadata

Assignees

Labels

3.2Target is 3.2 and all newer release/master branchesbugSomething isn't workingcrashraftRAFT protocol

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions