Skip to content

Promote can get stuck in transition to election leader #11574

@CuriousGeorgiy

Description

@CuriousGeorgiy

Steps to reproduce

  1. Setup two replicas, replica 1 being the synchronous queue owner with a synchronous quorum 3. Enable manual elections on replica 2.
-- replica 1
console = require('console')

box.cfg{
    listen = 3301,
    replication = {3301, 3302},
    replication_timeout = 0.1,
    replication_synchro_timeout = 600,
    replication_synchro_quorum = 3,
}

box.once('bootstrap', function()
    box.schema.user.grant('guest', 'super')

    box.schema.create_space('s', {is_sync = true}):create_index('p')

    box.ctl.promote()
end)

console.start()

os.exit()

-- replica 2
console = require('console')

box.cfg{
    listen = 3302,
    read_only = true,
    replication = {3301, 3302},
    replication_synchro_quorum = 3,
    replication_timeout = 0.1,
    election_mode = 'manual',
}

console.start()

os.exit()
  1. Add a request to the synchronous replication queue owned by replica 1. After that, enable manual elections on replica 1 and promote it to leader.
-- replica 1

box.atomic({wait = 'submit'}, function() box.space.s:replace{0} end)
box.cfg{election_mode = 'manual'}

Actual behavior

The raft worker on replica 1 will wait forever for quorum 3:

tarantool/src/box/box.cc

Lines 2688 to 2705 in 5921a93

static int
box_quorum_on_ack_f(struct trigger *trigger, void *event)
{
struct replication_ack *ack = (struct replication_ack *)event;
struct box_quorum_trigger *t = (struct box_quorum_trigger *)trigger;
int64_t new_lsn = vclock_get(ack->vclock, t->replica_id);
int64_t old_lsn = vclock_get(&t->vclock, ack->source);
if (new_lsn < t->target_lsn || old_lsn >= t->target_lsn)
return 0;
vclock_follow(&t->vclock, ack->source, new_lsn);
++t->ack_count;
if (t->ack_count >= t->quorum) {
fiber_wakeup(t->waiter);
trigger_clear(trigger);
}
return 0;
}

  • It is impossible to demote replica 1, since it is stuck in the promote.
  • Changing the quorum on replica 1 does not have any effect.
  • Cancelling the raft_worker on replica 1 does not have any effect.

As a result, the promote on replica 1 is stuck.

Expected behavior

The promote on replica 1 is not stuck.

Metadata

Metadata

Assignees

Labels

3.2Target is 3.2 and all newer release/master branchesbugSomething isn't workingraftRAFT protocol

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions