Skip to content

qsync: replication can get stuck when replication_synchro_queue_max_size is reached #11836

@Gerold103

Description

@Gerold103

Here is a reproducer:

--
-- Instance 1
--
-- Step 1
--
fiber = require('fiber')
log = require('log')
data = string.rep('a', 1000)
box.cfg{
    listen = 3313,
    replication = {3313, 3314},
    replication_synchro_timeout = 1000,
    election_mode = 'candidate',
    replication_synchro_quorum = 3,
    replication_timeout = 1000,
    replication_reconnect_timeout = 1,
}
box.ctl.promote()
box.ctl.wait_rw()
box.schema.user.grant('guest', 'super')
s = box.schema.create_space('test', {is_sync= true})
_ = s:create_index('pk')
a = box.schema.create_space('test2')
_ = a:create_index('pk')
--
-- Step 3
--
f1 = make_txn_fiber(s, 1)
f2 = make_txn_fiber(a, 2)
f3 = make_txn_fiber(a, 3)
--
-- Step 5
--
box.cfg{replication_synchro_quorum = 2}
-- Observe the logs that the f1, f2, f3 are committed.
-- Also see that box.info.synchro queue length is zero.


--
-- Instance 2
--
-- Step 2
--
fiber = require('fiber')
log = require('log')
json = require('json')
box.cfg{
    listen = 3314,
    replication = {3313, 3314},
    election_mode = 'voter',
    replication_synchro_queue_max_size = 1000,
    read_only = true,
    replication_timeout = 1000,
    replication_reconnect_timeout = 1,
}
function make_on_replace(space_name)
    return function(old, new)
        log.info(('%s: %s -> %s'):format(space_name, json.encode(old), json.encode(new)))
    end
end
s = box.space.test
_ = s:on_replace(make_on_replace(s.name))
a = box.space.test2
_ = a:on_replace(make_on_replace(a.name))
--
-- Step 4
--
-- Observe the logs. Only txns from f1 and f2 fibers are logged. Like f3 never arrived.
--
-- Step 6
--
-- Observe that nothing happened in the logs and the box.info.synchro queue length
-- is still 1. It means f1 is in the queue, f2 is volatile in the limbo, and f3 wasn't received.

If I remove replication_timeout setting or make it 1 second, then shortly after Step-3 the Instance-2 gets disconnected on timeout.

It seems that when replication_synchro_queue_max_size is reached, the applier fiber on the replica somehow gets blocked and doesn't sent any heartbeats or acks. That doesn't look right.

Note that this is not just a temporary thing until the WAL entries get written. The replica gets actually stuck. Even after all the committed on the master, the applier on the replica still can't read and apply anything. Neither next txns, nor CONFIRMs for the older txns.

This looks tricky to fix. On one hand the applier can't apply new synchro txns, because they can't be submitted into the limbo because the limbo's max size is reached. The fiber doing those txns simply gets blocked and waits for limbo space. One the other hand the applier will never free the limbo space, because it is not even reading CONFIRMs. Which means whatever has blocked its limbo right now will not go away already ever.

Perhaps the applier should somehow continue reading the socket even if it can't submit new txns in the TX thread. And/or it might have to forcefully submit the limbo entries ignoring the limbo queue max size. The latter makes sense, because if a limbo entry has reached the replica, it means it went to WAL on the master. Which in turn means that there is no much use to keep this entry from entering WAL already.

Might be related to #11837.

Metadata

Metadata

Assignees

Labels

3.3Target is 3.3 and all newer release/master branchesbugSomething isn't workingqsync replicationreplication

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions