qsync: replication can get stuck when `replication_synchro_queue_max_size` is reached

Here is a reproducer:
```Lua
--
-- Instance 1
--
-- Step 1
--
fiber = require('fiber')
log = require('log')
data = string.rep('a', 1000)
box.cfg{
    listen = 3313,
    replication = {3313, 3314},
    replication_synchro_timeout = 1000,
    election_mode = 'candidate',
    replication_synchro_quorum = 3,
    replication_timeout = 1000,
    replication_reconnect_timeout = 1,
}
box.ctl.promote()
box.ctl.wait_rw()
box.schema.user.grant('guest', 'super')
s = box.schema.create_space('test', {is_sync= true})
_ = s:create_index('pk')
a = box.schema.create_space('test2')
_ = a:create_index('pk')
--
-- Step 3
--
f1 = make_txn_fiber(s, 1)
f2 = make_txn_fiber(a, 2)
f3 = make_txn_fiber(a, 3)
--
-- Step 5
--
box.cfg{replication_synchro_quorum = 2}
-- Observe the logs that the f1, f2, f3 are committed.
-- Also see that box.info.synchro queue length is zero.


--
-- Instance 2
--
-- Step 2
--
fiber = require('fiber')
log = require('log')
json = require('json')
box.cfg{
    listen = 3314,
    replication = {3313, 3314},
    election_mode = 'voter',
    replication_synchro_queue_max_size = 1000,
    read_only = true,
    replication_timeout = 1000,
    replication_reconnect_timeout = 1,
}
function make_on_replace(space_name)
    return function(old, new)
        log.info(('%s: %s -> %s'):format(space_name, json.encode(old), json.encode(new)))
    end
end
s = box.space.test
_ = s:on_replace(make_on_replace(s.name))
a = box.space.test2
_ = a:on_replace(make_on_replace(a.name))
--
-- Step 4
--
-- Observe the logs. Only txns from f1 and f2 fibers are logged. Like f3 never arrived.
--
-- Step 6
--
-- Observe that nothing happened in the logs and the box.info.synchro queue length
-- is still 1. It means f1 is in the queue, f2 is volatile in the limbo, and f3 wasn't received.
```
If I remove `replication_timeout` setting or make it 1 second, then shortly after Step-3 the Instance-2 gets disconnected on timeout.

It seems that when `replication_synchro_queue_max_size` is reached, the applier fiber on the replica somehow gets blocked and doesn't sent any heartbeats or acks. That doesn't look right.

Note that this is not just a temporary thing until the WAL entries get written. The replica gets actually stuck. Even after all the committed on the master, the applier on the replica still can't read and apply anything. Neither next txns, nor `CONFIRM`s for the older txns.

This looks tricky to fix. On one hand the applier can't apply new synchro txns, because they can't be submitted into the limbo because the limbo's max size is reached. The fiber doing those txns simply gets blocked and waits for limbo space. One the other hand the applier will never free the limbo space, because it is not even reading CONFIRMs. Which means whatever has blocked its limbo right now will not go away already ever.

Perhaps the applier should somehow continue reading the socket even if it can't submit new txns in the TX thread. And/or it might have to forcefully submit the limbo entries ignoring the limbo queue max size. The latter makes sense, because if a limbo entry has reached the replica, it means it went to WAL on the master. Which in turn means that there is no much use to keep this entry from entering WAL already.

Might be related to #11837.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qsync: replication can get stuck when `replication_synchro_queue_max_size` is reached #11836

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

qsync: replication can get stuck when replication_synchro_queue_max_size is reached #11836

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

qsync: replication can get stuck when `replication_synchro_queue_max_size` is reached #11836