Skip to content

Scheduler is not unthrottled by box.snapshot() #3519

@locker

Description

@locker

The following script fails to make a snapshot even after the error injection preventing vinyl from writing a run file is turned off (run on debug build):

box.cfg{log_level = 4}

function snapshot()
    local status, err = pcall(box.snapshot)
    if not status then
        require('log').error(string.format('failed to make snapshot: %s', err))
    end
end

box.schema.space.create('test', {engine = 'vinyl'})
box.space.test:create_index('pk')
box.space.test:replace{1}

box.error.injection.set('ERRINJ_VY_RUN_WRITE', true)
snapshot()
box.error.injection.set('ERRINJ_VY_RUN_WRITE', false)
snapshot()

os.exit(0)

Output:

2018-07-11 11:21:17.831 [19068] main/101/test.lua C> Tarantool 1.10.1-179-g5b1757be
2018-07-11 11:21:17.831 [19068] main/101/test.lua C> log level 4
2018-07-11 11:21:17.868 [19068] main/103/vinyl.scheduler vy_scheduler.c:670 E> ER_INJECTION: Error injection 'vinyl dump'
2018-07-11 11:21:17.868 [19068] main/103/vinyl.scheduler vy_scheduler.c:908 E> 512/0: dump failed
2018-07-11 11:21:17.868 [19068] main/103/vinyl.scheduler vy_scheduler.c:1609 W> throttling scheduler for 1 second(s)
2018-07-11 11:21:17.872 [19068] main/101/test.lua vy_scheduler.c:585 E> vinyl checkpoint failed: Error injection 'vinyl dump'
2018-07-11 11:21:17.872 [19068] main/101/test.lua test.lua:6 E> failed to make snapshot: Error injection 'vinyl dump'
2018-07-11 11:21:17.872 [19068] main/101/test.lua vy_scheduler.c:549 E> cannot checkpoint vinyl, scheduler is throttled with: Error injection 'vinyl dump'
2018-07-11 11:21:17.872 [19068] main/101/test.lua test.lua:6 E> failed to make snapshot: Error injection 'vinyl dump'

This happens, because box.snapshot() bails out immediately if it sees that the scheduler is throttled due to errors. So after a disk error, the user has to wait for up to a minute to make a snapshot, which is rather annoying. The reason why throttling was introduced in the first place was to avoid flooding the log in case of repeating disk errors. Throttling box.snapshot() is pointless, as it is called manually or on schedule, so we should probably unthrottle the scheduler when it is called. Another reason to do that is tests: we had to introduce a new error injection (ERRINJ_VY_SCHED_TIMEOUT) to reduce time duration during which the scheduler remains throttled, which is ugly and race prone.

The issue is relevant to all currently maintained versions of tarantool (1.9, 1.10, 2.0).

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions