-
Notifications
You must be signed in to change notification settings - Fork 403
Description
I've made some measurements for tarantool recovery process.
I made a snapshot with 16M tuples in one space with two simple indexes. Tuples consisted of two numbers and a random string of random length (1-100).
Tarantool load time:
- hash indexes - 18 sec
- trees with hints - 10 sec
- trees w/o hints - 14 sec
See attached flame graphs and logs of that cases.
I also attached the test itself (first script creates snapshot, second - loads it). Beware - it can delete local snaps/xlogs.
data-5787.zip
It seems that the recovery process can be significantly improved. I'd bet that it's possible to load that snapshot in 2 seconds.
I think about these steps, sorted by my perception of effectiveness / implementation cost ratio:
- Increase
XLOG_READ_AHEAD#8108 - Always use hints for building a tree index.
- Do index building more parallel - for example parallel hint caclulation and even btree construction.
- Invent index building for hash index: now it's just sequential insertion.
- ZSTD_decompressStream: do in another thread (pipeline) - we can afford it during recovery (box: read snapshot in a distinct thread to improve recovery performance #11861).
- Improve msgpack decoding. At least we should use
xrow_decode_dmlright insidexrow_header_decode(instead of mp_check). - Challenge! Do something with field map building. I see that
memtx_tuple_newcosts about 18% of total time, mosly because of field map, and I don't understand what's going on there for such a simple case. UPD: Probably fixed bytuple_field_map_create_plain()optimization in box: speed uptuple_new()for large sparse tuples by 3.5x #9793. - Challenge! Don't use transactions during recovery. I mean all that
build_nextcalls must be done directly. That needs some refactoring of space/engine/index API. - Make one smart multi-thread pass of radix sort by hints before using multi-thread quick sort. **
- Get rid of all
*_journal_*,add_redoetc - all this stuff is not needed while reading snapshot.
** The problem of multi-thread quick sort is that the first pass must be made only by one thread, the second - by two etc. I also analyzed idea using of multi-thread radix sort instead of qsort at all but after some tests I came to a conclusion that radix sort is very cache-unfriendly and usually is limited by memory throughput. But one (multi-thread? need a test) pass of radix sort that will split array into several (4-16) relatively sorted parts would be useful.
What else should be tested:
- more complex indexes - by strings, by several fields.
- more fields in tuple (e.g. 100).
- more complex tuple structure.
- recovery from WAL.
Actually measurements of WAL recovery must lead to a similar issue.