Skip to content

Refactor and speed up recovery #5787

@alyapunov

Description

@alyapunov

I've made some measurements for tarantool recovery process.
I made a snapshot with 16M tuples in one space with two simple indexes. Tuples consisted of two numbers and a random string of random length (1-100).
Tarantool load time:

  • hash indexes - 18 sec
  • trees with hints - 10 sec
  • trees w/o hints - 14 sec

See attached flame graphs and logs of that cases.
I also attached the test itself (first script creates snapshot, second - loads it). Beware - it can delete local snaps/xlogs.
data-5787.zip

It seems that the recovery process can be significantly improved. I'd bet that it's possible to load that snapshot in 2 seconds.

I think about these steps, sorted by my perception of effectiveness / implementation cost ratio:

  • Increase XLOG_READ_AHEAD #8108
  • Always use hints for building a tree index.
  • Do index building more parallel - for example parallel hint caclulation and even btree construction.
  • Invent index building for hash index: now it's just sequential insertion.
  • ZSTD_decompressStream: do in another thread (pipeline) - we can afford it during recovery (box: read snapshot in a distinct thread to improve recovery performance #11861).
  • Improve msgpack decoding. At least we should use xrow_decode_dml right inside xrow_header_decode (instead of mp_check).
  • Challenge! Do something with field map building. I see that memtx_tuple_new costs about 18% of total time, mosly because of field map, and I don't understand what's going on there for such a simple case. UPD: Probably fixed by tuple_field_map_create_plain() optimization in box: speed up tuple_new() for large sparse tuples by 3.5x #9793.
  • Challenge! Don't use transactions during recovery. I mean all that build_next calls must be done directly. That needs some refactoring of space/engine/index API.
  • Make one smart multi-thread pass of radix sort by hints before using multi-thread quick sort. **
  • Get rid of all *_journal_*, add_redo etc - all this stuff is not needed while reading snapshot.

** The problem of multi-thread quick sort is that the first pass must be made only by one thread, the second - by two etc. I also analyzed idea using of multi-thread radix sort instead of qsort at all but after some tests I came to a conclusion that radix sort is very cache-unfriendly and usually is limited by memory throughput. But one (multi-thread? need a test) pass of radix sort that will split array into several (4-16) relatively sorted parts would be useful.

What else should be tested:

  • more complex indexes - by strings, by several fields.
  • more fields in tuple (e.g. 100).
  • more complex tuple structure.
  • recovery from WAL.

Actually measurements of WAL recovery must lead to a similar issue.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions