Refactor and speed up recovery

I've made some measurements for tarantool recovery process.
I made a snapshot with 16M tuples in one space with two simple indexes. Tuples consisted of two numbers and a random string of random length (1-100).
Tarantool load time:
 - hash indexes - 18 sec
 - trees with hints - 10 sec
 - trees w/o hints - 14 sec

See attached flame graphs and logs of that cases.
I also attached the test itself (first script creates snapshot, second - loads it). Beware - it can delete local snaps/xlogs.
[data-5787.zip](https://github.com/tarantool/tarantool/files/5926246/data-5787.zip)

It seems that the recovery process can be significantly improved. I'd bet that it's possible to load that snapshot in 2 seconds.

I think about these steps, sorted by my perception of effectiveness / implementation cost ratio:
- [x] #8108 
- [ ] Always use hints for building a tree index.
- [ ] Do index building more parallel - for example parallel hint caclulation and even btree construction.
- [ ] Invent index building for hash index: now it's just sequential insertion.
- [x] ZSTD_decompressStream: do in another thread (pipeline) - we can afford it during recovery (https://github.com/tarantool/tarantool/pull/11861).
- [ ] Improve msgpack decoding. At least we should use `xrow_decode_dml` right inside `xrow_header_decode` (instead of mp_check).
- [x] **Challenge!** Do something with field map building. I see that `memtx_tuple_new` costs about 18% of total time, mosly because of field map, and I don't understand what's going on there for such a simple case. **UPD:** Probably fixed by `tuple_field_map_create_plain()` optimization in #9793.
- [ ] **Challenge!** Don't use transactions during recovery. I mean all that `build_next` calls must be done directly. That needs some refactoring of space/engine/index API.
- [ ] Make one smart multi-thread pass of radix sort by hints before using multi-thread quick sort. **
- [ ] Get rid of all `*_journal_*`, `add_redo` etc - all this stuff is not needed while reading snapshot.

** The problem of multi-thread quick sort is that the first pass must be made only by one thread, the second - by two etc. I also analyzed idea using of multi-thread radix sort instead of qsort at all but after some tests I came to a conclusion that radix sort is _very_ cache-unfriendly and usually is limited by memory throughput. But one (multi-thread? need a test) pass of radix sort that will split array into several (4-16) relatively sorted parts would be useful.

What else should be tested:
- more complex indexes - by strings, by several fields.
- more fields in tuple (e.g. 100).
- more complex tuple structure.
- recovery from WAL.

Actually measurements of WAL recovery must lead to a similar issue.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor and speed up recovery #5787

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Refactor and speed up recovery #5787

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions