Bps tree simplification by mkostoevr · Pull Request #10183 · tarantool/tarantool

mkostoevr · 2024-06-29T02:29:11Z

Closes #9982
Closes #10045
Part of #10322

coveralls · 2024-06-29T02:45:00Z

coverage: 87.081% (+0.02%) from 87.066%
when pulling 3fd60d1 on mkostoevr:m.kostoev/bps-tree-simplification
into 319357d
on tarantool:master.

coveralls · 2024-06-29T04:13:37Z

coverage: 87.102% (+0.04%) from 87.066%
when pulling cad485b on mkostoevr:m.kostoev/bps-tree-simplification
into 319357d
on tarantool:master.

The `matras_touch` function is split into two parts: `matras_need_touch` and `matras_touch_no_check`. The first one checks if a matras block ID needs to be touched to be modified, another one performs the touch of the given block. Needed for tarantool/tarantool#10183

coveralls · 2024-07-07T19:46:47Z

coverage: 87.127% (-0.03%) from 87.155%
when pulling 2c93f72 on mkostoevr:m.kostoev/bps-tree-simplification
into fbea986
on tarantool:master.

src/lib/salad/bps_tree.h

src/lib/small

src/lib/salad/bps_tree.h

The `matras_touch` function is split into two parts: `matras_needs_touch` and `matras_touch_no_check`. The first one checks if a matras block ID needs to be touched to be modified, another one performs the touch of the given block. Needed for tarantool/tarantool#10183

Prior to this the count request had been performed in the memtx in the traightforward way: we created an iterator by a type and key and simply called its `next` method until it's exhausted. That means the operation had a linear complexity, which may lead to DoS situations. This patch makes the memtx tree index use a version of BPS tree with the cardinality information enabled and takes advantage of its offset based API to implement the count operation using tree lookups. One thing to be noted is that using the cardinality config introduces a performance regression against the current version of memtx, this is mitigated in tarantool#10183. Another thinig is about the memtx_tx_handle_gap_write update. Prior to this patch, any failed lookup of a full key using EQ or REQ iterators had been recorded using memtx_tx_track_point instead of the memtx_tx_track_gap, so it was impossible to find a GAP_NEARBY item with full key and EQ/REQ iterator type in the read_gaps list. But in the fast count implementation only memtx_tx_track_gap is used (for the sake of readability), so the memtx_tx_handle_gap_write is modified to take it into account. NO_DOC=TBD when fully implemented NO_CHANGELOG=TBD when fully implemented

Prior to this the count request had been performed in the memtx in the straightforward way: we created an iterator by a type and key and simply called its `next` method until it's exhausted. That means the operation had a linear complexity, which could lead to DoS situations. This patch makes the memtx tree index use a version of BPS tree with the cardinality information enabled and takes advantage of its offset based API to implement the count operation using tree lookups. Since the method does not read each counted tuple now, MVCC subsystem would be unaware of it. In order to fix this, this patch introduces a new entity in the memtx transaction manager to track such operations: GAP_COUNT, and the corresponding `memtx_tx_track_count` function. The entity (gap item) is a record that a concurrent transaction has counted tuples matching some key and iterator type in an index. If a transaction creates such an entity, any insertion or deletion of a matching tuple in the index will be conflicted with it. This works differently for inserts and deletes: 1. If a concurrent transaction inserts a new matching tuple, then its read_gaps list is modified like the counted transaction had read the exact key of the tuple and found nothing. This creates a conflict. 2. If a concurrent transaction deletes a matching tuple, then the transaction that counted the tuple is inserted into the tuple reader list: it pretends to have read the tuple prior to the deletion. This creates a conflict. One thing to be noted is that using the cardinality config introduces a performance regression against the current version of memtx, this is mitigated in tarantool#10183. Another thinig is about the memtx_tx_handle_gap_write update. Prior to this patch, any failed lookup of a full key using EQ or REQ iterators had been recorded using memtx_tx_track_point instead of the memtx_tx_track_gap, so it was impossible to find a GAP_NEARBY item with full key and EQ/REQ iterator type in the read_gaps list. But in the fast count implementation only memtx_tx_track_gap is used (for the sake of readability), so the memtx_tx_handle_gap_write is modified to take it into account. NO_DOC=TBD when fully implemented NO_CHANGELOG=TBD when fully implemented

Currently the BPS tree allows to enable both `BPS_INNER_CHILD_CARDS` and `BPS_INNER_CARD` configs. This possibility had been introduced in order to test the overall slowdown of both solutions, but it's not required in the real code. Moreover, it's not working properly because of a typo (see tarantool#10045 for details). Let's only allow to have one of the two configs. Closes tarantool#10045 NO_DOC=dead code elimination NO_TEST=dead code elimination NO_CHANGELOG=dead code elimination

Prior to this patch there was three separated implementations of insertion of a value in the tree: `insert`, `insert_get_iterator` and `insert_get_offset`. This is error-prone and makes the further development more difficult. This patch introduces a new `bps_tree_insert_impl` function, which deduplicates the existed insertion implementations and makes them use it. The function is made `ALWAYS_INLINE` in order to prevent regressions in the future. Now the value of `offset` parameter is undefined on error. This does not affect the only user of the function, which is the memcs engine. Performance impact: | request | stat | old | new | diff | stdev | | ------- | ---- | --------- | --------- | ------ | ----- | | insert | RPS | 953798.71 | 961295.51 | +0.79% | 0.33% | | select | RPS | 805361.94 | 807647.13 | +0.28% | 0.32% | | replace | RPS | 953764.82 | 960956.13 | +0.75% | 0.33% | | delete | RPS | 833315.10 | 836079.60 | +0.33% | 0.35% | Part of tarantool#9982 NO_DOC=refactoring NO_TEST=refactoring NO_CHANGELOG=refactoring

Prior to this patch there was three separated implementations of deletion of an element from the tree: `delete`, `delete_get_offset` and `delete_value`. This is error-prone and makes the development more difficult. This patch introduces a new `bps_tree_delete_impl` function, which deduplicates the existed implementations, and makes them use it. The function is made `ALWAYS_INLINE` in order to prevent regressions. By the way fixed the misleading comment for the `bps_tree_delete_value` function and updated the `bps_tree_delete_get_offset` documentation: now the value of the `offset` parameter is undefined on failure. This does not affect the function behavior though. Performance impact against the version prior to all refactorings: | request | stat | old | new | diff | stdev | | ------- | ---- | --------- | --------- | ------ | ----- | | insert | RPS | 953798.71 | 964738.50 | +1.15% | 0.63% | | select | RPS | 805361.94 | 807050.13 | +0.21% | 0.66% | | replace | RPS | 953764.82 | 964439.47 | +1.12% | 0.58% | | delete | RPS | 833315.10 | 834984.89 | +0.20% | 0.68% | Part of tarantool#9982 NO_DOC=refactoring NO_TEST=refactoring NO_CHANGELOG=refactoring

Prior to this patch there was two separated implementations of the key lookup sharing almost identical code: `bps_tree_find_impl` and the `bps_tree_find_get_offset_impl`. The only difference between the two functions is computing the offset to the found element. This patch removes the `bps_tree_find_get_offset_impl` and adds its functionality to the `bps_tree_find_impl`. The function is made `ALWAYS_INLINE` in order to prevent regressions. By the way changed the interface so that the value of the `offset` argument is undefined if the element is not found. This breaks the BPS tree offset API tests, but this function is only used in the memcs engine now, and it does not rely on the offset if no element found. Performance impact against the version prior to all refactorings: | request | stat | old | new | diff | stdev | | ------- | ---- | --------- | --------- | ------ | ----- | | insert | RPS | 953798.71 | 977308.37 | +2.46% | 0.35% | | select | RPS | 805361.94 | 822407.71 | +2.12% | 0.25% | | replace | RPS | 953764.82 | 977324.31 | +2.47% | 0.31% | | delete | RPS | 833315.10 | 843095.19 | +1.17% | 0.26% | Part of tarantool#9982 NO_DOC=refactoring NO_CHANGELOG=refactoring

Prior to this patch there were additional implementations of each bound lookup function sharing almost identical code: - bps_tree_lower_bound_get_offset_impl - bps_tree_lower_bound_elem_get_offset_impl - bps_tree_upper_bound_get_offset_impl - bps_tree_upper_bound_elem_get_offset_impl The only difference between the functions and their countersparts is computing the offset to the found element. This patch removes the `_get_offset` functions and introduces `_impl` versions optionally including the removed functionality. The functions are made `ALWAYS_INLINE` in order to prevent regressions. By the way fixed the misleading comments to bound lookup functions returning offset. Performance impact against the version prior to all refactorings: | request | stat | old | new | diff | stdev | | ------- | ---- | --------- | --------- | ------ | ----- | | insert | RPS | 953798.71 | 957071.12 | +0.34% | 0.33% | | select | RPS | 805361.94 | 807065.98 | +0.21% | 0.31% | | replace | RPS | 953764.82 | 955549.16 | +0.19% | 0.83% | | delete | RPS | 833315.10 | 833423.01 | +0.01% | 0.30% | Closes tarantool#9982 NO_DOC=refactoring NO_TEST=refactoring NO_CHANGELOG=refactoring

Prior to this the count request had been performed in the memtx in the straightforward way: we created an iterator by a type and key and simply called its `next` method until it's exhausted. That means the operation had a linear complexity, which could lead to DoS situations. This patch makes the memtx tree index use a version of BPS tree with the cardinality information enabled and takes advantage of its offset based API to implement the count operation using tree lookups. Since the method does not read each counted tuple now, MVCC subsystem would be unaware of it. In order to fix this, this patch introduces a new entity in the memtx transaction manager to track such operations: GAP_COUNT, and the corresponding `memtx_tx_track_count` function. The entity (gap item) is a record that a concurrent transaction has counted tuples matching some key and iterator type in an index. If a transaction creates such an entity, any insertion or deletion of a matching tuple in the index will be conflicted with it. This works differently for inserts and deletes: 1. If a concurrent transaction inserts a new matching tuple, then its read_gaps list is modified like the counted transaction had read the exact key of the tuple and found nothing. This creates a conflict. 2. If a concurrent transaction deletes a matching tuple, then the transaction that counted the tuple is inserted into the tuple reader list: it pretends to have read the tuple prior to the deletion. This creates a conflict. One thing to be noted is that using the cardinality config introduces a performance regression against the current version of memtx, this is mitigated in the tarantool#10183 and in the scope of tarantool#10322. Another thinig is about the memtx_tx_handle_gap_write update. Prior to this patch, any failed lookup of a full key using EQ or REQ iterators had been recorded using memtx_tx_track_point instead of the memtx_tx_track_gap, so it was impossible to find a GAP_NEARBY item with full key and EQ/REQ iterator type in the read_gaps list. But in the fast count implementation only memtx_tx_track_gap is used (for the sake of readability), so the memtx_tx_handle_gap_write is modified to take it into account. Part of tarantool#8204 NO_DOC=perf improvement NO_CHANGELOG=TBD when fully implemented

We use the `matras_touch` function in order to notify the matras that a block is going to be modified. In case the block has to be copied in order to preserve read views, it allocates a new block and returns its address, otherwise it just returns the current address of the given block. The thing is, if we have this address acquired already, we don't have to look it up from the matras again sometimes. Let's use the new matras API to check if an allocation of a new block might be required and only update the pointer using the `matras_touch` if it can be changed. Performance impact against the version prior to all commits: | request | stat | old | new | diff | stdev | | ------- | ---- | --------- | --------- | ------ | ----- | | insert | RPS | 953798.71 | 958990.66 | +0.54% | 0.47% | | select | RPS | 805361.94 | 808628.46 | +0.41% | 0.23% | | replace | RPS | 953764.82 | 958759.03 | +0.52% | 0.43% | | delete | RPS | 833315.10 | 834287.96 | +0.12% | 0.23% | Part of tarantool#10322 NO_DOC=perf improvement NO_TEST=perf improvement NO_CHANGELOG=perf improvement

On each insertion and deletion we go over all levels of the tree and dereference block IDs using `bps_tree_restore_block`. The thing is, in case of insertion, we may touch the target leaf unconditionally, because there's no no-op flow in the `bps_tree_insert` after it: we will either replace or insert an element into the leaf. This patch makes the `bps_tree_collect_path` always touch the leaf block in the `bps_tree_insert`. The function is now too complicated for GCC 13 to inline it, so it's made `ALWAYS_INLINE` to prevent the possible regressions. Performance impact against the version prior to all commits: | request | stat | old | new | diff | stdev | | ------- | ---- | --------- | --------- | ------ | ----- | | insert | RPS | 953798.71 | 957184.34 | +0.35% | 0.28% | | select | RPS | 805361.94 | 808730.63 | +0.42% | 0.23% | | replace | RPS | 953764.82 | 957252.78 | +0.37% | 0.26% | | delete | RPS | 833315.10 | 835194.55 | +0.23% | 0.23% | Part of tarantool#10322 NO_DOC=perf improvement NO_TEST=perf improvement NO_CHANGELOG=perf improvement

In case if the insertion or deletion from the block requires the reballancing or if the tree is used with a cardinality config, we use the `bps_tree_touch_path` function in order to inform matras about incoming block modifications. The problem is that the function not only touches the leaf parents, but also touches owners of `max_elem` data. But these owners are included into the parents touched, so it's not required to touch them again, we can simply use their new pointers after we have touched all the path. This decreases the amount of matras ID lookups in general (which is a pretty expensive operation). Performance impact against the version prior to all commits: | request | stat | old | new | diff | stdev | | ------- | ---- | --------- | --------- | ------ | ----- | | insert | RPS | 953798.71 | 978592.11 | +2.60% | 0.32% | | select | RPS | 805361.94 | 821532.64 | +2.01% | 0.23% | | replace | RPS | 953764.82 | 978612.78 | +2.61% | 0.31% | | delete | RPS | 833315.10 | 842437.50 | +1.09% | 0.24% | Part of tarantool#10322 NO_DOC=perf improvement NO_TEST=perf improvement NO_CHANGELOG=perf improvement

mkostoevr self-assigned this Jun 29, 2024

mkostoevr force-pushed the m.kostoev/bps-tree-simplification branch from 3fd60d1 to cad485b Compare June 29, 2024 03:56

mkostoevr force-pushed the m.kostoev/bps-tree-simplification branch 2 times, most recently from 9ea3f7b to 4b010d3 Compare July 7, 2024 19:22

mkostoevr mentioned this pull request Jul 7, 2024

matras: introduce matras_need_touch and matras_touch_no_check tarantool/small#88

Merged

mkostoevr force-pushed the m.kostoev/bps-tree-simplification branch from 4b010d3 to 27c7a06 Compare July 7, 2024 19:33

mkostoevr added the do not merge Not ready to be merged label Jul 7, 2024

mkostoevr marked this pull request as ready for review July 7, 2024 19:41

mkostoevr requested a review from alyapunov July 7, 2024 19:42

mkostoevr assigned alyapunov and unassigned mkostoevr Jul 7, 2024

mkostoevr requested a review from locker July 8, 2024 07:46

mkostoevr assigned locker Jul 8, 2024

locker approved these changes Jul 8, 2024

View reviewed changes

src/lib/salad/bps_tree.h Outdated Show resolved Hide resolved

src/lib/small Show resolved Hide resolved

src/lib/salad/bps_tree.h Outdated Show resolved Hide resolved

locker assigned mkostoevr and unassigned locker Jul 8, 2024

mkostoevr force-pushed the m.kostoev/bps-tree-simplification branch from 27c7a06 to d3b5556 Compare July 16, 2024 12:11

mkostoevr force-pushed the m.kostoev/bps-tree-simplification branch 4 times, most recently from 6d24b2e to 23f5b7f Compare July 18, 2024 09:58

mkostoevr force-pushed the m.kostoev/bps-tree-simplification branch 2 times, most recently from b46ddc1 to 13931e6 Compare July 21, 2024 18:54

mkostoevr unassigned alyapunov Jul 29, 2024

mkostoevr mentioned this pull request Jul 29, 2024

Optimize the cardinality config of the BPS tree #10322

Open

mkostoevr added 5 commits July 29, 2024 14:56

mkostoevr force-pushed the m.kostoev/bps-tree-simplification branch from 13931e6 to 4915206 Compare July 29, 2024 11:57

mkostoevr assigned locker and mkostoevr and unassigned mkostoevr and locker Jul 29, 2024

mkostoevr added full-ci Enables all tests for a pull request and removed do not merge Not ready to be merged labels Jul 29, 2024

mkostoevr added 3 commits July 29, 2024 15:53

mkostoevr force-pushed the m.kostoev/bps-tree-simplification branch from 4915206 to 2c93f72 Compare July 29, 2024 12:54

mkostoevr assigned locker and unassigned mkostoevr Jul 29, 2024

locker merged commit 731618f into tarantool:master Jul 29, 2024

mkostoevr deleted the m.kostoev/bps-tree-simplification branch August 23, 2024 16:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bps tree simplification#10183

Bps tree simplification#10183
locker merged 8 commits intotarantool:masterfrom
mkostoevr:m.kostoev/bps-tree-simplification

mkostoevr commented Jun 29, 2024 •

edited

Loading

Uh oh!

coveralls commented Jun 29, 2024

Uh oh!

coveralls commented Jun 29, 2024

Uh oh!

coveralls commented Jul 7, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

mkostoevr commented Jun 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coveralls commented Jun 29, 2024

Uh oh!

coveralls commented Jun 29, 2024

Uh oh!

coveralls commented Jul 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mkostoevr commented Jun 29, 2024 •

edited

Loading

coveralls commented Jul 7, 2024 •

edited

Loading