Merged
Conversation
3fd60d1 to
cad485b
Compare
9ea3f7b to
4b010d3
Compare
mkostoevr
added a commit
to tarantool/small
that referenced
this pull request
Jul 7, 2024
The `matras_touch` function is split into two parts: `matras_need_touch` and `matras_touch_no_check`. The first one checks if a matras block ID needs to be touched to be modified, another one performs the touch of the given block. Needed for tarantool/tarantool#10183
4b010d3 to
27c7a06
Compare
locker
approved these changes
Jul 8, 2024
27c7a06 to
d3b5556
Compare
mkostoevr
added a commit
to tarantool/small
that referenced
this pull request
Jul 18, 2024
The `matras_touch` function is split into two parts: `matras_needs_touch` and `matras_touch_no_check`. The first one checks if a matras block ID needs to be touched to be modified, another one performs the touch of the given block. Needed for tarantool/tarantool#10183
6d24b2e to
23f5b7f
Compare
alyapunov
pushed a commit
to tarantool/small
that referenced
this pull request
Jul 18, 2024
The `matras_touch` function is split into two parts: `matras_needs_touch` and `matras_touch_no_check`. The first one checks if a matras block ID needs to be touched to be modified, another one performs the touch of the given block. Needed for tarantool/tarantool#10183
b46ddc1 to
13931e6
Compare
mkostoevr
added a commit
to mkostoevr/tarantool
that referenced
this pull request
Jul 26, 2024
Prior to this the count request had been performed in the memtx in the traightforward way: we created an iterator by a type and key and simply called its `next` method until it's exhausted. That means the operation had a linear complexity, which may lead to DoS situations. This patch makes the memtx tree index use a version of BPS tree with the cardinality information enabled and takes advantage of its offset based API to implement the count operation using tree lookups. One thing to be noted is that using the cardinality config introduces a performance regression against the current version of memtx, this is mitigated in tarantool#10183. Another thinig is about the memtx_tx_handle_gap_write update. Prior to this patch, any failed lookup of a full key using EQ or REQ iterators had been recorded using memtx_tx_track_point instead of the memtx_tx_track_gap, so it was impossible to find a GAP_NEARBY item with full key and EQ/REQ iterator type in the read_gaps list. But in the fast count implementation only memtx_tx_track_gap is used (for the sake of readability), so the memtx_tx_handle_gap_write is modified to take it into account. NO_DOC=TBD when fully implemented NO_CHANGELOG=TBD when fully implemented
mkostoevr
added a commit
to mkostoevr/tarantool
that referenced
this pull request
Jul 28, 2024
Prior to this the count request had been performed in the memtx in the straightforward way: we created an iterator by a type and key and simply called its `next` method until it's exhausted. That means the operation had a linear complexity, which could lead to DoS situations. This patch makes the memtx tree index use a version of BPS tree with the cardinality information enabled and takes advantage of its offset based API to implement the count operation using tree lookups. Since the method does not read each counted tuple now, MVCC subsystem would be unaware of it. In order to fix this, this patch introduces a new entity in the memtx transaction manager to track such operations: GAP_COUNT, and the corresponding `memtx_tx_track_count` function. The entity (gap item) is a record that a concurrent transaction has counted tuples matching some key and iterator type in an index. If a transaction creates such an entity, any insertion or deletion of a matching tuple in the index will be conflicted with it. This works differently for inserts and deletes: 1. If a concurrent transaction inserts a new matching tuple, then its read_gaps list is modified like the counted transaction had read the exact key of the tuple and found nothing. This creates a conflict. 2. If a concurrent transaction deletes a matching tuple, then the transaction that counted the tuple is inserted into the tuple reader list: it pretends to have read the tuple prior to the deletion. This creates a conflict. One thing to be noted is that using the cardinality config introduces a performance regression against the current version of memtx, this is mitigated in tarantool#10183. Another thinig is about the memtx_tx_handle_gap_write update. Prior to this patch, any failed lookup of a full key using EQ or REQ iterators had been recorded using memtx_tx_track_point instead of the memtx_tx_track_gap, so it was impossible to find a GAP_NEARBY item with full key and EQ/REQ iterator type in the read_gaps list. But in the fast count implementation only memtx_tx_track_gap is used (for the sake of readability), so the memtx_tx_handle_gap_write is modified to take it into account. NO_DOC=TBD when fully implemented NO_CHANGELOG=TBD when fully implemented
mkostoevr
added a commit
to mkostoevr/tarantool
that referenced
this pull request
Jul 29, 2024
Prior to this the count request had been performed in the memtx in the straightforward way: we created an iterator by a type and key and simply called its `next` method until it's exhausted. That means the operation had a linear complexity, which could lead to DoS situations. This patch makes the memtx tree index use a version of BPS tree with the cardinality information enabled and takes advantage of its offset based API to implement the count operation using tree lookups. Since the method does not read each counted tuple now, MVCC subsystem would be unaware of it. In order to fix this, this patch introduces a new entity in the memtx transaction manager to track such operations: GAP_COUNT, and the corresponding `memtx_tx_track_count` function. The entity (gap item) is a record that a concurrent transaction has counted tuples matching some key and iterator type in an index. If a transaction creates such an entity, any insertion or deletion of a matching tuple in the index will be conflicted with it. This works differently for inserts and deletes: 1. If a concurrent transaction inserts a new matching tuple, then its read_gaps list is modified like the counted transaction had read the exact key of the tuple and found nothing. This creates a conflict. 2. If a concurrent transaction deletes a matching tuple, then the transaction that counted the tuple is inserted into the tuple reader list: it pretends to have read the tuple prior to the deletion. This creates a conflict. One thing to be noted is that using the cardinality config introduces a performance regression against the current version of memtx, this is mitigated in tarantool#10183. Another thinig is about the memtx_tx_handle_gap_write update. Prior to this patch, any failed lookup of a full key using EQ or REQ iterators had been recorded using memtx_tx_track_point instead of the memtx_tx_track_gap, so it was impossible to find a GAP_NEARBY item with full key and EQ/REQ iterator type in the read_gaps list. But in the fast count implementation only memtx_tx_track_gap is used (for the sake of readability), so the memtx_tx_handle_gap_write is modified to take it into account. NO_DOC=TBD when fully implemented NO_CHANGELOG=TBD when fully implemented
mkostoevr
added a commit
to mkostoevr/tarantool
that referenced
this pull request
Jul 29, 2024
Prior to this the count request had been performed in the memtx in the straightforward way: we created an iterator by a type and key and simply called its `next` method until it's exhausted. That means the operation had a linear complexity, which could lead to DoS situations. This patch makes the memtx tree index use a version of BPS tree with the cardinality information enabled and takes advantage of its offset based API to implement the count operation using tree lookups. Since the method does not read each counted tuple now, MVCC subsystem would be unaware of it. In order to fix this, this patch introduces a new entity in the memtx transaction manager to track such operations: GAP_COUNT, and the corresponding `memtx_tx_track_count` function. The entity (gap item) is a record that a concurrent transaction has counted tuples matching some key and iterator type in an index. If a transaction creates such an entity, any insertion or deletion of a matching tuple in the index will be conflicted with it. This works differently for inserts and deletes: 1. If a concurrent transaction inserts a new matching tuple, then its read_gaps list is modified like the counted transaction had read the exact key of the tuple and found nothing. This creates a conflict. 2. If a concurrent transaction deletes a matching tuple, then the transaction that counted the tuple is inserted into the tuple reader list: it pretends to have read the tuple prior to the deletion. This creates a conflict. One thing to be noted is that using the cardinality config introduces a performance regression against the current version of memtx, this is mitigated in tarantool#10183. Another thinig is about the memtx_tx_handle_gap_write update. Prior to this patch, any failed lookup of a full key using EQ or REQ iterators had been recorded using memtx_tx_track_point instead of the memtx_tx_track_gap, so it was impossible to find a GAP_NEARBY item with full key and EQ/REQ iterator type in the read_gaps list. But in the fast count implementation only memtx_tx_track_gap is used (for the sake of readability), so the memtx_tx_handle_gap_write is modified to take it into account. NO_DOC=TBD when fully implemented NO_CHANGELOG=TBD when fully implemented
Currently the BPS tree allows to enable both `BPS_INNER_CHILD_CARDS` and `BPS_INNER_CARD` configs. This possibility had been introduced in order to test the overall slowdown of both solutions, but it's not required in the real code. Moreover, it's not working properly because of a typo (see tarantool#10045 for details). Let's only allow to have one of the two configs. Closes tarantool#10045 NO_DOC=dead code elimination NO_TEST=dead code elimination NO_CHANGELOG=dead code elimination
Prior to this patch there was three separated implementations of insertion of a value in the tree: `insert`, `insert_get_iterator` and `insert_get_offset`. This is error-prone and makes the further development more difficult. This patch introduces a new `bps_tree_insert_impl` function, which deduplicates the existed insertion implementations and makes them use it. The function is made `ALWAYS_INLINE` in order to prevent regressions in the future. Now the value of `offset` parameter is undefined on error. This does not affect the only user of the function, which is the memcs engine. Performance impact: | request | stat | old | new | diff | stdev | | ------- | ---- | --------- | --------- | ------ | ----- | | insert | RPS | 953798.71 | 961295.51 | +0.79% | 0.33% | | select | RPS | 805361.94 | 807647.13 | +0.28% | 0.32% | | replace | RPS | 953764.82 | 960956.13 | +0.75% | 0.33% | | delete | RPS | 833315.10 | 836079.60 | +0.33% | 0.35% | Part of tarantool#9982 NO_DOC=refactoring NO_TEST=refactoring NO_CHANGELOG=refactoring
Prior to this patch there was three separated implementations of deletion of an element from the tree: `delete`, `delete_get_offset` and `delete_value`. This is error-prone and makes the development more difficult. This patch introduces a new `bps_tree_delete_impl` function, which deduplicates the existed implementations, and makes them use it. The function is made `ALWAYS_INLINE` in order to prevent regressions. By the way fixed the misleading comment for the `bps_tree_delete_value` function and updated the `bps_tree_delete_get_offset` documentation: now the value of the `offset` parameter is undefined on failure. This does not affect the function behavior though. Performance impact against the version prior to all refactorings: | request | stat | old | new | diff | stdev | | ------- | ---- | --------- | --------- | ------ | ----- | | insert | RPS | 953798.71 | 964738.50 | +1.15% | 0.63% | | select | RPS | 805361.94 | 807050.13 | +0.21% | 0.66% | | replace | RPS | 953764.82 | 964439.47 | +1.12% | 0.58% | | delete | RPS | 833315.10 | 834984.89 | +0.20% | 0.68% | Part of tarantool#9982 NO_DOC=refactoring NO_TEST=refactoring NO_CHANGELOG=refactoring
Prior to this patch there was two separated implementations of the key lookup sharing almost identical code: `bps_tree_find_impl` and the `bps_tree_find_get_offset_impl`. The only difference between the two functions is computing the offset to the found element. This patch removes the `bps_tree_find_get_offset_impl` and adds its functionality to the `bps_tree_find_impl`. The function is made `ALWAYS_INLINE` in order to prevent regressions. By the way changed the interface so that the value of the `offset` argument is undefined if the element is not found. This breaks the BPS tree offset API tests, but this function is only used in the memcs engine now, and it does not rely on the offset if no element found. Performance impact against the version prior to all refactorings: | request | stat | old | new | diff | stdev | | ------- | ---- | --------- | --------- | ------ | ----- | | insert | RPS | 953798.71 | 977308.37 | +2.46% | 0.35% | | select | RPS | 805361.94 | 822407.71 | +2.12% | 0.25% | | replace | RPS | 953764.82 | 977324.31 | +2.47% | 0.31% | | delete | RPS | 833315.10 | 843095.19 | +1.17% | 0.26% | Part of tarantool#9982 NO_DOC=refactoring NO_CHANGELOG=refactoring
Prior to this patch there were additional implementations of each bound lookup function sharing almost identical code: - bps_tree_lower_bound_get_offset_impl - bps_tree_lower_bound_elem_get_offset_impl - bps_tree_upper_bound_get_offset_impl - bps_tree_upper_bound_elem_get_offset_impl The only difference between the functions and their countersparts is computing the offset to the found element. This patch removes the `_get_offset` functions and introduces `_impl` versions optionally including the removed functionality. The functions are made `ALWAYS_INLINE` in order to prevent regressions. By the way fixed the misleading comments to bound lookup functions returning offset. Performance impact against the version prior to all refactorings: | request | stat | old | new | diff | stdev | | ------- | ---- | --------- | --------- | ------ | ----- | | insert | RPS | 953798.71 | 957071.12 | +0.34% | 0.33% | | select | RPS | 805361.94 | 807065.98 | +0.21% | 0.31% | | replace | RPS | 953764.82 | 955549.16 | +0.19% | 0.83% | | delete | RPS | 833315.10 | 833423.01 | +0.01% | 0.30% | Closes tarantool#9982 NO_DOC=refactoring NO_TEST=refactoring NO_CHANGELOG=refactoring
13931e6 to
4915206
Compare
mkostoevr
added a commit
to mkostoevr/tarantool
that referenced
this pull request
Jul 29, 2024
Prior to this the count request had been performed in the memtx in the straightforward way: we created an iterator by a type and key and simply called its `next` method until it's exhausted. That means the operation had a linear complexity, which could lead to DoS situations. This patch makes the memtx tree index use a version of BPS tree with the cardinality information enabled and takes advantage of its offset based API to implement the count operation using tree lookups. Since the method does not read each counted tuple now, MVCC subsystem would be unaware of it. In order to fix this, this patch introduces a new entity in the memtx transaction manager to track such operations: GAP_COUNT, and the corresponding `memtx_tx_track_count` function. The entity (gap item) is a record that a concurrent transaction has counted tuples matching some key and iterator type in an index. If a transaction creates such an entity, any insertion or deletion of a matching tuple in the index will be conflicted with it. This works differently for inserts and deletes: 1. If a concurrent transaction inserts a new matching tuple, then its read_gaps list is modified like the counted transaction had read the exact key of the tuple and found nothing. This creates a conflict. 2. If a concurrent transaction deletes a matching tuple, then the transaction that counted the tuple is inserted into the tuple reader list: it pretends to have read the tuple prior to the deletion. This creates a conflict. One thing to be noted is that using the cardinality config introduces a performance regression against the current version of memtx, this is mitigated in the tarantool#10183 and in the scope of tarantool#10322. Another thinig is about the memtx_tx_handle_gap_write update. Prior to this patch, any failed lookup of a full key using EQ or REQ iterators had been recorded using memtx_tx_track_point instead of the memtx_tx_track_gap, so it was impossible to find a GAP_NEARBY item with full key and EQ/REQ iterator type in the read_gaps list. But in the fast count implementation only memtx_tx_track_gap is used (for the sake of readability), so the memtx_tx_handle_gap_write is modified to take it into account. Part of tarantool#8204 NO_DOC=perf improvement NO_CHANGELOG=TBD when fully implemented
mkostoevr
added a commit
to mkostoevr/tarantool
that referenced
this pull request
Jul 29, 2024
Prior to this the count request had been performed in the memtx in the straightforward way: we created an iterator by a type and key and simply called its `next` method until it's exhausted. That means the operation had a linear complexity, which could lead to DoS situations. This patch makes the memtx tree index use a version of BPS tree with the cardinality information enabled and takes advantage of its offset based API to implement the count operation using tree lookups. Since the method does not read each counted tuple now, MVCC subsystem would be unaware of it. In order to fix this, this patch introduces a new entity in the memtx transaction manager to track such operations: GAP_COUNT, and the corresponding `memtx_tx_track_count` function. The entity (gap item) is a record that a concurrent transaction has counted tuples matching some key and iterator type in an index. If a transaction creates such an entity, any insertion or deletion of a matching tuple in the index will be conflicted with it. This works differently for inserts and deletes: 1. If a concurrent transaction inserts a new matching tuple, then its read_gaps list is modified like the counted transaction had read the exact key of the tuple and found nothing. This creates a conflict. 2. If a concurrent transaction deletes a matching tuple, then the transaction that counted the tuple is inserted into the tuple reader list: it pretends to have read the tuple prior to the deletion. This creates a conflict. One thing to be noted is that using the cardinality config introduces a performance regression against the current version of memtx, this is mitigated in the tarantool#10183 and in the scope of tarantool#10322. Another thinig is about the memtx_tx_handle_gap_write update. Prior to this patch, any failed lookup of a full key using EQ or REQ iterators had been recorded using memtx_tx_track_point instead of the memtx_tx_track_gap, so it was impossible to find a GAP_NEARBY item with full key and EQ/REQ iterator type in the read_gaps list. But in the fast count implementation only memtx_tx_track_gap is used (for the sake of readability), so the memtx_tx_handle_gap_write is modified to take it into account. Part of tarantool#8204 NO_DOC=perf improvement NO_CHANGELOG=TBD when fully implemented
We use the `matras_touch` function in order to notify the matras that a block is going to be modified. In case the block has to be copied in order to preserve read views, it allocates a new block and returns its address, otherwise it just returns the current address of the given block. The thing is, if we have this address acquired already, we don't have to look it up from the matras again sometimes. Let's use the new matras API to check if an allocation of a new block might be required and only update the pointer using the `matras_touch` if it can be changed. Performance impact against the version prior to all commits: | request | stat | old | new | diff | stdev | | ------- | ---- | --------- | --------- | ------ | ----- | | insert | RPS | 953798.71 | 958990.66 | +0.54% | 0.47% | | select | RPS | 805361.94 | 808628.46 | +0.41% | 0.23% | | replace | RPS | 953764.82 | 958759.03 | +0.52% | 0.43% | | delete | RPS | 833315.10 | 834287.96 | +0.12% | 0.23% | Part of tarantool#10322 NO_DOC=perf improvement NO_TEST=perf improvement NO_CHANGELOG=perf improvement
On each insertion and deletion we go over all levels of the tree and dereference block IDs using `bps_tree_restore_block`. The thing is, in case of insertion, we may touch the target leaf unconditionally, because there's no no-op flow in the `bps_tree_insert` after it: we will either replace or insert an element into the leaf. This patch makes the `bps_tree_collect_path` always touch the leaf block in the `bps_tree_insert`. The function is now too complicated for GCC 13 to inline it, so it's made `ALWAYS_INLINE` to prevent the possible regressions. Performance impact against the version prior to all commits: | request | stat | old | new | diff | stdev | | ------- | ---- | --------- | --------- | ------ | ----- | | insert | RPS | 953798.71 | 957184.34 | +0.35% | 0.28% | | select | RPS | 805361.94 | 808730.63 | +0.42% | 0.23% | | replace | RPS | 953764.82 | 957252.78 | +0.37% | 0.26% | | delete | RPS | 833315.10 | 835194.55 | +0.23% | 0.23% | Part of tarantool#10322 NO_DOC=perf improvement NO_TEST=perf improvement NO_CHANGELOG=perf improvement
In case if the insertion or deletion from the block requires the reballancing or if the tree is used with a cardinality config, we use the `bps_tree_touch_path` function in order to inform matras about incoming block modifications. The problem is that the function not only touches the leaf parents, but also touches owners of `max_elem` data. But these owners are included into the parents touched, so it's not required to touch them again, we can simply use their new pointers after we have touched all the path. This decreases the amount of matras ID lookups in general (which is a pretty expensive operation). Performance impact against the version prior to all commits: | request | stat | old | new | diff | stdev | | ------- | ---- | --------- | --------- | ------ | ----- | | insert | RPS | 953798.71 | 978592.11 | +2.60% | 0.32% | | select | RPS | 805361.94 | 821532.64 | +2.01% | 0.23% | | replace | RPS | 953764.82 | 978612.78 | +2.61% | 0.31% | | delete | RPS | 833315.10 | 842437.50 | +1.09% | 0.24% | Part of tarantool#10322 NO_DOC=perf improvement NO_TEST=perf improvement NO_CHANGELOG=perf improvement
4915206 to
2c93f72
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #9982
Closes #10045
Part of #10322