Skip to content

Bps tree simplification#10183

Merged
locker merged 8 commits intotarantool:masterfrom
mkostoevr:m.kostoev/bps-tree-simplification
Jul 29, 2024
Merged

Bps tree simplification#10183
locker merged 8 commits intotarantool:masterfrom
mkostoevr:m.kostoev/bps-tree-simplification

Conversation

@mkostoevr
Copy link
Contributor

@mkostoevr mkostoevr commented Jun 29, 2024

Closes #9982
Closes #10045
Part of #10322

@mkostoevr mkostoevr self-assigned this Jun 29, 2024
@coveralls
Copy link

Coverage Status

coverage: 87.081% (+0.02%) from 87.066%
when pulling 3fd60d1 on mkostoevr:m.kostoev/bps-tree-simplification
into 319357d
on tarantool:master
.

@mkostoevr mkostoevr force-pushed the m.kostoev/bps-tree-simplification branch from 3fd60d1 to cad485b Compare June 29, 2024 03:56
@coveralls
Copy link

Coverage Status

coverage: 87.102% (+0.04%) from 87.066%
when pulling cad485b on mkostoevr:m.kostoev/bps-tree-simplification
into 319357d
on tarantool:master
.

@mkostoevr mkostoevr force-pushed the m.kostoev/bps-tree-simplification branch 2 times, most recently from 9ea3f7b to 4b010d3 Compare July 7, 2024 19:22
mkostoevr added a commit to tarantool/small that referenced this pull request Jul 7, 2024
The `matras_touch` function is split into two parts: `matras_need_touch`
and `matras_touch_no_check`. The first one checks if a matras block ID
needs to be touched to be modified, another one performs the touch of
the given block.

Needed for tarantool/tarantool#10183
@mkostoevr mkostoevr force-pushed the m.kostoev/bps-tree-simplification branch from 4b010d3 to 27c7a06 Compare July 7, 2024 19:33
@mkostoevr mkostoevr added the do not merge Not ready to be merged label Jul 7, 2024
@mkostoevr mkostoevr marked this pull request as ready for review July 7, 2024 19:41
@mkostoevr mkostoevr requested a review from alyapunov July 7, 2024 19:42
@mkostoevr mkostoevr assigned alyapunov and unassigned mkostoevr Jul 7, 2024
@coveralls
Copy link

coveralls commented Jul 7, 2024

Coverage Status

coverage: 87.127% (-0.03%) from 87.155%
when pulling 2c93f72 on mkostoevr:m.kostoev/bps-tree-simplification
into fbea986
on tarantool:master
.

@mkostoevr mkostoevr requested a review from locker July 8, 2024 07:46
@locker locker assigned mkostoevr and unassigned locker Jul 8, 2024
@mkostoevr mkostoevr force-pushed the m.kostoev/bps-tree-simplification branch from 27c7a06 to d3b5556 Compare July 16, 2024 12:11
mkostoevr added a commit to tarantool/small that referenced this pull request Jul 18, 2024
The `matras_touch` function is split into two parts: `matras_needs_touch`
and `matras_touch_no_check`. The first one checks if a matras block ID
needs to be touched to be modified, another one performs the touch of
the given block.

Needed for tarantool/tarantool#10183
@mkostoevr mkostoevr force-pushed the m.kostoev/bps-tree-simplification branch 4 times, most recently from 6d24b2e to 23f5b7f Compare July 18, 2024 09:58
alyapunov pushed a commit to tarantool/small that referenced this pull request Jul 18, 2024
The `matras_touch` function is split into two parts: `matras_needs_touch`
and `matras_touch_no_check`. The first one checks if a matras block ID
needs to be touched to be modified, another one performs the touch of
the given block.

Needed for tarantool/tarantool#10183
@mkostoevr mkostoevr force-pushed the m.kostoev/bps-tree-simplification branch 2 times, most recently from b46ddc1 to 13931e6 Compare July 21, 2024 18:54
mkostoevr added a commit to mkostoevr/tarantool that referenced this pull request Jul 26, 2024
Prior to this the count request had been performed in the memtx in the
traightforward way: we created an iterator by a type and key and simply
called its `next` method until it's exhausted. That means the operation
had a linear complexity, which may lead to DoS situations.

This patch makes the memtx tree index use a version of BPS tree with
the cardinality information enabled and takes advantage of its offset
based API to implement the count operation using tree lookups.

One thing to be noted is that using the cardinality config introduces
a performance regression against the current version of memtx, this is
mitigated in tarantool#10183.

Another thinig is about the memtx_tx_handle_gap_write update.

Prior to this patch, any failed lookup of a full key using EQ or REQ
iterators had been recorded using memtx_tx_track_point instead of the
memtx_tx_track_gap, so it was impossible to find a GAP_NEARBY item
with full key and EQ/REQ iterator type in the read_gaps list.

But in the fast count implementation only memtx_tx_track_gap is used
(for the sake of readability), so the memtx_tx_handle_gap_write is
modified to take it into account.

NO_DOC=TBD when fully implemented
NO_CHANGELOG=TBD when fully implemented
mkostoevr added a commit to mkostoevr/tarantool that referenced this pull request Jul 28, 2024
Prior to this the count request had been performed in the memtx in the
straightforward way: we created an iterator by a type and key and simply
called its `next` method until it's exhausted. That means the operation
had a linear complexity, which could lead to DoS situations.

This patch makes the memtx tree index use a version of BPS tree with
the cardinality information enabled and takes advantage of its offset
based API to implement the count operation using tree lookups.

Since the method does not read each counted tuple now, MVCC subsystem
would be unaware of it. In order to fix this, this patch introduces a
new entity in the memtx transaction manager to track such operations:
GAP_COUNT, and the corresponding `memtx_tx_track_count` function.

The entity (gap item) is a record that a concurrent transaction has
counted tuples matching some key and iterator type in an index.

If a transaction creates such an entity, any insertion or deletion
of a matching tuple in the index will be conflicted with it. This
works differently for inserts and deletes:

1. If a concurrent transaction inserts a new matching tuple, then
   its read_gaps list is modified like the counted transaction had
   read the exact key of the tuple and found nothing. This creates
   a conflict.

2. If a concurrent transaction deletes a matching tuple, then the
   transaction that counted the tuple is inserted into the tuple
   reader list: it pretends to have read the tuple prior to the
   deletion. This creates a conflict.

One thing to be noted is that using the cardinality config introduces
a performance regression against the current version of memtx, this is
mitigated in tarantool#10183.

Another thinig is about the memtx_tx_handle_gap_write update.

Prior to this patch, any failed lookup of a full key using EQ or REQ
iterators had been recorded using memtx_tx_track_point instead of the
memtx_tx_track_gap, so it was impossible to find a GAP_NEARBY item
with full key and EQ/REQ iterator type in the read_gaps list.

But in the fast count implementation only memtx_tx_track_gap is used
(for the sake of readability), so the memtx_tx_handle_gap_write is
modified to take it into account.

NO_DOC=TBD when fully implemented
NO_CHANGELOG=TBD when fully implemented
mkostoevr added a commit to mkostoevr/tarantool that referenced this pull request Jul 29, 2024
Prior to this the count request had been performed in the memtx in the
straightforward way: we created an iterator by a type and key and simply
called its `next` method until it's exhausted. That means the operation
had a linear complexity, which could lead to DoS situations.

This patch makes the memtx tree index use a version of BPS tree with
the cardinality information enabled and takes advantage of its offset
based API to implement the count operation using tree lookups.

Since the method does not read each counted tuple now, MVCC subsystem
would be unaware of it. In order to fix this, this patch introduces a
new entity in the memtx transaction manager to track such operations:
GAP_COUNT, and the corresponding `memtx_tx_track_count` function.

The entity (gap item) is a record that a concurrent transaction has
counted tuples matching some key and iterator type in an index.

If a transaction creates such an entity, any insertion or deletion
of a matching tuple in the index will be conflicted with it. This
works differently for inserts and deletes:

1. If a concurrent transaction inserts a new matching tuple, then
   its read_gaps list is modified like the counted transaction had
   read the exact key of the tuple and found nothing. This creates
   a conflict.

2. If a concurrent transaction deletes a matching tuple, then the
   transaction that counted the tuple is inserted into the tuple
   reader list: it pretends to have read the tuple prior to the
   deletion. This creates a conflict.

One thing to be noted is that using the cardinality config introduces
a performance regression against the current version of memtx, this is
mitigated in tarantool#10183.

Another thinig is about the memtx_tx_handle_gap_write update.

Prior to this patch, any failed lookup of a full key using EQ or REQ
iterators had been recorded using memtx_tx_track_point instead of the
memtx_tx_track_gap, so it was impossible to find a GAP_NEARBY item
with full key and EQ/REQ iterator type in the read_gaps list.

But in the fast count implementation only memtx_tx_track_gap is used
(for the sake of readability), so the memtx_tx_handle_gap_write is
modified to take it into account.

NO_DOC=TBD when fully implemented
NO_CHANGELOG=TBD when fully implemented
mkostoevr added a commit to mkostoevr/tarantool that referenced this pull request Jul 29, 2024
Prior to this the count request had been performed in the memtx in the
straightforward way: we created an iterator by a type and key and simply
called its `next` method until it's exhausted. That means the operation
had a linear complexity, which could lead to DoS situations.

This patch makes the memtx tree index use a version of BPS tree with
the cardinality information enabled and takes advantage of its offset
based API to implement the count operation using tree lookups.

Since the method does not read each counted tuple now, MVCC subsystem
would be unaware of it. In order to fix this, this patch introduces a
new entity in the memtx transaction manager to track such operations:
GAP_COUNT, and the corresponding `memtx_tx_track_count` function.

The entity (gap item) is a record that a concurrent transaction has
counted tuples matching some key and iterator type in an index.

If a transaction creates such an entity, any insertion or deletion
of a matching tuple in the index will be conflicted with it. This
works differently for inserts and deletes:

1. If a concurrent transaction inserts a new matching tuple, then
   its read_gaps list is modified like the counted transaction had
   read the exact key of the tuple and found nothing. This creates
   a conflict.

2. If a concurrent transaction deletes a matching tuple, then the
   transaction that counted the tuple is inserted into the tuple
   reader list: it pretends to have read the tuple prior to the
   deletion. This creates a conflict.

One thing to be noted is that using the cardinality config introduces
a performance regression against the current version of memtx, this is
mitigated in tarantool#10183.

Another thinig is about the memtx_tx_handle_gap_write update.

Prior to this patch, any failed lookup of a full key using EQ or REQ
iterators had been recorded using memtx_tx_track_point instead of the
memtx_tx_track_gap, so it was impossible to find a GAP_NEARBY item
with full key and EQ/REQ iterator type in the read_gaps list.

But in the fast count implementation only memtx_tx_track_gap is used
(for the sake of readability), so the memtx_tx_handle_gap_write is
modified to take it into account.

NO_DOC=TBD when fully implemented
NO_CHANGELOG=TBD when fully implemented
Currently the BPS tree allows to enable both `BPS_INNER_CHILD_CARDS`
and `BPS_INNER_CARD` configs. This possibility had been introduced
in order to test the overall slowdown of both solutions, but it's
not required in the real code. Moreover, it's not working properly
because of a typo (see tarantool#10045 for details).

Let's only allow to have one of the two configs.

Closes tarantool#10045

NO_DOC=dead code elimination
NO_TEST=dead code elimination
NO_CHANGELOG=dead code elimination
Prior to this patch there was three separated implementations of
insertion of a value in the tree: `insert`, `insert_get_iterator`
and `insert_get_offset`. This is error-prone and makes the further
development more difficult.

This patch introduces a new `bps_tree_insert_impl` function, which
deduplicates the existed insertion implementations and makes them
use it. The function is made `ALWAYS_INLINE` in order to prevent
regressions in the future.

Now the value of `offset` parameter is undefined on error. This does
not affect the only user of the function, which is the memcs engine.

Performance impact:

  | request | stat | old       | new       | diff   | stdev |
  | ------- | ---- | --------- | --------- | ------ | ----- |
  | insert  | RPS  | 953798.71 | 961295.51 | +0.79% | 0.33% |
  | select  | RPS  | 805361.94 | 807647.13 | +0.28% | 0.32% |
  | replace | RPS  | 953764.82 | 960956.13 | +0.75% | 0.33% |
  | delete  | RPS  | 833315.10 | 836079.60 | +0.33% | 0.35% |

Part of tarantool#9982

NO_DOC=refactoring
NO_TEST=refactoring
NO_CHANGELOG=refactoring
Prior to this patch there was three separated implementations of
deletion of an element from the tree: `delete`, `delete_get_offset`
and `delete_value`. This is error-prone and makes the development
more difficult.

This patch introduces a new `bps_tree_delete_impl` function, which
deduplicates the existed implementations, and makes them use it. The
function is made `ALWAYS_INLINE` in order to prevent regressions.

By the way fixed the misleading comment for the `bps_tree_delete_value`
function and updated the `bps_tree_delete_get_offset` documentation: now
the value of the `offset` parameter is undefined on failure. This does
not affect the function behavior though.

Performance impact against the version prior to all refactorings:

  | request | stat | old       | new       | diff   | stdev |
  | ------- | ---- | --------- | --------- | ------ | ----- |
  | insert  | RPS  | 953798.71 | 964738.50 | +1.15% | 0.63% |
  | select  | RPS  | 805361.94 | 807050.13 | +0.21% | 0.66% |
  | replace | RPS  | 953764.82 | 964439.47 | +1.12% | 0.58% |
  | delete  | RPS  | 833315.10 | 834984.89 | +0.20% | 0.68% |

Part of tarantool#9982

NO_DOC=refactoring
NO_TEST=refactoring
NO_CHANGELOG=refactoring
Prior to this patch there was two separated implementations of the
key lookup sharing almost identical code: `bps_tree_find_impl` and
the `bps_tree_find_get_offset_impl`.

The only difference between the two functions is computing the offset
to the found element.

This patch removes the `bps_tree_find_get_offset_impl` and adds its
functionality to the `bps_tree_find_impl`. The function is made
`ALWAYS_INLINE` in order to prevent regressions.

By the way changed the interface so that the value of the `offset`
argument is undefined if the element is not found. This breaks the
BPS tree offset API tests, but this function is only used in the
memcs engine now, and it does not rely on the offset if no element
found.

Performance impact against the version prior to all refactorings:

  | request | stat | old       | new       | diff   | stdev |
  | ------- | ---- | --------- | --------- | ------ | ----- |
  | insert  | RPS  | 953798.71 | 977308.37 | +2.46% | 0.35% |
  | select  | RPS  | 805361.94 | 822407.71 | +2.12% | 0.25% |
  | replace | RPS  | 953764.82 | 977324.31 | +2.47% | 0.31% |
  | delete  | RPS  | 833315.10 | 843095.19 | +1.17% | 0.26% |

Part of tarantool#9982

NO_DOC=refactoring
NO_CHANGELOG=refactoring
Prior to this patch there were additional implementations of each bound
lookup function sharing almost identical code:
- bps_tree_lower_bound_get_offset_impl
- bps_tree_lower_bound_elem_get_offset_impl
- bps_tree_upper_bound_get_offset_impl
- bps_tree_upper_bound_elem_get_offset_impl

The only difference between the functions and their countersparts is
computing the offset to the found element.

This patch removes the `_get_offset` functions and introduces `_impl`
versions optionally including the removed functionality. The functions
are made `ALWAYS_INLINE` in order to prevent regressions.

By the way fixed the misleading comments to bound lookup functions
returning offset.

Performance impact against the version prior to all refactorings:

  | request | stat | old       | new       | diff   | stdev |
  | ------- | ---- | --------- | --------- | ------ | ----- |
  | insert  | RPS  | 953798.71 | 957071.12 | +0.34% | 0.33% |
  | select  | RPS  | 805361.94 | 807065.98 | +0.21% | 0.31% |
  | replace | RPS  | 953764.82 | 955549.16 | +0.19% | 0.83% |
  | delete  | RPS  | 833315.10 | 833423.01 | +0.01% | 0.30% |

Closes tarantool#9982

NO_DOC=refactoring
NO_TEST=refactoring
NO_CHANGELOG=refactoring
@mkostoevr mkostoevr force-pushed the m.kostoev/bps-tree-simplification branch from 13931e6 to 4915206 Compare July 29, 2024 11:57
mkostoevr added a commit to mkostoevr/tarantool that referenced this pull request Jul 29, 2024
Prior to this the count request had been performed in the memtx in the
straightforward way: we created an iterator by a type and key and simply
called its `next` method until it's exhausted. That means the operation
had a linear complexity, which could lead to DoS situations.

This patch makes the memtx tree index use a version of BPS tree with
the cardinality information enabled and takes advantage of its offset
based API to implement the count operation using tree lookups.

Since the method does not read each counted tuple now, MVCC subsystem
would be unaware of it. In order to fix this, this patch introduces a
new entity in the memtx transaction manager to track such operations:
GAP_COUNT, and the corresponding `memtx_tx_track_count` function.

The entity (gap item) is a record that a concurrent transaction has
counted tuples matching some key and iterator type in an index.

If a transaction creates such an entity, any insertion or deletion
of a matching tuple in the index will be conflicted with it. This
works differently for inserts and deletes:

1. If a concurrent transaction inserts a new matching tuple, then
   its read_gaps list is modified like the counted transaction had
   read the exact key of the tuple and found nothing. This creates
   a conflict.

2. If a concurrent transaction deletes a matching tuple, then the
   transaction that counted the tuple is inserted into the tuple
   reader list: it pretends to have read the tuple prior to the
   deletion. This creates a conflict.

One thing to be noted is that using the cardinality config introduces
a performance regression against the current version of memtx, this is
mitigated in the tarantool#10183 and in the scope of tarantool#10322.

Another thinig is about the memtx_tx_handle_gap_write update.

Prior to this patch, any failed lookup of a full key using EQ or REQ
iterators had been recorded using memtx_tx_track_point instead of the
memtx_tx_track_gap, so it was impossible to find a GAP_NEARBY item
with full key and EQ/REQ iterator type in the read_gaps list.

But in the fast count implementation only memtx_tx_track_gap is used
(for the sake of readability), so the memtx_tx_handle_gap_write is
modified to take it into account.

Part of tarantool#8204

NO_DOC=perf improvement
NO_CHANGELOG=TBD when fully implemented
mkostoevr added a commit to mkostoevr/tarantool that referenced this pull request Jul 29, 2024
Prior to this the count request had been performed in the memtx in the
straightforward way: we created an iterator by a type and key and simply
called its `next` method until it's exhausted. That means the operation
had a linear complexity, which could lead to DoS situations.

This patch makes the memtx tree index use a version of BPS tree with
the cardinality information enabled and takes advantage of its offset
based API to implement the count operation using tree lookups.

Since the method does not read each counted tuple now, MVCC subsystem
would be unaware of it. In order to fix this, this patch introduces a
new entity in the memtx transaction manager to track such operations:
GAP_COUNT, and the corresponding `memtx_tx_track_count` function.

The entity (gap item) is a record that a concurrent transaction has
counted tuples matching some key and iterator type in an index.

If a transaction creates such an entity, any insertion or deletion
of a matching tuple in the index will be conflicted with it. This
works differently for inserts and deletes:

1. If a concurrent transaction inserts a new matching tuple, then
   its read_gaps list is modified like the counted transaction had
   read the exact key of the tuple and found nothing. This creates
   a conflict.

2. If a concurrent transaction deletes a matching tuple, then the
   transaction that counted the tuple is inserted into the tuple
   reader list: it pretends to have read the tuple prior to the
   deletion. This creates a conflict.

One thing to be noted is that using the cardinality config introduces
a performance regression against the current version of memtx, this is
mitigated in the tarantool#10183 and in the scope of tarantool#10322.

Another thinig is about the memtx_tx_handle_gap_write update.

Prior to this patch, any failed lookup of a full key using EQ or REQ
iterators had been recorded using memtx_tx_track_point instead of the
memtx_tx_track_gap, so it was impossible to find a GAP_NEARBY item
with full key and EQ/REQ iterator type in the read_gaps list.

But in the fast count implementation only memtx_tx_track_gap is used
(for the sake of readability), so the memtx_tx_handle_gap_write is
modified to take it into account.

Part of tarantool#8204

NO_DOC=perf improvement
NO_CHANGELOG=TBD when fully implemented
@mkostoevr mkostoevr assigned locker and mkostoevr and unassigned mkostoevr and locker Jul 29, 2024
@mkostoevr mkostoevr added full-ci Enables all tests for a pull request and removed do not merge Not ready to be merged labels Jul 29, 2024
We use the `matras_touch` function in order to notify the matras that
a block is going to be modified. In case the block has to be copied
in order to preserve read views, it allocates a new block and returns
its address, otherwise it just returns the current address of the
given block.

The thing is, if we have this address acquired already, we don't have
to look it up from the matras again sometimes.

Let's use the new matras API to check if an allocation of a new block
might be required and only update the pointer using the `matras_touch`
if it can be changed.

Performance impact against the version prior to all commits:

  | request | stat | old       | new       | diff   | stdev |
  | ------- | ---- | --------- | --------- | ------ | ----- |
  | insert  | RPS  | 953798.71 | 958990.66 | +0.54% | 0.47% |
  | select  | RPS  | 805361.94 | 808628.46 | +0.41% | 0.23% |
  | replace | RPS  | 953764.82 | 958759.03 | +0.52% | 0.43% |
  | delete  | RPS  | 833315.10 | 834287.96 | +0.12% | 0.23% |

Part of tarantool#10322

NO_DOC=perf improvement
NO_TEST=perf improvement
NO_CHANGELOG=perf improvement
On each insertion and deletion we go over all levels of the tree and
dereference block IDs using `bps_tree_restore_block`. The thing is,
in case of insertion, we may touch the target leaf unconditionally,
because there's no no-op flow in the `bps_tree_insert` after it:
we will either replace or insert an element into the leaf.

This patch makes the `bps_tree_collect_path` always touch the leaf
block in the `bps_tree_insert`. The function is now too complicated
for GCC 13 to inline it, so it's made `ALWAYS_INLINE` to prevent the
possible regressions.

Performance impact against the version prior to all commits:

  | request | stat | old       | new       | diff   | stdev |
  | ------- | ---- | --------- | --------- | ------ | ----- |
  | insert  | RPS  | 953798.71 | 957184.34 | +0.35% | 0.28% |
  | select  | RPS  | 805361.94 | 808730.63 | +0.42% | 0.23% |
  | replace | RPS  | 953764.82 | 957252.78 | +0.37% | 0.26% |
  | delete  | RPS  | 833315.10 | 835194.55 | +0.23% | 0.23% |

Part of tarantool#10322

NO_DOC=perf improvement
NO_TEST=perf improvement
NO_CHANGELOG=perf improvement
In case if the insertion or deletion from the block requires the
reballancing or if the tree is used with a cardinality config, we
use the `bps_tree_touch_path` function in order to inform matras
about incoming block modifications.

The problem is that the function not only touches the leaf parents,
but also touches owners of `max_elem` data. But these owners are
included into the parents touched, so it's not required to touch
them again, we can simply use their new pointers after we have
touched all the path. This decreases the amount of matras ID
lookups in general (which is a pretty expensive operation).

Performance impact against the version prior to all commits:

  | request | stat | old       | new       | diff   | stdev |
  | ------- | ---- | --------- | --------- | ------ | ----- |
  | insert  | RPS  | 953798.71 | 978592.11 | +2.60% | 0.32% |
  | select  | RPS  | 805361.94 | 821532.64 | +2.01% | 0.23% |
  | replace | RPS  | 953764.82 | 978612.78 | +2.61% | 0.31% |
  | delete  | RPS  | 833315.10 | 842437.50 | +1.09% | 0.24% |

Part of tarantool#10322

NO_DOC=perf improvement
NO_TEST=perf improvement
NO_CHANGELOG=perf improvement
@mkostoevr mkostoevr force-pushed the m.kostoev/bps-tree-simplification branch from 4915206 to 2c93f72 Compare July 29, 2024 12:54
@mkostoevr mkostoevr assigned locker and unassigned mkostoevr Jul 29, 2024
@locker locker merged commit 731618f into tarantool:master Jul 29, 2024
@mkostoevr mkostoevr deleted the m.kostoev/bps-tree-simplification branch August 23, 2024 16:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full-ci Enables all tests for a pull request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tree with both inner block cardinality and children cardinalities enabled is not covered by tests #9478 follow-up: BPS tree simplification

4 participants