perf: Introduce sublevels in L0

This issue falls under the general umbrella of compaction improvements suggested in #552.

# Motivation 
The LSM tree is made up of multiple levels, and SSTables within a level are not allowed to overlap one another. This invariant is relaxed for the topmost level (L0), where SSTables are allowed to overlap in key ranges (and as of #591, seqnum ranges as well). The side effect of this is that all SSTables in L0 must be merged during read time, as each SSTable contributes one unit to read amplification.

If L0 gets a lot of SSTables, and these SSTables are particularly wide in key ranges, they end up overlapping with many other L0 SSTables as well as with many SSTables in LBase (the next non-empty level "below" L0). This results in really wide and slow compactions (as all overlapping L0 and LBase SSTables must be compacted together). In addition, concurrent L0 -> LBase compactions are reduced.

These situations are observed during large IMPORT operations, where a lot of L0 SSTables are created too quickly, quicker than can be drained by L0 -> LBase compactions (and subsequent compactions). This increases read amplification and increases latency across the board until the LSM tree goes back towards the more "natural" shape of having ~90% of its bytes in the bottommost level.

This would be a simple example of an LSM tree with multiple overlapping L0 SSTables.

```
L0  a..............r
      b..............y
              m........z
    a......f

L2    b.....g m..q r...z

L3  ...
...
```

Assume that newer L0 SSTables are shown above older ones.

This issue introduces a design that has already been prototyped/experimented in #563. If L0 is organized into sublevels, where the regular level invariants hold within a sublevel (no key range overlaps), and the sublevels are organized in an order where "newer" sublevels (defined as sublevels with higher indexes) shadow keys in older sublevels, we are able to reduce read amplification by using `levelIters` for each sublevel. The above example could then be written as:

```
level sublevel  sstables
L0    2         a..............r
      1           b..............y
      0         a......f  m........z

L2                b.....g m..q r...z

L3  
...
```

This has reduced read amplification by 1, as `a..f` and `m..z` can occupy the same sublevel. L0's contribution to read amplification can therefore be defined as the highest height of the stack of overlapping L0 SSTables (arranged in sublevels) at any given key. This will always be less than or equal to the number of sublevels.

However, we still have an issue with wide SSTables and compactions. If the SST in sublevel 1 is chosen for compaction, we inevitably end up choosing all of L0 and Lbase in the same compaction. To keep compaction shorter and improve concurrency of compactions, we introduce flush split keys. At flush time, each SSTable would be partitioned into multiple SSTables at some defined user keys known as flush split keys. More on how these flush split keys can be picked is under Implementation Design.

In the above case, let's assume`m` and `r` are flush split keys. This results in this LSM tree:

```

level sublevel  sstables
L0    2         a........l m..q rr
      1           b......l m..q r...y
      0         a......f   m..q r....z

L2                b.....g  m..q r....z

L3  
...
```

Such an LSM tree would have the same amount of read amplification as we saw earlier, while also allowing multiple, smaller L0 -> LBase compactions to proceed in parallel. In particular, 3 parallel L0 -> LBase compactions can proceed in this example: all the L0 and L2 SSTables in the key range `a..l`, all the ones in `m..q`, and all the ones in `r..z`.

This vertical rectangular shape of compactions is what we want to find, since it maximizes concurrency and effectiveness. Effectiveness (or compaction score) can be defined as the number of files reduced; therefore, the key interval that is the "tallest" would be the most effective one to compact. Consider this example:

```
level sublevel  sstables
L0    4                    m.o
      3                     n.q
      2         a........l m..q rr
      1           b......l m..q r...y
      0         a......f   m..q r....z

L2                b.....g  m..q r....z

L3  
...
```

In this case compacting all SSTs in `m..q` starting from the `m..o` SST in sublevel 4 downwards would be the most effective and still narrow, since it would reduce stack depth. In this case this compaction could single-handedly eliminate sublevels 3 and 4, assuming it's the only operation taking place.

# Implementation design decisions

Sublevels can be deduced from a set of L0 SSTs by iterating through them from oldest to newest, and for each of these files, seeing what's the highest assigned sublevel for an overlapping L0 SST (if any), and assigning it that sublevel plus 1.

To reduce the number of key comparisons, an ordered list of intervals keys could be produced, as done in #563. These would be the union set of all SSTable boundary keys. After this is produced, all subsequent operations around SSTable boundary comparisons could just deal with index comparisons wrt. the ordered list, instead of having to do more extensive key comparisons.

Compactions can be chosen by picking seed files out of tall intervals (intervals in which many SSTables overlap). For base compactions, we'd iterate downward (towards the 0 sublevel), picking all overlapping SSTables along the way, and for intra-L0 compactions, we'd proceed in the other direction (towards the highest indexed sublevel). These compactions can optionally be expanded in the other direction (higher indexed sublevels for base compactions, lower indexed ones for intra-L0)  by progressively adding SSTables in that interval until we can't add one due to concurrent compactions.

Flush split keys could be chosen by iterating through interval keys in order, from lowest to highest, keeping track of all bytes in files observed so far (with an estimated byte count for any files that we're partially through) since the last flush split key, and publishing a split key when that byte count exceeds a specified threshold.

Currently the prototype in #563 recreates the entire sublevel data structure after any addition or deletion to the L0 tables. There's room for performance improvement by leveraging the map of added/deleted files in a `BulkVersionEdit` to modify the sublevel data structure instead of recreating it from scratch. These updates would be required:

1) Adding/deleting from the set of ordered keys, then going around and updating all indexes in the data structure to the new indexes

2) Updating all booleans denoting ongoing compactions, based on any new info about ongoing compactions.

3) Recalculating all flush split keys. It would be more efficient to add and remove flush split keys on a one-off basis instead of calculating them all from scratch, resulting in fewer overlapping SSTables.

# Remaining challenges

* Integration with the rest of Pebble
* Demonstrating a performance improvement under a practical heavy import workload
* Manual testing
* Unit testing


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Introduce sublevels in L0 #609

Motivation

Implementation design decisions

Remaining challenges

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

perf: Introduce sublevels in L0 #609

Description

Motivation

Implementation design decisions

Remaining challenges

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions