[WIP] db: add support for user-defined sstable partitioning#536
[WIP] db: add support for user-defined sstable partitioning#536petermattis wants to merge 1 commit intomasterfrom
Conversation
|
This was a Friday experiment to see if there were any unexpected complexities. For CRDB, we could set Turns out there is an unexpected complexity: naively partitioning L0 sstables during flushing creates sstables which overlap in their smallest/largest seqnum, violating the invariants checked by In order to make this "real", there is additional plumbing of partitioning that would be required. So the PR message above. Cc @dt, @miretskiy |
I took another look at RocksDB's invariants this morning. It simply checks that the L0 files are "sorted properly". The sorted properly check uses the same comparator that it uses for sorting itself (i.e. sort by largest seqnum). The result is that all RocksDB is checking for in L0 is that |
Add `Options.TablePartitioner` hook which allows the user to specify a required partition between 2 user-keys. `TablePartitioner` is called during flush and compaction before outputting a new key to an sstable which already contains at least 1 key. TODO One complication from table partitioning is that it can create L0 tables which overlap in seqnum space. In order to support partitioned L0, we'd have to relax the invariant checks in `manifest.CheckOrdering`. Doing so will make Pebble incompatible with RocksDB. In order for partitioning to not naively increase read amplification, we'll want provide some sort of partitioned view of the sstables in `Version.Files[0]`. `mergingIter` will then need to be made aware of the partitioning. We may want to adjust the compaction picking heuristics to not expand compaction inputs across the partition boundary. See #517
cde25a0 to
3331106
Compare
Add
Options.TablePartitionerhook which allows the user to specify arequired partition between 2 user-keys.
TablePartitioneris calledduring flush and compaction before outputting a new key to an sstable
which already contains at least 1 key.
TODO
One complication from table partitioning is that it can create L0 tables
which overlap in seqnum space. In order to support partitioned L0, we'd
have to relax the invariant checks in
manifest.CheckOrdering. Doing sowill make Pebble incompatible with RocksDB.
In order for partitioning to not naively increase read amplification,
we'll want provide some sort of partitioned view of the sstables in
Version.Files[0].mergingIterwill then need to be made aware of thepartitioning.
We may want to adjust the compaction picking heuristics to not expand
compaction inputs across the partition boundary.
See #517