Skip to content

[WIP] db: add support for user-defined sstable partitioning#536

Open
petermattis wants to merge 1 commit intomasterfrom
pmattis/table-partitioner
Open

[WIP] db: add support for user-defined sstable partitioning#536
petermattis wants to merge 1 commit intomasterfrom
pmattis/table-partitioner

Conversation

@petermattis
Copy link
Copy Markdown
Collaborator

Add Options.TablePartitioner hook which allows the user to specify a
required partition between 2 user-keys. TablePartitioner is called
during flush and compaction before outputting a new key to an sstable
which already contains at least 1 key.

TODO

One complication from table partitioning is that it can create L0 tables
which overlap in seqnum space. In order to support partitioned L0, we'd
have to relax the invariant checks in manifest.CheckOrdering. Doing so
will make Pebble incompatible with RocksDB.

In order for partitioning to not naively increase read amplification,
we'll want provide some sort of partitioned view of the sstables in
Version.Files[0]. mergingIter will then need to be made aware of the
partitioning.

We may want to adjust the compaction picking heuristics to not expand
compaction inputs across the partition boundary.

See #517

@petermattis
Copy link
Copy Markdown
Collaborator Author

This change is Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator Author

This was a Friday experiment to see if there were any unexpected complexities. For CRDB, we could set TablePartitioner to something like:

opts.TablePartitioner = func(lastKey, curKey []byte) bool {
	return keys.IsLocal(roachpb.Key(lastKey)) != keys.IsLocal(roachpb.Key(curKey))
}

Turns out there is an unexpected complexity: naively partitioning L0 sstables during flushing creates sstables which overlap in their smallest/largest seqnum, violating the invariants checked by manifest.CheckOrdering. This can be fixed, but doing so will make Pebble incompatible with RocksDB.

In order to make this "real", there is additional plumbing of partitioning that would be required. So the PR message above.

Cc @dt, @miretskiy

@petermattis
Copy link
Copy Markdown
Collaborator Author

Turns out there is an unexpected complexity: naively partitioning L0 sstables during flushing creates sstables which overlap in their smallest/largest seqnum, violating the invariants checked by manifest.CheckOrdering. This can be fixed, but doing so will make Pebble incompatible with RocksDB.

I took another look at RocksDB's invariants this morning. It simply checks that the L0 files are "sorted properly". The sorted properly check uses the same comparator that it uses for sorting itself (i.e. sort by largest seqnum). The result is that all RocksDB is checking for in L0 is that std::sort works and that the L0 comparator is stable. This is good news. I think we can partition L0 sstables in Pebble and still remain compatible with RocksDB.

Add `Options.TablePartitioner` hook which allows the user to specify a
required partition between 2 user-keys. `TablePartitioner` is called
during flush and compaction before outputting a new key to an sstable
which already contains at least 1 key.

TODO

One complication from table partitioning is that it can create L0 tables
which overlap in seqnum space. In order to support partitioned L0, we'd
have to relax the invariant checks in `manifest.CheckOrdering`. Doing so
will make Pebble incompatible with RocksDB.

In order for partitioning to not naively increase read amplification,
we'll want provide some sort of partitioned view of the sstables in
`Version.Files[0]`. `mergingIter` will then need to be made aware of the
partitioning.

We may want to adjust the compaction picking heuristics to not expand
compaction inputs across the partition boundary.

See #517
@petermattis petermattis force-pushed the pmattis/table-partitioner branch from cde25a0 to 3331106 Compare February 26, 2020 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant