Added support for differential snapshots#2999
Added support for differential snapshots#2999mikhail-antonov wants to merge 9 commits intofacebook:masterfrom
Conversation
|
I think to get most efficient filtering we need to make sure we disable setting seqnums to 0 in compaction iterator for the bottom level otherwise if entire DB was recently compacted we have no way to determine which keys are old? cc @ajkr |
sagar0
left a comment
There was a problem hiding this comment.
The approach looks good to me, as it aligns with what we have been discussing.
The PR is, however, introducing 4 new options (1 dboption + 4 read options), which is going against our long term effort of reducing the total number of options :( . Anyway we could reduce them?
db/db_impl.h
Outdated
include/rocksdb/options.h
Outdated
include/rocksdb/options.h
Outdated
There was a problem hiding this comment.
Isn't this doing the same thing as what cockroach db does with sst file filter? If so, why not use that approach?
There was a problem hiding this comment.
We can use the callback added by CockroachDB, it's just that this PR isn't merged yet (btw it can land now, right?), i can pick it and rebase on top yes.
Speaking on the reduction of properties..
- We don't need iter_start_ts once the PR from CockroachDB guys lands
- We don't really need to have a flag internal_keys, we can say that "if start sequence number is passed and is not zero, than return internal keys". A bit messy but documentation would help.
- We do need to SeqNum filter, and it can't be covered by the callback from mentioned above, since it works at the different level, above individual file (table) iterators.
include/rocksdb/options.h
Outdated
There was a problem hiding this comment.
Wondering if both sequence number and timestamp are needed to solve your use case? Or would having one of them be enough?
There was a problem hiding this comment.
timestamp can be handled by the callback added to table_cache; this one is needed since for some cases like universal compaction the selectivity of timestamp-based filter isn't enough (in the extreme case 95% of the DB size is a single giant SST file, so we need to be able to slice and dice it to extract a small portion of it).
mikhail-antonov
left a comment
There was a problem hiding this comment.
Thanks for review!
Replied to some of the comments and will update/rebase the diff.
Let me also remove timestamp-based filtering for now; once cockroach diff lands we can start passing std::function in.
include/rocksdb/options.h
Outdated
There was a problem hiding this comment.
We can use the callback added by CockroachDB, it's just that this PR isn't merged yet (btw it can land now, right?), i can pick it and rebase on top yes.
Speaking on the reduction of properties..
- We don't need iter_start_ts once the PR from CockroachDB guys lands
- We don't really need to have a flag internal_keys, we can say that "if start sequence number is passed and is not zero, than return internal keys". A bit messy but documentation would help.
- We do need to SeqNum filter, and it can't be covered by the callback from mentioned above, since it works at the different level, above individual file (table) iterators.
include/rocksdb/options.h
Outdated
There was a problem hiding this comment.
timestamp can be handled by the callback added to table_cache; this one is needed since for some cases like universal compaction the selectivity of timestamp-based filter isn't enough (in the extreme case 95% of the DB size is a single giant SST file, so we need to be able to slice and dice it to extract a small portion of it).
|
@mikhail-antonov has updated the pull request. View: changes |
|
@mikhail-antonov has updated the pull request. View: changes |
ajkr
left a comment
There was a problem hiding this comment.
approach looks good to me, a few comments/questions, mostly on the DBIter
db/db_iter.cc
Outdated
There was a problem hiding this comment.
how do you plan for user to deserialize internal key? Traditionally we've resisted exposing the serialization format in public headers, although I don't see any other path forward now.
There was a problem hiding this comment.
In my tests i'm using the following:
for (db_iter->SeekToFirst(); db_iter->Valid(); db_iter->Next()) {
ParsedInternalKey ikey;
ParseInternalKey(db_iter->key(), &ikey);
// ikey.user_key.ToString()), ikey.type, ikey.sequence;
}
But that's inside RocksDB; I suppose client would need to have dbformat.h imported right?
There was a problem hiding this comment.
(on the other hand, I think it doesn't change db_iter public api since it returns a Slice, basically pointer to a chunk in the arena with metadata?)
db/db_iter.cc
Outdated
There was a problem hiding this comment.
do you mean ikey_.sequence >= start_seqnum_ ?
db/db_iter.cc
Outdated
There was a problem hiding this comment.
I guess setting skipping isn't meaningful since it returns on the next line
There was a problem hiding this comment.
@ajkr i'm unsure if we need to set num_skipped = 0; before we do return;. I think we don't?
db/db_iter.cc
Outdated
There was a problem hiding this comment.
also not supported with blob values, right?
There was a problem hiding this comment.
I don't know enough about blob values to say, I was going by the fact that remaining logic for those 2 value types is the same. Will it not work for kTypeBlobIndex type?
db/db_iter.cc
Outdated
db/db_iter.cc
Outdated
There was a problem hiding this comment.
should you call saved_key_.SetUserKey() here?
There was a problem hiding this comment.
I don't think so; the outer
if (start_seqnum_ > 0) {
checks that user requested lower-bounded iterator, and then if we don't get to the branch
if (ikey_.sequence >= start_seqnum_) {
it means that this KV isn't visible at all and should be just skipped. Do i miss anything?
There was a problem hiding this comment.
I think setting skipping alone isn't enough to take the fast-path on L427, which is also necessary to trigger Seek when internal key skips are too many. It checks for both skipping and that the saved_user_key_ hasn't changed yet.
There was a problem hiding this comment.
Ah, ok, i think i got you - we can update the userkey w/o making it visible to iterator consumer, as long as we don't call return; right?
Also it my tests I didn't catch that effect..didn't test how too_many_skipped_rows work with that.
There was a problem hiding this comment.
@ajkr btw in the SetInternalKey here I don't bother with pinnable slices etc, unlike we do for user keys, e.g. see
saved_key_.SetUserKey( ikey_.user_key, !pin_thru_lifetime_ || !iter_->IsKeyPinned() /* copy */);
Do you think we need that support for pinnable slices as well here for internal keys?
mikhail-antonov
left a comment
There was a problem hiding this comment.
Added replies to commits; i also found a bug in db_iter now using tests, going to update PR
db/db_iter.cc
Outdated
There was a problem hiding this comment.
I don't think so; the outer
if (start_seqnum_ > 0) {
checks that user requested lower-bounded iterator, and then if we don't get to the branch
if (ikey_.sequence >= start_seqnum_) {
it means that this KV isn't visible at all and should be just skipped. Do i miss anything?
|
@mikhail-antonov has updated the pull request. View: changes |
8ccd6f2 to
92238e7
Compare
|
@mikhail-antonov has updated the pull request. |
facebook-github-bot
left a comment
There was a problem hiding this comment.
@mikhail-antonov has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@mikhail-antonov has updated the pull request. View: changes, changes since last import |
|
@mikhail-antonov has updated the pull request. View: changes, changes since last import |
ajkr
left a comment
There was a problem hiding this comment.
lgtm. Let's focus on the suggestions under include/ before I cut the release branch; the other stuff can be changed later if needed.
| input_->Next(); | ||
| } else if (compaction_ != nullptr && ikey_.type == kTypeDeletion && | ||
| ikey_.sequence <= earliest_snapshot_ && | ||
| ikeyNotNeededForIncrementalSnapshot() && |
There was a problem hiding this comment.
how about preserving SingleDeletes?
There was a problem hiding this comment.
For the first version at least I didn't plan to support it, so I was going to just say in the HISTORY.md they aren't supported (yet). If need arises they could be added later, no API change or anything like that.
There was a problem hiding this comment.
documenting that we don't support yet sounds good too
| // is set to true NO deletes will ever be processed. | ||
| // DEFAULT: false | ||
| // Immutable (TODO: make it dynamically changeable) | ||
| bool preserve_deletes = false; |
There was a problem hiding this comment.
can we omit this option by implicitly enabling it as soon as SetPreserveDeletesSequenceNumber is called?
and maybe dynamically disabling it could just be calling SetPreserveDeletesSequenceNumber(kMaxSequenceNumber). You can set preserve_deletes_seqnum_'s value in DB::Open to kMaxSequenceNumber, then it'll default to off.
There was a problem hiding this comment.
I think it would open up a possibility to undesirable scenarios like this:
- DB.Open() is called.
- SetPreserveDeletesSequenceNumber(1) is called
- some data is added, some data is removed, but all deletes are preserved
- DB.Close()
- DB.Open()
Now do we expect people to call SetPreserveDeletesSequenceNumber(1) again after DB.Open() returns? (also since we're going to store the actual cutoff value in the RocksDB itself as a special key, we'll need to read it back first) Does it leave a gap when some deletes could be dropped by an eager automatically scheduled compaction?
Having an option seems safer way to me and easier to reason about the DB state, though I totally understand that I've just added new DB option :)
There was a problem hiding this comment.
Right now the way it works is that the preserve_deletes_seqnum_ in the DB is defaulted to 0; which means if preserve_deletes DB options was set to true, all deletes are preserved until user calls SetPreserveDeletesSequenceNumber(some_seqnum). That seems safer to me.
There was a problem hiding this comment.
yes, the behavior you described sounds more convenient to me too, compared to user having to put a line immediately after DB::Open to start preserving deletes. Let's leave this new option as you have it :).
include/rocksdb/db.h
Outdated
| Range(const Slice& s, const Slice& l) : start(s), limit(l) { } | ||
| }; | ||
|
|
||
| // <user key, seqeence number and entry type> tuple. |
There was a problem hiding this comment.
can we move it to types.h? our db.h is widely read and too big already, so it'd be nice to separate out things most users don't need to know about. Also moving the EntryType enum to types.h would be beneficial since the dependency on table properties is non-intuitive.
There was a problem hiding this comment.
Since FullKey references EntryType that's defined in table_properties.h which includes types.h, we'd need to move EntryType to types.h as well. Would that be fine?
There was a problem hiding this comment.
yes, I prefer moving EntryType into types.h regardless since now it's used for things unrelated to table properties.
include/rocksdb/db.h
Outdated
| } | ||
| }; | ||
|
|
||
| // Parse slice representing internal key to FullKey |
There was a problem hiding this comment.
maybe note that FullKey is only valid while the memory pointed to by internal_key is alive/unchanged.
There was a problem hiding this comment.
will do (I didn't dig inside the internals of pinnable slices though, assuming we don't care about those here?)
db/compaction_iterator.cc
Outdated
| inline bool CompactionIterator::ikeyNotNeededForIncrementalSnapshot() { | ||
| return (!compaction_->preserve_deletes()) || | ||
| (preserve_deletes_seqnum_ == nullptr) || | ||
| (ikey_.sequence < preserve_deletes_seqnum_->load()); |
There was a problem hiding this comment.
if preserve_deletes_seqnum_ is increased after deciding to skip a tombstone, but before writing it out, it could decide to zero the tombstone's seqnum, which would then trigger this assert:
assert(ikey_.type != kTypeDeletion && ikey_.type != kTypeSingleDeletion);
There was a problem hiding this comment.
Not sure I followed.
Assuming the preserve_deletes comprise monotonically increasing sequence, let's say that preserve_deletes_seqnum_ was s1, and ikey k1 had seqnum s2 < s1 (so we decided to skip it). Then if the preserved_seqnum was increased, the invariant s2 < s1 still holds true right? I.e. if before preserve_seqnum was increased the key was eligyable for skipping (so we drop it) it would still be fine to drop if we bump the seqnum?
There was a problem hiding this comment.
Sorry my explanation above was unclear -- I meant skip dropping a tombstone, so actually preserving it :p. Following your example, the problem would be when:
- s2 >= s1 so we preserve a tombstone
- Then user increases s1 such that s1 > s2
- Now
PrepareOutputdecides to zero the tombstone's seqnum
There was a problem hiding this comment.
Oh I see. Hm yeah. Thinking about options...
Why do we need this assert here?
There was a problem hiding this comment.
Well, previously writing out a tombstone with seqnum zero was considered a bug because it couldn't possibly cover any data. This assertion has been fairly useful as it caught a bug for me within the past two weeks.
One way you can fix is store a local copy of the current preserve_deletes_seqnum_ in a CompactionIterator instance variable when you decide whether to drop a delete. Then use that variable when you decide whether to zero its seqnum.
There was a problem hiding this comment.
You could consider simplifying further and make the whole CompactionIterator use a constant value for preserve_deletes_seqnum_ that is retrieved when the iterator is constructed.
ajkr
left a comment
There was a problem hiding this comment.
do you mind also mentioning the feature in HISTORY.md file?
db/db_impl.cc
Outdated
| } | ||
|
|
||
| void DBImpl::SetPreserveDeletesSequenceNumber(SequenceNumber seqnum) { | ||
| preserve_deletes_seqnum_.store(seqnum); |
There was a problem hiding this comment.
can we assert that it's monotonically increasing?
There was a problem hiding this comment.
actually i'd rather check and return false if preserve_deletes_seqnum_ hasn't been updated as a result of the call. Failing an assert in the response to incorrect user input seems wrong; user is responsible for checking that the call always returns true as it should if the user input is sane.
There was a problem hiding this comment.
sure, let's fail the call in that case.
| allow_blob_(allow_blob), | ||
| is_blob_(false) { | ||
| is_blob_(false), | ||
| start_seqnum_(read_options.iter_start_seqnum) { |
There was a problem hiding this comment.
should we assert it's no smaller than the delete-preserving seqnum?
| valid_ = false; | ||
| } else { | ||
| is_blob_ = true; | ||
| if (start_seqnum_ > 0) { |
There was a problem hiding this comment.
sorry I lost track of our discussion on whether blobs are supported. Anyways, since incremental iterator doesn't have any code like in the else if (ikey_.type == kTypeBlobIndex) block below, I think we should mark it unsupported. Maybe just return an error when ReadOptions has iter_start_seqnum > 0 while at the same time allow_blob_ = true.
There was a problem hiding this comment.
you could also consider returning an error if range_del_agg_ becomes non-empty and iter_start_seqnum > 0.
|
@mikhail-antonov has updated the pull request. View: changes, changes since last import |
|
@mikhail-antonov has updated the pull request. View: changes, changes since last import |
ajkr
left a comment
There was a problem hiding this comment.
Looks great, let's ship it!
BTW, the behavior where we don't drop deletes during memtable flush might not be there forever. Maybe we can think of a way to harden this in case that assumption changes (like pass delete-preserving seqnum to the CompactionIterator constructed in builder.cc).
HISTORY.md
Outdated
| ## Unreleased | ||
| ### Public API Change | ||
| * `BackupableDBOptions::max_valid_backups_to_open == 0` now means no backups will be opened during BackupEngine initialization. Previously this condition disabled limiting backups opened. | ||
| * `DBOptions::preserve_deletes == false` is a new option that allows one to specify that DB should not drop tombstones for regular deletes if they have sequence number larger than what was set by the new API call `DB::SetPreserveDeletesSequenceNumber(SequenceNumber seqnum)`. |
There was a problem hiding this comment.
sorry to be pedantic. We usually don't break lines in this file since this text is copy/pasted into wiki/elsewhere, where each item needs to be one long line to show up as a single bullet point.
Also I think you're describing behavior for preserve_deletes = true. Or maybe you want to mention the default value is false? Might be least confusing to just say we're introducing preserve_deletes, without saying any value.
There was a problem hiding this comment.
No problem! Good pointed. Updated HISTORY.md file with that.
| // always open the DB with 0 here, which means if preserve_deletes_==true | ||
| // we won't drop any deletion markers until SetPreserveDeletesSequenceNumber() | ||
| // is called by client and this seqnum is advanced. | ||
| preserve_deletes_seqnum_.store(0); |
There was a problem hiding this comment.
should we initialize it to the current seqnum? since in case of DB reopen, we can't know whether deletes before the current seqnum were preserved. Maybe a future feature is persist this across restarts.
There was a problem hiding this comment.
that's an option to consider but that would change some semantics. Right now we don't have good place to keep that data persisted inside DB internal structures like manifest, so we rely on user to keep track of those seqnums. That's why here we set it to 0, saying we won't process any deletes (if preserve_deletes == true) until user calls SetPreserve...() and informs us what's the current cutoff seqnum is.
|
@mikhail-antonov has updated the pull request. View: changes, changes since last import |
|
@mikhail-antonov has updated the pull request. View: changes, changes since last import |
|
@mikhail-antonov has updated the pull request. View: changes, changes since last import |
|
@mikhail-antonov has updated the pull request. View: changes, changes since last import |
facebook-github-bot
left a comment
There was a problem hiding this comment.
@mikhail-antonov is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
The motivation for this PR is to add to RocksDB support for differential (incremental) snapshots, as snapshot of the DB changes between two points in time (one can think of it as diff between to sequence numbers, or the diff D which can be thought of as an SST file or just set of KVs that can be applied to sequence number S1 to get the database to the state at sequence number S2).
This feature would be useful for various distributed storages layers built on top of RocksDB, as it should help reduce resources (time and network bandwidth) needed to recover and rebuilt DB instances as replicas in the context of distributed storages.
From the API standpoint that would like client app requesting iterator between (start seqnum) and current DB state, and reading the "diff".
This is a very draft PR for initial review in the discussion on the approach, i'm going to rework some parts and keep updating the PR.
For now, what's done here according to initial discussions:
Preserving deletes:
Iterator changes:
TableCache changes:
What's left: