Added support for differential snapshots by mikhail-antonov · Pull Request #2999 · facebook/rocksdb

mikhail-antonov · 2017-10-12T22:10:17Z

The motivation for this PR is to add to RocksDB support for differential (incremental) snapshots, as snapshot of the DB changes between two points in time (one can think of it as diff between to sequence numbers, or the diff D which can be thought of as an SST file or just set of KVs that can be applied to sequence number S1 to get the database to the state at sequence number S2).

This feature would be useful for various distributed storages layers built on top of RocksDB, as it should help reduce resources (time and network bandwidth) needed to recover and rebuilt DB instances as replicas in the context of distributed storages.

From the API standpoint that would like client app requesting iterator between (start seqnum) and current DB state, and reading the "diff".

This is a very draft PR for initial review in the discussion on the approach, i'm going to rework some parts and keep updating the PR.

For now, what's done here according to initial discussions:

Preserving deletes:

We want to be able to optionally preserve recent deletes for some defined period of time, so that if a delete came in recently and might need to be included in the next incremental snapshot it would't get dropped by a compaction. This is done by adding new param to Options (preserve deletes flag) and new variable to DB Impl where we keep track of the sequence number after which we don't want to drop tombstones, even if they are otherwise eligible for deletion.
I also added a new API call for clients to be able to advance this cutoff seqnum after which we drop deletes; i assume it's more flexible to let clients control this, since otherwise we'd need to keep some kind of timestamp < -- > seqnum mapping inside the DB, which sounds messy and painful to support. Clients could make use of it by periodically calling GetLatestSequenceNumber(), noting the timestamp, doing some calculation and figuring out by how much we need to advance the cutoff seqnum.
Compaction codepath in compaction_iterator.cc has been modified to avoid dropping tombstones with seqnum > cutoff seqnum.

Iterator changes:

couple params added to ReadOptions, to optionally allow client to request internal keys instead of user keys (so that client can get the latest value of a key, be it delete marker or a put), as well as min timestamp and min seqnum.

TableCache changes:

I modified table_cache code to be able to quickly exclude SST files from iterators heep if creation_time on the file is less then iter_start_ts as passed in ReadOptions. That would help a lot in some DB settings (like reading very recent data only or using FIFO compactions), but not so much for universal compaction with more or less long iterator time span.

What's left:

Still looking at how to best plug that inside DBIter codepath. So far it seems that FindNextUserKeyInternal only parses values as UserKeys, and iter->key() call generally returns user key. Can we add new API to DBIter as internal_key(), and modify this internal method to optionally set saved_key_ to point to the full internal key? I don't need to store actual seqnum there, but I do need to store type.

mikhail-antonov · 2017-10-13T00:28:45Z

I think to get most efficient filtering we need to make sure we disable setting seqnums to 0 in compaction iterator for the bottom level otherwise if entire DB was recently compacted we have no way to determine which keys are old? cc @ajkr

sagar0

The approach looks good to me, as it aligns with what we have been discussing.

The PR is, however, introducing 4 new options (1 dboption + 4 read options), which is going against our long term effort of reducing the total number of options :( . Anyway we could reduce them?

sagar0 · 2017-10-17T01:14:31Z

db/db_impl.h

s/mush/must

sagar0 · 2017-10-17T05:14:46Z

include/rocksdb/options.h

s/StartTime/SequenceNumber

sagar0 · 2017-10-17T05:19:49Z

include/rocksdb/options.h

Isn't this doing the same thing as what cockroach db does with sst file filter? If so, why not use that approach?

We can use the callback added by CockroachDB, it's just that this PR isn't merged yet (btw it can land now, right?), i can pick it and rebase on top yes.

Speaking on the reduction of properties..

We don't need iter_start_ts once the PR from CockroachDB guys lands

We don't really need to have a flag internal_keys, we can say that "if start sequence number is passed and is not zero, than return internal keys". A bit messy but documentation would help.

We do need to SeqNum filter, and it can't be covered by the callback from mentioned above, since it works at the different level, above individual file (table) iterators.

sagar0 · 2017-10-17T05:22:09Z

include/rocksdb/options.h

Wondering if both sequence number and timestamp are needed to solve your use case? Or would having one of them be enough?

timestamp can be handled by the callback added to table_cache; this one is needed since for some cases like universal compaction the selectivity of timestamp-based filter isn't enough (in the extreme case 95% of the DB size is a single giant SST file, so we need to be able to slice and dice it to extract a small portion of it).

mikhail-antonov

Thanks for review!

Replied to some of the comments and will update/rebase the diff.

Let me also remove timestamp-based filtering for now; once cockroach diff lands we can start passing std::function in.

mikhail-antonov · 2017-10-17T19:14:35Z

include/rocksdb/options.h

We can use the callback added by CockroachDB, it's just that this PR isn't merged yet (btw it can land now, right?), i can pick it and rebase on top yes.

Speaking on the reduction of properties..

We don't need iter_start_ts once the PR from CockroachDB guys lands

We don't really need to have a flag internal_keys, we can say that "if start sequence number is passed and is not zero, than return internal keys". A bit messy but documentation would help.

We do need to SeqNum filter, and it can't be covered by the callback from mentioned above, since it works at the different level, above individual file (table) iterators.

mikhail-antonov · 2017-10-17T19:16:09Z

include/rocksdb/options.h

timestamp can be handled by the callback added to table_cache; this one is needed since for some cases like universal compaction the selectivity of timestamp-based filter isn't enough (in the extreme case 95% of the DB size is a single giant SST file, so we need to be able to slice and dice it to extract a small portion of it).

facebook-github-bot · 2017-10-19T21:47:58Z

@mikhail-antonov has updated the pull request. View: changes

facebook-github-bot · 2017-10-19T22:37:53Z

@mikhail-antonov has updated the pull request. View: changes

ajkr

approach looks good to me, a few comments/questions, mostly on the DBIter

ajkr · 2017-10-26T21:32:36Z

db/db_iter.cc

how do you plan for user to deserialize internal key? Traditionally we've resisted exposing the serialization format in public headers, although I don't see any other path forward now.

In my tests i'm using the following:

for (db_iter->SeekToFirst(); db_iter->Valid(); db_iter->Next()) {
ParsedInternalKey ikey;
ParseInternalKey(db_iter->key(), &ikey);
// ikey.user_key.ToString()), ikey.type, ikey.sequence;
}

But that's inside RocksDB; I suppose client would need to have dbformat.h imported right?

(on the other hand, I think it doesn't change db_iter public api since it returns a Slice, basically pointer to a chunk in the arena with metadata?)

ajkr · 2017-10-26T21:36:35Z

db/db_iter.cc

do you mean ikey_.sequence >= start_seqnum_ ?

ajkr · 2017-10-26T21:39:25Z

db/db_iter.cc

I guess setting skipping isn't meaningful since it returns on the next line

@ajkr i'm unsure if we need to set num_skipped = 0; before we do return;. I think we don't?

ajkr · 2017-10-26T21:45:33Z

db/db_iter.cc

also not supported with blob values, right?

I don't know enough about blob values to say, I was going by the fact that remaining logic for those 2 value types is the same. Will it not work for kTypeBlobIndex type?

ajkr · 2017-10-26T21:46:05Z

db/db_iter.cc

should it be >=?

ajkr · 2017-10-26T21:50:48Z

db/db_iter.cc

should you call saved_key_.SetUserKey() here?

I don't think so; the outer

if (start_seqnum_ > 0) {

checks that user requested lower-bounded iterator, and then if we don't get to the branch

if (ikey_.sequence >= start_seqnum_) {

it means that this KV isn't visible at all and should be just skipped. Do i miss anything?

I think setting skipping alone isn't enough to take the fast-path on L427, which is also necessary to trigger Seek when internal key skips are too many. It checks for both skipping and that the saved_user_key_ hasn't changed yet.

Ah, ok, i think i got you - we can update the userkey w/o making it visible to iterator consumer, as long as we don't call return; right?

Also it my tests I didn't catch that effect..didn't test how too_many_skipped_rows work with that.

@ajkr btw in the SetInternalKey here I don't bother with pinnable slices etc, unlike we do for user keys, e.g. see

saved_key_.SetUserKey( ikey_.user_key, !pin_thru_lifetime_ || !iter_->IsKeyPinned() /* copy */);

Do you think we need that support for pinnable slices as well here for internal keys?

mikhail-antonov

Added replies to commits; i also found a bug in db_iter now using tests, going to update PR

mikhail-antonov · 2017-10-26T22:05:51Z

db/db_iter.cc

I don't think so; the outer

if (start_seqnum_ > 0) {

checks that user requested lower-bounded iterator, and then if we don't get to the branch

if (ikey_.sequence >= start_seqnum_) {

it means that this KV isn't visible at all and should be just skipped. Do i miss anything?

facebook-github-bot · 2017-10-27T11:00:17Z

@mikhail-antonov has updated the pull request. View: changes

facebook-github-bot · 2017-10-27T19:31:45Z

@mikhail-antonov has updated the pull request.

facebook-github-bot

@mikhail-antonov has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2017-10-31T01:38:17Z

@mikhail-antonov has updated the pull request. View: changes, changes since last import

facebook-github-bot · 2017-10-31T02:08:41Z

@mikhail-antonov has updated the pull request. View: changes, changes since last import

ajkr

lgtm. Let's focus on the suggestions under include/ before I cut the release branch; the other stuff can be changed later if needed.

ajkr · 2017-10-31T21:00:45Z

db/compaction_iterator.cc

      input_->Next();
    } else if (compaction_ != nullptr && ikey_.type == kTypeDeletion &&
               ikey_.sequence <= earliest_snapshot_ &&
+               ikeyNotNeededForIncrementalSnapshot() &&


how about preserving SingleDeletes?

For the first version at least I didn't plan to support it, so I was going to just say in the HISTORY.md they aren't supported (yet). If need arises they could be added later, no API change or anything like that.

documenting that we don't support yet sounds good too

ajkr · 2017-11-01T01:44:59Z

include/rocksdb/options.h

+  // is set to true NO deletes will ever be processed.
+  // DEFAULT: false
+  // Immutable (TODO: make it dynamically changeable)
+  bool preserve_deletes = false;


can we omit this option by implicitly enabling it as soon as SetPreserveDeletesSequenceNumber is called?

and maybe dynamically disabling it could just be calling SetPreserveDeletesSequenceNumber(kMaxSequenceNumber). You can set preserve_deletes_seqnum_'s value in DB::Open to kMaxSequenceNumber, then it'll default to off.

I think it would open up a possibility to undesirable scenarios like this:

DB.Open() is called.

SetPreserveDeletesSequenceNumber(1) is called

some data is added, some data is removed, but all deletes are preserved

DB.Close()

DB.Open()

Now do we expect people to call SetPreserveDeletesSequenceNumber(1) again after DB.Open() returns? (also since we're going to store the actual cutoff value in the RocksDB itself as a special key, we'll need to read it back first) Does it leave a gap when some deletes could be dropped by an eager automatically scheduled compaction?

Having an option seems safer way to me and easier to reason about the DB state, though I totally understand that I've just added new DB option :)

Right now the way it works is that the preserve_deletes_seqnum_ in the DB is defaulted to 0; which means if preserve_deletes DB options was set to true, all deletes are preserved until user calls SetPreserveDeletesSequenceNumber(some_seqnum). That seems safer to me.

yes, the behavior you described sounds more convenient to me too, compared to user having to put a line immediately after DB::Open to start preserving deletes. Let's leave this new option as you have it :).

ajkr · 2017-11-01T01:57:15Z

include/rocksdb/db.h

  Range(const Slice& s, const Slice& l) : start(s), limit(l) { }
 };

+// <user key, seqeence number and entry type> tuple.


can we move it to types.h? our db.h is widely read and too big already, so it'd be nice to separate out things most users don't need to know about. Also moving the EntryType enum to types.h would be beneficial since the dependency on table properties is non-intuitive.

Since FullKey references EntryType that's defined in table_properties.h which includes types.h, we'd need to move EntryType to types.h as well. Would that be fine?

yes, I prefer moving EntryType into types.h regardless since now it's used for things unrelated to table properties.

ajkr · 2017-11-01T02:00:46Z

include/rocksdb/db.h

+  }
+};
+
+// Parse slice representing internal key to FullKey


maybe note that FullKey is only valid while the memory pointed to by internal_key is alive/unchanged.

will do (I didn't dig inside the internals of pinnable slices though, assuming we don't care about those here?)

ajkr · 2017-11-01T02:23:02Z

db/compaction_iterator.cc

+inline bool CompactionIterator::ikeyNotNeededForIncrementalSnapshot() {
+  return (!compaction_->preserve_deletes()) ||
+         (preserve_deletes_seqnum_ == nullptr) ||
+         (ikey_.sequence < preserve_deletes_seqnum_->load());


if preserve_deletes_seqnum_ is increased after deciding to skip a tombstone, but before writing it out, it could decide to zero the tombstone's seqnum, which would then trigger this assert:

assert(ikey_.type != kTypeDeletion && ikey_.type != kTypeSingleDeletion);

Not sure I followed.

Assuming the preserve_deletes comprise monotonically increasing sequence, let's say that preserve_deletes_seqnum_ was s1, and ikey k1 had seqnum s2 < s1 (so we decided to skip it). Then if the preserved_seqnum was increased, the invariant s2 < s1 still holds true right? I.e. if before preserve_seqnum was increased the key was eligyable for skipping (so we drop it) it would still be fine to drop if we bump the seqnum?

Sorry my explanation above was unclear -- I meant skip dropping a tombstone, so actually preserving it :p. Following your example, the problem would be when:

s2 >= s1 so we preserve a tombstone

Then user increases s1 such that s1 > s2

Now PrepareOutput decides to zero the tombstone's seqnum

Oh I see. Hm yeah. Thinking about options...

Why do we need this assert here?

Well, previously writing out a tombstone with seqnum zero was considered a bug because it couldn't possibly cover any data. This assertion has been fairly useful as it caught a bug for me within the past two weeks.

One way you can fix is store a local copy of the current preserve_deletes_seqnum_ in a CompactionIterator instance variable when you decide whether to drop a delete. Then use that variable when you decide whether to zero its seqnum.

You could consider simplifying further and make the whole CompactionIterator use a constant value for preserve_deletes_seqnum_ that is retrieved when the iterator is constructed.

ajkr

do you mind also mentioning the feature in HISTORY.md file?

ajkr · 2017-11-01T04:35:07Z

db/db_impl.cc

 }

+void DBImpl::SetPreserveDeletesSequenceNumber(SequenceNumber seqnum) {
+  preserve_deletes_seqnum_.store(seqnum);


can we assert that it's monotonically increasing?

actually i'd rather check and return false if preserve_deletes_seqnum_ hasn't been updated as a result of the call. Failing an assert in the response to incorrect user input seems wrong; user is responsible for checking that the call always returns true as it should if the user input is sane.

sure, let's fail the call in that case.

ajkr · 2017-11-01T04:45:37Z

db/db_iter.cc

        allow_blob_(allow_blob),
-        is_blob_(false) {
+        is_blob_(false),
+        start_seqnum_(read_options.iter_start_seqnum) {


should we assert it's no smaller than the delete-preserving seqnum?

ajkr · 2017-11-01T04:50:48Z

db/db_iter.cc

-                valid_ = false;
-              } else {
-                is_blob_ = true;
+            if (start_seqnum_ > 0) {


sorry I lost track of our discussion on whether blobs are supported. Anyways, since incremental iterator doesn't have any code like in the else if (ikey_.type == kTypeBlobIndex) block below, I think we should mark it unsupported. Maybe just return an error when ReadOptions has iter_start_seqnum > 0 while at the same time allow_blob_ = true.

you could also consider returning an error if range_del_agg_ becomes non-empty and iter_start_seqnum > 0.

facebook-github-bot · 2017-11-01T21:21:51Z

@mikhail-antonov has updated the pull request. View: changes, changes since last import

facebook-github-bot · 2017-11-01T21:48:28Z

@mikhail-antonov has updated the pull request. View: changes, changes since last import

ajkr

Looks great, let's ship it!

BTW, the behavior where we don't drop deletes during memtable flush might not be there forever. Maybe we can think of a way to harden this in case that assumption changes (like pass delete-preserving seqnum to the CompactionIterator constructed in builder.cc).

ajkr · 2017-11-01T22:05:10Z

HISTORY.md

 ## Unreleased
 ### Public API Change
 * `BackupableDBOptions::max_valid_backups_to_open == 0` now means no backups will be opened during BackupEngine initialization. Previously this condition disabled limiting backups opened.
+* `DBOptions::preserve_deletes == false` is a new option that allows one to specify that DB should not drop tombstones for regular deletes if they have sequence number larger than what was set by the new API call `DB::SetPreserveDeletesSequenceNumber(SequenceNumber seqnum)`.


sorry to be pedantic. We usually don't break lines in this file since this text is copy/pasted into wiki/elsewhere, where each item needs to be one long line to show up as a single bullet point.

Also I think you're describing behavior for preserve_deletes = true. Or maybe you want to mention the default value is false? Might be least confusing to just say we're introducing preserve_deletes, without saying any value.

No problem! Good pointed. Updated HISTORY.md file with that.

ajkr · 2017-11-01T22:28:30Z

db/db_impl.cc

+  // always open the DB with 0 here, which means if preserve_deletes_==true
+  // we won't drop any deletion markers until SetPreserveDeletesSequenceNumber()
+  // is called by client and this seqnum is advanced.
+  preserve_deletes_seqnum_.store(0);


should we initialize it to the current seqnum? since in case of DB reopen, we can't know whether deletes before the current seqnum were preserved. Maybe a future feature is persist this across restarts.

that's an option to consider but that would change some semantics. Right now we don't have good place to keep that data persisted inside DB internal structures like manifest, so we rely on user to keep track of those seqnums. That's why here we set it to 0, saying we won't process any deletes (if preserve_deletes == true) until user calls SetPreserve...() and informs us what's the current cutoff seqnum is.

facebook-github-bot · 2017-11-02T00:15:11Z

@mikhail-antonov has updated the pull request. View: changes, changes since last import

facebook-github-bot · 2017-11-02T00:27:12Z

@mikhail-antonov has updated the pull request. View: changes, changes since last import

facebook-github-bot · 2017-11-02T00:28:51Z

@mikhail-antonov has updated the pull request. View: changes, changes since last import

facebook-github-bot · 2017-11-02T00:43:18Z

@mikhail-antonov has updated the pull request. View: changes, changes since last import

facebook-github-bot

@mikhail-antonov is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

mikhail-antonov requested a review from sagar0 October 12, 2017 22:10

facebook-github-bot added the CLA Signed label Oct 12, 2017

mikhail-antonov self-assigned this Oct 12, 2017

mikhail-antonov requested review from IslamAbdelRahman, ajkr and siying October 12, 2017 22:12

sagar0 reviewed Oct 17, 2017

View reviewed changes

mikhail-antonov commented Oct 17, 2017

View reviewed changes

sagar0 mentioned this pull request Oct 18, 2017

expose a hook to skip tables during iteration #2265

Closed

ajkr reviewed Oct 26, 2017

View reviewed changes

mikhail-antonov commented Oct 26, 2017

View reviewed changes

Added support for differential snapshots

92238e7

mikhail-antonov force-pushed the diff_snapshots branch from 8ccd6f2 to 92238e7 Compare October 27, 2017 19:31

mikhail-antonov changed the title ~~WIP diff snapshots~~ Added support for differential snapshots Oct 27, 2017

facebook-github-bot reviewed Oct 27, 2017

View reviewed changes

added api to parse internal keys

4fa36a6

updated db iter test

a95495f

ajkr suggested changes Nov 1, 2017

View reviewed changes

ajkr reviewed Nov 1, 2017

View reviewed changes

addressed review comments from Andrew

e1fee1c

Updated HISTORY.md with info about differential snapshots

3d1b144

ajkr approved these changes Nov 1, 2017

View reviewed changes

updated HISTORY.md, reformatted details about diff snapshots

1b981d7

Merge branch 'master' into diff_snapshots

dda3240

Merge branch 'master' into diff_snapshots

d1e18ec

fixed formatting in HISTORY.md

dc74a4b

facebook-github-bot reviewed Nov 2, 2017

View reviewed changes

facebook-github-bot closed this in 7fe3b32 Nov 2, 2017

Conversation

mikhail-antonov commented Oct 12, 2017

Uh oh!

mikhail-antonov commented Oct 13, 2017

Uh oh!

sagar0 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mikhail-antonov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Oct 19, 2017

Uh oh!

facebook-github-bot commented Oct 19, 2017

Uh oh!

ajkr left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mikhail-antonov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Oct 27, 2017

Uh oh!

facebook-github-bot commented Oct 27, 2017

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Oct 31, 2017

Uh oh!

facebook-github-bot commented Oct 31, 2017

Uh oh!