-
Notifications
You must be signed in to change notification settings - Fork 555
db: blob storage / value separation #112
Description
Background
The WiscKey paper showed the value of segregating storage for large values in order to reduce write amplification. Badger is a Go KV store implementing that design. RocksDB has support for storing blobs (binary large objects) outside of the main LSM. Unfortunately, this blob-db is incompatible with the normal RocksDB API, making it unusable for us as-is. This issue outlines how I think blob storage should be added to Pebble.
Overview
The base WiscKey idea is that instead of storing large values inline in the LSM, they are stored in a separate value-log. The LSM merely contains a value-pointer. Storing large values outside of the LSM significantly reduces the size of the LSM which improves caching and reduces the overhead of compactions.
The preliminary blob support in RocksDB has outlined the basic structure for how this could work in Pebble. A new
InternalKeyKindBlobIndex is added indicating that the value for a key is stored elsewhere. The value looks like:
+-------------+----------+----------+
| file number | offset | size |
+-------------+----------+----------+
| varint64 | varint64 | varint64 |
+-------------+----------+----------+
(Note that RocksDB also provides support for a TTL on blobs, but I don't think this is usable for CockroachDB and therefore not needed in Pebble).
Blobs are stored in a series of blob logs. A background process periodically walks over the blob logs, reading the entries and checking to see if the value is still "live" in the LSM. This process is called value GC. If the value is still "live", it is written to the active blob log, and the LSM is updated.
Critique
An LSM does not allow mutating existing entries. In order to change the value for a key, a new value needs to be written which masks the old value. In Pebble/RocksDB/Badger, each version of a key/value pair has an associated sequence number (timestamp in Badger). RocksDB/Pebble take advantage of these sequence numbers to provide immutable snapshots. But the immutability of existing values presents a problem for blob storage. How do we update the LSM when a value is moved from one blob log to another? AFAICT, RocksDB's blob-db punts on this issue and just updates the current version. Badger at least attempts to deal with the problem:
// IMPORTANT: We should never write an entry with an older timestamp
// for the same key, We need to maintain this invariant to search for
// the latest value of a key, or else we need to search in all tables
// and find the max version among them. To maintain this invariant,
// we also need to ensure that all versions of a key are always
// present in the same table from level 1, because compaction can push
// any table down.
//
// Update (Sep 22, 2018): To maintain the above invariant, and to
// allow keys to be moved from one value log to another (while
// reclaiming space during value log GC), we have logically moved this
// need to write "old versions after new versions" to the badgerMove
// keyspace. Thus, for normal gets, we can stop going down the LSM
// tree once we find any version of the key (note however that we will
// ALWAYS skip versions with ts greater than the key version).
// However, if that key has been moved, then for the corresponding
// movekey, we'll look through all the levels of the tree to ensure
// that we pick the highest version of the movekey present.
The way this logical move is implemented in badger, is to rewrite the key as !badger!move<key> (the literal string !badger!move is prepended to the key). I am not fond of this design as special logic is used on the read-side in order to find the location for a value.
The Blob Level
Blob pointers uniquely identify a blob. Blob pointers need to be immutable after they are created, because due to LSM immutability, we can't update the blob pointer without violating a core tenant of an LSM. A key observation: we don't have to update the blob pointer when a blob moves, we just need to be able to find a blob given the original blob pointer even if the blob's actual location changes. A blob pointer is composed of <file-num,offset>. File numbers are monotonically increasing and never reused, so we don't have to worry about a blob pointer ever being reused. We just need to provide an index that allows locating a blob by its blob pointer. In an LSM, the readily available indexing structure is an sstable.
Putting this together, blobs are stored in a series of logs and sstables. When first created, there is a single log that is being appended to (blog is short for blob-log):
000007.blog
As blobs get written to this log, they are given blob pointers with file-num=7 and the offset where they are written in the log. As the log fills up, additional ones are added:
000007.blog 000010.blog 000015.blog
A background process periodically loops over these files and rewrites them, only retaining blobs that are still "live" in the LSM. The notable difference from Badger/blob-db, is that rewritten blobs are stored in an sstable. For example, if 000007.blog contains 2 "live" blobs with blob pointers <7,0> and <7,1000>, GC will rewrite these blobs into an sstable where the keys in the sstable are <7,0> and <7,1000>, and the values are the blob data. It is likely that multiple adjacent logs / sstables would be processed into a single rewritten sstable. In this example, 000007.blog and 000010.blog could be rewritten into 000031.sst:
000031.sst 000015.blog
This sstable+log structure can be viewed as a single level in an LSM. Just as for a normal level, the metadata for this blob level is []fileMetadata which would be persisted to the MANIFEST along with all other LSM metadata. Locating a blob would involve a binary search for the file containing the blob pointer, just as retrieving a key in the main LSM involves a binary search for the sstable containing the key. If the file containing the blob is a log, we can seek to the desired offset to read the blob. Otherwise we retrieve the blob from the sstable using a normal sstable lookup.
Triggering GC
When should GC be triggered? Simply walking through the files in the blob level in order is one approach, but we can do better. We should GC a blob file when it has a significant fraction of "dead" data. We can keep an exact count of this dead data by tracking when blob pointers are dropped during compactions (because a tombstone covering the blob has reached the bottommost level). A compaction would keep a []blobPointer slice that it populates with dropped blob pointers. When the compaction is complete, the dropped blob pointers are sorted and the sorted blob pointers are walked in parallel with the in-memory blob file metadata, updating a fileMetadata.deadBytes field. Recall that this blob file metadata will be persisted to the MANIFEST as well, so the deadBytes field will be precise.
A background process would periodically wake up and look at the blob file metadata, looking for contiguous runs of files that will "compact" down to much smaller size. Note that we'll be able to accurately predict both the space savings and the size of the new file.
Jira issue: PEBBLE-183