db: blob storage / value separation

## Background

The [WiscKey](https://www.usenix.org/system/files/conference/fast16/fast16-papers-lu.pdf) paper showed the value of segregating storage for large values in order to reduce write amplification. [Badger](https://github.com/dgraph-io/badger) is a Go KV store implementing that design. RocksDB has support for [storing blobs](https://github.com/facebook/rocksdb/tree/master/utilities/blob_db) (binary large objects) outside of the main LSM. Unfortunately, this blob-db is incompatible with the normal RocksDB API, making it unusable for us as-is. This issue outlines how I think blob storage should be added to Pebble.

## Overview

The base WiscKey idea is that instead of storing large values inline in the LSM, they are stored in a separate value-log. The LSM merely contains a value-pointer. Storing large values outside of the LSM significantly reduces the size of the LSM which improves caching and reduces the overhead of compactions.

The preliminary blob support in RocksDB has outlined the basic structure for how this could work in Pebble. A new
`InternalKeyKindBlobIndex` is added indicating that the value for a key is stored elsewhere. The value looks like:

```
+-------------+----------+----------+
| file number | offset   | size     |
+-------------+----------+----------+
| varint64    | varint64 | varint64 |
+-------------+----------+----------+
```

(Note that RocksDB also provides support for a TTL on blobs, but I don't think this is usable for CockroachDB and therefore not needed in Pebble).

Blobs are stored in a series of blob logs. A background process periodically walks over the blob logs, reading the entries and checking to see if the value is still "live" in the LSM. This process is called value GC. If the value is still "live", it is written to the active blob log, and the LSM is updated.

## Critique

An LSM does not allow mutating existing entries. In order to change the value for a key, a new value needs to be written which masks the old value. In Pebble/RocksDB/Badger, each version of a key/value pair has an associated sequence number (timestamp in Badger). RocksDB/Pebble take advantage of these sequence numbers to provide immutable snapshots. But the immutability of existing values presents a problem for blob storage. How do we update the LSM when a value is moved from one blob log to another? AFAICT, RocksDB's blob-db punts on this issue and just updates the current version. Badger at least attempts to deal with the problem:

```
// IMPORTANT: We should never write an entry with an older timestamp
// for the same key, We need to maintain this invariant to search for
// the latest value of a key, or else we need to search in all tables
// and find the max version among them.  To maintain this invariant,
// we also need to ensure that all versions of a key are always
// present in the same table from level 1, because compaction can push
// any table down.
//
// Update (Sep 22, 2018): To maintain the above invariant, and to
// allow keys to be moved from one value log to another (while
// reclaiming space during value log GC), we have logically moved this
// need to write "old versions after new versions" to the badgerMove
// keyspace. Thus, for normal gets, we can stop going down the LSM
// tree once we find any version of the key (note however that we will
// ALWAYS skip versions with ts greater than the key version).
// However, if that key has been moved, then for the corresponding
// movekey, we'll look through all the levels of the tree to ensure
// that we pick the highest version of the movekey present.
```

The way this logical move is implemented in badger, is to rewrite the key as `!badger!move<key>` (the literal string `!badger!move` is prepended to the key). I am not fond of this design as special logic is used on the read-side in order to find the location for a value.

## The Blob Level

Blob pointers uniquely identify a blob. Blob pointers need to be immutable after they are created, because due to LSM immutability, we can't update the blob pointer without violating a core tenant of an LSM. A key observation: we don't have to update the blob pointer when a blob moves, we just need to be able to find a blob given the original blob pointer even if the blob's actual location changes. A blob pointer is composed of `<file-num,offset>`. File numbers are monotonically increasing and never reused, so we don't have to worry about a blob pointer ever being reused. We just need to provide an index that allows locating a blob by its blob pointer. In an LSM, the readily available indexing structure is an sstable.

Putting this together, blobs are stored in a series of logs and sstables. When first created, there is a single log that is being appended to (`blog` is short for blob-log):

```
  000007.blog
```

As blobs get written to this log, they are given blob pointers with `file-num=7` and the offset where they are written in the log. As the log fills up, additional ones are added:

```
  000007.blog  000010.blog  000015.blog
```

A background process periodically loops over these files and rewrites them, only retaining blobs that are still "live" in the LSM. The notable difference from Badger/blob-db, is that rewritten blobs are stored in an sstable. For example, if `000007.blog` contains 2 "live" blobs with blob pointers `<7,0>` and `<7,1000>`, GC will rewrite these blobs into an sstable where the keys in the sstable are `<7,0>` and `<7,1000>`, and the values are the blob data. It is likely that multiple adjacent logs / sstables would be processed into a single rewritten sstable. In this example, `000007.blog` and `000010.blog` could be rewritten into `000031.sst`:

```
  000031.sst  000015.blog
```

This sstable+log structure can be viewed as a single level in an LSM. Just as for a normal level, the metadata for this blob level is `[]fileMetadata` which would be persisted to the `MANIFEST` along with all other LSM metadata. Locating a blob would involve a binary search for the file containing the blob pointer, just as retrieving a key in the main LSM involves a binary search for the sstable containing the key. If the file containing the blob is a log, we can seek to the desired offset to read the blob. Otherwise we retrieve the blob from the sstable using a normal sstable lookup.

## Triggering GC

When should GC be triggered? Simply walking through the files in the blob level in order is one approach, but we can do better. We should GC a blob file when it has a significant fraction of "dead" data. We can keep an exact count of this dead data by tracking when blob pointers are dropped during compactions (because a tombstone covering the blob has reached the bottommost level). A compaction would keep a `[]blobPointer` slice that it populates with dropped blob pointers. When the compaction is complete, the dropped blob pointers are sorted and the sorted blob pointers are walked in parallel with the in-memory blob file metadata, updating a `fileMetadata.deadBytes` field. Recall that this blob file metadata will be persisted to the `MANIFEST` as well, so the `deadBytes` field will be precise.

A background process would periodically wake up and look at the blob file metadata, looking for contiguous runs of files that will "compact" down to much smaller size. Note that we'll be able to accurately predict both the space savings and the size of the new file. 

Jira issue: PEBBLE-183


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

db: blob storage / value separation #112

Background

Overview

Critique

The Blob Level

Triggering GC

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

db: blob storage / value separation #112

Description

Background

Overview

Critique

The Blob Level

Triggering GC

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions