sstable: support for replacing key suffixes during reads#1233
sstable: support for replacing key suffixes during reads#1233dt wants to merge 4 commits intocockroachdb:masterfrom
Conversation
sstable/reader.go
Outdated
| initialReadaheadSize = 64 << 10 /* 64KB */ | ||
| maxReadaheadSize = 256 << 10 /* 256KB */ | ||
|
|
||
| UserPropertyCockroachMVCCTimestampSuffix = "crdb.ts.synthesize" |
There was a problem hiding this comment.
One thing I've been debating is if this should be specific to Cockroach MVCC timestamps, as it is here, where the name is "crdb.ts." and the replaced value is a hard-coded representation of a placeholder cockroach mvcc timestamp, or if I should instead try to genericize it: add two props e.g. "pebble.suffix-replace.{from,to}" and then enable rewriting -- to to -- if from is set, and keep mentions of "timestamp" out?
There was a problem hiding this comment.
went ahead and did this a a generic "suffix from" to "suffix to" mapping, with no mention of mvcc or timestamps.
5a132f4 to
156f4c3
Compare
eee6f8f to
1a1a919
Compare
In CockroachDB we have processes that build and then ingest SSTables. These SSTables have timestamp-suffixed keys, and often take a non-trivial amount of time to construct and then transport to where they will be added. If these processes what the timestamps encoded on those keys to be the time _at which the file was actually added_ that can pose a significant challenge. You _could_, while adding, iterate the entire SST to re-encode every key with a new, addition-time timestamp suffix, but that too could take O(n) time for a larger SSTable. Instead, this adds the ability to simply make an O(1) size edit to the props, to note a suffix to replace with some other value. Using this, such a process can construct all of its keys with placeholder, then, when it actually knows the real addition time, simply map the placeholder to that value with these props. At read-time, specifically when a block is loaded, after it is decompressed but before it is cached, if the file specifies this replacement every key in the block is edited to replace the mapped suffix. The edited suffix of a key must not be shared with a prior key, and only the data and index blocks are edited, so filters if present must be only on the unedited prefix of keys (this is true for CRDB as we already do not include the MVCC timestamp suffix in filters).
If transform returns the same slice it was passed, which was already in the cache, we can leave it there rather than allocating a new slice from the cache, copying the result into it and freeing the old one.
Ideally this would take some File interface that included WriteAt but at the moment, the intended use in CockroachDB is operating on a temporary SST held in memory prior to being written to disk, so this []byte version is good enough for now.
petermattis
left a comment
There was a problem hiding this comment.
For CRDB timestamps, are you imagining the placeholder timestamp will be something like {walltime: 0, logical: 1} in order to ensure we encode a full size timestamp even if the replacement timestamp has a logical time of zero?
The edited suffix of a key must not be
shared with a prior key
How will you prevent the sharing of bytes of the timestamp suffix? The mvcc key is encoded as <user-key>\0<timestamp><timestamp-length-byte>. If we have the user-keys a and a\0 and the placeholder timestamp is \0\0\0\0\0\0\0\0\1\1\1\1, then our two keys will be encoded as a\0\0\0\0\0\0\0\0\0\1\1\1\1 and a\0\0\0\0\0\0\0\0\0\0\1\1\1\1. The shared prefix of the keys includes a large chunk of the timestamp. Am I missing something here? One possibility is that you could set the block-restart-interval to 1, though that will cause a bit of cache space wastage (on-disk compression will avoid on disk space wastage). Another thought is that we could compute the shared prefix of adjacent keys only only the portion returned by Comparer.Split.
Reviewable status: 0 of 9 files reviewed, 2 unresolved discussions (waiting on @sumeerbhola)
|
Yep, for my proof of concept, I used
Yeah, if a user key |
That could work, though it is a bit of a weird setting. My suggestion of only sharing the prefix returned by |
sumeerbhola
left a comment
There was a problem hiding this comment.
My suggestion of only sharing the prefix returned by Split would have a similar effect.
I prefer this option too, since its semantically cleaner. It would not need to be the default behavior for a sst writer.
Reviewed 2 of 8 files at r2, 2 of 2 files at r3, 1 of 1 files at r4, 1 of 1 files at r5, 1 of 2 files at r6.
Reviewable status: 5 of 9 files reviewed, 3 unresolved discussions (waiting on @dt and @sumeerbhola)
sstable/reader.go, line 41 at r1 (raw file):
Previously, dt (David Taylor) wrote…
went ahead and did this a a generic "suffix from" to "suffix to" mapping, with no mention of mvcc or timestamps.
Keeping it general is good in this case since it doesn't really require more code.
I am concerned about the possibility that when the suffix replacement property is edited, it violates some invariant in the file. I'd prefer that the original writing code was:
- passed the suffix that was later going to be replaced.
- recorded this replaceable suffix in the sstable
- when writing the sstable, checked that anything containing this replaceable suffix did not get affected by prefix encoding in a way that would prevent the replacement. I realize that there will be code changes elsewhere to prevent such violations but I think the writing code needs additionally check the effective shared prefix for every key to ensure that the behavior is as expected. If not, it can return an error.
- when editing the replacement property, verify that the suffix being replaced is the same as that recorded by the writer.
sstable/reader.go, line 293 at r3 (raw file):
for key, _ := iter.First(); key != nil; key, _ = iter.Next() { if bytes.HasSuffix(key.UserKey, from) { dest := len(iter.unsharedKey) - len(from) - 8
what is the -8 for?
testdata/event_listener, line 172 at r2 (raw file):
Previously, dt (David Taylor) wrote…
if the fact that
Readergot bigger by a slice header is bad, we could potentially just havemaybeSwapSuffixjust go read the UserProps map every time instead of doing it once on creation but my guess was we'd rather not do the whole map lookup to read every block
I'm not super worried about the size increase, since the ssblock cache is much bigger. But reading a block from disk and decompressing it is costly enough, that paying the cost of the map lookup is going to not matter. So I'd lean towards doing the map lookup.
jbowens
left a comment
There was a problem hiding this comment.
Reviewable status: 5 of 9 files reviewed, 4 unresolved discussions (waiting on @dt, @jbowens, and @sumeerbhola)
sstable/properties.go, line 204 at r6 (raw file):
// EditPropsInFile edits the contents of the passed sst file bytes to update the // values for each user property in the passed overrides. The updated properties // must exist and the new value must have the same length as the prior value.
Maybe a silly question: when is the MVCC timestamp known? Is it known before any potential sideloading, or would setting this property sometimes require modifying an on-disk sideloadeded sstable's properties?
I suppose an alternative would be to allow callers to pass the replacement value in their call to Ingest. Pebble could write it to disk within the manifest representation of the file metadata. I believe this is a little closer to what we do for ingested sstable's GlobalSeqNum. There exists a GlobalSeqNum sstable property, but we don't actually update the on-disk property. It's populated on sstable load from the manifest's LargestSeqNum.
In CockroachDB the timestamp would be known during evaluation of the |
I'm worried about the amount of work we currently do in |
|
We're not currently writing to disk at all during evalAddSStable -- that happens later, after it returns its ReplicatedResult, when that result gets proposed, right? I don't know if we can really do it any earlier either -- we don't know until we get into eval if we'll even actually propose it: we may decide to reject it outright and/or change what we propose, e.g. by iterating the contents of a "small" file into a regular write-batch to propose instead. I think a decent chunk of this does belong in eval, since it can depend on the replica (i.e. only once we have latches can we look for collisions in existing data, compute mvcc stats diffs, etc). I don't think editing the props bytes during eval is going to really change how much work we're doing in eval vs elsewhere. I have a mild preference for props over manifest for debugging since we can inspect a single SST more easily, and it really is an attribute of the SST that eval is choosing to propose -- i.e. must be exactly the same across all applications of that proposal -- unlike the seq no, which is more an attribute of each application/re-application when we call IngestExternalFile? |
IIUC, this is the code in
Rejections are rare, yes? We could make an assumption that it will succeed and write the file. And we know the size of the file up front too. I realized that even if we wrote the file before latching on the leaseholder, the raft replication would require the followers to write the file, while the latches are held, so we may not save much.
I agree that a self-contained file is mildly better, and the fact that is not a function of the engine to which it is being ingested is also compelling. And the fact that we can probably not optimize latching duration, I am ok with the current approach. |
|
Not sure if I'm being helpful here (I'm not sure exactly what the problem being discussed is), but the way it works right now is that the incoming SST is simply passed to replication, which has special casing (raft sideloading) that leads to the sst file being written to disk instead of to the log. At apply time (in So the status quo puts everything to disk while holding latches (latch is released only when the effect of the command is visible on the state machine). It's not easy to change that. |
In CockroachDB we have processes that build and then ingest SSTables.
These SSTables have timestamp-suffixed keys, and often take a non-trivial
amount of time to construct and then transport to where they will be added.
If these processes what the timestamps encoded on those keys to be the time
at which the file was actually added that can pose a significant challenge.
You could, while adding, iterate the entire SST to re-encode every key with
a new, addition-time timestamp suffix, but that too could take O(n) time for a
larger SSTable.
Instead, this adds the ability to simply make an O(1) size edit to the props, to
note a suffix to replace with some other value. Using this, such a process can
construct all of its keys with placeholder, then, when it actually knows the real
addition time, simply map the placeholder to that value with these props.
At read-time, specifically when a block is loaded, after it is decompressed but
before it is cached, if the file specifies this replacement every key in the block
is edited to replace the mapped suffix. The edited suffix of a key must not be
shared with a prior key, and only the data and index blocks are edited, so filters
if present must be only on the unedited prefix of keys (this is true for CRDB
as we already do not include the MVCC timestamp suffix in filters).
Also included here is a helper to make the above mentioned edits to props on
an in-memory representation of an SSTable (i.e. what Cockroach typically has
when during the evaluation of AddSSTableRequest when it would assign such
a timestamp). Ideally this helper would take a File-like API that supported writing
at an offset to be able to update on-disk files as well, but this can be done when
needed.