Skip to content

sstable: support for replacing key suffixes during reads#1233

Closed
dt wants to merge 4 commits intocockroachdb:masterfrom
dt:magic-ts
Closed

sstable: support for replacing key suffixes during reads#1233
dt wants to merge 4 commits intocockroachdb:masterfrom
dt:magic-ts

Conversation

@dt
Copy link
Copy Markdown
Contributor

@dt dt commented Aug 19, 2021

In CockroachDB we have processes that build and then ingest SSTables.
These SSTables have timestamp-suffixed keys, and often take a non-trivial
amount of time to construct and then transport to where they will be added.
If these processes what the timestamps encoded on those keys to be the time
at which the file was actually added that can pose a significant challenge.
You could, while adding, iterate the entire SST to re-encode every key with
a new, addition-time timestamp suffix, but that too could take O(n) time for a
larger SSTable.

Instead, this adds the ability to simply make an O(1) size edit to the props, to
note a suffix to replace with some other value. Using this, such a process can
construct all of its keys with placeholder, then, when it actually knows the real
addition time, simply map the placeholder to that value with these props.

At read-time, specifically when a block is loaded, after it is decompressed but
before it is cached, if the file specifies this replacement every key in the block
is edited to replace the mapped suffix. The edited suffix of a key must not be
shared with a prior key, and only the data and index blocks are edited, so filters
if present must be only on the unedited prefix of keys (this is true for CRDB
as we already do not include the MVCC timestamp suffix in filters).

Also included here is a helper to make the above mentioned edits to props on
an in-memory representation of an SSTable (i.e. what Cockroach typically has
when during the evaluation of AddSSTableRequest when it would assign such
a timestamp). Ideally this helper would take a File-like API that supported writing
at an offset to be able to update on-disk files as well, but this can be done when
needed.

@dt dt requested a review from sumeerbhola August 19, 2021 21:49
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@dt dt marked this pull request as draft August 19, 2021 21:57
initialReadaheadSize = 64 << 10 /* 64KB */
maxReadaheadSize = 256 << 10 /* 256KB */

UserPropertyCockroachMVCCTimestampSuffix = "crdb.ts.synthesize"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I've been debating is if this should be specific to Cockroach MVCC timestamps, as it is here, where the name is "crdb.ts." and the replaced value is a hard-coded representation of a placeholder cockroach mvcc timestamp, or if I should instead try to genericize it: add two props e.g. "pebble.suffix-replace.{from,to}" and then enable rewriting -- to to -- if from is set, and keep mentions of "timestamp" out?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

went ahead and did this a a generic "suffix from" to "suffix to" mapping, with no mention of mvcc or timestamps.

@dt dt force-pushed the magic-ts branch 3 times, most recently from 5a132f4 to 156f4c3 Compare August 20, 2021 19:50
@dt dt changed the title [WIP] sstable: support for replacing key suffixes during reads sstable: support for replacing key suffixes during reads Aug 20, 2021
@dt dt marked this pull request as ready for review August 20, 2021 20:23
@dt dt force-pushed the magic-ts branch 2 times, most recently from eee6f8f to 1a1a919 Compare August 21, 2021 16:51
dt added 4 commits August 23, 2021 16:38
In CockroachDB we have processes that build and then ingest SSTables.
These SSTables have timestamp-suffixed keys, and often take a non-trivial
amount of time to construct and then transport to where they will be added.
If these processes what the timestamps encoded on those keys to be the time
_at which the file was actually added_ that can pose a significant challenge.
You _could_, while adding, iterate the entire SST to re-encode every key with
a new, addition-time timestamp suffix, but that too could take O(n) time for a
larger SSTable.

Instead, this adds the ability to simply make an O(1) size edit to the props, to
note a suffix to replace with some other value. Using this, such a process can
construct all of its keys with placeholder, then, when it actually knows the real
addition time, simply map the placeholder to that value with these props.

At read-time, specifically when a block is loaded, after it is decompressed but
before it is cached, if the file specifies this replacement every key in the block
is edited to replace the mapped suffix. The edited suffix of a key must not be
shared with a prior key, and only the data and index blocks are edited, so filters
if present must be only on the unedited prefix of keys (this is true for CRDB
as we already do not include the MVCC timestamp suffix in filters).
If transform returns the same slice it was passed, which was already in 
the cache, we can leave it there rather than allocating a new slice from 
the cache, copying the result into it and freeing the old one.
Ideally this would take some File interface that included WriteAt but at
the moment, the intended use in CockroachDB is operating on a temporary
SST held in memory prior to being written to disk, so this []byte version
is good enough for now.
Copy link
Copy Markdown
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For CRDB timestamps, are you imagining the placeholder timestamp will be something like {walltime: 0, logical: 1} in order to ensure we encode a full size timestamp even if the replacement timestamp has a logical time of zero?

The edited suffix of a key must not be
shared with a prior key

How will you prevent the sharing of bytes of the timestamp suffix? The mvcc key is encoded as <user-key>\0<timestamp><timestamp-length-byte>. If we have the user-keys a and a\0 and the placeholder timestamp is \0\0\0\0\0\0\0\0\1\1\1\1, then our two keys will be encoded as a\0\0\0\0\0\0\0\0\0\1\1\1\1 and a\0\0\0\0\0\0\0\0\0\0\1\1\1\1. The shared prefix of the keys includes a large chunk of the timestamp. Am I missing something here? One possibility is that you could set the block-restart-interval to 1, though that will cause a bit of cache space wastage (on-disk compression will avoid on disk space wastage). Another thought is that we could compute the shared prefix of adjacent keys only only the portion returned by Comparer.Split.

Reviewable status: 0 of 9 files reviewed, 2 unresolved discussions (waiting on @sumeerbhola)

@dt
Copy link
Copy Markdown
Contributor Author

dt commented Sep 2, 2021

Yep, for my proof of concept, I used var IngestionMVCCTimestampPlaceholder = hlc.Timestamp{WallTime: 1 << 62, Logical: 1}, using a non-zero logical to ensure it was all 12b:
cockroachdb/cockroach@master...dt:magic-ts#diff-5ddef732446d3f6c4cb30f2775b73c734b7d40e8b3c0cdbdb61b1aced178d2b0R267

How will you prevent the sharing of bytes of the timestamp suffix?

Yeah, if a user key x is followed by another that is x plus a \0 and then any prefix of the placeholder, that'd an overlap with the to-be-replaced suffix of the prior key. I was thinking we could just limit the max length of the shared prefix in an option to Writer like MinimumUnsharedSuffix or something, and then set it to len(placeholder) for these magic-ts ssts?

@petermattis
Copy link
Copy Markdown
Collaborator

Yeah, if a user key x is followed by another that is x plus a \0 and then any prefix of the placeholder, that'd an overlap with the to-be-replaced suffix of the prior key. I was thinking we could just limit the max length of the shared prefix in an option to Writer like MinimumUnsharedSuffix or something, and then set it to len(placeholder) for these magic-ts ssts?

That could work, though it is a bit of a weird setting. My suggestion of only sharing the prefix returned by Split would have a similar effect. I'd be willing to bet that we essentially never share bytes in the timestamp suffix in normal sstables.

Copy link
Copy Markdown
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion of only sharing the prefix returned by Split would have a similar effect.

I prefer this option too, since its semantically cleaner. It would not need to be the default behavior for a sst writer.

Reviewed 2 of 8 files at r2, 2 of 2 files at r3, 1 of 1 files at r4, 1 of 1 files at r5, 1 of 2 files at r6.
Reviewable status: 5 of 9 files reviewed, 3 unresolved discussions (waiting on @dt and @sumeerbhola)


sstable/reader.go, line 41 at r1 (raw file):

Previously, dt (David Taylor) wrote…

went ahead and did this a a generic "suffix from" to "suffix to" mapping, with no mention of mvcc or timestamps.

Keeping it general is good in this case since it doesn't really require more code.

I am concerned about the possibility that when the suffix replacement property is edited, it violates some invariant in the file. I'd prefer that the original writing code was:

  • passed the suffix that was later going to be replaced.
  • recorded this replaceable suffix in the sstable
  • when writing the sstable, checked that anything containing this replaceable suffix did not get affected by prefix encoding in a way that would prevent the replacement. I realize that there will be code changes elsewhere to prevent such violations but I think the writing code needs additionally check the effective shared prefix for every key to ensure that the behavior is as expected. If not, it can return an error.
  • when editing the replacement property, verify that the suffix being replaced is the same as that recorded by the writer.

sstable/reader.go, line 293 at r3 (raw file):

		for key, _ := iter.First(); key != nil; key, _ = iter.Next() {
			if bytes.HasSuffix(key.UserKey, from) {
				dest := len(iter.unsharedKey) - len(from) - 8

what is the -8 for?


testdata/event_listener, line 172 at r2 (raw file):

Previously, dt (David Taylor) wrote…

if the fact that Reader got bigger by a slice header is bad, we could potentially just have maybeSwapSuffix just go read the UserProps map every time instead of doing it once on creation but my guess was we'd rather not do the whole map lookup to read every block

I'm not super worried about the size increase, since the ssblock cache is much bigger. But reading a block from disk and decompressing it is costly enough, that paying the cost of the map lookup is going to not matter. So I'd lean towards doing the map lookup.

@sumeerbhola sumeerbhola requested a review from jbowens September 27, 2021 20:01
@jbowens jbowens requested a review from a team September 28, 2021 20:07
Copy link
Copy Markdown
Contributor

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 5 of 9 files reviewed, 4 unresolved discussions (waiting on @dt, @jbowens, and @sumeerbhola)


sstable/properties.go, line 204 at r6 (raw file):

// EditPropsInFile edits the contents of the passed sst file bytes to update the
// values for each user property in the passed overrides. The updated properties
// must exist and the new value must have the same length as the prior value.

Maybe a silly question: when is the MVCC timestamp known? Is it known before any potential sideloading, or would setting this property sometimes require modifying an on-disk sideloadeded sstable's properties?

I suppose an alternative would be to allow callers to pass the replacement value in their call to Ingest. Pebble could write it to disk within the manifest representation of the file metadata. I believe this is a little closer to what we do for ingested sstable's GlobalSeqNum. There exists a GlobalSeqNum sstable property, but we don't actually update the on-disk property. It's populated on sstable load from the manifest's LargestSeqNum.

@dt
Copy link
Copy Markdown
Contributor Author

dt commented Sep 28, 2021

when is the MVCC timestamp known? Is it known before any potential sideloading, or would setting this property sometimes require modifying an on-disk sideloadeded sstable's properties?

In CockroachDB the timestamp would be known during evaluation of the AddSSTable request, i.e. after we've acquired latches and everything, but before we actually decide there is a result of that request that we want to propose to raft (it'd be the batch timestamp at that point, just like a normal mvcc write). That means the "file" is still just a []byte we got in the request params, and we can still freely edit it without worrying about on-disk copies, raft-log immutability, etc, since we haven't gotten there yet. I think that means we don't need any help from the manifest.

@sumeerbhola
Copy link
Copy Markdown
Collaborator

I suppose an alternative would be to allow callers to pass the replacement value in their call to Ingest. Pebble could write it to disk within the manifest representation of the file metadata. I believe this is a little closer to what we do for ingested sstable's GlobalSeqNum.

That means the "file" is still just a []byte we got in the request params, and we can still freely edit it without worrying about on-disk copies, raft-log immutability, etc, since we haven't gotten there yet. I think that means we don't need any help from the manifest.

I'm worried about the amount of work we currently do in EvalAddSSTable while holding latches. Much of it doesn't rely on the state of the engine. And writing large sstables to disk after acquiring latches is not great. Using the manifest would give us a cheaper option, in that the writing could happen before latch acquisition. Has this not been previously a concern since a new index being built via adding of sstables will not have any user-facing operations? I am guessing this is not quite true because new rows being inserted into a table would also need to write to this new index and would encounter conflicting latches.

@dt
Copy link
Copy Markdown
Contributor Author

dt commented Sep 28, 2021

We're not currently writing to disk at all during evalAddSStable -- that happens later, after it returns its ReplicatedResult, when that result gets proposed, right? I don't know if we can really do it any earlier either -- we don't know until we get into eval if we'll even actually propose it: we may decide to reject it outright and/or change what we propose, e.g. by iterating the contents of a "small" file into a regular write-batch to propose instead. I think a decent chunk of this does belong in eval, since it can depend on the replica (i.e. only once we have latches can we look for collisions in existing data, compute mvcc stats diffs, etc). I don't think editing the props bytes during eval is going to really change how much work we're doing in eval vs elsewhere. I have a mild preference for props over manifest for debugging since we can inspect a single SST more easily, and it really is an attribute of the SST that eval is choosing to propose -- i.e. must be exactly the same across all applications of that proposal -- unlike the seq no, which is more an attribute of each application/re-application when we call IngestExternalFile?

@sumeerbhola
Copy link
Copy Markdown
Collaborator

We're not currently writing to disk at all during evalAddSStable -- that happens later, after it returns its ReplicatedResult, when that result gets proposed, right?

IIUC, this is the code in addSSTablePreApply which is part of applying to the state machine, after replication via raft?

we may decide to reject it outright and/or change what we propose, e.g. by iterating the contents of a "small" file into a regular write-batch to propose instead.
...
only once we have latches can we look for collisions in existing data, compute mvcc stats diffs, etc

Rejections are rare, yes? We could make an assumption that it will succeed and write the file. And we know the size of the file up front too.
We don't seem to compute MVCCState using the storage.ReadWriter. The only thing we do (unless we write a batch) is to checkForKeyCollisions if DisallowShadowing=true. Just curious: which paths set this to true and which to false (I couldn't easily figure this out based on looking at callers)?

I realized that even if we wrote the file before latching on the leaseholder, the raft replication would require the followers to write the file, while the latches are held, so we may not save much.
@tbg in case he has some ideas.

I have a mild preference for props over manifest for debugging since we can inspect a single SST more easily, and it really is an attribute of the SST that eval is choosing to propose -- i.e. must be exactly the same across all applications of that proposal -- unlike the seq no, which is more an attribute of each application/re-application when we call IngestExternalFile?

I agree that a self-contained file is mildly better, and the fact that is not a function of the engine to which it is being ingested is also compelling. And the fact that we can probably not optimize latching duration, I am ok with the current approach.

@tbg
Copy link
Copy Markdown
Member

tbg commented Oct 13, 2021

Not sure if I'm being helpful here (I'm not sure exactly what the problem being discussed is), but the way it works right now is that the incoming SST is simply passed to replication, which has special casing (raft sideloading) that leads to the sst file being written to disk instead of to the log. At apply time (in addSSTablePreApply), we ingest that SST file, and usually do so directly. So the bulk of the disk I/O is during sideloading, here:

https://github.com/cockroachdb/cockroach/blob/6e4c3ae05d92f4104b7b9fbb33f9d1cecf055fd3/pkg/kv/kvserver/replica_sideload.go#L123-L126

So the status quo puts everything to disk while holding latches (latch is released only when the effect of the command is visible on the state machine). It's not easy to change that.

@dt dt closed this Nov 10, 2021
@dt dt deleted the magic-ts branch December 30, 2021 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants