docs/RFCs: draft RFC for protected timestamps by ajwerner · Pull Request #41806 · cockroachdb/cockroach

ajwerner · 2019-10-22T14:05:12Z

This commit provides a draft RFC for a subsystem to protect values in spans
of data alive at specific timestamps from being garbage collected.

Release note: None

cockroach-teamcity · 2019-10-22T14:05:20Z

This change is

ajwerner · 2019-10-22T14:06:17Z

This could use some more words about reconciliation and how that's going to hook up but I think I want some eyeballs on it before diving in to those details.

tbg

Some half-baked comments I didn't want to languish before heading into my meeting block

tbg · 2019-10-22T14:08:29Z

docs/RFCS/20191009_gc_protected_timestamps.md

+to store key-value pairs. Old values need to be removed eventually.
+Today's default GC threshold is 25 hours. This threshold comes from an
+assumption about common backup strategies; we assume that customers frequently.
+run a nightly incremental backup from the previous incremental backup and then


from the previous nightly backup

tbg · 2019-10-22T14:14:11Z

docs/RFCS/20191009_gc_protected_timestamps.md

+
+An important element to all of this is that GC TTLs are defined in 
+[Zone Configurations](https://www.cockroachlabs.com/docs/stable/configure-replication-zones.html).
+This implies a to need to couple the protected timestamp subsystem to the zone


tbg · 2019-10-22T14:14:13Z

docs/RFCS/20191009_gc_protected_timestamps.md

+
+It would be unreasonable if an attempt to protect a timestamp succeeded after
+data live at that timestamp has already been GC'd. The protected timestamp
+subsystem should only succeed if indeed it is impossible for the timestamp


grammar off

tbg · 2019-10-22T14:16:54Z

docs/RFCS/20191009_gc_protected_timestamps.md

+
+There are different choices we could make about which timestamp are available
+for protection at any given time. This proposal allows clients to protect
+timestamps for a set of spans up to the GC ttl ago. Imagine that you have a 60


From what you write above, it seems like you want the invariant that if the protection request comes back successfully, the data is actually protected, which means it has to actually touch all of the ranges containing any subset of the span to make sure the GC threshold isn't already violating what you're trying to protect (perhaps because the zone config's TTL was recently increased).

tbg · 2019-10-22T14:29:49Z

docs/RFCS/20191009_gc_protected_timestamps.md

+  // timestamp can be garbage collected until the this ProtectedTimestamp is
+  // released.
+  //
+  // Protect will set the deadline on the provided transaction to ensure that


Setting a deadline on an existing txn object seems not ideal. Are you doing this only to avoid touching the ranges? I think if the zone config GC TTL was recently increased you'll still get burned, so you need to touch all ranges anyways (to verify that they see the ... latest? zone config, and have a GC TTL that is still compatible), so I hope you can drop the deadline instead.

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @tbg)

docs/RFCS/20191009_gc_protected_timestamps.md, line 114 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

From what you write above, it seems like you want the invariant that if the protection request comes back successfully, the data is actually protected, which means it has to actually touch all of the ranges containing any subset of the span to make sure the GC threshold isn't already violating what you're trying to protect (perhaps because the zone config's TTL was recently increased).

I'm very eager to avoid an implementation that has to touch every range, I think this one is able to provide the invariant without touching every range, see my reply to your below comment.

docs/RFCS/20191009_gc_protected_timestamps.md, line 283 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Setting a deadline on an existing txn object seems not ideal. Are you doing this only to avoid touching the ranges? I think if the zone config GC TTL was recently increased you'll still get burned, so you need to touch all ranges anyways (to verify that they see the ... latest? zone config, and have a GC TTL that is still compatible), so I hope you can drop the deadline instead.

I think the deadline is the lesser of these evils. The deadline in the common case is going to be the entire GC TTL. Touching every range seems like a big mistake.

You have identified a real problem. Say the GC TTL went from short to long and the protection is installed after it became long but the replica decides to run GC because is has not yet updated its zone config to know that it should be long.

To deal with this case we can also write the MVCC timestamp of the zone config into the record and ensure that for GC to occur we also need a zone config that has an MVCC timestamp at least as high as that in the record. This doesn't seem like a big addition.

nvb · 2019-10-23T21:02:31Z

docs/RFCS/20191009_gc_protected_timestamps.md

+specified timestamp are protected whereby `ProtectAfter` ensures that all
+values live at or after the timestamp are protected (note that `ProtectAfter`
+includes `ProtectAt` and protections marked as `ProtectAt` will be treated the
+sam in the initial implementation).  Each protected timestamp may contain


nvb · 2019-10-23T21:02:34Z

docs/RFCS/20191009_gc_protected_timestamps.md

+It would be unreasonable if an attempt to protect a timestamp succeeded after
+data live at that timestamp has already been GC'd. The protected timestamp
+subsystem should only succeed if indeed it is impossible for the timestamp
+which the client tries to protect is indeed protected. Furthermore it useful


nvb · 2019-10-23T21:02:36Z

docs/RFCS/20191009_gc_protected_timestamps.md

+
+An important element to all of this is that GC TTLs are defined in 
+[Zone Configurations](https://www.cockroachlabs.com/docs/stable/configure-replication-zones.html).
+This implies a to need to couple the protected timestamp subsystem to the zone


nvb · 2019-10-23T21:02:41Z

docs/RFCS/20191009_gc_protected_timestamps.md

+   rest of the procedure.
+
+At this point it is safe to garbage collect up to now less the GC TTL less
+the time since the protected state was read (and a clock max offset).


Do you mind writing this correctness proof out a bit more formally (no need to go overboard)? Mostly I'm just interested in why a max offset slipped in here and how the MVCC timestamp of the protected_ts_timestamps row relates to its ts field.

nvb · 2019-10-23T21:02:51Z

docs/RFCS/20191009_gc_protected_timestamps.md

+# Reference-level explanation
+
+In this initial implementation we'll say that the main creators of protected
+timestamps are schema changes and backups. In both cases these clients these


the clients, these clients

nvb · 2019-10-23T21:03:02Z

docs/RFCS/20191009_gc_protected_timestamps.md

+Zone configuration updates need to observe the protected timestamp
+metadata. This is important so that when protected timestamps are written
+they know that they have seen an up-to-date copy of the zone config. We do
+this by relying on the single-key linearizability offered by CockroachDB.


relying on the single-key linearizability offered by CockroachDB

This is only guaranteed for a read after a write if the system is within clock offset bounds. Meanwhile, single-key linearizability is guaranteed between two writes regardless of clock skew. If we expect zone cfg updates to be infrequent then it may be worth performing an identity write on the protected timestamp metadata to eliminate the clock dependency.

nvb · 2019-10-23T21:03:08Z

docs/RFCS/20191009_gc_protected_timestamps.md

+  5: put baz
+  4: put bar
+  3: --- protected ---
+  2: delete


Is it important in this example that delete is right after the protected ts? Doesn't this allow us to remove k@2 during the GC? Maybe not because we want a snapshot of the latest change to k? Is this handling of deletion tombstones after the GC threshold discussed at all in this RFC?

nvb · 2019-10-23T21:03:11Z

docs/RFCS/20191009_gc_protected_timestamps.md

+  this solution as written and more importantly the exact protection 
+  optimization?
+
+- Is it a bad idea to use SQL for the state as opposed to a pseudo-table like


system.rangelog is prior-art for adding a SQL dependency to KV, so I'm less personally opposed to it than I was before.

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @nvanbenschoten, and @tbg)

docs/RFCS/20191009_gc_protected_timestamps.md, line 193 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do you mind writing this correctness proof out a bit more formally (no need to go overboard)? Mostly I'm just interested in why a max offset slipped in here and how the MVCC timestamp of the protected_ts_timestamps row relates to its ts field.

Yes, it has problems as currently written related to changing zone configs as Tobi pointed out. I'm working though them now.

knz

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @nvanbenschoten, and @tbg)

docs/RFCS/20191009_gc_protected_timestamps.md, line 15 at r1 (raw file):

we need to provide a mechanism for ongoing
jobs to prevent relevant spans from being garbage collected.

There's something I'm missing in the rest of the text: a running example of a job and how it determines which ranges are relevant.

The particular example I'm interested in is a scenario like this:

backfill job starts at t0
SQL client adds new rows at t1
SQL client deletes new rows at t2
after t2 + GC TTL, backup job not complete yet
GC removes new rows inserted at t2
backfill job checkpoints
backfill job resumes, "catches up" but fails to find the rows that started to exist at t1 and stopped to exist at t2
backfill job cancels

Arguably it's kinda OK if backfill doesn't see rows that were eventually deleted, but what of the "revert range" operation? Does it properly works with the part of the key space that only existed between t1 and t2 + GC ttl?

Is the protected ts mechanism proposed here going to improve this?

if yes, then I'd like to learn how the backup job is to discover new "relevant spans" that appear after it starts
if no, then this should be spelled out as a known limitation

Note: a backup job has the same pitfall I Think (if it works incrementally via checkpoints), how do you foresee a backup is to mark new spans as protected that did not exist by the time it was started?

docs/RFCS/20191009_gc_protected_timestamps.md, line 48 at r1 (raw file):

Long-running OLAP transactions also merit consideration. Imagine a transaction
is kept open for a long period of time, today if it attempts to read at a
timestamp which has been garbage collected, the client will receive an error.

nit: provide a reminder of what kind of error a client is likely to receive, to help the reader along.

docs/RFCS/20191009_gc_protected_timestamps.md, line 56 at r1 (raw file):

overly considered in this RFC but we do not want to preclude it by choosing
solutions that are unlikely to be reasonably efficient if run once for every
minutes-long query.

While I recognize the use case as interesting, it also runs into the concern I outlined above: how do you foresee a long-running txn will determine "relevant spans" to protect? How can we mark spans as protected that have not been queried by the SQL client yet?

Or are you suggesting that the mere existence of an open txn at ts t1 will place a protection on the entire kv space after t1?

docs/RFCS/20191009_gc_protected_timestamps.md, line 68 at r1 (raw file):

of which long running jobs can safely prevent relevant data from being garbage
collected. This state is then plumbed in to the `storage` layer's data GC
process to provide the specified semantics. We'll then detail how clients can

... how KV/internal clients, or jobs, ...

(the text should highlight that this API is not meant for use by external/SQL clients)

docs/RFCS/20191009_gc_protected_timestamps.md, line 90 at r1 (raw file):

in the process, and then we'll talk about how we anticipate the API will be
integrated.

Add a paragraph:

"In the rest of this RFC, the term "client" designates an internal component of CockroachDB, such as the job subsytem or the SQL executor. The mechanisms presented here are invisible to external clients."

docs/RFCS/20191009_gc_protected_timestamps.md, line 137 at r1 (raw file):

limits. It will be observed and modified in all transactions which create
protected timestamps. We'll also observe this meta row in all transactions
which update zone configurations.

Nathan pointed out in the temp table RFC that we want TTs to have custom zone configs that pin them to their gateway node.
Given that a SQL app that uses TT is likely to do so in every conn, is the mandatory hotspot that you're proposing here going to nefariously interplay with TTs?

docs/RFCS/20191009_gc_protected_timestamps.md, line 232 at r1 (raw file):

protected timestamps to some other state and run periodic reconciliation,
also great.

I would recommend that this RFC spells out / specifies an API to inspect all the protected ts ranges currently configured,
together with an admin RPC / CLI command to inspect them and clean them up manually.

As we've found out in the past, automation breaks sometimes and we value manual tools in our toolbox.

ajwerner

I'm reworking the basic guarantees of this proposal after a conversation with @dt offline.

I'm working to formalize the guarantees we want. Thanks for the initial feedback. I'll let y'all know when this if RFAL.

tl;dr is that in the face of changing GC ttls we don't need as strong of guarantees. We now just want a claim that says that if a protected timestamp is installed for some time t and it is not possible under any zone config (current or stale) that a replica under any of those spans hold to have GC'd at that time then no replica will be able to GC at that time. It's a much weaker claim but seems sufficient.

I'll then propose some ideas to validate whether a protected timestamp has successfully been applied

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @knz, @nvanbenschoten, and @tbg)

docs/RFCS/20191009_gc_protected_timestamps.md, line 15 at r1 (raw file):

Previously, knz (kena) wrote…

we need to provide a mechanism for ongoing
jobs to prevent relevant spans from being garbage collected.
There's something I'm missing in the rest of the text: a running example of a job and how it determines which ranges are relevant.

The particular example I'm interested in is a scenario like this:

backfill job starts at t0

SQL client adds new rows at t1

SQL client deletes new rows at t2

after t2 + GC TTL, backup job not complete yet

GC removes new rows inserted at t2

backfill job checkpoints

backfill job resumes, "catches up" but fails to find the rows that started to exist at t1 and stopped to exist at t2

backfill job cancels

Arguably it's kinda OK if backfill doesn't see rows that were eventually deleted, but what of the "revert range" operation? Does it properly works with the part of the key space that only existed between t1 and t2 + GC ttl?

Is the protected ts mechanism proposed here going to improve this?

if yes, then I'd like to learn how the backup job is to discover new "relevant spans" that appear after it starts

if no, then this should be spelled out as a known limitation

Note: a backup job has the same pitfall I Think (if it works incrementally via checkpoints), how do you foresee a backup is to mark new spans as protected that did not exist by the time it was started?

Sorry, I was confused, it is the IMPORT INTO job that does revert range. I'll rework this whole intro. I conflated the backfill with the IMPORT INTO. Will update

docs/RFCS/20191009_gc_protected_timestamps.md, line 232 at r1 (raw file):

Previously, knz (kena) wrote…

I would recommend that this RFC spells out / specifies an API to inspect all the protected ts ranges currently configured,
together with an admin RPC / CLI command to inspect them and clean them up manually.

As we've found out in the past, automation breaks sometimes and we value manual tools in our toolbox.

Sure, the good news is that all of this state will be sitting in system tables which can be interacted with using SQL though probably I wouldn't want us doing that directly. I'll add to this RFC different approaches for manual intervention.

docs/RFCS/20191009_gc_protected_timestamps.md, line 283 at r1 (raw file):

Previously, ajwerner wrote…

I think the deadline is the lesser of these evils. The deadline in the common case is going to be the entire GC TTL. Touching every range seems like a big mistake.

You have identified a real problem. Say the GC TTL went from short to long and the protection is installed after it became long but the replica decides to run GC because is has not yet updated its zone config to know that it should be long.

To deal with this case we can also write the MVCC timestamp of the zone config into the record and ensure that for GC to occur we also need a zone config that has an MVCC timestamp at least as high as that in the record. This doesn't seem like a big addition.

Alright I think my previous claim was both completely wrong and if it were right would be more complicated than it needed to be in the face of changing zone configs. I'm reworking it.

docs/RFCS/20191009_gc_protected_timestamps.md, line 379 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

relying on the single-key linearizability offered by CockroachDB

This is only guaranteed for a read after a write if the system is within clock offset bounds. Meanwhile, single-key linearizability is guaranteed between two writes regardless of clock skew. If we expect zone cfg updates to be infrequent then it may be worth performing an identity write on the protected timestamp metadata to eliminate the clock dependency.

ack.

docs/RFCS/20191009_gc_protected_timestamps.md, line 483 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is it important in this example that delete is right after the protected ts? Doesn't this allow us to remove k@2 during the GC? Maybe not because we want a snapshot of the latest change to k? Is this handling of deletion tombstones after the GC threshold discussed at all in this RFC?

No but it's not clear that it necessarily needs to be at least for the first iteration. If we refuse to GC live values at some timestamp we can't do anything with deletion tombstones above them. The logic to remove deletion tombstones only applies when the previously live value they remove is GC'd.

docs/RFCS/20191009_gc_protected_timestamps.md, line 588 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

system.rangelog is prior-art for adding a SQL dependency to KV, so I'm less personally opposed to it than I was before.

Ack.

ajwerner

I've now reworked the RFC to provide weaker guarantees which simplifies the implementation and lessens its cost. This approach feels less generally problematic and I suspect that it will prove less controversial. I'm sure this RFC will still need some iteration but it's ready for another pass.

There is a prototype of this proposal sitting in #41996.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @knz, @nvanbenschoten, and @tbg)

docs/RFCS/20191009_gc_protected_timestamps.md, line 15 at r1 (raw file):

Previously, ajwerner wrote…

Sorry, I was confused, it is the IMPORT INTO job that does revert range. I'll rework this whole intro. I conflated the backfill with the IMPORT INTO. Will update

Err and by that I even meant CREATE TABLE AS SELECT.

docs/RFCS/20191009_gc_protected_timestamps.md, line 24 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

from the previous nightly backup

Done.

docs/RFCS/20191009_gc_protected_timestamps.md, line 48 at r1 (raw file):

Previously, knz (kena) wrote…

nit: provide a reminder of what kind of error a client is likely to receive, to help the reader along.

Done.

docs/RFCS/20191009_gc_protected_timestamps.md, line 56 at r1 (raw file):

How can we mark spans as protected that have not been queried by the SQL client yet?

IIUC we know all of the possible spans which could be queried at physical planning time. Some of these may end up being entire indexes or tables but I don't think there should be a need to protect the entire kv space.

docs/RFCS/20191009_gc_protected_timestamps.md, line 82 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

same

Done.

docs/RFCS/20191009_gc_protected_timestamps.md, line 90 at r1 (raw file):

Previously, knz (kena) wrote…

Add a paragraph:

"In the rest of this RFC, the term "client" designates an internal component of CockroachDB, such as the job subsytem or the SQL executor. The mechanisms presented here are invisible to external clients."

Done. Added above

docs/RFCS/20191009_gc_protected_timestamps.md, line 100 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

grammar off

Reworked.

docs/RFCS/20191009_gc_protected_timestamps.md, line 101 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

it is

Done.

docs/RFCS/20191009_gc_protected_timestamps.md, line 107 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

a to

Done.

docs/RFCS/20191009_gc_protected_timestamps.md, line 114 at r1 (raw file):

Previously, ajwerner wrote…

I'm very eager to avoid an implementation that has to touch every range, I think this one is able to provide the invariant without touching every range, see my reply to your below comment.

Done.

docs/RFCS/20191009_gc_protected_timestamps.md, line 137 at r1 (raw file):

Previously, knz (kena) wrote…

Nathan pointed out in the temp table RFC that we want TTs to have custom zone configs that pin them to their gateway node.
Given that a SQL app that uses TT is likely to do so in every conn, is the mandatory hotspot that you're proposing here going to nefariously interplay with TTs?

It would. That coordination between zone configs and protected timestamps has been removed in this formulation.

docs/RFCS/20191009_gc_protected_timestamps.md, line 193 at r1 (raw file):

Previously, ajwerner wrote…

Yes, it has problems as currently written related to changing zone configs as Tobi pointed out. I'm working though them now.

The whole thing got simpler, I'll let you take another pass before I make things more rigorous.

docs/RFCS/20191009_gc_protected_timestamps.md, line 236 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

the clients, these clients

Done.

docs/RFCS/20191009_gc_protected_timestamps.md, line 283 at r1 (raw file):

Previously, ajwerner wrote…

Alright I think my previous claim was both completely wrong and if it were right would be more complicated than it needed to be in the face of changing zone configs. I'm reworking it.

Done.

docs/RFCS/20191009_gc_protected_timestamps.md, line 379 at r1 (raw file):

Previously, ajwerner wrote…

ack.

Done.

bdarnell

I wrote up a bunch of comments about the GC threshold and its centrality to the current GC process, but I realized I may be out of date in my knowledge of this code; someone (@tbg?) should confirm that this is still how it works.

I'd appreciate a high-level summary of the design somewhere. Specifically, the aspects of this design that I didn't grok until very deep in the document are that the protected TS data is persisted in KV in a global/system span, and the primary mechanism of protection is that the GC queue is responsible for querying this data before deciding what is GC'able.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @knz, @nvanbenschoten, and @tbg)

docs/RFCS/20191009_gc_protected_timestamps.md, line 90 at r2 (raw file):

  * GC Threshold: When a range runs garbage collection it sets, in the
    replicated range state, a timestamp below which no reads may be performed.

In addition to disallowing reads, the GC threshold permits GCs. The range has complete data for everything above the GC threshold, while data below the threshold is allowed to be GC'd.

docs/RFCS/20191009_gc_protected_timestamps.md, line 94 at r2 (raw file):

  * Expiration: a row (henceforth a single MVCC version of a key) is said to
    expire if its MVCC timestamp is older than the leaseholder's current view of

There's an off-by-one subtlety here: a row cannot be GC'd if it could be returned by any valid read, no matter how old its MVCC timestamp is (so in practice we keep the latest value of any key that is below the GC threshold). We don't normally consider such rows to be "expired".

docs/RFCS/20191009_gc_protected_timestamps.md, line 96 at r2 (raw file):

    expire if its MVCC timestamp is older than the leaseholder's current view of
    `Now()` less the leaseholder's current view of its GC TTL. An "expired" row
    may still be read so long as its timestamp remains above the GC Threshold.

Aside: the fact that we currently allow expired rows to be read is an implementation detail. We must use the replicated GC threshold to ensure that decisions are made consistently across all replicas, and because setting that replicated value is relatively expensive, there is a considerable gap between the point of expiration and the GC threshold. In general, we'd like to disallow reading of expired rows (perhaps using closed timestamps to narrow the gap) as long as doing so wouldn't introduce other issues like inconsistencies between replicas.

docs/RFCS/20191009_gc_protected_timestamps.md, line 106 at r2 (raw file):

    contains data in those spans from garbage collecting data at that timestamp
    (open question about partial coverings, say Range contains [a, c) and [a, b)
    is protected, will [b,c) get collected?).

If we allow partial collection, the definition of GC Threshold will have to change, since it's currently one value for the whole Range.

docs/RFCS/20191009_gc_protected_timestamps.md, line 114 at r2 (raw file):

    timestamp are protected (note that `PROTECT_AFTER` protects a superset of
    the data which would be protected by `PROTECT_AT` and protections marked as
    `PROTECT_AT` will be treated the same in the initial implementation). Each

Similar to partial coverings, PROTECT_AT would require changes to the definition of the GC threshold.

docs/RFCS/20191009_gc_protected_timestamps.md, line 127 at r2 (raw file):

GC TTLs are defined in 
[Zone Configurations](https://www.cockroachlabs.com/docs/stable/configure-replication-zones.html).
A replica is free is GC any data which is older than the current timestamp

s/free is/free to/

This isn't true today. The leaseholder makes all GC decisions and replicates them (and it goes through the GC threshold indirection instead of directly using the current timestamp).

docs/RFCS/20191009_gc_protected_timestamps.md, line 152 at r2 (raw file):

There are tradeoffs we can make about the cost of different interfaces for
performing these protections. It is useful to clients that the interface to
creating and removing protected timestamps be transactional (i.e. Records are

I'm worried about this use of transactions. Transactions that observe their commit timestamp are more prone to contention-related problems, and global tables like this will be read a lot (even with a fairly infrequent poll interval).

It's not clear to me what these writes would be transactionally combined with, but I think I'd prefer a non-transactional solution.

docs/RFCS/20191009_gc_protected_timestamps.md, line 172 at r2 (raw file):

If a transaction writing a protection record commits at a timestamp before

I don't think "before which" is the right wording here, but I'm having a hard time processing this section (reading the doc in order) because you haven't defined the keys where these protection records live. I'd prefer to see keys and data structures defined up front.

docs/RFCS/20191009_gc_protected_timestamps.md, line 341 at r2 (raw file):

The decision allows different users of protected timestamps to ensure that they
are ultimately cleaned up in a manor which is appropriate for their use case. If

s/manor/manner/

docs/RFCS/20191009_gc_protected_timestamps.md, line 783 at r2 (raw file):

a heartbeat.

Heartbeating is a problem for jobs which can be paused and resumed.

Paused jobs should probably expire; we don't want paused jobs to block GC indefinitely. In that case an expiration-based model in which heartbeats just extend the expiration (by a context-dependent duration - while the job is running it could bump the expiration a little at a time, while a paused job would set it up to the maximum pause time) makes the most sense to me.

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @bdarnell, @knz, @nvanbenschoten, and @tbg)

docs/RFCS/20191009_gc_protected_timestamps.md, line 114 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Similar to partial coverings, PROTECT_AT would require changes to the definition of the GC threshold.

Yes indeed, this is briefly mentioned later on. I'll note that here and refer to there.

docs/RFCS/20191009_gc_protected_timestamps.md, line 152 at r2 (raw file):

Transactions that observe their commit timestamp are more prone to contention-related problems, and global tables like this will be read a lot (even with a fairly infrequent poll interval).

I agree with this statement wholeheartedly. This proposal neither requires no suggests that these transactions read their commit timestamp. The idea is that these writes would be transactionally combined with the writes which create jobs.

Is there a part of the RFC that made it seem like these transactions would need to observe their commit timestamp? I'd like to make it extremely clear that transactions should not observe their commit timestamp in normal operation at all. The only place where we do that today outside of cluster_logical_timestamp() and AS OF SYSTEM TIME is underneath the merge protocol and this unfortunate call in TRUNCATE which I intend to remove for 20.1.

cockroach/pkg/sql/truncate.go

Line 193 in 566f1b6

ID: id, Time: p.txn.CommitTimestamp(),

bdarnell

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @bdarnell, @knz, @nvanbenschoten, and @tbg)

docs/RFCS/20191009_gc_protected_timestamps.md, line 152 at r2 (raw file):

Is there a part of the RFC that made it seem like these transactions would need to observe their commit timestamp?

Just the fact that commit timestamps matter here in a way that is rarely the case; I assumed that that meant they'd need to observe the timestamp.

If they don't observe their timestamp, and you're instead just reading whatever timestamp they got assigned, that seems to open up a lot of weird cases where the commit timestamp got pushed far enough that it didn't protect anything. Is the idea just that you'd leave enough margin around the GC ttl that this wouldn't be a realistic problem? (just like today)

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @bdarnell, @knz, @nvanbenschoten, and @tbg)

docs/RFCS/20191009_gc_protected_timestamps.md, line 152 at r2 (raw file):

you're instead just reading whatever timestamp they got assigned, that seems to open up a lot of weird cases where the commit timestamp got pushed far enough that it didn't protect anything.

Correct, the idea is that writing the protection record itself is best effort and if the use case really requires that the protection succeeded it can go verify the record after the transaction commits. I don't see any need to synchronously verify the protection for the use cases we currently envision but this RFC keeps the door to that open. If we're willing to consider a non-transactional protection mechanism it seems to me that a transactional, best-effort protection mechanism plus a non-transactional, post-hoc verification protocol can be used equivalently.

Is the idea just that you'd leave enough margin around the GC ttl that this wouldn't be a realistic problem? (just like today)

Yes. Most commands will be protecting their statement timestamp which should have a margin around the GC TTL. If the GC TTL is 60 minutes and a command performs CREATE BACKUP AS OF SYSTEM TIME '-59m59s' and it ultimately fails to commit before the data expires, so be it. The worst case scenario is the protection fails. If the protection fails we're no worse off than we are today. Furthermore for the IMPORT INTO (it's not CREATE TABLE AS, it's IMPORT INTO) case where we need to ensure that we can revert the range before we perform the ingestion, we'll explicitly verify that the timestamp has been seen.

ajwerner

Thanks for the review. I added a high level overview which hopefully presents the big picture sooner. I didn't majorly rework the order in which ideas are presented but am happy to do so if it will make things clearer.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @dt, @knz, @nvanbenschoten, and @tbg)

docs/RFCS/20191009_gc_protected_timestamps.md, line 15 at r1 (raw file):

Previously, ajwerner wrote…

Err and by that I even meant CREATE TABLE AS SELECT.

It is indeed IMPORT INTO.

docs/RFCS/20191009_gc_protected_timestamps.md, line 68 at r1 (raw file):

Previously, knz (kena) wrote…

... how KV/internal clients, or jobs, ...

(the text should highlight that this API is not meant for use by external/SQL clients)

Done.

docs/RFCS/20191009_gc_protected_timestamps.md, line 90 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

In addition to disallowing reads, the GC threshold permits GCs. The range has complete data for everything above the GC threshold, while data below the threshold is allowed to be GC'd.

Added language to that effect.

docs/RFCS/20191009_gc_protected_timestamps.md, line 94 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

There's an off-by-one subtlety here: a row cannot be GC'd if it could be returned by any valid read, no matter how old its MVCC timestamp is (so in practice we keep the latest value of any key that is below the GC threshold). We don't normally consider such rows to be "expired".

Added that important additional condition.

docs/RFCS/20191009_gc_protected_timestamps.md, line 96 at r2 (raw file):

In general, we'd like to disallow reading of expired rows (perhaps using closed timestamps to narrow the gap) as long as doing so wouldn't introduce other issues like inconsistencies between replicas.

I'm not sure that I subscribe to this mindset though I could be convinced. We can serve reads from data we have not yet attempted to GC. I understand not wanting to encourage clients to rely on the fact that the GC threshold often lags the expiration dramatically but I don't necessarily understand why we'd prefer to fail to service requests when we don't need to.

docs/RFCS/20191009_gc_protected_timestamps.md, line 106 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

If we allow partial collection, the definition of GC Threshold will have to change, since it's currently one value for the whole Range.

Definitely. I'm not eager to make that change. I removed this note here and expanded on the section which asks about this below.

docs/RFCS/20191009_gc_protected_timestamps.md, line 114 at r2 (raw file):

Previously, ajwerner wrote…

Yes indeed, this is briefly mentioned later on. I'll note that here and refer to there.

Done.

docs/RFCS/20191009_gc_protected_timestamps.md, line 127 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/free is/free to/

This isn't true today. The leaseholder makes all GC decisions and replicates them (and it goes through the GC threshold indirection instead of directly using the current timestamp).

By GC here I meant set the GC threshold to and by replica I meant leaseholder. Updated.

docs/RFCS/20191009_gc_protected_timestamps.md, line 152 at r2 (raw file):

Previously, ajwerner wrote…

you're instead just reading whatever timestamp they got assigned, that seems to open up a lot of weird cases where the commit timestamp got pushed far enough that it didn't protect anything.

Correct, the idea is that writing the protection record itself is best effort and if the use case really requires that the protection succeeded it can go verify the record after the transaction commits. I don't see any need to synchronously verify the protection for the use cases we currently envision but this RFC keeps the door to that open. If we're willing to consider a non-transactional protection mechanism it seems to me that a transactional, best-effort protection mechanism plus a non-transactional, post-hoc verification protocol can be used equivalently.

Is the idea just that you'd leave enough margin around the GC ttl that this wouldn't be a realistic problem? (just like today)

Yes. Most commands will be protecting their statement timestamp which should have a margin around the GC TTL. If the GC TTL is 60 minutes and a command performs CREATE BACKUP AS OF SYSTEM TIME '-59m59s' and it ultimately fails to commit before the data expires, so be it. The worst case scenario is the protection fails. If the protection fails we're no worse off than we are today. Furthermore for the IMPORT INTO (it's not CREATE TABLE AS, it's IMPORT INTO) case where we need to ensure that we can revert the range before we perform the ingestion, we'll explicitly verify that the timestamp has been seen.

I added some language about

docs/RFCS/20191009_gc_protected_timestamps.md, line 172 at r2 (raw file):

I don't think "before which" is the right wording here

I've attempted to sharpen the language.

I'd prefer to see keys and data structures defined up front.

Interesting. The data structures feel like an implementation detail to me. The semantics are what I'm trying to socialize up front. It isn't clear to me what belongs in the guide-level vs. the reference-level explanation.

docs/RFCS/20191009_gc_protected_timestamps.md, line 341 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/manor/manner/

Done.

docs/RFCS/20191009_gc_protected_timestamps.md, line 783 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Paused jobs should probably expire; we don't want paused jobs to block GC indefinitely. In that case an expiration-based model in which heartbeats just extend the expiration (by a context-dependent duration - while the job is running it could bump the expiration a little at a time, while a paused job would set it up to the maximum pause time) makes the most sense to me.

I don't disagree. That being said, expiring paused jobs doesn't seem particularly onerous and could fit into a reconciliation policy without much extra work. Perhaps @dt wants to weigh in.

Having jobs both heartbeat their leases and their protection records feels unnecessary.

nvb

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @bdarnell, @dt, @knz, @nvanbenschoten, and @tbg)

docs/RFCS/20191009_gc_protected_timestamps.md, line 104 at r2 (raw file):

    prevents any range which
    contains data in those spans from garbage collecting data at that timestamp

This is now best-effort, right?

docs/RFCS/20191009_gc_protected_timestamps.md, line 134 at r2 (raw file):

The GC Threshold is a part of a range's state which determines the lower bound
for of timestamps for requests. The GC threshold is updated during the GC

"for of"

docs/RFCS/20191009_gc_protected_timestamps.md, line 167 at r2 (raw file):

operations. Verification in this proposal will need to perform `O(ranges)`
requests (but not `O(ranges)` writes). It's worth noting that no use cases
envisioned by this RFC need to perform verification directly or synchronously.

Would this verification compose nicely with the transactional interface to provide the strong guarantee if any caller ever did need it (at the expense of perf)? In other words, could the verification be performed synchronously if necessary?

docs/RFCS/20191009_gc_protected_timestamps.md, line 266 at r2 (raw file):

The following conditions which must be true in order to run GC at all.

nit: none of the conditions are predicates, so the "must be true" doesn't make sense.

Also, it's worth reiterating that this is all from the context of a single Replica (that's what "our" is)

docs/RFCS/20191009_gc_protected_timestamps.md, line 268 at r2 (raw file):

expired since

This is difficult to reason about. The second sentence here is more clear. Consider re-wording the first to say something like "for data which was not expired at the timestamp of our last reading of the protected TS state".

docs/RFCS/20191009_gc_protected_timestamps.md, line 272 at r2 (raw file):

   timestamp state as `now` for the purposes of computing a score.
   
1) We cannot run GC for timestamps which are older than the lease start time.

One thing I'm getting tripped up on is what it means to "run GC for a timestamp". This surely doesn't mean run GC and set that timestamp as the new threshold. So does it mean run GC with now set to this timestamp?

docs/RFCS/20191009_gc_protected_timestamps.md, line 276 at r2 (raw file):

verifying client

It took me a few reads of this to understand what this meant. It's pretty easy to get it mixed up with a "protecting client" because we haven't actually formally introduced verification yet. Do you mind 1) linking to the "verification" section and 2) making it clear that this is a client of this particular Range.

docs/RFCS/20191009_gc_protected_timestamps.md, line 318 at r2 (raw file):

To do this we augment the volatile state on the leaseholder to include a
lower-bound on the timestamp of the last read from the protected timestamp

Why is it a lower-bound?

"of the timestamp"

docs/RFCS/20191009_gc_protected_timestamps.md, line 593 at r2 (raw file):

The mechanism by which we achieve this invariant is a new request,
`AdminVerifyProtectedTimestamp`. This will be distributed to all spans covered

How does this request ensure that it's not operating on a stale leaseholder with a potentially skewed clock? Does it operate at a certain timestamp?

docs/RFCS/20191009_gc_protected_timestamps.md, line 799 at r2 (raw file):

have a GC threshold of `t1`, when should we run GC? 

The most naive thing to do is just not run GC if we have a protected timestamp

At least for the first iteration of this, but depending on how much use this gets, we may need to quickly improve on this issue.

petermattis · 2019-11-05T12:56:37Z

I probably just missed it, but where are the protected timestamp Records going to be stored? Or, more specifically, where is the State proto (which contains all of the Records) going to be stored? A new key in the system key space? I don't see a problem with that, I just didn't see it mentioned.

I ask the above because while reading this doc I was imagining that the protected timestamp state could be stored in the zone config. I think this approach has a downside in adding further functionality to zone configs. On the plus side, with cascading zone configs you could imaging setting the protected timestamp for a database or the entire cluster by specifying it in one zone config. And zone configs are already available at the range level when performing GC, so you wouldn't be adding dependency on additional state. You might have already considered and dismissed this approach, but it seems worth at least calling out in the alternatives section.

ajwerner

I'm realizing I somehow didn't push a set of updates.

I probably just missed it, but where are the protected timestamp Records going to be stored? Or, more specifically, where is the State proto (which contains all of the Records) going to be stored? A new key in the system key space? I don't see a problem with that, I just didn't see it mentioned.

From the just pushed High Level Overview I thought I had pushed yesterday:

The protection record will be
stored in a system table as an encoded protocol buffer which declares the
timestamps being protected, the relevant spans, and some metadata which might
help a background reconciliation job to delete orphaned protection records. The

From the implementation section:

This RFC chooses to store the state in the `kv` database. An initial
implementation was implemented on top of SQL but the lack of access to the MVCC
timestamps of rows and the accompanying complexity that created while trying to
provide invariants was deemed to not be worth it.

* There is a meta row which stores the aggregate counts and the version.
  * The meta row is transactionally updated on every write.
  * The meta row enables cheap caching at the cost of high contention; writes
    are certain to be pushed, see the `Tracker`.
  * The meta row key is defined as a constant in the `keys` package.
  * The value is an encoded `ptpb.Metadata`.
* The records themselves are stored in a system table which has a well-known ID.
  * The `protected_timestamp_records` table is defined in the `sqlbase` package.
    It will be created on bootstrap and in a `sqlmigrations` migration.
  * The primary key of the table is a UUID. The other `bytes` column contains
    encoded `ptpb.Record` values.

I ask the above because while reading this doc I was imagining that the protected timestamp state could be stored in the zone config.

My initial problem with that approach is that there's no existing contract on when a zone config will be seen by a range's leaseholder. I'm not sure I adequately explored that approach.

Perhaps focusing on improving zone configuration propagation would have been a worthwhile avenue. An earlier iteration of this RFC attempted to coordinate writes to the protected timestamp state with zone configurations and was much more complicated but that approach also tried to achieve stronger semantics.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @bdarnell, @dt, @knz, @nvanbenschoten, and @tbg)

docs/RFCS/20191009_gc_protected_timestamps.md, line 104 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

    prevents any range which
    contains data in those spans from garbage collecting data at that timestamp

This is now best-effort, right?

Yes, added more words here.

docs/RFCS/20191009_gc_protected_timestamps.md, line 167 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Would this verification compose nicely with the transactional interface to provide the strong guarantee if any caller ever did need it (at the expense of perf)? In other words, could the verification be performed synchronously if necessary?

I believe it would be straight-forward to implement a transaction guarantee. Here's a straw man:

Imagine a protocol whereby an intent were laid down on some replicated range key. This key's MVCC timestamp could be used as a lower bound for reading MVCC state. If an intent were successfully laid down on every range under all of the spans and no garbage collection were to occur while an intent existed on that key. When the transaction commits the resolved value will have been written at the timestamp of the transaction and thus any read of protected timestamp state at that timestamp will include the record. Some care will need to be taken to deal with splits. One idea would be to make the RHS of a split initialize its value from the LHS. We could do something similar to what we did with descriptors by allowing an hlc timestamp to be populated into the record at that key and then backfill from the MVCC timestamp of the row if the record contains a zero value.

docs/RFCS/20191009_gc_protected_timestamps.md, line 272 at r2 (raw file):

So does it mean run GC with now set to this timestamp?

Yes, that's exactly what I meant it to mean. To run a GC for a timestamp in the parlance of this document meants to attempt to run a GC process which GC Threshold to timestamp-TTL. I'll spell that out in this document.

docs/RFCS/20191009_gc_protected_timestamps.md, line 140 at r3 (raw file):

Quoted 4 lines of code…

the same transaction which creates the jobs. The protection record will be
stored in a system table as an encoded protocol buffer which declares the
timestamps being protected, the relevant spans, and some metadata which might
help a background reconciliation job to delete orphaned protection records. The

petermattis · 2019-11-05T14:07:24Z

I'm realizing I somehow didn't push a set of updates.

And I just realized I was reading a version that was several days old...

My initial problem with that approach is that there's no existing contract on when a zone config will be seen by a range's leaseholder. I'm not sure I adequately explored that approach.

Good point about the lack of contract on when zone config updates are seen. I think it is worth calling this alternative out in the alternatives section as it seems reasonable on the surface.

bdarnell

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @dt, @knz, @nvanbenschoten, and @tbg)

docs/RFCS/20191009_gc_protected_timestamps.md, line 96 at r2 (raw file):

why we'd prefer to fail to service requests when we don't need to.

Because that's what the user asked us to do. They may have data retention/deletion obligations for example, that data is gone (or at least inaccessible) after some specific period of time. The fact that we still have the data and could serve it is not necessarily a good thing; we should aim to at least act as if data is instantaneously vaporized when it reaches the TTL.

docs/RFCS/20191009_gc_protected_timestamps.md, line 152 at r2 (raw file):

Correct, the idea is that writing the protection record itself is best effort and if the use case really requires that the protection succeeded it can go verify the record after the transaction commits.

If it's best effort, why not create the job normally and create the protection record after the job is committed? I'm wary of tying this kind of special data to the transaction system; this turned out to be a mistake in the case of node liveness.

Concretely, if writes to the protected timestamp records were not transactionally linked to anything else, they could more safely observe their timestamps, and by moving the timestamp into regular user-data space we could remove some novel requirements like being able to observe the MVCC timestamp of a read value. (I'll sketch out an alternative design based on this idea in a separate comment)

The worst case scenario is the protection fails.

What does this look like? When is it detected and reported?

docs/RFCS/20191009_gc_protected_timestamps.md, line 172 at r2 (raw file):

Previously, ajwerner wrote…

I don't think "before which" is the right wording here

I've attempted to sharpen the language.

I'd prefer to see keys and data structures defined up front.

Interesting. The data structures feel like an implementation detail to me. The semantics are what I'm trying to socialize up front. It isn't clear to me what belongs in the guide-level vs. the reference-level explanation.

Keys and data structures may be too much detail, but the thing I had missed up to this point in the document is the fact that this is centralized rather than range-local state. I tend to start understanding designs by looking at the state: what's stored and where?

docs/RFCS/20191009_gc_protected_timestamps.md, line 735 at r4 (raw file):

important.

## Rationale and Alternatives

Having all nodes poll a single range smells bad; we usually prefer gossip for things that belong in memory on all nodes.

It's true that gossip has no guarantees around receipt, but I don't think that's necessary here. As long as the gossip message includes the MVCC timestamp at which it read, we can do the right thing. (I wouldn't overload zone configs for this, but I think we could do something similar)

Specificially, I would

Make a new system.protected_ts table. It would go in the "ungossiped" system table space, but be gossiped by a similar-but-separate mechanism. It's separate because we want to re-gossip this data periodically whether it's changed or not, while we only want to gossip the full system config space when there are changes. (Maybe this isn't worth it and we should just put it in the same system config space)
When gossiping this table, also gossip the MVCC timestamp at which the read occurred
This is a regular SQL table; inserts to this table use cluser_logical_timestamp() to insert their commit timestamp as a regular column. Inserts to this table are done in their own implicit transaction.
The GC queue uses the MVCC read timestamp from the protected TS gossip message to decide what timestamps are eligible for GC. If gossip propagation is slow, it will limit the GC process but not cause other issues.

My goal here is to make the persistent part of the system as "normal" as possible. This is still a special table because it's gossiped, but that's transient and relatively easier to change. The proposal in the RFC creates a persistent non-SQL table with special requirements around the reading of MVCC commit timestamps and seems like it will be more of a maintenance burden longer-term.

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @bdarnell, @dt, @knz, @nvanbenschoten, and @tbg)

docs/RFCS/20191009_gc_protected_timestamps.md, line 152 at r2 (raw file):

The worst case scenario is the protection fails.

What does this look like? When is it detected and reported?

In the case of a backup or an index backfill it will look exactly the same as it does today, an error relating to a GC Threshold:

> BACKUP users TO 'nodelocal:///bar' AS OF SYSTEM TIME '-10h';
pq: exporting 2 ranges: batch timestamp 1572945366.506057069,0 must be after replica GC threshold 1572981299.836595329,0

The idea in this proposal is not to make such errors impossible in the general case but rather to make them impossible in the case where the data has not yet expired.

In the case of an IMPORT INTO the protection failure will be detected during an attempt to add an SSTable to a range which could not be reverted to the required timestamp. In today's code, if this operation were to happen the IMPORT INTO would hang trying to roll itself back forever. Now it will just fail.

This proposal makes the failure graceful if it is to happen and will prevent it in the case that no data had yet expired.

If it's best effort, why not create the job normally and create the protection record after the job is committed?

You could do that. It's a bit more work for the job implementer, but sure, absolutely. That could be accomplished by just stripping the *client.Txn from the interface as currently presented and running the operations in their own transactions. In the prototype, mostly to ease testing, I added logic to run operations in their own transaction. If we required operations to be run like this, would it ease your concerns?

cockroach/pkg/storage/protectedts/storage_with_database.go

Line 12 in 5445726

    
           // WithDatabase wraps s such that any calls made with a nil *Txn will be wrapped

As the proposal currently stands I envision the following workflow:

job statement execution:
   begin;
   write job record;
   write protected timestamp;
   modify table descriptors;
   commit;

job implementation:
   optionally verify that record was applied
   run

If the interface were non-transaction it could be the following:

job statement execution:
   begin;
   write job record;
   modify table descriptors;
   commit;

job implementation:
   if job state has protected timestamp record associated with it:
       attempt to read record
       if record does not exist:
          write record
   else:
       synthesize record, write it into the job state
       write record
   optionally verify that record was applied
   run

"best effort" is probably the wrong terminology. To me best effort implies that there are no conditions under which the operation is guaranteed to succeed. That's not what this proposal offers. This proposal says that so long as the protection occurs prior to any data expiring, the protection is guaranteed to succeed. It's only "best effort" in the case where some data might have expired. It's also confusing because the expiration terminology includes all of TTLs which could have covered the relevant spans.

In the general case it's difficult to know whether a piece of data has ever expired under changing TTLs. The case that I keep coming back to is where a TTL starts out very short and then is changed to be very long. You might expect under the zone configuration to be able to protect some piece of timestamp which is well within the current TTL and furthermore we might have GC'd the previous zone config but it's possible that the data was already GC'd.

I'm wary of tying this kind of special data to the transaction system; this turned out to be a mistake in the case of node liveness.

That's a fair concern. Fortunately this state is outside of the critical path. What scenarios are driving your concerns?

Here's two failure modes I could come up with:

the protected timestamp data becomes unavailable and because of that all garbage collection stops.

In general cockroach clusters cannot withstand long-term system range unavailability (think table leases for example). I can imagine several mitigation to this. One might be a cluster setting to disable protected timestamps.

a bug leads to large amounts of traffic against this single range

Again a cluster setting to disable the subsystem seems reasonable. Furthermore I'm not clear how much we want to involve bugs in our threat model. We already know we have bugs in caching in the SQL schema land which can lead to large query volumes that are O(queries).

Node liveness is problematic because it needs to be available for leases and leases are a prerequisite for everything. I'm hopeful that this is much less risky because the only critical system functionality it will preclude is GC.

docs/RFCS/20191009_gc_protected_timestamps.md, line 266 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

The following conditions which must be true in order to run GC at all.

nit: none of the conditions are predicates, so the "must be true" doesn't make sense.

Also, it's worth reiterating that this is all from the context of a single Replica (that's what "our" is)

Tried to reword this whole section to make it clearer.

docs/RFCS/20191009_gc_protected_timestamps.md, line 268 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

expired since
This is difficult to reason about. The second sentence here is more clear. Consider re-wording the first to say something like "for data which was not expired at the timestamp of our last reading of the protected TS state".

reworded.

docs/RFCS/20191009_gc_protected_timestamps.md, line 276 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

verifying client
It took me a few reads of this to understand what this meant. It's pretty easy to get it mixed up with a "protecting client" because we haven't actually formally introduced verification yet. Do you mind 1) linking to the "verification" section and 2) making it clear that this is a client of this particular Range.

I tried to make it clearer.

docs/RFCS/20191009_gc_protected_timestamps.md, line 318 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why is it a lower-bound?

"of the timestamp"

See if this is clearer.

docs/RFCS/20191009_gc_protected_timestamps.md, line 593 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

How does this request ensure that it's not operating on a stale leaseholder with a potentially skewed clock? Does it operate at a certain timestamp?

This is a good question. I believe the answer to the last question is yes but I'll need some help here to get the terminology right.

The AdminVerifyProtectedTimestampRequest needs a timestamp higher than the timestamp at which the record was written. This is certainly going to be the case because such a request cannot be created without reading the record.

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @bdarnell, @dt, @knz, @nvanbenschoten, and @tbg)

docs/RFCS/20191009_gc_protected_timestamps.md, line 735 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Having all nodes poll a single range smells bad; we usually prefer gossip for things that belong in memory on all nodes.

It's true that gossip has no guarantees around receipt, but I don't think that's necessary here. As long as the gossip message includes the MVCC timestamp at which it read, we can do the right thing. (I wouldn't overload zone configs for this, but I think we could do something similar)

Specificially, I would

Make a new system.protected_ts table. It would go in the "ungossiped" system table space, but be gossiped by a similar-but-separate mechanism. It's separate because we want to re-gossip this data periodically whether it's changed or not, while we only want to gossip the full system config space when there are changes. (Maybe this isn't worth it and we should just put it in the same system config space)

When gossiping this table, also gossip the MVCC timestamp at which the read occurred

This is a regular SQL table; inserts to this table use cluser_logical_timestamp() to insert their commit timestamp as a regular column. Inserts to this table are done in their own implicit transaction.

The GC queue uses the MVCC read timestamp from the protected TS gossip message to decide what timestamps are eligible for GC. If gossip propagation is slow, it will limit the GC process but not cause other issues.

My goal here is to make the persistent part of the system as "normal" as possible. This is still a special table because it's gossiped, but that's transient and relatively easier to change. The proposal in the RFC creates a persistent non-SQL table with special requirements around the reading of MVCC commit timestamps and seems like it will be more of a maintenance burden longer-term.

One question I have is whether you feel that gossiping the data is really much better than gossiping metadata and having the nodes poll the data in the case when it changes. I suspect it’s six one way, half dozen the other. Especially as we’re talking about at kilobytes.

I’m very open to making the stored data “normal” sql tables, I have a branch with roughly the same interfaces typed using sql tables. In that schema the spans interleaved into the record, thoughts on that?

The issue I see with observing the timestamp is that it makes it difficult to inspect the aggregate dataset, either through a meta row or by scanning all of the records without getting pushed. If we had a select for update mechanism it would help here. I think it’s probably a good idea to limit the total size of data in this subsystem. If there’s no limits or accounting then this isn’t a problem.

ajwerner

Thanks for helping me think through the problems in this proposal!

Taking a step back I believe that we achieve the same interface proposed in the current iteration on top of vanilla sql tables with transactions which don’t need to observe their own timestamp or do anything fancy. I'm going to take a stab at implementing the exact same client facing interface with state backed by system sql tables. Record creation will still by transactional and not verified but there will be zero timestamp shenanigans.

I'm also thinking about augmenting the verification so that it can durably store the verification state on records.

Here was my thought process (other than Ben's thoughtful comments) that led to this new inspiration:

The big question w/regards to state are the medium (encoded protos in kv vs sql) and the layout. Then there's the question of caching and distribution.

sql table[s] or encoded KVs? The only real reason I opted for KVs was the desire to observe the row timestamps.

KVs
+ can access mvcc timestamps of individual keys.
+ easier to be efficient (which we probably don’t need to be).
- unusual and less empathetic with our users and other devs
- low level
- will become a maintainability problem
SQL
+ easier to interact with and evolve over time
+ lower maintenance burden
- harder to use in an ergonomic way if you want to be efficient (which we probably don’t need to be). Though obviously you can also interact with the sql encoded data as KVs if efficiency was important.
- no access to row mvcc time without observing commit timestamp.

Fortunately this isn't a problem!

The reason I was using the MVCC timestamp to be able to determine a timestamp after which reads of the protected timestamp state will always see some record. The MVCC timestamps provide a minimal causality token.
The only use of the causality token is to verify that the record actually protects data. We can use a much simpler causality token which isn’t minimal. For the verification requests we can just observe the timestamp at which the record was read. Then, when verifying on a leaseholder we ensure that either the current tracker state has the record and if it doesn’t we tell the replica it can’t gc until it sees the record or the causality token timestamp has passed. This should get rid of any dependency to observe the commit timestamp directly.

Regardless of the above stuff:

Meta table?
Scan all records?

To answer that question we need to talk about some other feature and dependencies.

Limits?
Data distribution and caching?

We could hook the polling and gossip the state into the leaseholder replica of the table. We should be able to do it in such a way that it doesn’t even need to poll if there are no writes, right? This feels pretty special.

We could just use the leaseholder property to periodically scan the durable state and not hook in to the store in any low level way. Eventually we could move this polling and gossiping to a jobs system.

This feels less special.

Data distribution:

Gossip all the records?
Gossip metadata and scan rows from each host?

Zone Configs:

As for using zone configs, I'll augment the document when I formulate completely why I feel so bad about coupling this subsystem to that one but here's some somewhat rambling thoughts:

It feels like we could store protections in in zone configs. It is also true that we already have MVCC timestamps on Zone Configs (not exposed) due to an implementation detail of the way the system config range is gossiped. There are ways they could be used and augmented in ways to provide the guarantees we need. I can also imagine wanting transactional guarantee about both the ttl and the protection state at the same time.

My unorganzized concerns are

the zone config subsystem is already quite large and has many touch points. I really don't want to have to pull along all of the existing clients.
the state of zone configs is a relatively public concept which is generally modified by users somewhat directly. It's less clear that these should be public.
the zone configs don't seem to be a place to store all of the system state. We need a way to go find and remove all of the zone configs which correspond to one logical protection. This implies to me some additional storage. If we're going to have additional storage somewhere outside of the zone configs, why tie them in at all?

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @bdarnell, @dt, @knz, @nvanbenschoten, and @tbg)

sumeerbhola

This RFC came up in a discussion about doing GC in RocksDB/Pebble (e.g. in compactions) -- I have some basic questions about the policy specification.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @bdarnell, @dt, @knz, @nvanbenschoten, and @tbg)

docs/RFCS/20191009_gc_protected_timestamps.md, line 115 at r4 (raw file):

  * Protection Mode: each record has a protection mode. There are two 
    protection modes, `PROTECT_AT` and `PROTECT_AFTER`. `PROTECT_AT` ensures

what is the use case for PROTECT_AFTER?

docs/RFCS/20191009_gc_protected_timestamps.md, line 152 at r4 (raw file):

storage layer. In this proposal, leaseholders will need to consult the global
protected timestamp state in order to prove that it is safe to move the GC
threshold. In practice this leads to a relatively lightweight protocol (no

this statement about "move the GC threshold" makes me think that the low-level mechanism is still "PROTECT_AFTER t" and the value of t is being adjusted based on the GC TTL and the protected timestamp records. That is, we are adding a more sophisticated policy configuration and eventually will augment the low-level mechanism, but not now. Is that correct? If so, it may be worth calling that out.

docs/RFCS/20191009_gc_protected_timestamps.md, line 177 at r4 (raw file):

GC occurs based on a heuristic which estimates how much garbage would be
collected if GC were to run. Adapting this heuristic to closed timestamps is an

did you mean protected timestamps?

docs/RFCS/20191009_gc_protected_timestamps.md, line 180 at r4 (raw file):

open issue in this RFC. The naive approach, which is to prevent all GC on ranges
which are covered by a protected timestamp record, will be initially pursued
here. See open questions at the bottom of the document.

I was expecting PROTECT_AT t1 when starting a full backup at t1 and leaving it in place. Then incremental backup starting at t2 says PROTECT_AT t2, and when it finishes it removes PROTECT_AT t1 and so on. So there is always at least one PROTECT_AT in the system.
In this proposal, how is t1 going to be protected until the incremental backup at t2 finishes?

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @bdarnell, @dt, @knz, @nvanbenschoten, @sumeerbhola, and @tbg)

docs/RFCS/20191009_gc_protected_timestamps.md, line 96 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

why we'd prefer to fail to service requests when we don't need to.

Because that's what the user asked us to do. They may have data retention/deletion obligations for example, that data is gone (or at least inaccessible) after some specific period of time. The fact that we still have the data and could serve it is not necessarily a good thing; we should aim to at least act as if data is instantaneously vaporized when it reaches the TTL.

Alright, that makes some sense. I can see a case where in a pinch somebody would prefer to have the protection mechanism be less strict but I suppose it'd be adequately flexible if we just forced the user to extend the GC TTL back over some timestamp which had previously expired but happens to still be live.

I haven't yet come up with a good answer on how or where to enforce this property. It seems to me like if we protect some timestamp and then shorten the GC ttl then it would be bad to somehow not respect the protected timestamp. It seems easier to do at verification time where the verifier decides whether the TTLs permit the protection. Alternative it could be done above protection.

I suspect we're cool with it being okay to protect some timestamp which expires before the commit timestamp but after the statement timestamp.

docs/RFCS/20191009_gc_protected_timestamps.md, line 115 at r4 (raw file):

PROTECT_AFTER

One reason is to support WITH revision_history = true for Backups. You might want that to protect a time range but realistically if you're taking a revision history backup it probably want to later take another one in the future that has the revision history in between so you may as well just keep moving up a protect after. I'm not clear on exactly why people would take revision history backups. I can see why having the history is interesting, we just don't provide any tools to really usefully analyze the temporal dimension of keys. It would be cool to add support for temporal SQL where you could issue queries to detect predicates over changes in rows in a time range but that's a whole other thing.

See https://www.cockroachlabs.com/docs/stable/backup.html#parameters

The more compelling use case I see is for preserving CHANGEFEEDs in the face of pauses. Each feed maintains a cursor in time which it periodically checkpoints. If any data since the cursor has been GC'd when a feed is resumed, the feed will fail. I can imagine good use cases of changefeeds going in to Kafka will correspond to some workloads with high rates of data updates. In those cases we'd like to keep the GC TTL small generally. In cases where the sink is down or maintenance needs to be performed it'd be nice to be able to pause this mission critical feed and know you could resume it to catch everything you missed.

I'll add this to the doc.

docs/RFCS/20191009_gc_protected_timestamps.md, line 152 at r4 (raw file):

Previously, sumeerbhola wrote…

this statement about "move the GC threshold" makes me think that the low-level mechanism is still "PROTECT_AFTER t" and the value of t is being adjusted based on the GC TTL and the protected timestamp records. That is, we are adding a more sophisticated policy configuration and eventually will augment the low-level mechanism, but not now. Is that correct? If so, it may be worth calling that out.

Yes, that's correct. I can add a note about that here.

To move the GC threshold above a PROTECT_AT record we'd need to move the record (or at least some part of it) into the replicated range state and then modify the GC code to take into account those record. I'm not sure how hard that itself is. The thing I haven't gotten too deep into is how these protected timestamps will interact with the GC hueristics. That being said I don't have a great plan on how to deal with the heuristic and PROTECT_AFTER either.

docs/RFCS/20191009_gc_protected_timestamps.md, line 177 at r4 (raw file):

Previously, sumeerbhola wrote…

did you mean protected timestamps?

I did, thanks.

docs/RFCS/20191009_gc_protected_timestamps.md, line 180 at r4 (raw file):

Previously, sumeerbhola wrote…

I was expecting PROTECT_AT t1 when starting a full backup at t1 and leaving it in place. Then incremental backup starting at t2 says PROTECT_AT t2, and when it finishes it removes PROTECT_AT t1 and so on. So there is always at least one PROTECT_AT in the system.
In this proposal, how is t1 going to be protected until the incremental backup at t2 finishes?

That's a great point and a completely unexplored detail. You're right, that's how it would have to work.

I'm envisioning something at the Bulk IO level where we implement perhaps something like a backup scheduler which is itself a long-running job which can hold on to these protected timestamps and schedule full or incremental backups. I need some design help from Bulk IO to decide how exactly the lifecycle of records are going to be defined and maintained.

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @bdarnell, @dt, @knz, @nvanbenschoten, @sumeerbhola, and @tbg)

docs/RFCS/20191009_gc_protected_timestamps.md, line 96 at r2 (raw file):

Previously, ajwerner wrote…

Alright, that makes some sense. I can see a case where in a pinch somebody would prefer to have the protection mechanism be less strict but I suppose it'd be adequately flexible if we just forced the user to extend the GC TTL back over some timestamp which had previously expired but happens to still be live.

I haven't yet come up with a good answer on how or where to enforce this property. It seems to me like if we protect some timestamp and then shorten the GC ttl then it would be bad to somehow not respect the protected timestamp. It seems easier to do at verification time where the verifier decides whether the TTLs permit the protection. Alternative it could be done above protection.

I suspect we're cool with it being okay to protect some timestamp which expires before the commit timestamp but after the statement timestamp.

Moved this to open questions.

docs/RFCS/20191009_gc_protected_timestamps.md, line 115 at r4 (raw file):

Previously, ajwerner wrote…

PROTECT_AFTER
One reason is to support WITH revision_history = true for Backups. You might want that to protect a time range but realistically if you're taking a revision history backup it probably want to later take another one in the future that has the revision history in between so you may as well just keep moving up a protect after. I'm not clear on exactly why people would take revision history backups. I can see why having the history is interesting, we just don't provide any tools to really usefully analyze the temporal dimension of keys. It would be cool to add support for temporal SQL where you could issue queries to detect predicates over changes in rows in a time range but that's a whole other thing.

See https://www.cockroachlabs.com/docs/stable/backup.html#parameters

The more compelling use case I see is for preserving CHANGEFEEDs in the face of pauses. Each feed maintains a cursor in time which it periodically checkpoints. If any data since the cursor has been GC'd when a feed is resumed, the feed will fail. I can imagine good use cases of changefeeds going in to Kafka will correspond to some workloads with high rates of data updates. In those cases we'd like to keep the GC TTL small generally. In cases where the sink is down or maintenance needs to be performed it'd be nice to be able to pause this mission critical feed and know you could resume it to catch everything you missed.

I'll add this to the doc.

Added some more words. The other perhaps crazy idea that would benefit from this would be to expose the temporal dimension of rows to SQL. That's a whole can of worms but fascinates me.

docs/RFCS/20191009_gc_protected_timestamps.md, line 177 at r4 (raw file):

Previously, ajwerner wrote…

I did, thanks.

Done.

docs/RFCS/20191009_gc_protected_timestamps.md, line 180 at r4 (raw file):

Previously, ajwerner wrote…

That's a great point and a completely unexplored detail. You're right, that's how it would have to work.

I'm envisioning something at the Bulk IO level where we implement perhaps something like a backup scheduler which is itself a long-running job which can hold on to these protected timestamps and schedule full or incremental backups. I need some design help from Bulk IO to decide how exactly the lifecycle of records are going to be defined and maintained.

Added a future work subsection about this.

nvb

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @bdarnell, @dt, @knz, @sumeerbhola, and @tbg)

docs/RFCS/20191009_gc_protected_timestamps.md, line 153 at r2 (raw file):
Forking thread:

In the case of an IMPORT INTO the protection failure will be detected during an attempt to add an SSTable to a range which could not be reverted to the required timestamp. In today's code, if this operation were to happen the IMPORT INTO would hang trying to roll itself back forever.

We're going to need a partial revert range mechanism, right? Or at least some tracking during an IMPORT for which ranges have been imported into and which haven't in order to prevent the RevertRequest from going to a Range that will then throw the same GC threshold error.

Or are we going to verify that the entire import keyspace is protected first before importing anything? I think I prefer that approach. What about @dt?

EDIT: this is addressed below. Might be worth spelling out all the difficulty of allowing verification to be concurrent with importing.

docs/RFCS/20191009_gc_protected_timestamps.md, line 276 at r2 (raw file):

Previously, ajwerner wrote…

I tried to make it clearer.

Thanks.

docs/RFCS/20191009_gc_protected_timestamps.md, line 318 at r2 (raw file):

Previously, ajwerner wrote…

See if this is clearer.

I'm still thrown off by the term "permitted". Isn't this "the minimum timestamp at which the leaseholder must read the protected timestamp state in order to run GC"?

docs/RFCS/20191009_gc_protected_timestamps.md, line 593 at r2 (raw file):

Previously, ajwerner wrote…

This is a good question. I believe the answer to the last question is yes but I'll need some help here to get the terminology right.

The AdminVerifyProtectedTimestampRequest needs a timestamp higher than the timestamp at which the record was written. This is certainly going to be the case because such a request cannot be created without reading the record.

I still think it's worth spelling out exactly how this is safe across both cooperative and non-cooperative lease transfers. Especially if the verification can end up on a stale leaseholder. Is the idea that a new leaseholder (which missed the supposed verification) will necessarily need to read the record at a higher timestamp than the record (and therefore observe it) at a timestamp, so it doesn't need explicit verification?

In other words, how do we ensure that verifying on a stale leaseholder isn't making false promises?

docs/RFCS/20191009_gc_protected_timestamps.md, line 325 at r5 (raw file):

   GC Threshold (i.e. `now - gc.ttlseconds`).
   
* The GC timestamp must be newer the lease start time.

"newer than"

docs/RFCS/20191009_gc_protected_timestamps.md, line 610 at r5 (raw file):

Note that this does not solve the problem of keeping data alive between
incremental ba

Looks like this got cut off.

This commit provides a draft RFC for a subsystem to protect values in spans of data alive at specific timestamps from being garbage collected. Release note: None

ajwerner

@nvanbenschoten thanks for taking a look. I'm hopeful that this is close. My biggest issue as it stands is how to think about the GC heuristic in the case where we do have a protected timestamp but we have a large quantity of garbage which is older than that timestamp. In the short term I'm not too worried about that case as I don't think these protected timestamps are going to be used that frequently and in the use cases in which they are used this extra garbage lifetime is okay.

All that being said, coming up with a better heuristic which is compatible with PROTECT_AT is probably the biggest blocker in implementing PROTECT_AT. PROTECT_AT seems like a pretty worthwhile optimization if we care about short GC TTLs a backup schedule.

I've cleaned up an implementation of the design put forth here that I'll point you at soon (I want to wait for #42521 to merge and rebase it and I need to adopt the gossiping).

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @dt, @knz, @nvanbenschoten, @sumeerbhola, and @tbg)

docs/RFCS/20191009_gc_protected_timestamps.md, line 153 at r2 (raw file):
Yeah in response to this discussion I decided to pitch doing verification explicitly before starting an IMPORT INTO. That's what's implemented.

Might be worth spelling out all the difficulty of allowing verification to be concurrent with importing.

I added to alternatives considered. See Ensuring that AddSSTable requests can be reverted. It's not that its difficult per se, just more complicated than it needs to be.

docs/RFCS/20191009_gc_protected_timestamps.md, line 172 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Keys and data structures may be too much detail, but the thing I had missed up to this point in the document is the fact that this is centralized rather than range-local state. I tend to start understanding designs by looking at the state: what's stored and where?

Okay, I've tried to call it out more explicitly several times.

Intro paragraph:

 Protected timestamps records are stored in a regular SQL table which
is asynchronously broadcast into the `storage` layer's data GC process to
provide the specified semantics.

The protection record will be
stored in a system table which declares the timestamps being protected, the
relevant spans, and some metadata which might help a background reconciliation
job to delete orphaned protection records. The mere existence of a record does
not itself prove that the data has been protected; the protection semantics are
undefined if any data could have expired and has not been verified. However, if
the timestamp being protected is not already very close to expiring, the
protection will succeed. The exact semantics are outlined below. A synchronous
verification can be performed after the record is created for clients which need
strong guarantees.

docs/RFCS/20191009_gc_protected_timestamps.md, line 318 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I'm still thrown off by the term "permitted". Isn't this "the minimum timestamp at which the leaseholder must read the protected timestamp state in order to run GC"?

I've reworded this part and updated it to be informed by the implementation. I hope it's clearer.

docs/RFCS/20191009_gc_protected_timestamps.md, line 593 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I still think it's worth spelling out exactly how this is safe across both cooperative and non-cooperative lease transfers. Especially if the verification can end up on a stale leaseholder. Is the idea that a new leaseholder (which missed the supposed verification) will necessarily need to read the record at a higher timestamp than the record (and therefore observe it) at a timestamp, so it doesn't need explicit verification?

In other words, how do we ensure that verifying on a stale leaseholder isn't making false promises?

I referred back to the guide level explanation, largely because I'm lazy. I can restate it all or perhaps swap these two descriptions.

docs/RFCS/20191009_gc_protected_timestamps.md, line 799 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

At least for the first iteration of this, but depending on how much use this gets, we may need to quickly improve on this issue.

Yeah this is the biggest problem as far as I'm concerned.

docs/RFCS/20191009_gc_protected_timestamps.md, line 610 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Looks like this got cut off.

Whoops. Added some more words. This section could be more filled out.

nvb

This as far as I'm concerned. Thanks for the multiple iterations. I think it's landed in a pretty good place.

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @bdarnell, @dt, @knz, @nvanbenschoten, @sumeerbhola, and @tbg)

ajwerner

Thanks for the review! I'm going to merge this later today as a draft if nobody adds any new comments. None of this is set in stone but this PR and its discussions are starting to get stale. I'll then put up the review for the implementation as reference and start putting up the individual sub-commits.

I think the biggest short term open question is whether we're sad about the protobuf API vs vanilla go structs. I'd like to leave that conversation for the review of the code.

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @bdarnell, @dt, @knz, @nvanbenschoten, @sumeerbhola, and @tbg)

ajwerner

bors r=nvanbenschoten

I'll keep updating this as the implementation continues. Thanks for all the time everybody.

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @bdarnell, @dt, @knz, @nvanbenschoten, @sumeerbhola, and @tbg)

41806: docs/RFCs: draft RFC for protected timestamps r=nvanbenschoten a=ajwerner This commit provides a draft RFC for a subsystem to protect values in spans of data alive at specific timestamps from being garbage collected. Release note: None 42053: changefeedccl: add scan boundaries based on change in set of columns r=danhhz a=aayushshah15 Currently, the changefeed poller detects a scan boundary when detects that the last version of a table descriptor has a mutation but the current version doesn't. In case of an `ALTER TABLE DROP COLUMN` statement, the point at which this happens is the point at which the schema change backfill completes. This is incorrect since the dropped column is logically dropped before this point. This PR corrects this problem by instead checking that the last version of the descriptor has a mutation AND that the number of columns in the current table descriptor is different from the number of columns in the last table descriptor. Fixes #41961 Release note (bug fix): Changefeeds now emit backfill row updates for dropped column when the table descriptor drops that column. 42494: storage: Implement teeing Engine to test/compare pebble and rocksdb r=itsbilal a=itsbilal This change adds a new Engine implementation: TeePebbleRocksDB, as well as associated objects (batch, snapshots, files, iterators). This engine type spins up instances of both pebble and rocksdb under the hood, writes to both of them in all write paths, and compares their outputs in the read path. A panic is thrown if there's a mismatch. This engine can be used in practice with `--storage-engine pebble+rocksdb`, or the related env variable `COCKROACH_STORAGE_ENGINE` as in `COCKROACH_STORAGE_ENGINE=pebble+rocksdb make test ...`. Fixes #42381. Release note: None 42635: importccl: don't diasable nullif option when escape character is spec… r=spaskob a=spaskob …ified By default the string 'NULL' is parsed as null value in DELIMITED IMPORT mode. It's usefull to treat 'NULL' as just a normal string and parse it verbatim; this can be accomplished by providing option `WITH fields_escaped_by = '\'` and using '\N' to specify null values and so in this case the parser upon seeing the string 'NULL' will treat it as a normal string. However if the customer provides their own null string via `WITH nullif = 'my_null'` it does not make sense to disregard it if they use escaping as well for some other purposes, for example escaping the field separator. A typical use case is when the empty string should be treated as null. Fixes: https://github.com/cockroachlabs/support/issues/345. Release note (bug fix): when custom nullif is provided always treat it as null. Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com> Co-authored-by: Aayush Shah <aayush.shah15@gmail.com> Co-authored-by: Bilal Akhtar <bilal@cockroachlabs.com> Co-authored-by: Spas Bojanov <spas@cockroachlabs.com>

craig · 2019-11-21T16:17:39Z

Build succeeded

GitHub CI (Cockroach)

ajwerner requested a review from a team as a code owner October 22, 2019 14:05

tbg reviewed Oct 22, 2019

View reviewed changes

ajwerner commented Oct 22, 2019

View reviewed changes

nvb reviewed Oct 23, 2019

View reviewed changes

ajwerner commented Oct 23, 2019

View reviewed changes

knz reviewed Oct 24, 2019

View reviewed changes

ajwerner commented Oct 24, 2019

View reviewed changes

ajwerner force-pushed the ajwerner/protected_timestamp_rfc branch 4 times, most recently from 2743e07 to 7387661 Compare November 4, 2019 14:34

ajwerner commented Nov 4, 2019

View reviewed changes

bdarnell reviewed Nov 4, 2019

View reviewed changes

ajwerner commented Nov 4, 2019

View reviewed changes

bdarnell reviewed Nov 4, 2019

View reviewed changes

ajwerner commented Nov 4, 2019

View reviewed changes

nvb reviewed Nov 5, 2019

View reviewed changes

ajwerner force-pushed the ajwerner/protected_timestamp_rfc branch 2 times, most recently from 82e83ca to 8b7235a Compare November 5, 2019 13:16

ajwerner commented Nov 5, 2019

View reviewed changes

bdarnell reviewed Nov 5, 2019

View reviewed changes

ajwerner commented Nov 5, 2019

View reviewed changes

ajwerner commented Nov 6, 2019

View reviewed changes

nvb mentioned this pull request Nov 6, 2019

storage: purge GC'ed keys inconsistently, in the background #16132

Closed

sumeerbhola reviewed Nov 9, 2019

View reviewed changes

ajwerner commented Nov 9, 2019

View reviewed changes

ajwerner force-pushed the ajwerner/protected_timestamp_rfc branch from 8b7235a to 1dbe88a Compare November 14, 2019 14:48

ajwerner commented Nov 14, 2019

View reviewed changes

nvb reviewed Nov 18, 2019

View reviewed changes

docs/RFCs: draft RFC for protected timestamps

e601047

This commit provides a draft RFC for a subsystem to protect values in spans of data alive at specific timestamps from being garbage collected. Release note: None

ajwerner force-pushed the ajwerner/protected_timestamp_rfc branch from 1dbe88a to e601047 Compare November 19, 2019 16:17

ajwerner commented Nov 19, 2019

View reviewed changes

nvb reviewed Nov 20, 2019

View reviewed changes

ajwerner commented Nov 20, 2019

View reviewed changes

ajwerner commented Nov 21, 2019

View reviewed changes

craig bot merged commit e601047 into cockroachdb:master Nov 21, 2019

Conversation

ajwerner commented Oct 22, 2019

Uh oh!

cockroach-teamcity commented Oct 22, 2019

Uh oh!

ajwerner commented Oct 22, 2019

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

knz left a comment

Choose a reason for hiding this comment

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

bdarnell left a comment

Choose a reason for hiding this comment

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

bdarnell left a comment

Choose a reason for hiding this comment

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

petermattis commented Nov 5, 2019

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

petermattis commented Nov 5, 2019

Uh oh!

bdarnell left a comment

Choose a reason for hiding this comment