rfc: Bounded Staleness Reads by nvb · Pull Request #66020 · cockroachdb/cockroach

nvb · 2021-06-03T03:05:22Z

Bounded staleness reads are a form of historical read-only queries that use a dynamic, system-determined timestamp, subject to a user-provided staleness bound, to read from nearby replicas while minimizing data staleness. They provide a new way to perform follower reads off local replicas to minimize query latency in multi-region clusters.

Bounded staleness reads complement CockroachDB's existing mechanism for performing follower reads, which was originally proposed in this RFC and later adopted in this RFC. This original form of follower reads is more precisely classified as an exact staleness read, meaning that the read occurs at a statically chosen timestamp, regardless of the state of the system.

Exact staleness and bounded staleness reads can exist side-by-side, as there are trade-offs between the two in terms of cost, staleness, and applicability. In general, bounded staleness reads are more powerful because they minimize staleness while being tolerant to variable replication lag, but they come at the expense of being more costly and usable in fewer places.

Bounded staleness queries are limited in use to single-statement read-only queries, and only a subset of read-only queries at that. They will be accessed in the same way as exact bounded staleness reads - through a pair of new functions that can be passed to an AS OF SYSTEM TIME clause:

SELECT ... FROM ... AS OF SYSTEM TIME with_min_timestamp(TIMESTAMP)
SELECT ... FROM ... AS OF SYSTEM TIME with_max_staleness(INTERVAL)

The approach discussed in this RFC has a prototype in #62239 which, while not identical to what is proposed here, is similar and demonstrates the high-level changes that are needed to support bounded staleness reads.

The RFC lays out a three-stage progression towards the generalized implementation of bounded staleness reads. We only intend to implement the first step in the v21.2 release.

cockroach-teamcity · 2021-06-03T03:05:30Z

This change is

ajstorm

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

docs/RFCS/20210519_bounded_staleness_reads.md, line 31 at r1 (raw file):

Bounded staleness queries are limited in use to single-statement read-only
queries, and only a subset of read-only queries at that. They will be accessed
in the same way as exact bounded staleness reads - through a pair of new

Nit: "The will be accessed in the same way as exact bounded staleness reads, using the AS OF SYSTEM TIME clause, but with the addition of a pair of of new functions..."

It's a minor point, but in my reading of it it's only partially the same way.

docs/RFCS/20210519_bounded_staleness_reads.md, line 65 at r1 (raw file):

that they **minimize staleness** when possible. Whereas with exact staleness
reads, which must conservatively pick a historical query timestamp in advance
which provides a sufficiently high probability of being served locally, with

Nit: This seems to imply that the main (only?) reason to use exact staleness reads is for local access. I was under the impression though, that there are other motivations as well (e.g. conflict avoidance). Is it worth calling that out here that these benefits are specifically in cases where exact avoidance is being used for locality purposes?

docs/RFCS/20210519_bounded_staleness_reads.md, line 99 at r1 (raw file):

read-only transactions. This means that explicit transactions, those surrounded
by a `BEGIN` and `COMMIT` statement, can not use bounded staleness reads. The
reason for this limitation is that the selection of the transaction timestamp

Nit: Might just be me but I got hung up on which "transaction timestamp" you were referring to (one somehow used in explicit transactions, or one used for the bounded staleness read). If this is a reasonable thing to get hung up on, you might want to make it clearer here that you're referring to the timestamp for the bounded staleness read (and if you're not, I'm still confused :-P ).

docs/RFCS/20210519_bounded_staleness_reads.md, line 116 at r1 (raw file):

range. Because range splits can be added between any two rows in a table in
response to size and load factors, this effectively limits queries to those that
touch a single SQL row.

It might be nice here to tie back to the above stages and clarify what this means to users in each of the stages. Also, how does this impact index access. For example, is the first stage single rows, primary index access only?

docs/RFCS/20210519_bounded_staleness_reads.md, line 134 at r1 (raw file):

that if the closest replica (by latency) to the gateway is reachable, has a
closed timestamp at or above the staleness bound, and has no outstanding intents
at or below the staleness bound, the read will be served without blocking.

Does it make sense to provide externals which fail the query if locality and "non-blocking" can't be achieved? I could imagine cases where an customer would rather things to fail fast, than having latency increase by a multiple orders of magnitude.

docs/RFCS/20210519_bounded_staleness_reads.md, line 446 at r1 (raw file):

The change will introduce two new SQL builtin functions: 
- `with_min_timestamp(TIMESTAMP) -> TIMESTAMP`

This may be implied, but are there any limits to the value of TIMESTAMP? Does a value of 0 result in "give me whatever you can find" semantics?

docs/RFCS/20210519_bounded_staleness_reads.md, line 474 at r1 (raw file):

with the `min_timestamp_bound`, or it uses the `kv.BoundedStalenessNegotiator`
and the `min_timestamp_bound` to negotiate a query timestamp ahead of time and
then run the statement's batches with this timestamp.

Nit: runs

docs/RFCS/20210519_bounded_staleness_reads.md, line 587 at r1 (raw file):

needed for query processing, permission data may be needed for authentication).

This dramatically restricts the scope of the proposal, but it also limits its

Have we given any thought as to how we're going to document this limitation? It seems quite subtle, and requires a fair bit if system knowledge to fully comprehend.

docs/RFCS/20210519_bounded_staleness_reads.md, line 607 at r1 (raw file):

compromised availability after a schema change. This means that during periods
of strong read unavailability (e.g. gateway partitioned from leaseholder), a
schema change can lead to bounded staleness unavailability because it places a

How will this unavailability manifest? Queries fail? Hang?

petermattis

This was a nice read. A few tiny drive-by comments. I'm a bit too distant from these code areas to offer more substantial commentary.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

docs/RFCS/20210519_bounded_staleness_reads.md, line 20 at r1 (raw file):

and later adopted in [this RFC](20181227_follower_reads_implementation.md).
This original form of follower reads is more precisely classified as an
exact stateless read, meaning that the read occurs at a statically chosen

Nit: s/stateless/staleness/g (same typo in the commit message, btw)

docs/RFCS/20210519_bounded_staleness_reads.md, line 86 at r1 (raw file):

reads.

`with_max_staleness(INTERVAL)` defines a maximum staleness interval to perform

Is with_max_staleness(INTERVAL) syntactic sugar on top of with_min_timestamp(now() - INTERVAL)? I didn't see this called out anywhere explicitly, but it seems likely that it is given the KV enhancements only refer to a min_timestamp_bound.

docs/RFCS/20210519_bounded_staleness_reads.md, line 265 at r1 (raw file):

reducing it to an O(num_locks_in_range) operation.

One this is possible, we will be able to be more aggressive with our use of

Nit: s/One/Once/g

rytaft

Looks great, @nvanbenschoten!

Also, Becca, I'm interested to get your take on whether anything since you last saw this comes as a surprise. One area that you might want to focus on is the proposed implementation progression (see the Limitations section), as this will have an impact on the kinds of queries that the optimizer will need to allow now and in the future.

This doesn't seem too problematic, just means we'll either need to remove some of the checks or make them less strict for phases 2 and 3. For example, we'll disallow all joins until step 3, but at step 2 we can remove the check that a scan touches only a single row.

Reviewed 3 of 3 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm and @nvanbenschoten)

docs/RFCS/20210519_bounded_staleness_reads.md, line 111 at r1 (raw file):

We define a **single-scan** query as one that contains only a single `ScanExpr`,
and by extension, only a single `TableReader`.

You could have multiple table readers for a scan that touches multiple nodes. Maybe you just need to add "since we're not using DistSQL, as mentioned below".

docs/RFCS/20210519_bounded_staleness_reads.md, line 116 at r1 (raw file):

Previously, ajstorm (Adam Storm) wrote…

It might be nice here to tie back to the above stages and clarify what this means to users in each of the stages. Also, how does this impact index access. For example, is the first stage single rows, primary index access only?

I think secondary indexes should be allowed in the first stage, as long they are covering (i.e., no index join is required)

docs/RFCS/20210519_bounded_staleness_reads.md, line 152 at r1 (raw file):

Next, we define the _local_resolved_timestamp_ for a given bounded staleness
read query as follows:

did you mean to use the python tag here and below too?

docs/RFCS/20210519_bounded_staleness_reads.md, line 163 at r1 (raw file):

This guarantee is subject to a limitation discussed in TODO.

Don't forget this TODO :)

docs/RFCS/20210519_bounded_staleness_reads.md, line 398 at r1 (raw file):

We may reconsider this decision in the future in response to customer feedback,
opting to introduce some form of asynchronous refresh mechanism. Alternatively,
we may also adding a TTL to these cache entries in addition to the LRU eviction

nit: we may also adding -> we may also consider adding

docs/RFCS/20210519_bounded_staleness_reads.md, line 477 at r1 (raw file):

The approach chosen depends on whether the query contains a single `ScanExpr`,
and by extension, a single `TableReader`. If so, the `Sender` will be handed the

would be good to add a reminder here that only local execution will be supported (i.e. no DistSQL)

docs/RFCS/20210519_bounded_staleness_reads.md, line 491 at r1 (raw file):

TODO(otan): where is the best place for this determination to live? In execution
or in the optimizer.

When you say "this determination", do you mean the call to the BoundedStalenessNegotiator? That seems like it should live in execution. The optimizer may want to be aware of the fact that negotiation will be required for costing purposes, but that seems lower priority.

docs/RFCS/20210519_bounded_staleness_reads.md, line 546 at r1 (raw file):

best-effort checks to see if a query might satisfy the requirements for bounded
staleness. For example, if a query joins multiple tables, it definitely cannot
use bounded staleness. Although in theory a join could be eliminated by a

definitely cannot use bounded staleness (in step 1)

docs/RFCS/20210519_bounded_staleness_reads.md, line 559 at r1 (raw file):

makes sense to add checks here to catch problems at prepare time. In addition to
the checks already included in TryPlaceholderFastPath, we’ll need to add an
additional check that the resulting PlaceholderScanExpr scans at most one row.

additional check (for step 1)

docs/RFCS/20210519_bounded_staleness_reads.md, line 568 at r1 (raw file):

makes use of hints to disallow certain types of plans, so we can reuse some of
that infrastructure here to avoid building plans that won’t be allowed for
bounded staleness. For example, we can use existing hints to disallow lookup and

For example, in step 1 we can use...

docs/RFCS/20210519_bounded_staleness_reads.md, line 715 at r1 (raw file):

Bounded staleness reads will be licensed under the CCL and will require a valid
enterprise license to use. This parallels exact staleness reads, which are
allowed under the BSL but will not result in followers unless run under the CCL

nit: result in followers -> result in follower reads?

tbg · 2021-06-04T07:53:52Z

docs/RFCS/20210519_bounded_staleness_reads.md

+
+We define a **single-range** query as one that touches data on only a single KV
+range. Because range splits can be added between any two rows in a table in
+response to size and load factors, this effectively limits queries to those that


This is just an implementational concern though, right? You're saying "because we're not going to really be able to check that all the things we're going to read are within a range we will be pessimistically assuming that all ranges are single-row". At that point it seems like stage 1 is really just single-row queries, and we should add single-range queries as a separate stage if we're even going to ever have them.

tbg · 2021-06-04T07:54:46Z

docs/RFCS/20210519_bounded_staleness_reads.md

+Bounded staleness reads strive to provide this guarantee. They make the claim
+that if the closest replica (by latency) to the gateway is reachable, has a
+closed timestamp at or above the staleness bound, and has no outstanding intents
+at or below the staleness bound, the read will be served without blocking.


no outstanding intents overlapping the requested range of keys

tbg · 2021-06-04T07:57:05Z

docs/RFCS/20210519_bounded_staleness_reads.md

+
+#### GetResolvedTimestamp efficiency
+
+`GetResolvedTimestamp` will initially be fairly expensive due to its scan for


In the short term, we should also be able to leverage the MVCCStats. When !ContainsEstimates && IntentCount == 0, then there aren't any intents.

nvb

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @petermattis, @rytaft, and @tbg)

docs/RFCS/20210519_bounded_staleness_reads.md, line 65 at r1 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Nit: This seems to imply that the main (only?) reason to use exact staleness reads is for local access. I was under the impression though, that there are other motivations as well (e.g. conflict avoidance). Is it worth calling that out here that these benefits are specifically in cases where exact avoidance is being used for locality purposes?

Good point. Bounded staleness can also be used for conflict avoidance, but again, while minimizing staleness. I added a note.

docs/RFCS/20210519_bounded_staleness_reads.md, line 86 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Is with_max_staleness(INTERVAL) syntactic sugar on top of with_min_timestamp(now() - INTERVAL)? I didn't see this called out anywhere explicitly, but it seems likely that it is given the KV enhancements only refer to a min_timestamp_bound.

Yes, it is. I added a mention to this.

docs/RFCS/20210519_bounded_staleness_reads.md, line 111 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

You could have multiple table readers for a scan that touches multiple nodes. Maybe you just need to add "since we're not using DistSQL, as mentioned below".

Good point. Done.

docs/RFCS/20210519_bounded_staleness_reads.md, line 115 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This is just an implementational concern though, right? You're saying "because we're not going to really be able to check that all the things we're going to read are within a range we will be pessimistically assuming that all ranges are single-row". At that point it seems like stage 1 is really just single-row queries, and we should add single-range queries as a separate stage if we're even going to ever have them.

Yes, that's all correct. There's a decision to be made about whether we support all single-range queries because the KV layer can support it, or whether we limit support (in the optimizer) to guaranteed single-row queries so that users don't start relying on single-range queries and have their queries break once a range splits. I think the latter option is a better choice and that was the intention here. So we could call these "single-row" queries, but that then introduces the ambiguity Adam mentions about secondary indexes and index joins. I've clarified this.

docs/RFCS/20210519_bounded_staleness_reads.md, line 116 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

I think secondary indexes should be allowed in the first stage, as long they are covering (i.e., no index join is required)

Clarified.

docs/RFCS/20210519_bounded_staleness_reads.md, line 134 at r1 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Does it make sense to provide externals which fail the query if locality and "non-blocking" can't be achieved? I could imagine cases where an customer would rather things to fail fast, than having latency increase by a multiple orders of magnitude.

I went back and forth on this. I think it might. But I think this also might be the kind of policy that sounds good on the surface but is better served in practice with a statement timeout. If a user doesn't want to wait more than 10ms to serve a query, do they care why it's not returning?

docs/RFCS/20210519_bounded_staleness_reads.md, line 134 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

no outstanding intents overlapping the requested range of keys

Done.

docs/RFCS/20210519_bounded_staleness_reads.md, line 152 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

did you mean to use the python tag here and below too?

Done.

docs/RFCS/20210519_bounded_staleness_reads.md, line 163 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Don't forget this TODO :)

Done.

docs/RFCS/20210519_bounded_staleness_reads.md, line 256 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

In the short term, we should also be able to leverage the MVCCStats. When !ContainsEstimates && IntentCount == 0, then there aren't any intents.

Would we be ok with trusting this stats information? Based on #66268, it seems like we're a little hesitant.

docs/RFCS/20210519_bounded_staleness_reads.md, line 446 at r1 (raw file):

Previously, ajstorm (Adam Storm) wrote…

This may be implied, but are there any limits to the value of TIMESTAMP? Does a value of 0 result in "give me whatever you can find" semantics?

It's a good question. Beyond some point, there's an interaction with the GC threshold. So I think a value of 0 results in "give me whatever you can find unless its below the GC threshold, in which case, return an error". It will be a good test case to add.

docs/RFCS/20210519_bounded_staleness_reads.md, line 477 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

would be good to add a reminder here that only local execution will be supported (i.e. no DistSQL)

Done.

docs/RFCS/20210519_bounded_staleness_reads.md, line 491 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

When you say "this determination", do you mean the call to the BoundedStalenessNegotiator? That seems like it should live in execution. The optimizer may want to be aware of the fact that negotiation will be required for costing purposes, but that seems lower priority.

Ack. Thanks.

docs/RFCS/20210519_bounded_staleness_reads.md, line 546 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

definitely cannot use bounded staleness (in step 1)

Done.

docs/RFCS/20210519_bounded_staleness_reads.md, line 559 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

additional check (for step 1)

Done.

docs/RFCS/20210519_bounded_staleness_reads.md, line 568 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

For example, in step 1 we can use...

Done.

docs/RFCS/20210519_bounded_staleness_reads.md, line 587 at r1 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Have we given any thought as to how we're going to document this limitation? It seems quite subtle, and requires a fair bit if system knowledge to fully comprehend.

This is an interesting question. It implies that we are going to tout bounded staleness as an availability mechanism during loss of quorum. That is part of why we are building this, but it's also quite subtle, as you mention. It's also notably not something that other systems make guarantees about.

I have a hard time saying much here before we document and make stronger guarantees about availability across other parts of the system - especially as it relates to system ranges and the system database.

docs/RFCS/20210519_bounded_staleness_reads.md, line 607 at r1 (raw file):

Previously, ajstorm (Adam Storm) wrote…

How will this unavailability manifest? Queries fail? Hang?

Queries hanging unless they have a statement timeout. In which case, failing.

otan

i've had a brief glance (slipped my plate today) and it's changed since i read the doc, let me pick it up again better tomorrow.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @nvanbenschoten, @petermattis, @rytaft, and @tbg)

docs/RFCS/20210519_bounded_staleness_reads.md, line 491 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Ack. Thanks.

I agree with execution here, but I think becca would know better as it's a little outside my layer of knowledge.

docs/RFCS/20210519_bounded_staleness_reads.md, line 448 at r2 (raw file):

The change will introduce two new SQL builtin functions: 
- `with_min_timestamp(TIMESTAMP) -> TIMESTAMP`
- `with_max_staleness(INTERVAL) -> INTERVAL`

just double checking - these builtins don't return anything when used by themselves right, e.g. SELECT with_min_timestamp(TIMESTAMP)?
rather, it only makes sense when processing by a row by row basis? any reason why this shouldnot be a syntax thing instead, e.g. SELECT .... MIN STALENESS ... (as we do with AOST)

docs/RFCS/20210519_bounded_staleness_reads.md, line 463 at r2 (raw file):

and execution.

TODO(otan): is there anything more to say here? Do we add support for bounded

hmm, the session variable is interesting. not opposed.

docs/RFCS/20210519_bounded_staleness_reads.md, line 497 at r2 (raw file):

costing purposes, but that seems lower priority."

TODO(otan): the KV discussion proposes that we ignore a query's `kv.Txn` and

no strong opinions atm, might come up with it over the coming weeks.

docs/RFCS/20210519_bounded_staleness_reads.md, line 518 at r2 (raw file):

Unavailability](#schema-unavailability) section.

TODO(otan): how do we feel about this?

I go back to AOST as well - what do we do there?
It seems somewhat insane to support reading a row based on a past schema - at the very least for the beginning, I don't see it worth supporting back-in-time schema reads.

docs/RFCS/20210519_bounded_staleness_reads.md, line 522 at r2 (raw file):

#### Observability into query timestamp

TODO(otan): how will bounded staleness reads interact with the `cluster_logical_timestamp`

hmm, how does it interact with AOST today? AFAICT, nothing?
is it sane to do the sane thing?

docs/RFCS/20210519_bounded_staleness_reads.md, line 525 at r2 (raw file):

builtin function? Should it be disallowed in bounded staleness transactions?

TODO(otan): will we provide any other observaibility into the timestamp of a

should we do something akin to crdb_internal_mvcc_timestamp? a notice is trickier to manage imo.

rytaft

Reviewed 1 of 1 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @nvanbenschoten, @petermattis, and @tbg)

ajstorm

Forgot to mention this in with my first round of comments, but I also enjoyed reading this. Very well written, and a fun read.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten, @petermattis, and @tbg)

docs/RFCS/20210519_bounded_staleness_reads.md, line 134 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I went back and forth on this. I think it might. But I think this also might be the kind of policy that sounds good on the surface but is better served in practice with a statement timeout. If a user doesn't want to wait more than 10ms to serve a query, do they care why it's not returning?

I don't have a strong feeling one way or another here, but it's possible that customers may have a mix of queries, some of which they're prepared to let run longer than others? Would a statement_timeout setting work in that case? From a UX perspective, provided that we could get down to something simple here, it might be nice to have locality built into the bounded staleness interface. Coming at this from a different angle, do we know what Twitter wants here?

docs/RFCS/20210519_bounded_staleness_reads.md, line 446 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

It's a good question. Beyond some point, there's an interaction with the GC threshold. So I think a value of 0 results in "give me whatever you can find unless its below the GC threshold, in which case, return an error". It will be a good test case to add.

SGTM. Might be nice to mention this interplay with the GC threshold somewhere in here.

docs/RFCS/20210519_bounded_staleness_reads.md, line 587 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This is an interesting question. It implies that we are going to tout bounded staleness as an availability mechanism during loss of quorum. That is part of why we are building this, but it's also quite subtle, as you mention. It's also notably not something that other systems make guarantees about.

I have a hard time saying much here before we document and make stronger guarantees about availability across other parts of the system - especially as it relates to system ranges and the system database.

That makes sense to me too. We might want to leave some breadcrumbs here for the docs folks (and ourselves as well) that we don't want to tout this as an availability mechanism for the time being.

docs/RFCS/20210519_bounded_staleness_reads.md, line 607 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Queries hanging unless they have a statement timeout. In which case, failing.

Is it worth making that explicit here that "bounded staleness unavailability" means hangs (in the absence of statement timeouts). It also seems to require that a statement timeout is set to make this work reliably. Do we need to document that somewhere in the RFC?

otan · 2021-06-15T22:02:26Z

docs/RFCS/20210519_bounded_staleness_reads.md, line 448 at r2 (raw file):

Previously, otan (Oliver Tan) wrote…

just double checking - these builtins don't return anything when used by themselves right, e.g. SELECT with_min_timestamp(TIMESTAMP)?
rather, it only makes sense when processing by a row by row basis? any reason why this shouldnot be a syntax thing instead, e.g. SELECT .... MIN STALENESS ... (as we do with AOST)

ignore this!

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten, @petermattis, and @tbg)

docs/RFCS/20210519_bounded_staleness_reads.md, line 89 at r2 (raw file):

to request a read from nearby followers, if possible, while placing some limit
on how stale results can be. `with_max_staleness(INTERVAL)` is syntactic sugar
on top of `with_min_timestamp(now() - INTERVAL)`.

maybe note that while it is defined as a syntactic sugar for this second expression, the second expression is not allowed because the evaluation logic of this clause is primitive and does not support general sql expressions.

nvb

@rytaft @otan I've updated this RFC with the takeaways from our discussion last week. Specifically, I've made the following three changes:

bounded-staleness reads are no longer issued through a non-transactional KV API. Instead, they are issued through a kv.Txn that is configured either lazily or eagerly with the negotiated timestamp. This means even fewer changes at the SQL layer and fits more naturally with the mental model that I imagine people would have.
a result of ^ is that cluster_logical_timestamp now provides a natural method to observe the dynamic timestamp of bounded staleness reads. This will look something like SELECT *, cluster_logical_timestamp() FROM t AS OF SYSTEM TIME with_max_staleness('10s') WHERE id = 1.
I added a note about the eventual (in stage 3) representation of eager timestamp negotiation as a proper SQL operator that sits at the top of a query operator tree.

With those changes, this RFC is ready to be taken out of "draft" status. Also, since we've already had a large number of people look at this, I'll give it a few days and then move it into the final comment period if nothing else substantial arises.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @otan, @petermattis, @rytaft, and @tbg)

docs/RFCS/20210519_bounded_staleness_reads.md, line 134 at r1 (raw file):

Would a statement_timeout setting work in that case?

Yes, a statement timeout would work in that case.

My sense is that anything more general here is a separate topic. I can imaging have a session setting that disallows cross-region communication. This would apply to more than just bounded staleness reads. It could also apply to writes, to exact staleness reads, and to locality optimized search.

docs/RFCS/20210519_bounded_staleness_reads.md, line 587 at r1 (raw file):

Previously, ajstorm (Adam Storm) wrote…

That makes sense to me too. We might want to leave some breadcrumbs here for the docs folks (and ourselves as well) that we don't want to tout this as an availability mechanism for the time being.

Done.

docs/RFCS/20210519_bounded_staleness_reads.md, line 607 at r1 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Is it worth making that explicit here that "bounded staleness unavailability" means hangs (in the absence of statement timeouts). It also seems to require that a statement timeout is set to make this work reliably. Do we need to document that somewhere in the RFC?

Unavailability means hangs until connectivity is restored in the absence of statement timeouts throughout CRDB, so it's not specific to this RFC. I'll add a note.

docs/RFCS/20210519_bounded_staleness_reads.md, line 497 at r2 (raw file):

Previously, otan (Oliver Tan) wrote…

no strong opinions atm, might come up with it over the coming weeks.

I switched this over to what we discussed last week - issuing the batch through a kv.Txn and setting the kv.Txn's fixed timestamp after negotiation.

docs/RFCS/20210519_bounded_staleness_reads.md, line 522 at r2 (raw file):

Previously, otan (Oliver Tan) wrote…

hmm, how does it interact with AOST today? AFAICT, nothing?
is it sane to do the sane thing?

Using the kv.Txn directly solves this issue.

docs/RFCS/20210519_bounded_staleness_reads.md, line 525 at r2 (raw file):

Previously, otan (Oliver Tan) wrote…

should we do something akin to crdb_internal_mvcc_timestamp? a notice is trickier to manage imo.

cluster_logical_timestamp should work here. Added.

knz

Reviewed 2 of 3 files at r1, 1 of 1 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @nvanbenschoten, @otan, @petermattis, and @tbg)

docs/RFCS/20210519_bounded_staleness_reads.md, line 299 at r3 (raw file):

with abandoned intents. `GetResolvedTimestamp` will attach any intents that it
finds that are below the range's closed timestamp to its
`LocalResult.EncounteredIntents` set. This will cause the range to

This should discuss a limit to ensure that a range with abnormally high intent counts don't die with OOM due to excessively large EncounteredIntents.

Of course the operation can't just scan a range and stop scanning when the limit is reached (because there could be a later intent with a timestamp that's newer). Instead, I think the scan should use a max-k sorter where k is the configured limit.

docs/RFCS/20210519_bounded_staleness_reads.md, line 378 at r3 (raw file):

sending `GetResolvedTimestamp` requests on every cross-range bounded staleness
read, the `BoundedStalenessNegotiator` aggressively caches resolved timestamp
spans in an LRU interval cache.

How large is this cache? Is it configurable?

docs/RFCS/20210519_bounded_staleness_reads.md, line 602 at r3 (raw file):

to system ranges. This is true both at the KV layer (e.g. range addressing may
be needed for routing) and at the SQL layer (e.g. table descriptors may be
needed for query processing, permission data may be needed for authentication).

for authorization

NB: the descriptors are cached; wouldn't this generally mean that after a "warm up" phase we're still able to achieve the availability goals?

docs/RFCS/20210519_bounded_staleness_reads.md, line 635 at r3 (raw file):

It is open for discussion whether we want to do anything more sophisticated here,
like introduce a retry loop that steps back across schema versions until it finds
a version that

sentence is incomplete

docs/RFCS/20210519_bounded_staleness_reads.md, line 868 at r3 (raw file):

### Impact on Availability

There is very little discussion of the impact of stale reads on available. This

nit: availability

rytaft

Reviewed 1 of 1 files at r3.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajstorm, @nvanbenschoten, @otan, @petermattis, and @tbg)

docs/RFCS/20210519_bounded_staleness_reads.md, line 519 at r3 (raw file):

called after timestamp negotiation. Otherwise, the query will return an error.

This means that the way to access the dynamic timestamp of a bounded staleness

nit: bounded staleness -> bounded staleness read?

docs/RFCS/20210519_bounded_staleness_reads.md, line 590 at r3 (raw file):

phase before query execution for multi-scan statements. This will be modeled as
a new `TimestampNegotiation` (name pending) operator that sits at the top of a
query plan. The operator will be configured with the query's maximal set of

You might add "... sits at the top of a query plan, similar to a CTE (i.e., WITH expression)."

ajstorm

too.

Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @mgartner, @nvanbenschoten, @otan, @petermattis, @RaduBerinde, and @tbg)

docs/RFCS/20210519_bounded_staleness_reads.md, line 134 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Would a statement_timeout setting work in that case?

Yes, a statement timeout would work in that case.

My sense is that anything more general here is a separate topic. I can imaging have a session setting that disallows cross-region communication. This would apply to more than just bounded staleness reads. It could also apply to writes, to exact staleness reads, and to locality optimized search.

Perfect. I was under the impression that the statement_timeout value was cluster wide (due to the default_statement_timeout cluster variable), but now I see that you can set it on individual sessions (which makes complete sense). I agree that we're good here.

mgartner

Very cool! The high-level and optimizer-related sections .

From the PR description:

We only intend to implement the first step in the v21.1 release.

Did you mean 21.2?

Reviewable status: complete! 3 of 0 LGTMs obtained (waiting on @nvanbenschoten, @otan, @petermattis, @RaduBerinde, and @tbg)

docs/RFCS/20210519_bounded_staleness_reads.md, line 525 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

cluster_logical_timestamp should work here. Added.

Do we expect users to use this function? If not, maybe it should be prefixed with crdb_internal_?

docs/RFCS/20210519_bounded_staleness_reads.md, line 148 at r3 (raw file):

closest_replicas(gw, keys) = map(lambda range: closest_replica(gw, range), relevant_ranges(keys))

min_closed_timestamp(gw, keys) = min(closest_replicas(gw, keys),  lambda repl: repl.closed_timesatmp)

nit: repl.closed_timesatmp => repl.closed_timestamp

awoods187

Reviewable status: complete! 3 of 0 LGTMs obtained (waiting on @nvanbenschoten, @otan, @petermattis, @RaduBerinde, and @tbg)

docs/RFCS/20210519_bounded_staleness_reads.md, line 740 at r3 (raw file):

```sql
SELECT * FROM t AS OF SYSTEM TIME with_exact_staleness('30s')

I like this idea but think it could be split into a future release

docs/RFCS/20210519_bounded_staleness_reads.md, line 755 at r3 (raw file):

staleness instead of exact staleness? The answer is "probably not", because of
the various [limitations](#limitations) placed on bounded staleness reads.

I dont think we should change the existing follower read built in

RaduBerinde

SQL part LGTM.

Reviewable status: complete! 3 of 0 LGTMs obtained (waiting on @nvanbenschoten, @otan, @petermattis, and @tbg)

docs/RFCS/20210519_bounded_staleness_reads.md, line 576 at r3 (raw file):

makes use of hints to disallow certain types of plans, so we can reuse some of
that infrastructure here to avoid building plans that won’t be allowed for
bounded staleness. For example, in step 1 we can use existing hints to disallow

When time comes, we should be careful to document that using this feature limits the kinds of plans that the optimizer generates. It might be confusing to a user to see that a query isn't using an obvious index.

otan

Reviewed 2 of 3 files at r1.
Reviewable status: complete! 3 of 0 LGTMs obtained (waiting on @mgartner, @nvanbenschoten, @petermattis, and @tbg)

docs/RFCS/20210519_bounded_staleness_reads.md, line 525 at r2 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

Do we expect users to use of this function? If not, maybe it should be prefixed with crdb_internal_?

i do think we want users to use it, and i think it's already available as a function.

rytaft

Reviewable status: complete! 3 of 0 LGTMs obtained (waiting on @mgartner, @nvanbenschoten, @petermattis, @rmloveland, and @tbg)

docs/RFCS/20210519_bounded_staleness_reads.md, line 576 at r3 (raw file):

Previously, RaduBerinde wrote…

When time comes, we should be careful to document that using this feature limits the kinds of plans that the optimizer generates. It might be confusing to a user to see that a query isn't using an obvious index.

Good point. Not sure who will be documenting this feature (maybe @rmloveland?), but something to keep in mind for sure.

Bounded staleness reads are a form of historical read-only queries that use a dynamic, system-determined timestamp, subject to a user-provided staleness bound, to read from nearby replicas while minimizing data staleness. They provide a new way to perform follower reads off local replicas to minimize query latency in multi-region clusters. Bounded staleness reads complement CockroachDB's existing mechanism for performing follower reads, which was originally proposed in [this RFC](20180603_follower_reads.md) and later adopted in [this RFC](20181227_follower_reads_implementation.md). This original form of follower reads is more precisely classified as an exact staleness read, meaning that the read occurs at a statically chosen timestamp, regardless of the state of the system. Exact staleness and bounded staleness reads can exist side-by-side, as there are trade-offs between the two in terms of cost, staleness, and applicability. In general, bounded staleness reads are more powerful because they minimize staleness while being tolerant to variable replication lag, but they come at the expense of being more costly and usable in fewer places. Bounded staleness queries are limited in use to single-statement read-only queries, and only a subset of read-only queries at that. They will be accessed in the same way as exact bounded staleness reads - through a pair of new functions that can be passed to an `AS OF SYSTEM TIME` clause: - `SELECT ... FROM ... AS OF SYSTEM TIME with_min_timestamp(TIMESTAMP)` - `SELECT ... FROM ... AS OF SYSTEM TIME with_max_staleness(INTERVAL)` The approach discussed in this RFC has a prototype in cockroachdb#62239 which, while not identical to what is proposed here, is similar and demonstrates the high-level changes that are needed to support bounded staleness reads.

nvb

We only intend to implement the first step in the v21.1 release.

Did you mean 21.2?

Yes, done.

Thanks for all the reviews!

bors r+

Reviewable status: complete! 0 of 0 LGTMs obtained (and 3 stale) (waiting on @aliher1911, @awoods187, @knz, @otan, @petermattis, @rytaft, and @tbg)

docs/RFCS/20210519_bounded_staleness_reads.md, line 525 at r2 (raw file):

Previously, otan (Oliver Tan) wrote…

i do think we want users to use it, and i think it's already available as a function.

Right, this function already exists.

docs/RFCS/20210519_bounded_staleness_reads.md, line 148 at r3 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

nit: repl.closed_timesatmp => repl.closed_timestamp

Done.

docs/RFCS/20210519_bounded_staleness_reads.md, line 299 at r3 (raw file):

Previously, knz (kena) wrote…

This should discuss a limit to ensure that a range with abnormally high intent counts don't die with OOM due to excessively large EncounteredIntents.

Of course the operation can't just scan a range and stop scanning when the limit is reached (because there could be a later intent with a timestamp that's newer). Instead, I think the scan should use a max-k sorter where k is the configured limit.

This is a good point, but I don't think it's any more related to this RFC than it is to any other read operation. Instead, it's closely related to #64783, which was recently addressed by @aliher1911.

docs/RFCS/20210519_bounded_staleness_reads.md, line 378 at r3 (raw file):

Previously, knz (kena) wrote…

How large is this cache? Is it configurable?

I imagine this will be sized similarly to the RangeCache, which defaults to 1e6 entries and is configurable using the kv.range_descriptor_cache.size cluster setting. I haven't determined whether it would be best to use the RangeCache directly for this storage (see "Tracking Resolved Timestamp in RangeCache"), in which case we will inherit this size limit, or whether we'll need to introduce a similar cache. As that section discusses, part of that decision depends on the separated lock table and making GetResolvedTimestamp efficient.

docs/RFCS/20210519_bounded_staleness_reads.md, line 519 at r3 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: bounded staleness -> bounded staleness read?

Done.

docs/RFCS/20210519_bounded_staleness_reads.md, line 590 at r3 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

You might add "... sits at the top of a query plan, similar to a CTE (i.e., WITH expression)."

Done.

docs/RFCS/20210519_bounded_staleness_reads.md, line 602 at r3 (raw file):
Done.

the descriptors are cached; wouldn't this generally mean that after a "warm up" phase we're still able to achieve the availability goals?

Yes, that is true! Though as we're learning with certain customers, we need availability guarantees, not just average cases.

docs/RFCS/20210519_bounded_staleness_reads.md, line 635 at r3 (raw file):

Previously, knz (kena) wrote…

sentence is incomplete

Done.

docs/RFCS/20210519_bounded_staleness_reads.md, line 740 at r3 (raw file):

Previously, awoods187 (Andy Woods) wrote…

I like this idea but think it could be split into a future release

Agreed.

docs/RFCS/20210519_bounded_staleness_reads.md, line 755 at r3 (raw file):

Previously, awoods187 (Andy Woods) wrote…

I dont think we should change the existing follower read built in

Agreed.

docs/RFCS/20210519_bounded_staleness_reads.md, line 868 at r3 (raw file):

Previously, knz (kena) wrote…

nit: availability

Done.

craig · 2021-06-30T21:36:06Z

Build succeeded:

GitHub CI (Cockroach)

rmloveland

Reviewable status: complete! 0 of 0 LGTMs obtained (and 3 stale)

docs/RFCS/20210519_bounded_staleness_reads.md, line 576 at r3 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Good point. Not sure who will be documenting this feature (maybe @rmloveland?), but something to keep in mind for sure.

Thanks Becca! Filed cockroachdb/docs#10869 to track

nvb force-pushed the nvanbenschoten/boundedStalenessRFC branch from 749b0ae to 7b8c5f0 Compare June 3, 2021 03:07

ajstorm reviewed Jun 3, 2021

View reviewed changes

petermattis reviewed Jun 3, 2021

View reviewed changes

rytaft reviewed Jun 3, 2021

View reviewed changes

tbg reviewed Jun 4, 2021

View reviewed changes

nvb force-pushed the nvanbenschoten/boundedStalenessRFC branch from 7b8c5f0 to a8a2488 Compare June 14, 2021 20:48

nvb commented Jun 14, 2021

View reviewed changes

otan reviewed Jun 15, 2021

View reviewed changes

rytaft reviewed Jun 15, 2021

View reviewed changes

ajstorm requested review from petermattis and tbg June 15, 2021 19:13

ajstorm reviewed Jun 15, 2021

View reviewed changes

ajwerner reviewed Jun 16, 2021

View reviewed changes

tbg removed their request for review June 21, 2021 11:49

nvb force-pushed the nvanbenschoten/boundedStalenessRFC branch from a8a2488 to bc59a72 Compare June 22, 2021 03:06

blathers-crl bot requested a review from otan June 22, 2021 03:06

nvb commented Jun 22, 2021

View reviewed changes

nvb marked this pull request as ready for review June 22, 2021 03:07

nvb requested a review from a team as a code owner June 22, 2021 03:07

knz reviewed Jun 22, 2021

View reviewed changes

rytaft approved these changes Jun 22, 2021

View reviewed changes

rytaft requested review from a team, RaduBerinde and mgartner June 22, 2021 12:20

ajstorm requested a review from tbg June 22, 2021 13:18

ajstorm approved these changes Jun 22, 2021

View reviewed changes

mgartner approved these changes Jun 22, 2021

View reviewed changes

awoods187 approved these changes Jun 24, 2021

View reviewed changes

RaduBerinde approved these changes Jun 24, 2021

View reviewed changes

otan approved these changes Jun 24, 2021

View reviewed changes

rytaft approved these changes Jun 25, 2021

View reviewed changes

nvb force-pushed the nvanbenschoten/boundedStalenessRFC branch from bc59a72 to 6710a05 Compare June 30, 2021 20:27

nvb commented Jun 30, 2021

View reviewed changes

craig bot merged commit bb541aa into cockroachdb:master Jun 30, 2021

rmloveland mentioned this pull request Jul 1, 2021

Document that bounded staleness limits plans the optimizer can generate cockroachdb/docs#10869

Closed

rmloveland reviewed Jul 1, 2021

View reviewed changes

nvb deleted the nvanbenschoten/boundedStalenessRFC branch July 1, 2021 17:00


		#### GetResolvedTimestamp efficiency

		`GetResolvedTimestamp` will initially be fairly expensive due to its scan for

Conversation

nvb commented Jun 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Jun 3, 2021

Uh oh!

ajstorm left a comment

Choose a reason for hiding this comment

Uh oh!

petermattis left a comment

Choose a reason for hiding this comment

Uh oh!

rytaft left a comment

Choose a reason for hiding this comment

Uh oh!

tbg Jun 4, 2021

Choose a reason for hiding this comment

Uh oh!

tbg Jun 4, 2021

Choose a reason for hiding this comment

Uh oh!

tbg Jun 4, 2021

Choose a reason for hiding this comment

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

otan left a comment

Choose a reason for hiding this comment

Uh oh!

rytaft left a comment

Choose a reason for hiding this comment

Uh oh!

ajstorm left a comment

Choose a reason for hiding this comment

Uh oh!

otan commented Jun 15, 2021

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

knz left a comment

Choose a reason for hiding this comment

Uh oh!

rytaft left a comment

Choose a reason for hiding this comment

Uh oh!

ajstorm left a comment

Choose a reason for hiding this comment

Uh oh!

mgartner left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

awoods187 left a comment

Choose a reason for hiding this comment

Uh oh!

RaduBerinde left a comment

Choose a reason for hiding this comment

Uh oh!

otan left a comment

Choose a reason for hiding this comment

Uh oh!

rytaft left a comment

Choose a reason for hiding this comment

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

craig bot commented Jun 30, 2021

Uh oh!

rmloveland left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

nvb commented Jun 3, 2021 •

edited

Loading

mgartner left a comment •

edited

Loading