storage: implement follower reads using replica closed timestamps#21056
storage: implement follower reads using replica closed timestamps#21056spencerkimball wants to merge 1 commit intocockroachdb:masterfrom
Conversation
|
I haven't written any unittests yet. Wanted to get this out so people can take a gander at the approach first. |
76578f1 to
9f657fc
Compare
|
I cannot tell from the commit message - where does this PR want to place us on the trade-off between allowing long running txns (that want to write at arbitrarily old timestamps) and the desire to serve reads from followers? |
|
It limits us to 10s, which is already the practical reality due to the
timestamp cache.
…On Tue, Dec 26, 2017 at 6:32 PM Andrei Matei ***@***.***> wrote:
I cannot tell from the commit message - where does this PR want to place
us on the trade-off between allowing long running txns (that want to write
at arbitrarily old timestamps) and the desire to serve reads from followers?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#21056 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AF3MTYeVlPNnwj-mzKsDreCo0r-RPk6Mks5tEYIfgaJpZM4RNCq9>
.
|
|
The timestamp cache is more complicated; it has a size component too. Like,
if you don't do any reads on a node, the low water mark won't move up. I
believe 10s is some sort of a minimum it guarantees to hold, not a maximum.
No?
…On Dec 26, 2017 7:50 PM, "Spencer Kimball" ***@***.***> wrote:
It limits us to 10s, which is already the practical reality due to the
timestamp cache.
On Tue, Dec 26, 2017 at 6:32 PM Andrei Matei ***@***.***>
wrote:
> I cannot tell from the commit message - where does this PR want to place
> us on the trade-off between allowing long running txns (that want to
write
> at arbitrarily old timestamps) and the desire to serve reads from
followers?
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <https://github.com/cockroachdb/cockroach/pull/
21056#issuecomment-354024865>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AF3MTYeVlPNnwj-
mzKsDreCo0r-RPk6Mks5tEYIfgaJpZM4RNCq9>
> .
>
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#21056 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAXBcQb_uFQvJ7X-OjVHtoPjoR1S5AWbks5tEZRPgaJpZM4RNCq9>
.
|
|
If we introduce a blanket limit of 10s, we close the door to interactive
use of (non-readonly) transactions by humans in shells, which I think would
be quite annoying.
…On Dec 26, 2017 9:04 PM, "Andrei Matei" ***@***.***> wrote:
The timestamp cache is more complicated; it has a size component too. Like,
if you don't do any reads on a node, the low water mark won't move up. I
believe 10s is some sort of a minimum it guarantees to hold, not a maximum.
No?
On Dec 26, 2017 7:50 PM, "Spencer Kimball" ***@***.***>
wrote:
> It limits us to 10s, which is already the practical reality due to the
> timestamp cache.
>
> On Tue, Dec 26, 2017 at 6:32 PM Andrei Matei ***@***.***>
> wrote:
>
> > I cannot tell from the commit message - where does this PR want to
place
> > us on the trade-off between allowing long running txns (that want to
> write
> > at arbitrarily old timestamps) and the desire to serve reads from
> followers?
> >
> > —
> > You are receiving this because you authored the thread.
> > Reply to this email directly, view it on GitHub
> > <https://github.com/cockroachdb/cockroach/pull/
> 21056#issuecomment-354024865>,
> > or mute the thread
> > <https://github.com/notifications/unsubscribe-auth/AF3MTYeVlPNnwj-
> mzKsDreCo0r-RPk6Mks5tEYIfgaJpZM4RNCq9>
> > .
> >
>
> —
> You are receiving this because your review was requested.
> Reply to this email directly, view it on GitHub
> <https://github.com/cockroachdb/cockroach/pull/
21056#issuecomment-354030777>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAXBcQb_uFQvJ7X-
OjVHtoPjoR1S5AWbks5tEZRPgaJpZM4RNCq9>
> .
>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#21056 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAXBcRONNoI4-hxKDT8lRUl0v_azFPjbks5tEaWsgaJpZM4RNCq9>
.
|
|
I've given some thought to these issues. There are pros and cons to all approaches, and nothing I've considered would be an unqualified win across the board. There are huge pitfalls with allowing arbitrarily long-running transactions which can alter long-past history. That path leads through chasms which I don't believe can be bridged. See the bullet below for
|
9f657fc to
94e088f
Compare
|
pkg/kv/replica_slice.go, line 95 at r1 (raw file):
Wow, can't believe we were still sorting based on "attributes"? WTF? Comments from Reviewable |
c3f38e3 to
d016af5
Compare
|
@andreimatei perhaps the interactive CLI should default to |
d016af5 to
eb3d026
Compare
I don't think that doing special things for our CLI helps too much; we want humans to interact with our database through all sorts of tools, not just our client. I think a feature that comes with such a tradeoff has to be under some sort of control/opt-in. Btw Spencer, you know about this RFC, right? https://reviewable.io/reviews/cockroachdb/cockroach/19222#- |
|
Yeah, this needs to go through the RFC process (either taking over #19222 or a new doc). Some of the higher-level pieces of the RFC template are very relevant here, in particular the guide-level explanation and how this will be exposed to users, and how it will be monitored and tested. I think there's quite a bit of follow-up work here (such as making the 10s limit configurable) and we need to understand what level of polish we're committing to before starting down this path. I'm particularly concerned about how this degrades when the read cannot be served locally, either because the local replica is behind or it hits an unresolved intent (the latter is tricky because we will try to resolve the intent remotely, and then try the read again locally even though the resolution may not have propagated yet).
Do we have any data about this one way or another? How big does a workload need to be before the timestamp cache fills up and we have to restart older transactions? Is exceeding the limit a retryable or non-retryable error (if it's retryable, then transactions that are too large to ever complete may spin endlessly. This is already a problem now if the timestamp cache gets too big, but enforcing the limit in all cases would make it worse). The imposition of a global limit like this needs its own discussion instead of being buried in this PR. (I'm generally in favor of limiting the duration of write transactions, and opposed to defaulting to SNAPSHOT in the CLI) |
|
See #21140 for a solution to long-running @bdarnell I would default to not making the max safe timestamp interval configurable. For one, doing so seems premature. For another, I think there may be a better way to solve the main use case which still makes a case for configurability. That use case is lower-latency change data feeds. I actually think that those should work not by setting a range-wide max safe timestamp interval lower, but by having the change feed reader update the timestamp cache of the range it's reading from. Lower-latency change feeds would need to consume from the leaseholder. |
|
Reviewing the commit message, since it acts as a mini-RFC:
This paragraph is unclear. Define the "max safe" terminology before using it, and be more clear about how the "max safe log index" fits in. If I'm understanding correctly, I'd rewrite this paragraph as "The max safe timestamp is a timestamp beneath which no new writes will be proposed, and therefore followers can serve reads older than this timestamp without consulting with the leaseholder. A raft leader may send a (max safe timestamp, log index) pair in its heartbeat which indicates that the follower's max safe timestamp should be updated to the given value when all raft log entries including the given index have been applied." Right?
What is the "max safe interval"? This "min proposal timestamp" is not otherwised referenced in the commit message; how is it used and how does it relate to the max safe timestamp?
What is "more interesting" about this? Continuing to rely on the old maxSafeTS as it slips further out of date limits the usefulness of this feature - we need some way of advancing the maxSafeTS without unquiescing the range.
Heartbeats are sent by the leader, so does this mean that the leader marks a range as "unsafe" when it is sending a heartbeat for a range whose lease it does not hold?
How long does the range stay "unsafe" after a lease change?
Heartbeats are not sent to or from removed replicas, so what does this mean?
So far, the only way in which maxSafeTS can advance is when a leader+leaseholder sends it in the heartbeat, so there is no way in which a non-leaseholder could report a maxSafeTS.
So maxSafeTS is revokable? What if followers have already served from that timestamp?
Who is "the receiver"? What happens if the receiver is dead? If a node crashes, does it need to recover its "unsafe" map on restart somehow?
My biggest concern with maxSafeTS generally is not addressed in the commit message: what happens when a range unquiesces? For quiescence to make sense, followers must be able to advance their maxSafeTS without additional periodic communication with the leader, but then we need some way to guarantee that when the range unquiesces, the leader will not make any proposals that could conflict with any of its followers maxSafeTimestamps. Review status: 0 of 21 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed. Comments from Reviewable |
eb3d026 to
35a04e1
Compare
|
No worries! Always appreciate writing tests. |
7e0c9b8 to
95454cd
Compare
|
OK, got the unittests fully working and stress tested. |
08b5545 to
40750b4
Compare
|
Added some explanatory comments (that will need some more smoothing out) and addressed parts of the comments. More to come. @spencerkimball I assume you're done with that branch. If not, please let's chat. Review status: pkg/kv/dist_sender.go, line 69 at r7 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. pkg/kv/dist_sender.go, line 70 at r7 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. pkg/kv/dist_sender.go, line 448 at r7 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. pkg/kv/dist_sender.go, line 477 at r7 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. pkg/roachpb/batch.go, line 39 at r7 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Moved the check to pkg/storage/metrics.go, line 497 at r7 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Done. I removed this one since pkg/storage/raft_transport.go, line 572 at r7 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Done. pkg/storage/replica.go, line 504 at r7 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. pkg/storage/replica.go, line 2609 at r7 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Added commentary. pkg/storage/replica.go, line 2614 at r7 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
The condition here is now pkg/storage/replica.go, line 2627 at r7 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Ditto. pkg/storage/replica.go, line 2652 at r7 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. pkg/storage/replica.go, line 2660 at r7 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. pkg/storage/replica.go, line 2706 at r7 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Looks like this was done. pkg/storage/replica.go, line 2720 at r7 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Seems obsolete. pkg/storage/replica.go, line 2924 at r7 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. pkg/storage/replica.go, line 2935 at r7 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. pkg/storage/replica.go, line 3448 at r7 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Added a TODO to do this. Want to double check that there's no problem there. pkg/storage/replica.go, line 3475 at r7 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Done. pkg/storage/replica.go, line 3490 at r7 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Done. pkg/storage/replica.go, line 4345 at r7 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Done. pkg/storage/replica.go, line 506 at r8 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. Comments from Reviewable |
|
I am done; was just getting the build green finally. Review status: Comments from Reviewable |
8c15094 to
2b04b53
Compare
|
I responded to all of the comments. For the active reviewers (@nvanbenschoten and @bdarnell I think) as a heads up, while I have edited this PR lightly, I took lots of your comments and simply put them into code as TODOs in my name. My plan (or hope, rather) is to polish this up to the point where it can merge (as default-off code, after the stability week sha is picked) similar to the range merge work, and then refactor/improve individual pieces in targeted PRs. Reviewed 1 of 20 files at r1, 10 of 21 files at r3, 1 of 8 files at r7, 2 of 11 files at r8, 9 of 18 files at r9, 4 of 6 files at r10, 7 of 7 files at r11. pkg/storage/raft.proto, line 56 at r7 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Done. pkg/storage/raft.proto, line 90 at r7 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Added a comment. pkg/storage/store.go, line 416 at r7 (raw file):
pkg/storage/store.go, line 3196 at r7 (raw file): // TODO(tschottdorf): can this already be set here? I think so, because of the pkg/storage/store.go, line 4289 at r7 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
No, this is driven by a single loop with a ticker, but I agree that this isn't great. Added an assertion plus a comment for now because pkg/storage/store.go, line 4327 at r7 (raw file): Previously, bdarnell (Ben Darnell) wrote…
I think I've found a concrete problem: TODO(tschottdorf): I think the liveness has a problem:
Note that the read timestamp here must be really close to current time, because We can address this by not attempting follower reads unless the requested time- pkg/storage/store.go, line 4363 at r7 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Good point. Added a TODO. Comments from Reviewable |
|
One thing we're going to have to be careful about here is resolving intents that a read finds while performing a follower read. I fear that without any extra logic, we could fall into busy loops where a follower read finds and resolves an intent but continues to stumble into it again and again because the follower it is reading on doesn't end up in the intent resolution quorum because it has fallen behind and as a result doesn't synchronously reflect the result of the intent resolution locally. In that case, the read will either need to continue to wait on the follower, remember that it has successfully resolved the intent so remember to ignore the WriteIntentError on its next try, or eventually be redirected to the leaseholder. I think an argument could be made to always redirect to the leaseholder when a read hits a There might be other instances of this class of race/busy loop possible with follower reads as well. |
|
@nvanbenschoten that's a good point, I added it to a comment in DistSender in the latest revision. |
2c2994e to
88c7d2b
Compare
|
@nvanbenschoten I've brought the code into a form that I think is mergable (though it will need lots of follow-up work). Per-range closed timestamp information is not collected and we also never communicate ranges as quiesced, unless you run with a nonzero |
…mestamps This change lays the ground work for follower reads by adding the prototype implementation into our main codebase Most of the logic is disabled by default. It is only exercised during specific unit tests or when running with a nonzero COCKROACH_CLOSED_TIMESTAMP_INTERVAL. The code contains several potential correctness anomalies and makes no attempt at handling lagging followers gracefully. It should not be used outside of testing and in fact should not be used at all (though we may want to write roachtests early). The [follower reads RFC] and many TODOs in this code hint at the upcoming changes. Most prominently, known correctness gotchas will be addressed, but the role of quiescence and coalesced heartbeats will be untangled from the main proposal, which hopefully can clarify the code somewhat as well. In the meantime, the commit message below documents what is implemented here, even though it is subject to change: Nodes send a closed timestamp with coalesced heartbeats. Receipt of a heartbeat from a node which is the leaseholder for the range means a closed timestamp can be trusted to apply to each follower replica which has committed at or over a min lease applied index, a new value supplied with coalesced heartbeats. Nodes keep track of their "min proposal timestamp" (MinPropTS), which is an HLC timestamp. On every heartbeat, the MinPropTS is persisted locally to ensure monotonicity on node restart. At startup, a node reads the last persisted MinPropTS, and forwards the HLC clock to the MPT timestamp + max safe interval if necessary. Nodes check MinPropTS on command evaluation; a command's timestamp is forwarded if less than MinPropTS. Things get more interesting when a range quiesces. Replicas of quiesced ranges no longer receive info on coalesced heartbeats. However, if a replica is quiesced, we can continue to rely on the most recent store-wide closed timestamp supplied with coalesced heartbeats, so long as the liveness epoch (reported with heartbeats) remains stable and no heartbeats are skipped. This can continue for as long as a range is quiesced, but requires that the leaseholder notifies all followers on the first heartbeat after a range is unquiesced. Note there is no concern that on leaseholder change, the new leaseholder allows a write at an earlier timestamp than a previously reported closed timestamp. This is due to the low water timestamp in the timestamp cache being reset on leaseholder transfer to prevent rewriting history in general. Release note: None [follower reads RFC]: cockroachdb#26362
|
Reviewed 5 of 18 files at r9, 2 of 6 files at r10, 1 of 7 files at r11, 10 of 11 files at r12. pkg/kv/dist_sender.go, line 410 at r12 (raw file):
Something's not adding up here. Transactional reads will only be turned into follower reads if they occur at timestamps beneath the closed timestamp of a range. How would txn2 write at timestamp 99 if the closed timestamp has already advanced past timestamp 100? Wouldn't it be pushed forward? pkg/kv/dist_sender.go, line 451 at r12 (raw file):
Do we get any benefit from preferring followers over leaseholders other than distributing load? If not then I don't expect this to make much of a difference because we should already be evenly distributing leaseholders. In that case, we're just undermining a small degree of locality present on each node (for instance, a well-populated block cache). pkg/storage/client_test.go, line 1225 at r12 (raw file):
consider pulling all of the shared code between pkg/storage/replica.go, line 504 at r7 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
I'd also leave it next to pkg/storage/replica.go, line 2706 at r7 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
⚔️ pkg/storage/replica.go, line 502 at r12 (raw file):
cant be served at ts pkg/storage/store.go, line 3181 at r7 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
It's been a while and I'm having a hard time remembering what I meant. I think I meant that the "unconfirmed" Also, the commentary would still be useful. Comments from Reviewable |
26362: RFC: follower reads r=bdarnell,nvanbenschoten a=tschottdorf NB: this is extracted from #21056; please don't add new commentary on the tech note there. ---- Follower reads are consistent reads at historical timestamps from follower replicas. They make the non-leader replicas in a range suitable sources for historical reads. The key enabling technology is the propagation of **closed timestamp heartbeats** from the range leaseholder to followers. The closed timestamp heartbeat (CT heartbeat) is more than just a timestamp. It is a set of conditions, that if met, guarantee that a follower replica has all state necessary to satisfy reads at or before the CT. Consistent historical reads are useful for analytics queries and in particular allow such queries to be carried out more efficiently and, with appropriate configuration, away from foreground traffic. But historical reads are also key to a proposal for [reference-like tables](#26301) aimed at cutting down on foreign key check latencies particularly in geo-distributed clusters; they help recover a reasonably recent consistent snapshot of a cluster after a loss of quorum; and they are one of the ingredients for [Change Data Capture](#25229). Release note: None 27699: storage: fix stopper race in compactor r=petermattis a=tschottdorf Starting workers without a surrounding task is unfortunately often not the right thing to do when the worker accesses other state that might become invalidated once the stopper begins to stop. In this particular case, the compactor might end up accessing the engine even though it had already been closed. I wasn't able to repro this failure in the first place, but pretty sure this: Fixes #27232. Release note: None 27704: issues: fix email fallback r=petermattis a=tschottdorf This was not my email address. Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
Nodes send a closed timestamp with coalesced heartbeats. Receipt of
a heartbeat from a node which is the leaseholder for the range means
a closed timestamp can be trusted to apply to each follower replica
which has committed at or over a min lease applied index, a new value
supplied with coalesced heartbeats.
Nodes keep track of their "min proposal timestamp" (MinPropTS), which
is an HLC timestamp. On every heartbeat, the MinPropTS is persisted
locally to ensure monotonicity on node restart. At startup, a node
reads the last persisted MinPropTS, and forwards the HLC clock to the
MPT timestamp + max safe interval if necessary. Nodes check MinPropTS
on command evaluation; a command's timestamp is forwarded if less than
MinPropTS.
Things get more interesting when a range quiesces. Replicas of
quiesced ranges no longer receive info on coalesced heartbeats.
However, if a replica is quiesced, we can continue to rely on the most
recent store-wide closed timestamp supplied with coalesced heartbeats,
so long as the liveness epoch (reported with heartbeats) remains stable
and no heartbeats are skipped. This can continue for as long as a range
is quiesced, but requires that the leaseholder notifies all followers
on the first heartbeat after a range is unquiesced.
Note there is no concern that on leaseholder change, the new
leaseholder allows a write at an earlier timestamp than a previously
reported closed timestamp. This is due to the low water timestamp in
the timestamp cache being reset on leaseholder transfer to prevent
rewriting history in general.
Release note: performance improvement allowing all replicas to be used
for servicing historical reads. This is especially important in geo-
diverse contexts, where using local replicas for any non-trivial SQL
is a big win for latency.