Skip to content

storage: implement follower reads using replica closed timestamps#21056

Closed
spencerkimball wants to merge 1 commit intocockroachdb:masterfrom
spencerkimball:max-safe-timestamp
Closed

storage: implement follower reads using replica closed timestamps#21056
spencerkimball wants to merge 1 commit intocockroachdb:masterfrom
spencerkimball:max-safe-timestamp

Conversation

@spencerkimball
Copy link
Copy Markdown
Member

@spencerkimball spencerkimball commented Dec 26, 2017

Nodes send a closed timestamp with coalesced heartbeats. Receipt of
a heartbeat from a node which is the leaseholder for the range means
a closed timestamp can be trusted to apply to each follower replica
which has committed at or over a min lease applied index, a new value
supplied with coalesced heartbeats.

Nodes keep track of their "min proposal timestamp" (MinPropTS), which
is an HLC timestamp. On every heartbeat, the MinPropTS is persisted
locally to ensure monotonicity on node restart. At startup, a node
reads the last persisted MinPropTS, and forwards the HLC clock to the
MPT timestamp + max safe interval if necessary. Nodes check MinPropTS
on command evaluation; a command's timestamp is forwarded if less than
MinPropTS.

Things get more interesting when a range quiesces. Replicas of
quiesced ranges no longer receive info on coalesced heartbeats.
However, if a replica is quiesced, we can continue to rely on the most
recent store-wide closed timestamp supplied with coalesced heartbeats,
so long as the liveness epoch (reported with heartbeats) remains stable
and no heartbeats are skipped. This can continue for as long as a range
is quiesced, but requires that the leaseholder notifies all followers
on the first heartbeat after a range is unquiesced.

Note there is no concern that on leaseholder change, the new
leaseholder allows a write at an earlier timestamp than a previously
reported closed timestamp. This is due to the low water timestamp in
the timestamp cache being reset on leaseholder transfer to prevent
rewriting history in general.

Release note: performance improvement allowing all replicas to be used
for servicing historical reads. This is especially important in geo-
diverse contexts, where using local replicas for any non-trivial SQL
is a big win for latency.

@spencerkimball spencerkimball requested review from a team, bdarnell and tbg December 26, 2017 22:34
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@spencerkimball
Copy link
Copy Markdown
Member Author

I haven't written any unittests yet. Wanted to get this out so people can take a gander at the approach first.

@spencerkimball spencerkimball force-pushed the max-safe-timestamp branch 2 times, most recently from 76578f1 to 9f657fc Compare December 26, 2017 22:49
@andreimatei
Copy link
Copy Markdown
Contributor

I cannot tell from the commit message - where does this PR want to place us on the trade-off between allowing long running txns (that want to write at arbitrarily old timestamps) and the desire to serve reads from followers?

@spencerkimball
Copy link
Copy Markdown
Member Author

spencerkimball commented Dec 27, 2017 via email

@andreimatei
Copy link
Copy Markdown
Contributor

andreimatei commented Dec 27, 2017 via email

@andreimatei
Copy link
Copy Markdown
Contributor

andreimatei commented Dec 27, 2017 via email

@spencerkimball
Copy link
Copy Markdown
Member Author

spencerkimball commented Dec 27, 2017

I've given some thought to these issues. There are pros and cons to all approaches, and nothing I've considered would be an unqualified win across the board. There are huge pitfalls with allowing arbitrarily long-running transactions which can alter long-past history. That path leads through chasms which I don't believe can be bridged. See the bullet below for SNAPSHOT, which I think is a much better solution.

  • The timestamp cache does support long running transactions, but only with toy workloads. Its seeming ability to handle long-running transactions is illusory and simply cannot be relied on.
  • With SNAPSHOT isolation, you'd still be able to have long-running transactions with this mechanism. I think using SNAPSHOT for these kinds of cases is a reasonable compromise.
  • We could still make the interactive shell work with long-running transactions if there is a single node – there is no need to send any max safe timestamps and increment the min proposal timestamp if there are no heartbeats. That may actually be most of the solution for what people desire here. Most of the times, people worry about the behavior of transactions within the CLI in test / evaluation settings.

@spencerkimball
Copy link
Copy Markdown
Member Author

pkg/kv/replica_slice.go, line 95 at r1 (raw file):

}

// SortByCommonAttributePrefix rearranges the ReplicaSlice by comparing the

Wow, can't believe we were still sorting based on "attributes"? WTF?


Comments from Reviewable

@spencerkimball spencerkimball force-pushed the max-safe-timestamp branch 2 times, most recently from c3f38e3 to d016af5 Compare December 27, 2017 13:26
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Dec 27, 2017

CLA assistant check
All committers have signed the CLA.

@spencerkimball
Copy link
Copy Markdown
Member Author

@andreimatei perhaps the interactive CLI should default to SNAPSHOT isolation...?

@andreimatei
Copy link
Copy Markdown
Contributor

@andreimatei perhaps the interactive CLI should default to SNAPSHOT isolation...?

I don't think that doing special things for our CLI helps too much; we want humans to interact with our database through all sorts of tools, not just our client. I think a feature that comes with such a tradeoff has to be under some sort of control/opt-in.

Btw Spencer, you know about this RFC, right? https://reviewable.io/reviews/cockroachdb/cockroach/19222#-
That one discusses the opt-in and other considerations (e.g. the clock offset issues). We should probably move the conversation there...

@bdarnell
Copy link
Copy Markdown
Contributor

Yeah, this needs to go through the RFC process (either taking over #19222 or a new doc). Some of the higher-level pieces of the RFC template are very relevant here, in particular the guide-level explanation and how this will be exposed to users, and how it will be monitored and tested. I think there's quite a bit of follow-up work here (such as making the 10s limit configurable) and we need to understand what level of polish we're committing to before starting down this path.

I'm particularly concerned about how this degrades when the read cannot be served locally, either because the local replica is behind or it hits an unresolved intent (the latter is tricky because we will try to resolve the intent remotely, and then try the read again locally even though the resolution may not have propagated yet).

The timestamp cache does support long running transactions, but only with toy workloads. Its seeming ability to handle long-running transactions is illusory and simply cannot be relied on.

Do we have any data about this one way or another? How big does a workload need to be before the timestamp cache fills up and we have to restart older transactions? Is exceeding the limit a retryable or non-retryable error (if it's retryable, then transactions that are too large to ever complete may spin endlessly. This is already a problem now if the timestamp cache gets too big, but enforcing the limit in all cases would make it worse). The imposition of a global limit like this needs its own discussion instead of being buried in this PR. (I'm generally in favor of limiting the duration of write transactions, and opposed to defaulting to SNAPSHOT in the CLI)

@spencerkimball
Copy link
Copy Markdown
Member Author

spencerkimball commented Jan 2, 2018

See #21140 for a solution to long-running SERIALIZABLE transactions. It should address the CLI issue which @andreimatei raises here.

@bdarnell I would default to not making the max safe timestamp interval configurable. For one, doing so seems premature. For another, I think there may be a better way to solve the main use case which still makes a case for configurability. That use case is lower-latency change data feeds. I actually think that those should work not by setting a range-wide max safe timestamp interval lower, but by having the change feed reader update the timestamp cache of the range it's reading from. Lower-latency change feeds would need to consume from the leaseholder.

@bdarnell
Copy link
Copy Markdown
Contributor

bdarnell commented Jan 8, 2018

Reviewing the commit message, since it acts as a mini-RFC:

Nodes send a max safe timestamp with coalesced heartbeats. Receipt of
a heartbeat from a node which is both the leaseholder and Raft leader
means the maxSafeTS can be trusted to apply to each follower replica
which has committed at or over a max safe log index, a new value
supplied with heartbeats.

This paragraph is unclear. Define the "max safe" terminology before using it, and be more clear about how the "max safe log index" fits in. If I'm understanding correctly, I'd rewrite this paragraph as "The max safe timestamp is a timestamp beneath which no new writes will be proposed, and therefore followers can serve reads older than this timestamp without consulting with the leaseholder. A raft leader may send a (max safe timestamp, log index) pair in its heartbeat which indicates that the follower's max safe timestamp should be updated to the given value when all raft log entries including the given index have been applied." Right?

Nodes keep track of their "min proposal timestamp" (MinPropTS), which
is an HLC timestamp. On every heartbeat, the MinPropTS is persisted
locally to ensure monotonicity on node restart. At startup, a node
reads the last persisted MinPropTS, and forwards the HLC clock to the
MPT timestamp + max safe interval if necessary. Nodes check MinPropTS
on command evaluation; a command's timestamp is forwarded if less than
MinPropTS.

What is the "max safe interval"? This "min proposal timestamp" is not otherwised referenced in the commit message; how is it used and how does it relate to the max safe timestamp?

Things get more interesting when a range quiesces. Replicas of
quiesced ranges no longer receive heartbeats. However, if a replica is
quiesced, we can continue to rely on the most recent maxSafeTS
supplied in coalesced heartbeats, so long as the liveness epoch
(reported with heartbeats) remains stable.

What is "more interesting" about this? Continuing to rely on the old maxSafeTS as it slips further out of date limits the usefulness of this feature - we need some way of advancing the maxSafeTS without unquiescing the range.

Because Raft leadership can
change while leaseholdership remains stable, the coalesced heartbeat
protocol includes "unsafe" range IDs to which the node's maxSafeTS no
longer applies. Any range where the leaseholder has to forward
commands to the Raft leader,

Heartbeats are sent by the leader, so does this mean that the leader marks a range as "unsafe" when it is sending a heartbeat for a range whose lease it does not hold?

or where the lease has changed, is

How long does the range stay "unsafe" after a lease change?

reported as unsafe. Removed replicas are also reported as unsafe.

Heartbeats are not sent to or from removed replicas, so what does this mean?

Nodes maintain a map from node/store ID to a maxSafeTS, which contains
a maxSafe timestamp and a map from range ID to "safe" log indexes. Any
replica which has committed to the "safe" log index can serve follower
reads at timestamps earlier than the maxSafeTS, if the node which
reported it is the valid leaseholder (requires an epoch lease).

So far, the only way in which maxSafeTS can advance is when a leader+leaseholder sends it in the heartbeat, so there is no way in which a non-leaseholder could report a maxSafeTS.

In
the event a node reports that its maxSafeTS no longer applies to a
range ID, that range ID is removed from the set.

So maxSafeTS is revokable? What if followers have already served from that timestamp?

As previously mentioned, ranges are marked as "unsafe" on lease
update, any leaseholder-not-leader writes, and on replica removal.
Untrusted ranges are kept in a map and sent with coalesced
heartbeats. They remain in the untrusted map until they're
explicitly acknowledged by the receiver to avoid allowing the
intended recipient to continue trusting the range on a missed
heartbeat.

Who is "the receiver"? What happens if the receiver is dead? If a node crashes, does it need to recover its "unsafe" map on restart somehow?

Note there is no concern that on leaseholder change, the new
leaseholder allows a write at an earlier timestamp than a previously
reported max safe timestamp. This is due to the low water timestamp in
the timestamp cache being reset on leaseholder transfer to prevent
rewriting history in general.

Release note: performance improvement allowing all replicas to be used
for servicing historical reads. This is especially important in geo-
diverse contexts, where using local replicas for any non-trivial SQL
is a big win for latency.

My biggest concern with maxSafeTS generally is not addressed in the commit message: what happens when a range unquiesces? For quiescence to make sense, followers must be able to advance their maxSafeTS without additional periodic communication with the leader, but then we need some way to guarantee that when the range unquiesces, the leader will not make any proposals that could conflict with any of its followers maxSafeTimestamps.


Review status: 0 of 21 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.


Comments from Reviewable

@tbg
Copy link
Copy Markdown
Member

tbg commented Jun 4, 2018

No worries! Always appreciate writing tests.

@spencerkimball
Copy link
Copy Markdown
Member Author

OK, got the unittests fully working and stress tested.

@spencerkimball spencerkimball force-pushed the max-safe-timestamp branch 4 times, most recently from 08b5545 to 40750b4 Compare June 11, 2018 17:29
@tbg
Copy link
Copy Markdown
Member

tbg commented Jun 11, 2018

Added some explanatory comments (that will need some more smoothing out) and addressed parts of the comments. More to come.

@spencerkimball I assume you're done with that branch. If not, please let's chat.


Review status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/kv/dist_sender.go, line 69 at r7 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

None of the other distsender metrics have a .count suffix. These metrics are already a bit of a mess but distsender.batches.followereligible might be more consistent with what's here.

Done.


pkg/kv/dist_sender.go, line 70 at r7 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/reads /read-/

"Treated as follower reads" is ambiguous - it sounds like it might mean "served from a follower", but it actually means "sent to the nearest node regardless of its lease status". Let's call this "eligible for follower reads".

Done.


pkg/kv/dist_sender.go, line 448 at r7 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/safe/closed/

Done.


pkg/kv/dist_sender.go, line 477 at r7 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Something to consider for the future: If we're eligible for follower reads, we may want to add some jitter in OptimizeReplicaOrder so load will be spread out among replicas with similar latencies (or maybe even actively prefer followers over leaseholders)

Done.


pkg/roachpb/batch.go, line 39 at r7 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This imposes an odd ordering constraint that GetActiveTimestamp may only be called before SetActiveTimestamp. Something's not right with this interface. (but the backwards-compatibility constraints are ugly since the current call to SetActiveTimestamp is on the store side and it will reject any attempts to move it earlier).

Moved the check to SetActiveTimestamp. This is a little little weaker but looks a lot better.


pkg/storage/metrics.go, line 497 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Expand this description a bit. By "enforced", I assume you mean "enforced by pushing a txn".

Done. I removed this one since EnforceMinProposal already captures this adequately (we basically always have a txn, so having two almost identical metrics wasn't useful).


pkg/storage/raft_transport.go, line 572 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This change seems strange to me. I think it's trying to allow messages sent to req.RangeID == 0 without coalesced heartbeats in cases where the message contains new follower read info. That's fine, but then is len(req.Ranges) == 0 sufficient to determine this?

Done.


pkg/storage/replica.go, line 504 at r7 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This timestamp should be named instead of embedded.

Done.


pkg/storage/replica.go, line 2609 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

What's the point of doing this?

Also, like Ben said above, embedding the timestamp in this closed struct makes this more confusing.

Added commentary.


pkg/storage/replica.go, line 2614 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Yeah, this should return true.

The condition here is now closedTS.Less(readTS) { return false } which looks correct.


pkg/storage/replica.go, line 2627 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I don't see how it's even possible to hit this case.

Ditto.


pkg/storage/replica.go, line 2652 at r7 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

It's surprising that a can method might have the (slow) effect of acquiring the lease. Call this out more clearly in the comment.

Done.


pkg/storage/replica.go, line 2660 at r7 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This could use a comment about how we use the last confirmed closed timestamp as a fast path, and if we don't pass this check we'll try again with the unconfirmed timestamp after we have determined our lease status.

Done.


pkg/storage/replica.go, line 2706 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Follower reads should be treated the same as inconsistent reads with respect to command queue and timestamp cache handling (see beginCmds and endCmds.done). This is introducing a separate set of conditional checks which is bound to cause issues eventually.

Looks like this was done.


pkg/storage/replica.go, line 2720 at r7 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I'd make this condition if endCmds == nil just in case we introduce some endCmds cleanup for follower reads.

Seems obsolete.


pkg/storage/replica.go, line 2924 at r7 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I'd use the same verb for both of these variables, something like enforcedTSCache and enforcedMinProposal

Done.


pkg/storage/replica.go, line 2935 at r7 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Replace "timestamp cache" here with an indication of why the txn was updated.

Done.


pkg/storage/replica.go, line 3448 at r7 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I think we can move this up above if crt := p.command.ReplicatedEvalResult.ChangeReplicas to avoid the repetition.

Added a TODO to do this. Want to double check that there's no problem there.


pkg/storage/replica.go, line 3475 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

false /* quiesced */ (or false /* quiescing */ if we change the field name)

Done.


pkg/storage/replica.go, line 3490 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: s/r.mu.lastAssignedLeaseIndex/mlai/

Done.


pkg/storage/replica.go, line 4345 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

true /* quiescing */

Done.


pkg/storage/replica.go, line 506 at r8 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Note to self: better comment

Done.


Comments from Reviewable

@spencerkimball
Copy link
Copy Markdown
Member Author

I am done; was just getting the build green finally.


Review status: :shipit: complete! 0 of 0 LGTMs obtained


Comments from Reviewable

@tbg tbg force-pushed the max-safe-timestamp branch 3 times, most recently from 8c15094 to 2b04b53 Compare June 13, 2018 16:35
@tbg
Copy link
Copy Markdown
Member

tbg commented Jun 13, 2018

I responded to all of the comments. For the active reviewers (@nvanbenschoten and @bdarnell I think) as a heads up, while I have edited this PR lightly, I took lots of your comments and simply put them into code as TODOs in my name. My plan (or hope, rather) is to polish this up to the point where it can merge (as default-off code, after the stability week sha is picked) similar to the range merge work, and then refactor/improve individual pieces in targeted PRs.


Reviewed 1 of 20 files at r1, 10 of 21 files at r3, 1 of 8 files at r7, 2 of 11 files at r8, 9 of 18 files at r9, 4 of 6 files at r10, 7 of 7 files at r11.
Review status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/raft.proto, line 56 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

What do you think about changing this name to quiescing? Right now this does a bad job conveying the importance of the first ClosedTSInfo that sets this to true.

Done.


pkg/storage/raft.proto, line 90 at r7 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

As I said in the doc, I don't see a benefit to putting these in the coalesced heartbeat and would use a separate message instead.

Added a comment.


pkg/storage/store.go, line 416 at r7 (raw file):
I agree, and it is almost trivial, but only almost, so I added this comment:

// TODO(tschottdorf): this could just be an int64 (i.e. not
// tracked per store), but this also serves a more subtle use case:
// Assume a peer store has only quiesced ranges; it wouldn't receive
// any update from this node proactively. We need to somehow keep
// track of the fact that that peer still needs to receive a store
// closed timestamp regularly. This map is used for that purpose
// by making sure that we send to every store within in each round
// of updates, even if all we're sharing is the closed timestamp.
// This approach comes with the problem that when a store disappears,
// we'll try sending to it "forever". There needs to be some expiration;
// we should probably just send to all live stores instead. Left as a
// separate PR.


pkg/storage/store.go, line 3196 at r7 (raw file):
I think this could happen:

// TODO(tschottdorf): can this already be set here? I think so, because of the
// following:
// 1. range quiesces, sends update, triggers this line at recipient
// 2. range unquiesces and immediately quiesces again, without applying any commands
// 3. this triggers another quiesced updated here (I think).
//
// The origin store could avoid such duplicates, but there doesn't seem to be a
// huge point, except that it would allow an assertion here.


pkg/storage/store.go, line 4289 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

The double locking here is a little scary. Can sendQueuedHeartbeats be called concurrently? If so, can updates to s.coalescedMu.closedSequences race and cause regressions in the map?

No, this is driven by a single loop with a ticker, but I agree that this isn't great. Added an assertion plus a comment for now because closedSequences will see a follow-up refactor anyway.


pkg/storage/store.go, line 4327 at r7 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

The caller must verify that lh is the valid leaseholder

It feels wrong to talk about being a valid leaseholder without reference to a timestamp.

The synchronization here is questionable. The caller verified that lh was the leaseholder at some time, and then at some unspecified later time we're using that to look things up in this map. What if a new heartbeat message came in between the caller validating the lease and this method reading from closedMu.leaseholders? I can't find a specific hole here but I also can't articulate why it's correct, which worries me about concurrency.

I think I've found a concrete problem:

TODO(tschottdorf): I think the liveness has a problem:

  1. the read timestamp gets fed into the HLC
  2. we call redirectOnOrAcquireLease which finds that the lease is live at
    a higher HLC timestamp (so far so good)
  3. we call this method, but things hang for a while. If we proceeded, we'd find
    that the range unquiesced and that the closed ts does not cover our read.
  4. the leaseholder transfers its lease away at some timestamp below our read timestamp.
  5. the new leaseholder commits a few commands, but never manages to send us a closed TS
    update, and also never gets the log replicated our way
  6. the lease transfers back to the old leaseholder, which quiesces and sends
    a closed TS that does cover our read.
  7. this method resumes and returns that higher closed timestamp.
  8. the caller can check whatever they want on the lease; it looks good. It will
    go ahead and serve a follower read, which misses the writes carried out by
    the interim leaseholder.

Note that the read timestamp here must be really close to current time, because
that's what's used for the lease transfers. However also note that we try to use
follower reads whenever we can. not just when a read timestamp lags by a few
seconds. So this just needs a bit of clock skew, where the leaseholder lags
slightly behind, and an intricate sequence of events.

We can address this by not attempting follower reads unless the requested time-
stamp is well in the past (the critical threshold is the MaxOffset).


pkg/storage/store.go, line 4363 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

If s.minPropMu.curRef == 0, can't we return newMinProposal as closed?

Good point. Added a TODO.


Comments from Reviewable

@tbg tbg force-pushed the max-safe-timestamp branch from 2b04b53 to 4b3cb61 Compare June 13, 2018 18:10
@nvb
Copy link
Copy Markdown
Contributor

nvb commented Jun 18, 2018

One thing we're going to have to be careful about here is resolving intents that a read finds while performing a follower read. I fear that without any extra logic, we could fall into busy loops where a follower read finds and resolves an intent but continues to stumble into it again and again because the follower it is reading on doesn't end up in the intent resolution quorum because it has fallen behind and as a result doesn't synchronously reflect the result of the intent resolution locally. In that case, the read will either need to continue to wait on the follower, remember that it has successfully resolved the intent so remember to ignore the WriteIntentError on its next try, or eventually be redirected to the leaseholder. I think an argument could be made to always redirect to the leaseholder when a read hits a WriteIntentError, since it will need to end up there anyway when resolving the intent.

There might be other instances of this class of race/busy loop possible with follower reads as well.

@tbg
Copy link
Copy Markdown
Member

tbg commented Jun 20, 2018

@nvanbenschoten that's a good point, I added it to a comment in DistSender in the latest revision.

@tbg tbg force-pushed the max-safe-timestamp branch 3 times, most recently from 2c2994e to 88c7d2b Compare June 20, 2018 11:56
@tbg
Copy link
Copy Markdown
Member

tbg commented Jun 20, 2018

@nvanbenschoten I've brought the code into a form that I think is mergable (though it will need lots of follow-up work). Per-range closed timestamp information is not collected and we also never communicate ranges as quiesced, unless you run with a nonzero COCKROACH_CLOSED_TIMESTAMP_INTERVAL. PTAL

…mestamps

This change lays the ground work for follower reads by adding the
prototype implementation into our main codebase

Most of the logic is disabled by default. It is only exercised during
specific unit tests or when running with a nonzero
COCKROACH_CLOSED_TIMESTAMP_INTERVAL.

The code contains several potential correctness anomalies and makes
no attempt at handling lagging followers gracefully. It should not
be used outside of testing and in fact should not be used at all
(though we may want to write roachtests early).

The [follower reads RFC] and many TODOs in this code hint at the
upcoming changes. Most prominently, known correctness gotchas will be
addressed, but the role of quiescence and coalesced heartbeats will be
untangled from the main proposal, which hopefully can clarify the code
somewhat as well. In the meantime, the commit message below documents
what is implemented here, even though it is subject to change:

Nodes send a closed timestamp with coalesced heartbeats. Receipt of
a heartbeat from a node which is the leaseholder for the range means
a closed timestamp can be trusted to apply to each follower replica
which has committed at or over a min lease applied index, a new value
supplied with coalesced heartbeats.

Nodes keep track of their "min proposal timestamp" (MinPropTS), which
is an HLC timestamp. On every heartbeat, the MinPropTS is persisted
locally to ensure monotonicity on node restart. At startup, a node
reads the last persisted MinPropTS, and forwards the HLC clock to the
MPT timestamp + max safe interval if necessary. Nodes check MinPropTS
on command evaluation; a command's timestamp is forwarded if less than
MinPropTS.

Things get more interesting when a range quiesces. Replicas of quiesced
ranges no longer receive info on coalesced heartbeats.  However, if a
replica is quiesced, we can continue to rely on the most recent
store-wide closed timestamp supplied with coalesced heartbeats, so long
as the liveness epoch (reported with heartbeats) remains stable and no
heartbeats are skipped. This can continue for as long as a range is
quiesced, but requires that the leaseholder notifies all followers on
the first heartbeat after a range is unquiesced.

Note there is no concern that on leaseholder change, the new leaseholder
allows a write at an earlier timestamp than a previously reported closed
timestamp. This is due to the low water timestamp in the timestamp cache
being reset on leaseholder transfer to prevent rewriting history in
general.

Release note: None

[follower reads RFC]: cockroachdb#26362
@nvb
Copy link
Copy Markdown
Contributor

nvb commented Jun 21, 2018

Reviewed 5 of 18 files at r9, 2 of 6 files at r10, 1 of 7 files at r11, 10 of 11 files at r12.
Review status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/kv/dist_sender.go, line 410 at r12 (raw file):

	// 2. txn1 reads via follower read, so leader's timestamp cache does not see the update.
	// 3. does something else, who knows, commits.
	// 3. txn2 starts at timestamp 99 and writes a value invalidating the read on the lease

Something's not adding up here. Transactional reads will only be turned into follower reads if they occur at timestamps beneath the closed timestamp of a range. How would txn2 write at timestamp 99 if the closed timestamp has already advanced past timestamp 100? Wouldn't it be pushed forward?


pkg/kv/dist_sender.go, line 451 at r12 (raw file):

		//    fail repeatedly (though that is lower priority).
		// 2. we may want to add some jitter here to spread out load among replicas
		//    with similar latencies (or even actively prefer followers over lease-

Do we get any benefit from preferring followers over leaseholders other than distributing load? If not then I don't expect this to make much of a difference because we should already be evenly distributing leaseholders. In that case, we're just undermining a small degree of locality present on each node (for instance, a well-populated block cache).


pkg/storage/client_test.go, line 1225 at r12 (raw file):

// nodes to ensure they are considered live at the updated clock time.
func (m *multiTestContext) advanceClock(ctx context.Context, inc time.Duration) error {
	for i, clock := range m.clocks {

consider pulling all of the shared code between expireLeases and advanceClocks into a helper.


pkg/storage/replica.go, line 504 at r7 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Done.

ts is still kind of vague. Any ideas for a better name that help distinguish it from confTS? nextTS? pendingTS? unconfTS?

I'd also leave it next to minLeaseApplidIndex, which I'd also give the same prefix to indicate that "the unconfirmed timestamp becomes confirmed when this lease applied index is met."


pkg/storage/replica.go, line 2706 at r7 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Looks like this was done.

⚔️


pkg/storage/replica.go, line 502 at r12 (raw file):

		// closed holds the follower read information for the range:
		// 1. the MLAI (minimum lease applied index); follower reads can't be
		//    served until this replica's LeaseAppliedIndex has reached the MLAI.

cant be served at ts


pkg/storage/store.go, line 3181 at r7 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Isn't this a canSet==true case? This first case is the one in which the replica is immediately good to go for the incoming info. The second case is the one in which it isn't good for this new info, but it's good for the last one (or even before that) -- here we should canSet = false since we only use the previous closed timestamp, not the current one (maybe your comment got moved from the original line). I'll add commentary at some point.

It's been a while and I'm having a hard time remembering what I meant. I think I meant that the "unconfirmed" closed.ts is no longer needed, so why not clear it to make it clear that no timestamps are unconfirmed?

Also, the commentary would still be useful.


Comments from Reviewable

craig bot pushed a commit that referenced this pull request Jul 18, 2018
26362: RFC: follower reads r=bdarnell,nvanbenschoten a=tschottdorf

NB: this is extracted from #21056; please don't add new commentary on the
tech note there.

----

Follower reads are consistent reads at historical timestamps from follower
replicas. They make the non-leader replicas in a range suitable sources for
historical reads.

The key enabling technology is the propagation of **closed timestamp
heartbeats** from the range leaseholder to followers. The closed timestamp
heartbeat (CT heartbeat) is more than just a timestamp. It is a set of
conditions, that if met, guarantee that a follower replica has all state
necessary to satisfy reads at or before the CT.

Consistent historical reads are useful for analytics queries and in particular
allow such queries to be carried out more efficiently and, with appropriate
configuration, away from foreground traffic. But historical reads are also key
to a proposal for [reference-like tables](#26301) aimed at cutting
down on foreign key check latencies particularly in geo-distributed clusters;
they help recover a reasonably recent consistent snapshot of a cluster after a
loss of quorum; and they are one of the ingredients for [Change Data
Capture](#25229).

Release note: None

27699: storage: fix stopper race in compactor r=petermattis a=tschottdorf

Starting workers without a surrounding task is unfortunately often not
the right thing to do when the worker accesses other state that might
become invalidated once the stopper begins to stop. In this particular
case, the compactor might end up accessing the engine even though it
had already been closed.

I wasn't able to repro this failure in the first place, but pretty sure
this:
Fixes #27232.

Release note: None

27704: issues: fix email fallback r=petermattis a=tschottdorf

This was not my email address.

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
@tbg
Copy link
Copy Markdown
Member

tbg commented Jul 30, 2018

Closing in favor of the RFC #26362 and implementation work that will be referenced from #16593.

@tbg tbg closed this Jul 30, 2018
@spencerkimball spencerkimball deleted the max-safe-timestamp branch October 22, 2018 01:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants