Skip to content

read from a follower with timestamp bound #16593

@tbg

Description

@tbg

I ended up thinking about this tonight due to a related problem, so here are
some notes. The difficulty is making this zone configurable. Might've missed
something.

Goals

  • assuming a read timestamp far enough in the "past", (usually) be able to read
    from any replica. (think: analytics, time travel queries, backups, queries
    that can't or don't need to pay the latency to a far-away lease holder).
  • configurable on the level of zone configs (i.e. table)

Sketch of implementation

Add a field max_write_age to the zone configs (a value of zero behaves like
MaxUint64). The idea is that the timestamp caches of the affected ranges have
a low watermark that does not trail (now-max_write_age). Note that this
effectively limits how long transactions can write to approximately
max_write_age. In turn, when running a read-only transaction, once the
current HLC timestamp has passed read_timestamp + max_write_age + max_offset,
any replica can serve reads.

  1. add a field max_write_age to the lease proto.
  2. whenever a lease is proposed, max_write_age is populated with the value
    the proposer believes is current.
  3. lease extensions must not alter max_write_age. If a lease holder realizes
    that the ZoneConfig's max_write_age has changed, it must request a new lease
    (in practice, it only has to do this in case max_write_age increases) and let
    the old one expire (or transfer its lease away). There is room for optimization
    here: the replica could extend the lease with the new max_write_age, but all
    members must enforce the smaller max_write_ages for as long as the "old"
    version is not expired.
  4. Make DistSender aware of max_write_age. When considering a read-only
    BatchRequest with a timestamp eligible for a follower-served read, consider
    followers, prioritizing those in close proximity.
  5. A follower which receives a read-only batch first checks if the current
    lease is active (not whether it holds the lease itself). If not, it behaves as
    it would today (requests the lease). Otherwise, if it is not the lease holder,
    it checks if the batch timestamp is eligible for a follower-served read based
    on the information in the lease and the current timestamp. If so, it serves it
    (it does not need to update the timestamp cache).
  6. on writes that violate now - max_write_age < write_ts, behave as if there
    were a timestamp cache entry at now.

An interesting observation is that this can also be modified to allow serving
read queries when Raft completely breaks down (think all inter-DC connections
fail): a replica can always serve what is "safe" based on the last known lease.
There is much more work to do to get these replicas to agree on a timestamp,
though. The resulting syntax could be something along the lines of

SELECT (...) AS OF SYSTEM TIME STALE

and DistSender would consult its cache to find the minimal timestamp covered
by all leases (but even that timestamp may not work).

Caveats

  • This relies on clocks and thus on MaxOffset plus not having goroutines
    stalled in inconvenient locations (such a stall would violate MaxOffset too,
    but be very unlikely to be caught): If a write passes the check but then gets
    delayed until it doesn't hold any more, followers may serve reads that are
    then invalidated by the proceeding write. (This does not seem more fragile
    than what we already have with our read leases though).
  • if a Range isn't split along a ZoneConfig, the more restrictive
    max_write_age will be in effect.

Metadata

Metadata

Assignees

Labels

A-kv-clientRelating to the KV client and the KV interface.C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions