-
Notifications
You must be signed in to change notification settings - Fork 4.1k
read from a follower with timestamp bound #16593
Description
I ended up thinking about this tonight due to a related problem, so here are
some notes. The difficulty is making this zone configurable. Might've missed
something.
Goals
- assuming a read timestamp far enough in the "past", (usually) be able to read
from any replica. (think: analytics, time travel queries, backups, queries
that can't or don't need to pay the latency to a far-away lease holder). - configurable on the level of zone configs (i.e. table)
Sketch of implementation
Add a field max_write_age to the zone configs (a value of zero behaves like
MaxUint64). The idea is that the timestamp caches of the affected ranges have
a low watermark that does not trail (now-max_write_age). Note that this
effectively limits how long transactions can write to approximately
max_write_age. In turn, when running a read-only transaction, once the
current HLC timestamp has passed read_timestamp + max_write_age + max_offset,
any replica can serve reads.
- add a field
max_write_ageto the lease proto. - whenever a lease is proposed,
max_write_ageis populated with the value
the proposer believes is current. - lease extensions must not alter
max_write_age. If a lease holder realizes
that the ZoneConfig'smax_write_agehas changed, it must request a new lease
(in practice, it only has to do this in casemax_write_ageincreases) and let
the old one expire (or transfer its lease away). There is room for optimization
here: the replica could extend the lease with the newmax_write_age, but all
members must enforce the smallermax_write_ages for as long as the "old"
version is not expired. - Make DistSender aware of
max_write_age. When considering a read-only
BatchRequestwith a timestamp eligible for a follower-served read, consider
followers, prioritizing those in close proximity. - A follower which receives a read-only batch first checks if the current
lease is active (not whether it holds the lease itself). If not, it behaves as
it would today (requests the lease). Otherwise, if it is not the lease holder,
it checks if the batch timestamp is eligible for a follower-served read based
on the information in the lease and the current timestamp. If so, it serves it
(it does not need to update the timestamp cache). - on writes that violate
now - max_write_age < write_ts, behave as if there
were a timestamp cache entry atnow.
An interesting observation is that this can also be modified to allow serving
read queries when Raft completely breaks down (think all inter-DC connections
fail): a replica can always serve what is "safe" based on the last known lease.
There is much more work to do to get these replicas to agree on a timestamp,
though. The resulting syntax could be something along the lines of
SELECT (...) AS OF SYSTEM TIME STALE
and DistSender would consult its cache to find the minimal timestamp covered
by all leases (but even that timestamp may not work).
Caveats
- This relies on clocks and thus on MaxOffset plus not having goroutines
stalled in inconvenient locations (such a stall would violate MaxOffset too,
but be very unlikely to be caught): If a write passes the check but then gets
delayed until it doesn't hold any more, followers may serve reads that are
then invalidated by the proceeding write. (This does not seem more fragile
than what we already have with our read leases though). - if a Range isn't split along a ZoneConfig, the more restrictive
max_write_agewill be in effect.