-
Notifications
You must be signed in to change notification settings - Fork 4.1k
kv: add ResolveTimestampRequest #73399
Description
Is your feature request related to a problem? Please describe.
We have these RangeFeeds which send a stream of events and then periodically send checkpoints telling the client when they've seen all events over a span up to some timestamp. The checkpoints trail the present by some amount. They trail by a while (3s) for reasons that arguably relate to cockroach's lack of bufferred writes (#72614) and read pessimism (#52768, though we could do in-memory with broadcasted verification and 2PC like spanner).
Given the fact that the closed timestamp doesn't track the present, it's easy to have a scenario where a write a timestamp t2 commits and the event is sent to the watching RangeFeed, then a separate transaction commits several seconds later at t1. Until the timestamp is resolved, history is mutable. However, if a transactional Scan operation occurs over that keyspan, then, for all intents and purposes other than the RangeFeed, the timestamp is now resolved up to the timestamp of the Scan. That all happens via the TimestampCache.
Sometimes it'd be really nice if the committing of a transaction was immediately followed by the resolving of some key spans. In the multi-tenant zone config design (RFC #66348 hopefully will merge soon) we have an asynchronous task which reconciles changes to descriptors and zone configs into the system tenant. The reconciliation in the current implementation occurs only after all of the watched data has been checkpointed (this allows us to simplify a bunch of stuff).
It's not hard to imagine reasons why such sql statements which change these configurations would want to know when the reconciliation has actually happened. Here's some:
- In a serverless setting, there's a risk that the pod will scale down before reconciliation happens.
- If we're trying to protect data using protected timestamps, invariants all come from when the protected timestamp actually makes it to the host cluster. If the operation issued by the client has no direct relationship to the reconciliation, it's hard to say much of anything about protected timestamps actually working.
Describe the solution you'd like
This issue proposes that we add a new non-transactional KV request ResolveTimestampRequest which takes a span and a timestamp. The operation is a writing request which scans the entire span of any range it overlaps with (the reason for resolving the whole range is that we don't have a finer-grained notion of a closed timestamp, and that seems fine for my purposes). The request semantically would operate mostly like an MVCC scan that throws away all of the data that is operating at its request timestamp followed by a replicated command which moves the closed timestamp up to its request timestamp. The systems we have in place for concurrency control should take care of the rest of the semantics.
The downstream affect of such a request is that all listening RangeFeeds will end up getting a checkpoint.
Describe alternatives you've considered
We could alternatively move the desire to resolve timestamps into a zone config for ranges such that all writes to some of these ranges auto-resolves to the present. That seems worse and too tightly coupled.
Additional context
The proposal then is that we'd have the schema changes which intend to lead to changes to zone configs issue such a request immediately upon committing (can parallelize with the wait for version checks if you want) and then can expect reconciliation to occur promptly. We'll probably need to build a mechanism to determine the implied zone config changes of a sql statement to make any limits work happen anyway.
Jira issue: CRDB-11574