-
Notifications
You must be signed in to change notification settings - Fork 4.1k
kv: QueryResolvedTimestamp iterator snapshot may be captured before closed timestamp #70974
Description
In QueryResolvedTimestampRequest's evaluation function, we have the following logic:
cockroach/pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go
Lines 51 to 57 in 5f53feb
| // Grab the closed timestamp from the local replica. We do this before | |
| // iterating over intents to ensure that we observe any and all intents | |
| // written before the closed timestamp went into effect. This is important | |
| // because QueryResolvedTimestamp requests are often run without acquiring | |
| // latches (see roachpb.INCONSISTENT) and often also on follower replicas, | |
| // so latches won't help them to synchronize with writes. | |
| closedTS := cArgs.EvalCtx.GetClosedTimestamp(ctx) |
@sumeerbhola noticed in #70852 that:
unrelated to this PR, I'm worried by this comment @nvanbenschoten
- We have discussed that we would like a Reader that returns true for ConsistentIterators to contain a version of the engine from when the Reader was created. I think that would not be acceptable in this latchless situation where we want to see the engine state after the closed timestamp has been retrieved. Is my understanding correct?
- Even with the current lazy creation of the first cloneable iterator, we could have multiple requests in a batch, and a request in the batch that preceded this QueryResolvedTimestampRequest could have created that iterator that we will subsequently clone. So the iterator we create in createMinIntentTimestamp could reflect the engine state prior to this closed timestamp. Is there something preventing this?
@nvanbenschoten responded:
This is a very good catch. You are correct that this is improperly relying on the Reader's LSM snapshot being lazily captured at the time of the iterator creation. This will be completely broken when we begin capturing Reader snapshots eagerly and is broken today in cases where we have multiple QueryResolvedTimestampRequests in the same batch.
What's even more interesting is that this isn't actually causing issues today, for a very subtle reason. We do issue multiple QueryResolvedTimestampRequest in the same batch to the same range here, so we should hit the second problem. This means that for all but the first request in the batch, we could observe a closed timestamp that was carried in a later log entry than the last entry applied to the observed version of the state machine in computeMinIntentTimestamp. However, because of the way that these QueryResolvedTimestamp responses are combined, we happen to avoid any problems. Each response carries min(closed_ts, min_intent_ts), and we combine the responses by taking the min of all responses (min(min(closed_ts_1, min_intent_ts_1), min(closed_ts_2, min_intent_ts_2), ...)). So because the closed timestamp increases monotonically, we effectively ignore all closed timestamps except the first one and we end up with the correct resolved timestamp.
But regardless, we should fix this. My thinking about how to do that in a way that composes with #55461 is a little hazy. We've discussed the need to snapshot additional replica state for read-only operations that may evaluate after dropping latches. We could capture the pre-snapshot closed timestamp in this bundle to ensure that it is properly sequenced with the engine snapshot. But the closed timestamp isn't negligibly cheap to grab, so we probably wouldn't want to grab it if none of the requests needed it.
Since this isn't causing issues, I'm tempted to say that we should try to solve this as part of #55461, but if you have thoughts, I'd be happy to dive deeper into this now. I opened #70974 to continue the discussion.
Jira issue: CRDB-10315