kv: closed timestamp can starve read-write txns that take longer than 600ms

Now that we've dropped `kv.closed_timestamp.target_duration` down to 3s, we've seen an increase in read-write mutations that get starved by the closed timestamp. The earlier understanding was that these transactions would only be bumped by the closed timestamp if they took over 3s to commit, and would then be given 3s to refresh and complete their final batch if not. This first part is true, but the second part was a misunderstanding.

If a transaction takes more than 3s it will be bumped up to the current closed timestamp's value. It will then hit a retry error on its EndTxn batch. It will refresh at the old closed timestamp value and then try to commit again. If this commit is again bumped by the closed timestamp, we can enter a refresh loop that will terminate after [5 attempts](https://github.com/cockroachdb/cockroach/blob/65a30940a5071f527ded49f97b112ad981a1f0ed/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go#L31) and then be kicked back to the client. Since we're only ever refreshing up to the current closed timestamp, the transaction actually only has `kv.closed_timestamp.target_duration * kv.closed_timestamp.close_fraction` = `600ms` to refresh and commit.

The easiest way to reproduce this is to perform a large DELETE statement in an implicit txn. If the DELETE takes long enough, it can get stuck in a starvation loop. This looks something like:
```
1. read all rows: Batch{Scan, Scan, ...}
2. try to delete all rows and commit: Batch{Delete, Delete, ..., EndTxn}
3. first two steps take over 3s so batch is pushed to `now - 3s` and hits a retry error when evaluating its EndTxn
4. refresh @ now - 3s
5. retry Batch{Delete, Delete, ..., EndTxn}
6. steps 4 and 5 took over 600ms so batch is pushed to `now - 3s`
7. retry for 5 times until giving up and either returning error to client or retrying in conn_executor 
```

This starvation is clearly undesirable, but it doesn't seem fundamental. There seems to be a few different things we could do here to improve the situation:
1. if the closed timestamp bumped victims up to the present time instead of its current value, requests would be given more time to refresh and complete: 3s instead of 600ms. This would avoid a large class of these issues, in exchange for increasing the chance that refreshes fail under contention and increasing the contention footprint of transactions that get bumped by the closed timestamp.
2. we noticed in #36478 that we don't need to respect the closed timestamp when re-writing intents. This means that we could avoid getting bumped by the closed timestamp in step 5 when retrying the batch of DELETEs. OR, we could revive an optimization we had previously removed and avoid issuing that prefix of the batch entirely after the refresh since it had already succeeded. We probably don't want to do this second part because it gets confusing and may not be as useful in the context of parallel commits.
3. we also noticed in #36478 that EndTxn requests don't need to respect the closed timestamp. This means that we could ignore the closed timestamp entirely when issuing an EndTxn.

So in an optimal world, step 5 would be able to ignore the closed timestamp entirely as long as its intents were previously written in step 2.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: closed timestamp can starve read-write txns that take longer than 600ms #51294

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

kv: closed timestamp can starve read-write txns that take longer than 600ms #51294

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions