storage: SNAPSHOT isolation violation with write timestamp cache

I think the write timestamp cache is currently implemented incorrectly and, in conjunction with the RefreshSpans mechanism, can lead to serializability violations. It is also incorrect for snapshot isolation transactions even before the introduction of RefreshSpans. I haven't reproduced either error yet; I'm just writing up my notes so far.

The problematic code is in applyTimestampCache:

https://github.com/cockroachdb/cockroach/blob/7c8d6f63928ed0169d361dfd63d70a43f4b24411/pkg/storage/replica.go#L2348-L2379

We push the transaction's timestamp based on both the read and write timestamp caches. If the write timestamp cache had the higher value, we also set the WriteTooOld flag on the transaction. This means that the presence or absence of a read can change the outcome of the WriteTooOld flag.

Pre-2.0, the WriteTooOld flag was mostly optional for serializable transactions: they had to restart when their timestamp was pushed whether this flag was set or not. However, with RefreshSpans, a serializable transaction without WriteTooOld could issue its Refresh requests, see no new individual write intents, allowing it to commit at the new timestamp even though it has jumped past a write. (For snapshot transactions, the problem is more straightforward: the WriteTooOld flag directly determines whether a pushed snapshot transaction must restart. I haven't examined whether this could be the cause of #23176 yet).

I think the following set of transactions would demonstrate the error:
- INSERT INTO kv VALUES ($1, 1) ON CONFLICT DO UPDATE SET v = v+1
- DELETE FROM kv
- SELECT * FROM kv

The INSERT ON CONFLICT becomes a read-modify-write, so we can expand that into two possible orderings:

1. Read for INSERT
1. DELETE
1. Write for INSERT. The WriteTooOld flag is set and the transaction restarts.
1. Read for INSERT, seeing the outcome of the delete.
1. Write for INSERT

or

1. Read for INSERT
1. DELETE
1. SELECT
1. Write for INSERT. The timestamp gets pushed but the WriteTooOld flag is not set
1. Refresh for INSERT
1. INSERT commits at a timestamp after the DELETE, with value reflecting its read from before the DELETE.

I think the timestamp cache checks need to be updated to set the WriteTooOld flag if the OrigTimestamp is less than the write timestamp cache value (in addition to pushing).

We also need to add deletions to our jepsen tests to cover cases like this.

	// Forward the timestamp if there's been a more recent read (by someone else).
	rTS, rTxnID := r.store.tsCache.GetMaxRead(header.Key, header.EndKey)
	if ba.Txn != nil {
	if ba.Txn.ID != rTxnID {
	nextTS := rTS.Next()
	if ba.Txn.Timestamp.Less(nextTS) {
	txn := ba.Txn.Clone()
	bumped = txn.Timestamp.Forward(nextTS) \|\| bumped
	ba.Txn = &txn
	}
	}
	} else {
	bumped = ba.Timestamp.Forward(rTS.Next()) \|\| bumped
	}

	// On more recent writes, forward the timestamp and set the
	// write too old boolean for transactions. Note that currently
	// only EndTransaction and DeleteRange requests update the
	// write timestamp cache.
	wTS, wTxnID := r.store.tsCache.GetMaxWrite(header.Key, header.EndKey)
	if ba.Txn != nil {
	if ba.Txn.ID != wTxnID {
	if !wTS.Less(ba.Txn.Timestamp) {
	txn := ba.Txn.Clone()
	bumped = txn.Timestamp.Forward(wTS.Next()) \|\| bumped
	txn.WriteTooOld = true
	ba.Txn = &txn
	}
	}
	} else {
	bumped = ba.Timestamp.Forward(wTS.Next()) \|\| bumped
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: SNAPSHOT isolation violation with write timestamp cache #26460

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

storage: SNAPSHOT isolation violation with write timestamp cache #26460

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions