Skip to content

storage: SNAPSHOT isolation violation with write timestamp cache #26460

@bdarnell

Description

@bdarnell

I think the write timestamp cache is currently implemented incorrectly and, in conjunction with the RefreshSpans mechanism, can lead to serializability violations. It is also incorrect for snapshot isolation transactions even before the introduction of RefreshSpans. I haven't reproduced either error yet; I'm just writing up my notes so far.

The problematic code is in applyTimestampCache:

// Forward the timestamp if there's been a more recent read (by someone else).
rTS, rTxnID := r.store.tsCache.GetMaxRead(header.Key, header.EndKey)
if ba.Txn != nil {
if ba.Txn.ID != rTxnID {
nextTS := rTS.Next()
if ba.Txn.Timestamp.Less(nextTS) {
txn := ba.Txn.Clone()
bumped = txn.Timestamp.Forward(nextTS) || bumped
ba.Txn = &txn
}
}
} else {
bumped = ba.Timestamp.Forward(rTS.Next()) || bumped
}
// On more recent writes, forward the timestamp and set the
// write too old boolean for transactions. Note that currently
// only EndTransaction and DeleteRange requests update the
// write timestamp cache.
wTS, wTxnID := r.store.tsCache.GetMaxWrite(header.Key, header.EndKey)
if ba.Txn != nil {
if ba.Txn.ID != wTxnID {
if !wTS.Less(ba.Txn.Timestamp) {
txn := ba.Txn.Clone()
bumped = txn.Timestamp.Forward(wTS.Next()) || bumped
txn.WriteTooOld = true
ba.Txn = &txn
}
}
} else {
bumped = ba.Timestamp.Forward(wTS.Next()) || bumped
}

We push the transaction's timestamp based on both the read and write timestamp caches. If the write timestamp cache had the higher value, we also set the WriteTooOld flag on the transaction. This means that the presence or absence of a read can change the outcome of the WriteTooOld flag.

Pre-2.0, the WriteTooOld flag was mostly optional for serializable transactions: they had to restart when their timestamp was pushed whether this flag was set or not. However, with RefreshSpans, a serializable transaction without WriteTooOld could issue its Refresh requests, see no new individual write intents, allowing it to commit at the new timestamp even though it has jumped past a write. (For snapshot transactions, the problem is more straightforward: the WriteTooOld flag directly determines whether a pushed snapshot transaction must restart. I haven't examined whether this could be the cause of #23176 yet).

I think the following set of transactions would demonstrate the error:

  • INSERT INTO kv VALUES ($1, 1) ON CONFLICT DO UPDATE SET v = v+1
  • DELETE FROM kv
  • SELECT * FROM kv

The INSERT ON CONFLICT becomes a read-modify-write, so we can expand that into two possible orderings:

  1. Read for INSERT
  2. DELETE
  3. Write for INSERT. The WriteTooOld flag is set and the transaction restarts.
  4. Read for INSERT, seeing the outcome of the delete.
  5. Write for INSERT

or

  1. Read for INSERT
  2. DELETE
  3. SELECT
  4. Write for INSERT. The timestamp gets pushed but the WriteTooOld flag is not set
  5. Refresh for INSERT
  6. INSERT commits at a timestamp after the DELETE, with value reflecting its read from before the DELETE.

I think the timestamp cache checks need to be updated to set the WriteTooOld flag if the OrigTimestamp is less than the write timestamp cache value (in addition to pushing).

We also need to add deletions to our jepsen tests to cover cases like this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-kv-transactionsRelating to MVCC and the transactional model.C-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions