kv: immediately propagate WriteTooOld error from DeleteRange requests#56458
kv: immediately propagate WriteTooOld error from DeleteRange requests#56458craig[bot] merged 3 commits intocockroachdb:masterfrom
Conversation
32f2a87 to
1ada116
Compare
I tested this out and it looks like we're ok. The reason for this is because, as of da5dc5b, we don't actually defer the WTO status to future batches. So that means that if a DeleteRange request hits a WriteTooOld error, it will refresh and will be immediately re-issued at a refreshed timestamp. This re-issue will then succeed and will delete all keys. |
andreimatei
left a comment
There was a problem hiding this comment.
The reason for this is because, as of da5dc5b, we don't actually defer the WTO status to future batches.
Makes sense. But then I'm thinking that we don't need part of this patch - setting the isRead flag on DeleteRange. If we don't set that flag, the DeleteRange would leave some intents behind when it returns the wto flag, and that seems to me to be a good thing. We'd be taking advantage of the fact that DeleteRange always writes the same values (i.e. tombstones) regardless of the timestamps at which it evaluates. In fact, it seems to me that it'd be good for the DeleteRange to continue past the first value at a higher ts and continue putting down tombstones for the next keys. We'd then be relying on the client always retrying the batch - see below.
Having said this, I am surprised that DeleteRange doesn't have the read flag given that it's pretty clearly a read since it counts keys. What else does the read flag do other than this wto behavior?
In any case, this all makes me think that we should spell out the contract around the wto flag more. Is the client free to defer dealing with it or not? Right now it is (as explained here, but it never chooses to do that. If you agree that it's convenient, maybe we close the door to the client not retrying the batch (or we introduce another flag on the response specifying if the retry is mandatory or not).
Or, if we don't do that, let's put a comment on WriteTooOldError saying that, if a non-read request returns that (an individual request, not a batch), then it must make damn sure that it performed all its work because it might not be retried.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @itsbilal)
It also returns keys when the
Well it causes Lines 117 to 128 in 9c67b26
I don't think we should add any restrictions to the client. The server should either return a WriteTooOld error if any of the requests in a batch can't handle deferring a WTO condition that they created, or it should set the
Done. |
andreimatei
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @itsbilal)
f6ce8b7 to
a6dfb8f
Compare
|
TFTR! @itsbilal do you have any interest in taking a pass over the |
The purpose of the "fail on more recent" scanning mode is to detect cases where a subsequent write to a key being scanned would hit a WriteTooOld error (see `if readTimestamp.LessEq(metaTimestamp)` in `mvccPutInternal`). This is useful for locking scans and soon will be used for MVCCDeleteRange. The mode was working correctly for values above the read timestamp, but not for values at the same time as the read timestamp. This commit fixes this, ensuring that values at the same timestamp result in a WriteTooOld error. I think an argument could be made that "more recent" is no longer quite accurate, but I don't have a better suggestion for the name. Maybe "fail on equal or more recent", but that's starting to become pretty verbose.
Fixes cockroachdb#56111. This commit fixes a bug that allowed non-serializable state to leak from the return value of a DeleteRange request. DeleteRange requests were not returning WriteTooOld errors on conflicting tombstones at higher timestamps, so they were allowing conflicting DELETEs to influence their returned keys. This, in turn, was permitting cases where a DELETE's rows updated return value would disagree with previously observed state in a transaction. Worse than this, I think there might have even been a bug with deferring the WriteTooOld status from a DeleteRange request, because unlike most MVCC operations, MVCCDeleteRange does not complete before returning a WriteTooOldError. I think this means that it could have partially succeeded and then returned a WriteTooOldError. If the write too old status was deferred, this would result in the DeleteRange request failing to delete some keys. This is why "blind-write" requests must complete entirely even if they intend to return a WriteTooOldError (see `mvccPutInternal`). We don't particularly like this behavior, but it's been around for years. The fix for this issue is twofold. First, this commit fixes MVCCDeleteRange. It uses the relatively new `FailOnMoreRecent` flag in its initial scan to ensure that a WriteTooOld error is returned even in the presence of tombstones at equal or higher timestamps. This has the side-effect of letting us get rid of some strange code added in ee80e3d which scanned at MaxTimestamp. Second, this commit fixes DeleteRange request by giving it the `isRead` flag. This ensures that the request is never considered a blind-write and therefore never defers a "write too old" status. I intend to backport this to release-20.1 and release-20.2. Release notes (bug fix): DELETE statements no longer have a chance of returning an incorrect number of deleted rows in transactions that will eventually need to restart due to contention.
a6dfb8f to
24f8c29
Compare
|
bors r+ |
|
Build succeeded: |
Fixes #56111.
This commit fixes a bug that allowed non-serializable state to leak from
the return value of a DeleteRange request. DeleteRange requests were not
returning WriteTooOld errors on conflicting tombstones at higher
timestamps, so they were allowing conflicting DELETEs to influence their
returned keys. This, in turn, was permitting cases where a DELETE's rows
updated return value would disagree with previously observed state in a
transaction.
Worse than this, I think there might have even been a bug with deferring
the WriteTooOld status from a DeleteRange request, because unlike most
MVCC operations, MVCCDeleteRange does not complete before returning a
WriteTooOldError. I think this means that it could have partially
succeeded and then returned a WriteTooOldError. If the write too old
status was deferred, this would result in the DeleteRange request
failing to delete some keys. This is why "blind-write" requests must
complete entirely even if they intend to return a WriteTooOldError (see
mvccPutInternal). We don't particularly like this behavior, but it'sbeen around for years.
The fix for this issue is twofold. First, this commit fixes
MVCCDeleteRange. It uses the relatively new
FailOnMoreRecentflag inits initial scan to ensure that a WriteTooOld error is returned even in
the presence of tombstones at equal or higher timestamps. This has the
side-effect of letting us get rid of some strange code added in ee80e3d
which scanned at MaxTimestamp. Second, this commit fixes DeleteRange
request by giving it the
isReadflag. This ensures that the request isnever considered a blind-write and therefore never defers a "write too
old" status.
I intend to backport this to release-20.1 and release-20.2.
Release notes (bug fix): DELETE statements no longer have a chance of
returning an incorrect number of deleted rows in transactions that will
eventually need to restart due to contention.