kv: immediately propagate WriteTooOld error from DeleteRange requests by nvb · Pull Request #56458 · cockroachdb/cockroach

nvb · 2020-11-09T20:43:46Z

This commit fixes a bug that allowed non-serializable state to leak from
the return value of a DeleteRange request. DeleteRange requests were not
returning WriteTooOld errors on conflicting tombstones at higher
timestamps, so they were allowing conflicting DELETEs to influence their
returned keys. This, in turn, was permitting cases where a DELETE's rows
updated return value would disagree with previously observed state in a
transaction.

Worse than this, I think there might have even been a bug with deferring
the WriteTooOld status from a DeleteRange request, because unlike most
MVCC operations, MVCCDeleteRange does not complete before returning a
WriteTooOldError. I think this means that it could have partially
succeeded and then returned a WriteTooOldError. If the write too old
status was deferred, this would result in the DeleteRange request
failing to delete some keys. This is why "blind-write" requests must
complete entirely even if they intend to return a WriteTooOldError (see
mvccPutInternal). We don't particularly like this behavior, but it's
been around for years.

The fix for this issue is twofold. First, this commit fixes
MVCCDeleteRange. It uses the relatively new FailOnMoreRecent flag in
its initial scan to ensure that a WriteTooOld error is returned even in
the presence of tombstones at equal or higher timestamps. This has the
side-effect of letting us get rid of some strange code added in ee80e3d
which scanned at MaxTimestamp. Second, this commit fixes DeleteRange
request by giving it the isRead flag. This ensures that the request is
never considered a blind-write and therefore never defers a "write too
old" status.

I intend to backport this to release-20.1 and release-20.2.

Release notes (bug fix): DELETE statements no longer have a chance of
returning an incorrect number of deleted rows in transactions that will
eventually need to restart due to contention.

cockroach-teamcity · 2020-11-09T20:43:54Z

This change is

nvb · 2020-11-10T17:00:22Z

Worse than this, I think there might have even been a bug with deferring
the WriteTooOld status from a DeleteRange request, because unlike most
MVCC operations, MVCCDeleteRange does not complete before returning a
WriteTooOldError. I think this means that it could have partially
succeeded and then returned a WriteTooOldError. If the write too old
status was deferred, this would result in the DeleteRange request
failing to delete some keys.

I tested this out and it looks like we're ok. The reason for this is because, as of da5dc5b, we don't actually defer the WTO status to future batches. So that means that if a DeleteRange request hits a WriteTooOld error, it will refresh and will be immediately re-issued at a refreshed timestamp. This re-issue will then succeed and will delete all keys.

andreimatei

The reason for this is because, as of da5dc5b, we don't actually defer the WTO status to future batches.

Makes sense. But then I'm thinking that we don't need part of this patch - setting the isRead flag on DeleteRange. If we don't set that flag, the DeleteRange would leave some intents behind when it returns the wto flag, and that seems to me to be a good thing. We'd be taking advantage of the fact that DeleteRange always writes the same values (i.e. tombstones) regardless of the timestamps at which it evaluates. In fact, it seems to me that it'd be good for the DeleteRange to continue past the first value at a higher ts and continue putting down tombstones for the next keys. We'd then be relying on the client always retrying the batch - see below.
Having said this, I am surprised that DeleteRange doesn't have the read flag given that it's pretty clearly a read since it counts keys. What else does the read flag do other than this wto behavior?

In any case, this all makes me think that we should spell out the contract around the wto flag more. Is the client free to defer dealing with it or not? Right now it is (as explained here, but it never chooses to do that. If you agree that it's convenient, maybe we close the door to the client not retrying the batch (or we introduce another flag on the response specifying if the retry is mandatory or not).
Or, if we don't do that, let's put a comment on WriteTooOldError saying that, if a non-read request returns that (an individual request, not a batch), then it must make damn sure that it performed all its work because it might not be retried.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @itsbilal)

nvb · 2020-11-11T01:50:49Z

Having said this, I am surprised that DeleteRange doesn't have the read flag given that it's pretty clearly a read since it counts keys.

It also returns keys when the ReturnKeys option is set. So I think it was just a blatant omission for it not to have the isRead flag.

What else does the read flag do other than this wto behavior?

Well it causes IsBlindWrite to return false, which is how we've moved towards defining which requests can defer WriteTooOld errors and which requests cannot.

cockroach/pkg/roachpb/api.go

Lines 117 to 128 in 9c67b26

    
           // IsBlindWrite returns true iff the request is a blind-write. A request is a 
        
           // blind-write if it is a write that does not observe any key-value state when 
        
           // modifying that state (such as puts and deletes). This is in contrast with 
        
           // read-write requests, which do observe key-value state when modifying the 
        
           // state and may base their modifications off of this existing state (such as 
        
           // conditional puts and increments). 
        
           // 
        
           // As a result of being "blind", blind-writes are allowed to be more freely 
        
           // re-ordered with other writes. In practice, this means that they can be 
        
           // evaluated with a read timestamp below another write and a write timestamp 
        
           // above that write without needing to re-evaluate. This allows the WriteTooOld 
        
           // error that is generated during such an occurrence to be deferred.

In any case, this all makes me think that we should spell out the contract around the wto flag more. Is the client free to defer dealing with it or not? Right now it is (as explained here, but it never chooses to do that. If you agree that it's convenient, maybe we close the door to the client not retrying the batch (or we introduce another flag on the response specifying if the retry is mandatory or not).

I don't think we should add any restrictions to the client. The server should either return a WriteTooOld error if any of the requests in a batch can't handle deferring a WTO condition that they created, or it should set the wto flag and the client should be able to deal with that whenever it wants to.

Or, if we don't do that, let's put a comment on WriteTooOldError saying that, if a non-read request returns that (an individual request, not a batch), then it must make damn sure that it performed all its work because it might not be retried.

Done.

andreimatei

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @itsbilal)

nvb · 2020-11-17T00:05:07Z

TFTR!

@itsbilal do you have any interest in taking a pass over the pkg/storage changes in this PR? The changes to MVCCDeleteRange are somewhat large, even though only a few lines of code were touched.

The purpose of the "fail on more recent" scanning mode is to detect cases where a subsequent write to a key being scanned would hit a WriteTooOld error (see `if readTimestamp.LessEq(metaTimestamp)` in `mvccPutInternal`). This is useful for locking scans and soon will be used for MVCCDeleteRange. The mode was working correctly for values above the read timestamp, but not for values at the same time as the read timestamp. This commit fixes this, ensuring that values at the same timestamp result in a WriteTooOld error. I think an argument could be made that "more recent" is no longer quite accurate, but I don't have a better suggestion for the name. Maybe "fail on equal or more recent", but that's starting to become pretty verbose.

Fixes cockroachdb#56111. This commit fixes a bug that allowed non-serializable state to leak from the return value of a DeleteRange request. DeleteRange requests were not returning WriteTooOld errors on conflicting tombstones at higher timestamps, so they were allowing conflicting DELETEs to influence their returned keys. This, in turn, was permitting cases where a DELETE's rows updated return value would disagree with previously observed state in a transaction. Worse than this, I think there might have even been a bug with deferring the WriteTooOld status from a DeleteRange request, because unlike most MVCC operations, MVCCDeleteRange does not complete before returning a WriteTooOldError. I think this means that it could have partially succeeded and then returned a WriteTooOldError. If the write too old status was deferred, this would result in the DeleteRange request failing to delete some keys. This is why "blind-write" requests must complete entirely even if they intend to return a WriteTooOldError (see `mvccPutInternal`). We don't particularly like this behavior, but it's been around for years. The fix for this issue is twofold. First, this commit fixes MVCCDeleteRange. It uses the relatively new `FailOnMoreRecent` flag in its initial scan to ensure that a WriteTooOld error is returned even in the presence of tombstones at equal or higher timestamps. This has the side-effect of letting us get rid of some strange code added in ee80e3d which scanned at MaxTimestamp. Second, this commit fixes DeleteRange request by giving it the `isRead` flag. This ensures that the request is never considered a blind-write and therefore never defers a "write too old" status. I intend to backport this to release-20.1 and release-20.2. Release notes (bug fix): DELETE statements no longer have a chance of returning an incorrect number of deleted rows in transactions that will eventually need to restart due to contention.

nvb · 2020-11-19T21:48:51Z

bors r+

craig · 2020-11-19T22:48:53Z

Build succeeded:

GitHub CI (Cockroach)

nvb requested review from andreimatei and itsbilal November 9, 2020 20:43

nvb force-pushed the nvanbenschoten/deleteRangeWTO branch from 32f2a87 to 1ada116 Compare November 9, 2020 23:27

andreimatei reviewed Nov 10, 2020

View reviewed changes

andreimatei reviewed Nov 12, 2020

View reviewed changes

nvb force-pushed the nvanbenschoten/deleteRangeWTO branch 2 times, most recently from f6ce8b7 to a6dfb8f Compare November 13, 2020 23:36

nvb added 3 commits November 19, 2020 15:13

roachpb: improve document around blind writes and WriteTooOldErrors

24f8c29

nvb force-pushed the nvanbenschoten/deleteRangeWTO branch from a6dfb8f to 24f8c29 Compare November 19, 2020 20:13

craig bot merged commit cec4342 into cockroachdb:master Nov 19, 2020

nvb deleted the nvanbenschoten/deleteRangeWTO branch November 23, 2020 02:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: immediately propagate WriteTooOld error from DeleteRange requests#56458

kv: immediately propagate WriteTooOld error from DeleteRange requests#56458
craig[bot] merged 3 commits intocockroachdb:masterfrom
nvb:nvanbenschoten/deleteRangeWTO

nvb commented Nov 9, 2020

Uh oh!

cockroach-teamcity commented Nov 9, 2020

Uh oh!

nvb commented Nov 10, 2020

Uh oh!

andreimatei left a comment

Uh oh!

nvb commented Nov 11, 2020

Uh oh!

andreimatei left a comment

Uh oh!

nvb commented Nov 17, 2020

Uh oh!

nvb commented Nov 19, 2020

Uh oh!

craig bot commented Nov 19, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nvb commented Nov 9, 2020

Uh oh!

cockroach-teamcity commented Nov 9, 2020

Uh oh!

nvb commented Nov 10, 2020

Uh oh!

andreimatei left a comment

Choose a reason for hiding this comment

Uh oh!

nvb commented Nov 11, 2020

Uh oh!

andreimatei left a comment

Choose a reason for hiding this comment

Uh oh!

nvb commented Nov 17, 2020

Uh oh!

nvb commented Nov 19, 2020

Uh oh!

craig bot commented Nov 19, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants