Skip to content

kv: immediately propagate WriteTooOld error from DeleteRange requests#56458

Merged
craig[bot] merged 3 commits intocockroachdb:masterfrom
nvb:nvanbenschoten/deleteRangeWTO
Nov 19, 2020
Merged

kv: immediately propagate WriteTooOld error from DeleteRange requests#56458
craig[bot] merged 3 commits intocockroachdb:masterfrom
nvb:nvanbenschoten/deleteRangeWTO

Conversation

@nvb
Copy link
Copy Markdown
Contributor

@nvb nvb commented Nov 9, 2020

Fixes #56111.

This commit fixes a bug that allowed non-serializable state to leak from
the return value of a DeleteRange request. DeleteRange requests were not
returning WriteTooOld errors on conflicting tombstones at higher
timestamps, so they were allowing conflicting DELETEs to influence their
returned keys. This, in turn, was permitting cases where a DELETE's rows
updated return value would disagree with previously observed state in a
transaction.

Worse than this, I think there might have even been a bug with deferring
the WriteTooOld status from a DeleteRange request, because unlike most
MVCC operations, MVCCDeleteRange does not complete before returning a
WriteTooOldError. I think this means that it could have partially
succeeded and then returned a WriteTooOldError. If the write too old
status was deferred, this would result in the DeleteRange request
failing to delete some keys. This is why "blind-write" requests must
complete entirely even if they intend to return a WriteTooOldError (see
mvccPutInternal). We don't particularly like this behavior, but it's
been around for years.

The fix for this issue is twofold. First, this commit fixes
MVCCDeleteRange. It uses the relatively new FailOnMoreRecent flag in
its initial scan to ensure that a WriteTooOld error is returned even in
the presence of tombstones at equal or higher timestamps. This has the
side-effect of letting us get rid of some strange code added in ee80e3d
which scanned at MaxTimestamp. Second, this commit fixes DeleteRange
request by giving it the isRead flag. This ensures that the request is
never considered a blind-write and therefore never defers a "write too
old" status.

I intend to backport this to release-20.1 and release-20.2.

Release notes (bug fix): DELETE statements no longer have a chance of
returning an incorrect number of deleted rows in transactions that will
eventually need to restart due to contention.

@nvb nvb requested review from andreimatei and itsbilal November 9, 2020 20:43
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@nvb nvb force-pushed the nvanbenschoten/deleteRangeWTO branch from 32f2a87 to 1ada116 Compare November 9, 2020 23:27
@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Nov 10, 2020

Worse than this, I think there might have even been a bug with deferring
the WriteTooOld status from a DeleteRange request, because unlike most
MVCC operations, MVCCDeleteRange does not complete before returning a
WriteTooOldError. I think this means that it could have partially
succeeded and then returned a WriteTooOldError. If the write too old
status was deferred, this would result in the DeleteRange request
failing to delete some keys.

I tested this out and it looks like we're ok. The reason for this is because, as of da5dc5b, we don't actually defer the WTO status to future batches. So that means that if a DeleteRange request hits a WriteTooOld error, it will refresh and will be immediately re-issued at a refreshed timestamp. This re-issue will then succeed and will delete all keys.

Copy link
Copy Markdown
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for this is because, as of da5dc5b, we don't actually defer the WTO status to future batches.

Makes sense. But then I'm thinking that we don't need part of this patch - setting the isRead flag on DeleteRange. If we don't set that flag, the DeleteRange would leave some intents behind when it returns the wto flag, and that seems to me to be a good thing. We'd be taking advantage of the fact that DeleteRange always writes the same values (i.e. tombstones) regardless of the timestamps at which it evaluates. In fact, it seems to me that it'd be good for the DeleteRange to continue past the first value at a higher ts and continue putting down tombstones for the next keys. We'd then be relying on the client always retrying the batch - see below.
Having said this, I am surprised that DeleteRange doesn't have the read flag given that it's pretty clearly a read since it counts keys. What else does the read flag do other than this wto behavior?

In any case, this all makes me think that we should spell out the contract around the wto flag more. Is the client free to defer dealing with it or not? Right now it is (as explained here, but it never chooses to do that. If you agree that it's convenient, maybe we close the door to the client not retrying the batch (or we introduce another flag on the response specifying if the retry is mandatory or not).
Or, if we don't do that, let's put a comment on WriteTooOldError saying that, if a non-read request returns that (an individual request, not a batch), then it must make damn sure that it performed all its work because it might not be retried.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @itsbilal)

@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Nov 11, 2020

Having said this, I am surprised that DeleteRange doesn't have the read flag given that it's pretty clearly a read since it counts keys.

It also returns keys when the ReturnKeys option is set. So I think it was just a blatant omission for it not to have the isRead flag.

What else does the read flag do other than this wto behavior?

Well it causes IsBlindWrite to return false, which is how we've moved towards defining which requests can defer WriteTooOld errors and which requests cannot.

// IsBlindWrite returns true iff the request is a blind-write. A request is a
// blind-write if it is a write that does not observe any key-value state when
// modifying that state (such as puts and deletes). This is in contrast with
// read-write requests, which do observe key-value state when modifying the
// state and may base their modifications off of this existing state (such as
// conditional puts and increments).
//
// As a result of being "blind", blind-writes are allowed to be more freely
// re-ordered with other writes. In practice, this means that they can be
// evaluated with a read timestamp below another write and a write timestamp
// above that write without needing to re-evaluate. This allows the WriteTooOld
// error that is generated during such an occurrence to be deferred.

In any case, this all makes me think that we should spell out the contract around the wto flag more. Is the client free to defer dealing with it or not? Right now it is (as explained here, but it never chooses to do that. If you agree that it's convenient, maybe we close the door to the client not retrying the batch (or we introduce another flag on the response specifying if the retry is mandatory or not).

I don't think we should add any restrictions to the client. The server should either return a WriteTooOld error if any of the requests in a batch can't handle deferring a WTO condition that they created, or it should set the wto flag and the client should be able to deal with that whenever it wants to.

Or, if we don't do that, let's put a comment on WriteTooOldError saying that, if a non-read request returns that (an individual request, not a batch), then it must make damn sure that it performed all its work because it might not be retried.

Done.

Copy link
Copy Markdown
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:LGTM:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @itsbilal)

@nvb nvb force-pushed the nvanbenschoten/deleteRangeWTO branch 2 times, most recently from f6ce8b7 to a6dfb8f Compare November 13, 2020 23:36
@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Nov 17, 2020

TFTR!

@itsbilal do you have any interest in taking a pass over the pkg/storage changes in this PR? The changes to MVCCDeleteRange are somewhat large, even though only a few lines of code were touched.

nvb added 3 commits November 19, 2020 15:13
The purpose of the "fail on more recent" scanning mode is to detect
cases where a subsequent write to a key being scanned would hit a
WriteTooOld error (see `if readTimestamp.LessEq(metaTimestamp)` in
`mvccPutInternal`). This is useful for locking scans and soon will be
used for MVCCDeleteRange.

The mode was working correctly for values above the read timestamp,
but not for values at the same time as the read timestamp. This commit
fixes this, ensuring that values at the same timestamp result in a
WriteTooOld error.

I think an argument could be made that "more recent" is no longer
quite accurate, but I don't have a better suggestion for the name.
Maybe "fail on equal or more recent", but that's starting to become
pretty verbose.
Fixes cockroachdb#56111.

This commit fixes a bug that allowed non-serializable state to leak from
the return value of a DeleteRange request. DeleteRange requests were not
returning WriteTooOld errors on conflicting tombstones at higher
timestamps, so they were allowing conflicting DELETEs to influence their
returned keys. This, in turn, was permitting cases where a DELETE's rows
updated return value would disagree with previously observed state in a
transaction.

Worse than this, I think there might have even been a bug with deferring
the WriteTooOld status from a DeleteRange request, because unlike most
MVCC operations, MVCCDeleteRange does not complete before returning a
WriteTooOldError. I think this means that it could have partially
succeeded and then returned a WriteTooOldError. If the write too old
status was deferred, this would result in the DeleteRange request
failing to delete some keys. This is why "blind-write" requests must
complete entirely even if they intend to return a WriteTooOldError (see
`mvccPutInternal`). We don't particularly like this behavior, but it's
been around for years.

The fix for this issue is twofold. First, this commit fixes
MVCCDeleteRange. It uses the relatively new `FailOnMoreRecent` flag in
its initial scan to ensure that a WriteTooOld error is returned even in
the presence of tombstones at equal or higher timestamps. This has the
side-effect of letting us get rid of some strange code added in ee80e3d
which scanned at MaxTimestamp. Second, this commit fixes DeleteRange
request by giving it the `isRead` flag. This ensures that the request is
never considered a blind-write and therefore never defers a "write too
old" status.

I intend to backport this to release-20.1 and release-20.2.

Release notes (bug fix): DELETE statements no longer have a chance of
returning an incorrect number of deleted rows in transactions that will
eventually need to restart due to contention.
@nvb nvb force-pushed the nvanbenschoten/deleteRangeWTO branch from a6dfb8f to 24f8c29 Compare November 19, 2020 20:13
@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Nov 19, 2020

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Nov 19, 2020

Build succeeded:

@craig craig bot merged commit cec4342 into cockroachdb:master Nov 19, 2020
@nvb nvb deleted the nvanbenschoten/deleteRangeWTO branch November 23, 2020 02:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug caused by delete operations within a transaction

3 participants