Skip to content

storage/engine: return WriteIntentError for intents in uncertainty intervals#40600

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
nvb:nvanbenschoten/uncertainIntent
Sep 9, 2019
Merged

storage/engine: return WriteIntentError for intents in uncertainty intervals#40600
craig[bot] merged 1 commit intocockroachdb:masterfrom
nvb:nvanbenschoten/uncertainIntent

Conversation

@nvb
Copy link
Copy Markdown
Contributor

@nvb nvb commented Sep 9, 2019

This commit fixes the most common failure case in all of the following Jepsen failures. I'm not closing them here though, because they also should be failing due to #36431.
Would fix #37394.
Would fix #37545.
Would fix #37930.
Would fix #37932.
Would fix #37956.
Would fix #38126.
Would fix #38763.

Before this fix, we were not considering intents in a scan's uncertainty interval to be uncertain. This had the potential to cause stale reads because an unresolved intent doesn't indicate that its transaction hasn’t been committed and isn’t a causal ancestor of the scan. This was causing the jepsen/multi-register tests to fail, which I had previously incorrectly attributed entirely to #36431.

This commit fixes this by returning WriteIntentErrors for intents when they are above the read timestamp of a scan but below the max timestamp of a scan. This could have also been fixed by returning ReadWithinUncertaintyIntervalErrors in this situation. Both would eventually have the same effect, but it seems preferable to kick off concurrency control immediately in this situation and only fall back to uncertainty handling for committed values. If the intent ends up being aborted, this could allow the read to avoid moving its timestamp.

This commit will need to be backported all the way back to v2.0.

Release note (bug fix): Consider intents in a read's uncertainty interval to be uncertain just as if they were committed values. This removes the potential for stale reads when a causally dependent transaction runs into the not-yet resolved intents from a causal ancestor.

@nvb nvb requested a review from petermattis September 9, 2019 15:49
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@nvb nvb force-pushed the nvanbenschoten/uncertainIntent branch from af351fa to 8f05723 Compare September 9, 2019 16:32
@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Sep 9, 2019

Screen Shot 2019-09-09 at 12 32 43 PM

🎉

Copy link
Copy Markdown
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @petermattis)


c-deps/libroach/mvcc.h, line 267 at r1 (raw file):

    // visible to it at or below the txn's read timestamp. However, if the txn's
    // read timestamp is less than the max timestamp seen by the txn then intents
    // are visible to the txn all the way up to its max timestamp.

Can this be restated as: "Intents for other transactions are visible at max(txn.max_timestamp, read_timestamp)". And indicate that txn.max_timestamp includes the uncertainty interval. I guess I'm find some of the current wording confusing.


pkg/storage/engine/mvcc.go, line 885 at r1 (raw file):

		maxVisibleTimestamp := timestamp
		if checkUncertainty {
			maxVisibleTimestamp.Forward(txn.MaxTimestamp)

You could just assign maxVisibleTimetamp = txn.MaxTimestamp, right?


pkg/storage/engine/mvcc_test.go, line 823 at r1 (raw file):

			// Case 1: One value in the past, one value in the future of read
			// and ahead of MaxTimestamp of read. Neither should interfere.

These extra test cases and diagrams are great. Only big comment is that you're always testing distinct timestamps. What happens when max-timestamp == val-timestamp?

…tervals

This commit fixes the most common failure case in all of the following Jepsen
failures. I'm not closing them here though, because they also should be failing
due to cockroachdb#36431.
Would fix cockroachdb#37394.
Would fix cockroachdb#37545.
Would fix cockroachdb#37930.
Would fix cockroachdb#37932.
Would fix cockroachdb#37956.
Would fix cockroachdb#38126.
Would fix cockroachdb#38763.

Before this fix, we were not considering intents in a scan's uncertainty
interval to be uncertain. This had the potential to cause stale reads because
an unresolved intent doesn't indicate that its transaction hasn’t been committed
and isn’t a causal ancestor of the scan. This was causing the `jepsen/multi-register`
tests to fail, which I had previously incorrectly attributed entirely to cockroachdb#36431.

This commit fixes this by returning `WriteIntentError`s for intents when they
are above the read timestamp of a scan but below the max timestamp of a scan.
This could have also been fixed by returning `ReadWithinUncertaintyIntervalError`s
in this situation. Both would eventually have the same effect, but it seems
preferable to kick off concurrency control immediately in this situation and
only fall back to uncertainty handling for committed values. If the intent
ends up being aborted, this could allow the read to avoid moving its timestamp.

This commit will need to be backported all the way back to v2.0.

Release note (bug fix): Consider intents in a read's uncertainty
interval to be uncertain just as if they were committed values. This
removes the potential for stale reads when a causally dependent
transaction runs into the not-yet resolved intents from a causal
ancestor.
@nvb nvb force-pushed the nvanbenschoten/uncertainIntent branch from 8f05723 to d20419d Compare September 9, 2019 17:31
Copy link
Copy Markdown
Contributor Author

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @petermattis)


c-deps/libroach/mvcc.h, line 267 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Can this be restated as: "Intents for other transactions are visible at max(txn.max_timestamp, read_timestamp)". And indicate that txn.max_timestamp includes the uncertainty interval. I guess I'm find some of the current wording confusing.

Done.


pkg/storage/engine/mvcc.go, line 885 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

You could just assign maxVisibleTimetamp = txn.MaxTimestamp, right?

Done.


pkg/storage/engine/mvcc_test.go, line 823 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

These extra test cases and diagrams are great. Only big comment is that you're always testing distinct timestamps. What happens when max-timestamp == val-timestamp?

Good point. Added test cases for this.

Copy link
Copy Markdown
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/storage/engine/mvcc_test.go, line 823 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Good point. Added test cases for this.

Great. These test cases are fugly, but let's leave that to a future cleanup.

@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Sep 9, 2019

bors r=petermattis

craig bot pushed a commit that referenced this pull request Sep 9, 2019
40600: storage/engine: return WriteIntentError for intents in uncertainty intervals r=petermattis a=nvanbenschoten

This commit fixes the most common failure case in all of the following Jepsen failures. I'm not closing them here though, because they also should be failing due to #36431.
Would fix #37394.
Would fix #37545.
Would fix #37930.
Would fix #37932.
Would fix #37956.
Would fix #38126.
Would fix #38763.

Before this fix, we were not considering intents in a scan's uncertainty interval to be uncertain. This had the potential to cause stale reads because an unresolved intent doesn't indicate that its transaction hasn’t been committed and isn’t a causal ancestor of the scan. This was causing the `jepsen/multi-register` tests to fail, which I had previously incorrectly attributed entirely to #36431.

This commit fixes this by returning `WriteIntentError`s for intents when they are above the read timestamp of a scan but below the max timestamp of a scan. This could have also been fixed by returning `ReadWithinUncertaintyIntervalError`s in this situation. Both would eventually have the same effect, but it seems preferable to kick off concurrency control immediately in this situation and only fall back to uncertainty handling for committed values. If the intent ends up being aborted, this could allow the read to avoid moving its timestamp.

This commit will need to be backported all the way back to v2.0.

Release note (bug fix): Consider intents in a read's uncertainty interval to be uncertain just as if they were committed values. This removes the potential for stale reads when a causally dependent transaction runs into the not-yet resolved intents from a causal ancestor.

40603: make: pass TESTFLAGS to roachprod-stress, not GOFLAGS r=petermattis a=nvanbenschoten

Passing the testflags through the GOFLAGS env var was causing the following
error:
```
stringer -output=pkg/sql/opt/rule_name_string.go -type=RuleName pkg/sql/opt/rule_name.go pkg/sql/opt/rule_name.og.go
stringer: go [list -f {{context.GOARCH}} {{context.Compiler}} -tags= -- unsafe]: exit status 1: go: parsing $GOFLAGS: non-flag "storage.test"
Makefile:1496: recipe for target 'pkg/sql/opt/rule_name_string.go' failed
make: *** [pkg/sql/opt/rule_name_string.go] Error 1
make: *** Waiting for unfinished jobs....
```

Release note: None

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@craig
Copy link
Copy Markdown
Contributor

craig bot commented Sep 9, 2019

Build succeeded

@craig craig bot merged commit d20419d into cockroachdb:master Sep 9, 2019
@nvb nvb deleted the nvanbenschoten/uncertainIntent branch September 9, 2019 19:21
nvb added a commit to nvb/cockroach that referenced this pull request Sep 11, 2019
Now that cockroachdb#40600 is merged, this seems to be pretty stable. I still
expect it to fail due to cockroachdb#36431, but I've never managed to repro that.

Release note: None
craig bot pushed a commit that referenced this pull request Sep 11, 2019
40686: roachtest: stop skipping jepsen/multi-register r=nvanbenschoten a=nvanbenschoten

Now that #40600 is merged, this seems to be pretty stable. I still
expect it to fail due to #36431, but I've never managed to repro that.

Release note: None

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@nvb nvb mentioned this pull request Sep 17, 2019
53 tasks
nvb added a commit to nvb/cockroach that referenced this pull request Nov 25, 2020
This commit is a partial revert of cockroachdb#40600 and cockroachdb#46830. It solves the same
problem that those PRs were solving, but in a different way.

Those two PRs were handling the case where a reading transaction
observes an intent in its uncertainty interval. Before those fixes, we
were not considering intents in a scan's uncertainty interval to be
uncertain. This had the potential to cause stale reads because an
unresolved intent doesn't indicate that its transaction hasn’t been
committed and is not a causal ancestor of the scan.

The approach the first PR took was to throw a WriteIntentError on
intents in a scan's uncertainty interval. Effectively, it made scans
consider intents in their uncertainty interval to be write-read
conflicts. This had the benefit that the scan would push the intent and
eventually resolve the conflict, either by aborting the intent, pushing
it out of the read's uncertainty interval, or waiting for it to commit.
In this last case (which is by the far the most common), the scan would
then throw an uncertainty error, because it now had a committed value in
its uncertainty interval.

The second PR introduced some awkward code to deal with the fallout from
this decision. Once we started trying to keep the lock table in sync
with the MVCC state, we needed to use latches to ensure proper
synchronization between any operations that could conflict because such
conflicts would influence the state of the lock table. This meant that
we needed to start holding latches across a read's uncertainty interval,
because intent writes in this interval were considered write-read
conflicts. This led to some moderately gross code and always felt a little
wrong.

Now that we are complicating the logic around uncertainty intervals even
further, this becomes even more of a burden. This motivates a reworking
of these interactions. This commit replaces the logic that considers
intents in a transaction's uncertainty interval to be write-read
conflicts for logic that considers such intents to be... uncertain.
Doing so means that a transaction will immediately refresh/restart
above the uncertain timestamp and will only then begin conflicting
with the intent.

This has a number of nice benefits:
1. it keeps all considerations of read uncertainty intervals down in
   MVCC iteration logic. The lock table no longer needs to be aware of it.
   This is a big win now and an even bigger win once we introduce synthetic
   timestamps.
2. read latches no longer need to be acquired up to the edge of a read's
   uncertainty interval, they only need to be acquired up to its read
   timestamp. Similarly, the lock table only needs to consider true
   write-read conflicts.
3. readers that are almost certainly bound to hit an uncertainty error
   and need to restart will now do so sooner. In rare cases, this may
   result in wasted work. In the vast majority of cases, this will allow
   the reader to be more responsive to the commit of its conflict.
4. uncertainty errors will only be thrown for locks in the uncertainty
   interval of a read that are protecting a provisional write (intents).
   Before, any form of a lock in a read's uncertainty interval would be
   considered a write-read conflict, which was pessimistic and not needed
   for correctness.

   In a future with a fully segregated lock table, the change in semantic
   meaning here becomes even more clear. Instead of detecting the lock
   associated with an intent in a read's uncertainty interval and declaring
   a write-read conflict, the read will instead pass through the lock table
   untouched and will detect the provisional value associated with an intent
   and declaring uncertainty. This seems closer to what were actually trying
   to say about these interactions.
5. partially unblocks a change like cockroachdb#52610. Now that latches and lock
   consider the same time ranges for write-read conflicts, merging the
   latch spans and lock spans will be easier.

Before making this change, I intend to validate the hypothesis that it
will not affect performance (or may even slightly improve performance)
by running it on the YCSB benchmark suite.
nvb added a commit to nvb/cockroach that referenced this pull request Jan 6, 2021
This commit is a partial revert of cockroachdb#40600 and cockroachdb#46830. It solves the same
problem that those PRs were solving, but in a different way.

Those two PRs were handling the case where a reading transaction
observes an intent in its uncertainty interval. Before those fixes, we
were not considering intents in a scan's uncertainty interval to be
uncertain. This had the potential to cause stale reads because an
unresolved intent doesn't indicate that its transaction hasn’t been
committed and is not a causal ancestor of the scan.

The approach the first PR took was to throw a WriteIntentError on
intents in a scan's uncertainty interval. Effectively, it made scans
consider intents in their uncertainty interval to be write-read
conflicts. This had the benefit that the scan would push the intent and
eventually resolve the conflict, either by aborting the intent, pushing
it out of the read's uncertainty interval, or waiting for it to commit.
In this last case (which is by the far the most common), the scan would
then throw an uncertainty error, because it now had a committed value in
its uncertainty interval.

The second PR introduced some awkward code to deal with the fallout from
this decision. Once we started trying to keep the lock table in sync
with the MVCC state, we needed to use latches to ensure proper
synchronization between any operations that could conflict because such
conflicts would influence the state of the lock table. This meant that
we needed to start holding latches across a read's uncertainty interval,
because intent writes in this interval were considered write-read
conflicts. This led to some moderately gross code and always felt a little
wrong.

Now that we are complicating the logic around uncertainty intervals even
further, this becomes even more of a burden. This motivates a reworking
of these interactions. This commit replaces the logic that considers
intents in a transaction's uncertainty interval to be write-read
conflicts for logic that considers such intents to be... uncertain.
Doing so means that a transaction will immediately refresh/restart
above the uncertain timestamp and will only then begin conflicting
with the intent.

This has a number of nice benefits:
1. it keeps all considerations of read uncertainty intervals down in
   MVCC iteration logic. The lock table no longer needs to be aware of it.
   This is a big win now and an even bigger win once we introduce synthetic
   timestamps.
2. readers that are almost certainly bound to hit an uncertainty error
   and need to restart will now do so sooner. In rare cases, this may
   result in wasted work. In the vast majority of cases, this will allow
   the reader to be more responsive to the commit of its conflict.
3. uncertainty errors will only be thrown for locks in the uncertainty
   interval of a read that are protecting a provisional write (intents).
   Before, any form of a lock in a read's uncertainty interval would be
   considered a write-read conflict, which was pessimistic and not needed
   for correctness.

   In a future with a fully segregated lock table, the change in semantic
   meaning here becomes even more clear. Instead of detecting the lock
   associated with an intent in a read's uncertainty interval and declaring
   a write-read conflict, the read will instead pass through the lock table
   untouched and will detect the provisional value associated with an intent
   and declaring uncertainty. This seems closer to what were actually trying
   to say about these interactions.

Before making this change, I intend to validate the hypothesis that it
will not affect performance (or may even slightly improve performance)
by running it on the YCSB benchmark suite.
nvb added a commit to nvb/cockroach that referenced this pull request Jan 15, 2021
This commit is a partial revert of cockroachdb#40600 and cockroachdb#46830. It solves the same
problem that those PRs were solving, but in a different way.

Those two PRs were handling the case where a reading transaction
observes an intent in its uncertainty interval. Before those fixes, we
were not considering intents in a scan's uncertainty interval to be
uncertain. This had the potential to cause stale reads because an
unresolved intent doesn't indicate that its transaction hasn’t been
committed and is not a causal ancestor of the scan.

The approach the first PR took was to throw a WriteIntentError on
intents in a scan's uncertainty interval. Effectively, it made scans
consider intents in their uncertainty interval to be write-read
conflicts. This had the benefit that the scan would push the intent and
eventually resolve the conflict, either by aborting the intent, pushing
it out of the read's uncertainty interval, or waiting for it to commit.
In this last case (which is by the far the most common), the scan would
then throw an uncertainty error, because it now had a committed value in
its uncertainty interval.

The second PR introduced some awkward code to deal with the fallout from
this decision. Once we started trying to keep the lock table in sync
with the MVCC state, we needed to use latches to ensure proper
synchronization between any operations that could conflict because such
conflicts would influence the state of the lock table. This meant that
we needed to start holding latches across a read's uncertainty interval,
because intent writes in this interval were considered write-read
conflicts. This led to some moderately gross code and always felt a little
wrong.

Now that we are complicating the logic around uncertainty intervals even
further, this becomes even more of a burden. This motivates a reworking
of these interactions. This commit replaces the logic that considers
intents in a transaction's uncertainty interval to be write-read
conflicts for logic that considers such intents to be... uncertain.
Doing so means that a transaction will immediately refresh/restart
above the uncertain timestamp and will only then begin conflicting
with the intent.

This has a number of nice benefits:
1. it keeps all considerations of read uncertainty intervals down in
   MVCC iteration logic. The lock table no longer needs to be aware of it.
   This is a big win now and an even bigger win once we introduce synthetic
   timestamps.
2. readers that are almost certainly bound to hit an uncertainty error
   and need to restart will now do so sooner. In rare cases, this may
   result in wasted work. In the vast majority of cases, this will allow
   the reader to be more responsive to the commit of its conflict.
3. uncertainty errors will only be thrown for locks in the uncertainty
   interval of a read that are protecting a provisional write (intents).
   Before, any form of a lock in a read's uncertainty interval would be
   considered a write-read conflict, which was pessimistic and not needed
   for correctness.

   In a future with a fully segregated lock table, the change in semantic
   meaning here becomes even more clear. Instead of detecting the lock
   associated with an intent in a read's uncertainty interval and declaring
   a write-read conflict, the read will instead pass through the lock table
   untouched and will detect the provisional value associated with an intent
   and declaring uncertainty. This seems closer to what were actually trying
   to say about these interactions.

Before making this change, I intend to validate the hypothesis that it
will not affect performance (or may even slightly improve performance)
by running it on the YCSB benchmark suite.
craig bot pushed a commit that referenced this pull request Jan 16, 2021
57136: kv: consider intent in uncertainty interval to be uncertain, not a write-read conflict r=sumeerbhola a=nvanbenschoten

This PR is a partial revert of #40600 and #46830. It solves the same problem that those PRs were solving, but in a different way.

Those two PRs were handling the case where a reading transaction observes an intent in its uncertainty interval. Before those fixes, we were not considering intents in a scan's uncertainty interval to be uncertain. This had the potential to cause stale reads because an unresolved intent doesn't indicate that its transaction hasn’t been committed and is not a causal ancestor of the scan.

The approach the first PR took was to throw a WriteIntentError on intents in a scan's uncertainty interval. Effectively, it made scans consider intents in their uncertainty interval to be write-read conflicts. This had the benefit that the scan would push the intent and eventually resolve the conflict, either by aborting the intent, pushing it out of the read's uncertainty interval, or waiting for it to commit. In this last case (which is by the far the most common), the scan would then throw an uncertainty error, because it now had a committed value in its uncertainty interval.

The second PR introduced some awkward code to deal with the fallout from this decision. Once we started trying to keep the lock table in sync with the MVCC state, we needed to use latches to ensure proper synchronization between any operations that could conflict because such conflicts would influence the state of the lock table. This meant that we needed to start holding latches across a read's uncertainty interval, because intent writes in this interval were considered write-read conflicts. This led to some moderately gross code and always felt a little wrong.

Now that we are complicating the logic around uncertainty intervals even further, this becomes even more of a burden. This motivates a reworking of these interactions. This commit replaces the logic that considers intents in a transaction's uncertainty interval to be write-read conflicts for logic that considers such intents to be... uncertain. Doing so means that a transaction will immediately refresh/restart above the uncertain timestamp and will only then begin conflicting with the intent.

This has a number of nice benefits:
1. it keeps all considerations of read uncertainty intervals down in MVCC iteration logic. The lock table no longer needs to be aware of it. This is a big win now and an even bigger win once we introduce synthetic timestamps.
2. readers that are almost certainly bound to hit an uncertainty error and need to restart will now do so sooner. In rare cases, this may result in wasted work. In the vast majority of cases, this will allow the reader to be more responsive to the commit of its conflict.
3. uncertainty errors will only be thrown for locks in the uncertainty interval of a read that are protecting a provisional write (intents). Before, any form of a lock in a read's uncertainty interval would be considered a write-read conflict, which was pessimistic and not needed for correctness.

   In a future with a fully segregated lock table, the change in semantic meaning here becomes even more clear. Instead of detecting the lock associated with an intent in a read's uncertainty interval and declaring a write-read conflict, the read will instead pass through the lock table untouched and will detect the provisional value associated with an intent and declaring uncertainty. This seems closer to what were actually trying to say about these interactions.

Before making this change, I intend to validate the hypothesis that it will not affect performance (or may even slightly improve performance) by running it on the YCSB benchmark suite.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment