storage: propagate errors from contentionQueue, catch stalls in roachtest by nvb · Pull Request #37199 · cockroachdb/cockroach

nvb · 2019-04-30T02:20:44Z

Informs #36089.

The PR is split into a series of commits. The first fixes part of a bug that was causing #36089 to fail (thanks to #36748) and the second improves the test to have a more obvious failure condition for this class of bug in the future. The third, fifth, and sixth clean up code. Finally, the fourth fixes another bug that could cause issues with #36089.

Before the first commit, requests could get stuck repeatedly attempting to push a transaction only to repeatedly find that they themselves were already aborted. The error would not propagate up to the transaction coordinator and the request would get stuck. This commit fixes this behavior by correctly propagating errors observed by the contentionQueue.

The second commit bumps the TxnLivenessThreshold for clusters running kv/contention/nodes=4 to 10 minutes. This is sufficiently large such that if at any point a transaction is abandoned then all other transactions will begin waiting for it, throughput will drop to 0 for 10 straight minutes, and the test will fail to achieve its minimum QPS requirement.

The fourth commit instructs pushers in the txnwait.Queue to inform all other pushers that are waiting for the same transaction when it observes an ABORTED transaction. I never saw this cause issues with #36089, but it seems very possible that it could given frequent tscache rotations.

cockroach-teamcity · 2019-04-30T02:20:51Z

This change is

nvb · 2019-04-30T02:22:43Z

NOTE: I saw this get stuck once after bumping the TxnLivenessThreshold in the roachtest. That indicates that there is another stall lingering. I'm not planning on merging this PR until I gain a better understanding of what's going on there.

ajwerner

Nice find! I know you're not planning on merging this yet but

Reviewed 3 of 3 files at r1, 2 of 2 files at r2.
Reviewable status: complete! 1 of 0 LGTMs obtained

nvb · 2019-05-01T13:32:53Z

I think I tracked down the other stall here. It has to do with a rare interleaving of requests in the contentionQueue. I found this all by looking at the /debug/requests page after bumping the page's maxActiveTraces to a more reasonable 300. The expanded view of the active traces revealed the following deadlock:

txn 8703f023
    pushing ee447be3
    behind f081497c in the queue
    stuck after noticing pushee aborted
        waiting for f081497c to notice
    aborted itself

txn ee447be3
    pushing 8703f023
    behind f081497c in the queue
    stuck after noticing pushee aborted
        waiting for f081497c to notice
    aborted itself

txn f081497c
    pushing both 8703f023 and ee447be3
    front of queue in both
    stuck pushing both txns

all other txns: stuck on f081497c

The first two transactions are stuck in the contentionQueue after observing each other as aborted. The don't realize that they themselves are aborted because transactions in the contentionQueue don't query their own disposition. They also don't immediately exit the queue when they determine that the front of the queue is aborted. Instead, they wait on their predecessor in the queue to notice to avoid a thundering herd.

So what's going wrong? Why isn't this resolving itself?

If `8703f023` and `ee447be3` are aborted, why isn't `f081497c` noticing?

It is stuck in the TxnWaitQueue, waiting for updates to either of these transactions. Specifically, it is waiting for one of two things:

updates to its pushees' transaction records
its pushees' expirations, which are set to 5m!

Why aren't there any updates to `8703f023` and `ee447be3`'s transaction records

They are stuck in the contentionQueue and haven't noticed that they are aborted. This hasn't been propagated back to their transaction coordinator so no one has rolled back their transaction records.

Why isn't deadlock detection working?

Because both 8703f023 and ee447be3 are already aborted, so the two transactions aren't sitting in each other's TxnWaitQueues.

Why do `8703f023` and `ee447be3` think each other are aborted but `f081497c` is still in the `TxnWaitQueue`?

There actually appear to be two ways for us to get in this deadlock scenario. The first is due to a timestamp cache rotation. Imagine that 8703f023 and ee447be3 queried each other's transaction records after f081497c queried their transaction records. Between this time, nothing was written. However, there was a timestamp cache rotation! Each of these transactions would consider the other aborted because of ABORT_REASON_TIMESTAMP_CACHE_REJECTED_POSSIBLE_REPLAY. If the abort was noticed during a QueryTxnRequest, no one would notify f081497c, who is still sitting in the TxnWaitQueue, and it would be stuck there until the pushee expired. We can fix this case by informing the TxnWaitQueue if a QueryTxnRequest ever decides that a transaction is aborted.

The second way to get into this deadlock scenario is more tricky, but it shows up more in this workload. A contention point when designing #33523 was deciding how pushes of transactions originating from intents with txnMeta.timestamp >> txn.OrigTimestamp. The problem is that when a transaction is aborted, the write timestamp cache is only bumped to its current timestamp. However, that transaction can keep writing if it isn't immediately informed of the abort, potentially at higher timestamps. So if a contending transaction observes one of these intents with a higher timestamp and pushes, it may not realize that the transaction was already aborted because it compares its timestamp against the write timestamp cache and find that its timestamp is above the cache. This is discussed in the following three places:

cockroach/pkg/storage/batcheval/transaction.go

Lines 134 to 141 in 5c1497a

    
           // Determine whether the transaction record could ever actually be written 
        
           // in the future. We provide the TxnMeta's timestamp (which we read from an 
        
           // intent) as the upper bound on the transaction's minimum timestamp. This 
        
           // may be greater than the transaction's actually original epoch-zero 
        
           // timestamp, in which case we're subjecting ourselves to false positives 
        
           // where we don't discover that a transaction is uncommittable, but never 
        
           // false negatives where we think that a transaction is uncommittable even 
        
           // when it's not and could later complete.

and

cockroach/pkg/storage/batcheval/cmd_push_txn.go

Lines 161 to 167 in 5c1497a

    
           // Performing this detection could have false positives where we determine 
        
           // that a record could still be written and conclude that we're in the first 
        
           // case. However, it cannot have false negatives where we determine that a 
        
           // record can not be written and conclude that we're in the second or third 
        
           // case. This is important, because it means that we may end up failing to 
        
           // push a finalized transaction but will never determine that a transaction 
        
           // is finalized when it still could end up committing.

and

cockroach/pkg/storage/replica_tscache.go

Lines 286 to 292 in 5c1497a

    
           // for the provided transaction information. Callers should provide an upper 
        
           // bound on the transaction's minimum timestamp across all epochs (typically 
        
           // the original timestamp of its first epoch). If this is not exact then the 
        
           // method may return false positives (i.e. it determines that the record could 
        
           // be created when it actually couldn't) but will never return false negatives 
        
           // (i.e. it will never determine that the record could not be created when it 
        
           // actually could).

It turns out that this scenario can cause issues when combined with the dependencies enforced by the contentionQueue. If the first transaction in the queue falls into this trap of falsely assuming a pushee is not aborted and later transactions don't, the later transactions can get stuck waiting on the first transaction to notice that the pushee has already been aborted. If the pushee itself is stuck in a second contentionQueue on the same situation then it will never rollback its transaction record and inform the first transaction in the queue that it is indeed aborted. So everything gets stuck. It turns out that it is not ok for some pushees to consider a transaction to be pending while other pushees consider a transaction to be aborted.

The solution to this isn't immediately clear. I see two options at the moment:

we begin storing the EpochZeroOrigTimestamp in intents. This allows all pushers to have perfect accuracy about whether or not a transaction has been aborted by any other pusher. In other words, it allows us to get rid of this nastiness. Interestingly, it would come for free with the proposal to change transaction IDs to be a combination of the epoch-zero-orig-timestamp and a gateway ID.
we go back to always leaving an ABORTED transaction record before a pushee can consider a transaction to be aborted. This feels safer (although there is still come complication around QueryTxn requests that find a missing txn record), but is also a large step backward away from a proposal like this. It also returns to letting pushers create transaction records as long as they are in the ABORTED state.

@tbg @andreimatei I'm curious whether either of you two has a strong preference here.

On the positive side, with QueryTxn requests updating the TxnWaitQueue when they observe a transaction to be aborted and with the first of these options implemented, the test seems to pass without any stalls (after about 6 hrs of running).

tbg · 2019-05-01T14:26:55Z

Great writeup! Before I dive in, I'm curious if any of the stuff @darinpp added to the txnWaitQ helped (#35921) or would've helped figure out where "something" was stuck. (Or, if you repro this scenario again, I'd be grateful if you gave it a try and reported back, in particular the "have been waiting" log message).

I just took a look at the contentionQ code to remind myself how it works. Oof. I'd be surprised if this were the last bug.

There actually appear to be two ways for us to get in this deadlock scenario. The first is due to a timestamp cache rotation. Imagine that 8703f023 and ee447be3 queried each other's transaction records after f081497c queried their transaction records. Between this time, nothing was written. However, there was a timestamp cache rotation!

We query the pushee periodically, but I just realized that this is based on the TxnExpiration, which is artificially high in this test. Just pointing this out for anyone else wondering.

I like option 1. We can optimize for the common case in which Timestamp == Epoch0Timestamp by adding the convention that the epoch 0 timestamp defaults to the timestamp to avoid writing more than we do today (if we also decide to make the field nullable).

nvb · 2019-05-01T16:03:16Z

I'm curious if any of the stuff @darinpp added to the txnWaitQ helped (#35921) or would've helped figure out where "something" was stuck. (Or, if you repro this scenario again, I'd be grateful if you gave it a try and reported back, in particular the "have been waiting" log message).

The "have been waiting" log message in txnqueue.go was very helpful and helped point me at the offending transactions once I knew to look for them. The metrics themselves would have pointed me at the blockage, but I'm not sure whether they would have helped in the actual debugging. The problem with these kinds of issues is that most of the questions a debugger would like to ask are "what sequence of events led to this situation?", not "what situation are we in now that we're stalled?".

I also think the reason why the "have been waiting" log message was more useful than the metrics is because the log message actually includes transaction ids. Without that, it's not really possible to come to any conclusions about the state of a deadlock. With that in mind, a "have been waiting" log message in contentionQueue.go would be helpful.

I like option 1. We can optimize for the common case in which Timestamp == Epoch0Timestamp by adding the convention that the epoch 0 timestamp defaults to the timestamp to avoid writing more than we do today (if we also decide to make the field nullable).

👍 I was leaning that way myself. It will be really nice to remove this uncertainty. I don't think the migration will be too difficult either, as the new field will need to be optional no matter what.

tbg · 2019-05-01T16:11:22Z

The "have been waiting" log message in txnqueue.go was very helpful and helped point me at the offending transactions once I knew to look for them. The metrics themselves would have pointed me at the blockage, but I'm not sure whether they would have helped in the actual debugging. The problem with these kinds of issues is that most of the questions a debugger would like to ask are "what sequence of events led to this situation?", not "what situation are we in now that we're stalled?".

👍 that's about as good as I hoped it was going to get. The metrics can't do the debugging for you but looks like they gave you a good starting point to logspy in.

andreimatei · 2019-05-01T17:32:17Z

I like option 1 too. Option 2 - writing more ABORTED txn records - seems indeed that we'd be going backwards. I hope we get to where we never write any ABORTED records.

tbg · 2019-06-04T13:33:32Z

Friendly ping, was just looking through our roachtests and this seems like a nice wart to iron out.

nvb · 2019-06-04T14:03:24Z

Yep, this is on my plate for after we stabilize parallel commits. The two additional items that I need to polish off are:

QueryTxn requests update the TxnWaitQueue when they observe a transaction to be aborted
we begin storing the EpochZeroOrigTimestamp in intents. This allows all pushers to have perfect accuracy about whether or not a transaction has been aborted by any other pusher

… observed From cockroachdb#37199 (comment): > There actually appear to be two ways for us to get in this deadlock scenario. The first is due to a timestamp cache rotation. Imagine that 8703f023 and ee447be3 queried each other's transaction records after f081497c queried their transaction records. Between this time, nothing was written. However, there was a timestamp cache rotation! Each of these transactions would consider the other aborted because of ABORT_REASON_TIMESTAMP_CACHE_REJECTED_POSSIBLE_REPLAY. If the abort was noticed during a QueryTxnRequest, no one would notify f081497c, who is still sitting in the TxnWaitQueue, and it would be stuck there until the pushee expired. We can fix this case by informing the TxnWaitQueue if a QueryTxnRequest ever decides that a transaction is aborted. Release note: None

nvb · 2019-07-03T18:56:38Z

I've updated this PR to address the first additional item that we need to fix. The second one is larger, so I'm going to put that into a different PR. I've downgraded this to no longer say "Fixes" because it doesn't include the full fix.

PTAL @ajwerner.

ajwerner

Thanks for the great writeup. My mental model on the TxnWaitQueue is still fuzzier than I'd like but this change is straightforward enough. Are there any downsides to notifying all of the pushees upon a single abort?

Reviewed 3 of 5 files at r3, 1 of 2 files at r4, 2 of 2 files at r5, 2 of 3 files at r6, 10 of 10 files at r8.
Reviewable status: complete! 1 of 0 LGTMs obtained

Informs cockroachdb#36089. Before this commit, requests could get stuck repeatedly attempting to push a transaction only to repeatedly find that they themselves were already aborted. The error would not propagate up to the transaction coordinator and the request would get stuck. This commit fixes this behavior by correctly propagating errors observed by the contentionQueue. Release note: None

…hold Informs cockroachdb#36089. This commit bumps the TxnLivenessThreshold for clusters running `kv/contention/nodes=4` to 10 minutes. This is sufficiently large such that if at any point a transaction is abandoned then all other transactions will begin waiting for it and the test will fail to achieve its minimum QPS requirement. Release note: None

…sult Drive by fix. Release note: None

… observed From cockroachdb#37199 (comment): > There actually appear to be two ways for us to get in this deadlock scenario. The first is due to a timestamp cache rotation. Imagine that 8703f023 and ee447be3 queried each other's transaction records after f081497c queried their transaction records. Between this time, nothing was written. However, there was a timestamp cache rotation! Each of these transactions would consider the other aborted because of ABORT_REASON_TIMESTAMP_CACHE_REJECTED_POSSIBLE_REPLAY. If the abort was noticed during a QueryTxnRequest, no one would notify f081497c, who is still sitting in the TxnWaitQueue, and it would be stuck there until the pushee expired. We can fix this case by informing the TxnWaitQueue if a QueryTxnRequest ever decides that a transaction is aborted. Release note: None

The test was already called queue_test.go and package is called txnwait, so the prefix on the name is unnecessary. Only code movement. Release note: None

We had three different defaultHeartbeatInterval constants and it was unclear which values corresponded to which heartbeats without reading their documentation. This commit renames these constants to avoid confusion. Release note: None

Until cockroachdb#36089 is fixed, we know this will fail regularly. This is especially true now that we extended TxnLivenessThreshold. Release note: None

nvb · 2019-07-08T15:52:30Z

Are there any downsides to notifying all of the pushees upon a single abort?

I don't think so. Once the pushee is aborted, all of the pushers would like to know ASAP.

Thanks for the review.

bors r+

@solongordon

37199: storage: propagate errors from contentionQueue, catch stalls in roachtest r=nvanbenschoten a=nvanbenschoten Informs #36089. The PR is split into a series of commits. The first fixes part of a bug that was causing #36089 to fail (thanks to #36748) and the second improves the test to have a more obvious failure condition for this class of bug in the future. The third, fifth, and sixth clean up code. Finally, the fourth fixes another bug that could cause issues with #36089. Before the first commit, requests could get stuck repeatedly attempting to push a transaction only to repeatedly find that they themselves were already aborted. The error would not propagate up to the transaction coordinator and the request would get stuck. This commit fixes this behavior by correctly propagating errors observed by the `contentionQueue`. The second commit bumps the TxnLivenessThreshold for clusters running `kv/contention/nodes=4` to 10 minutes. This is sufficiently large such that if at any point a transaction is abandoned then all other transactions will begin waiting for it, throughput will drop to 0 for 10 straight minutes, and the test will fail to achieve its minimum QPS requirement. The fourth commit instructs pushers in the `txnwait.Queue` to inform all other pushers that are waiting for the same transaction when it observes an ABORTED transaction. I never saw this cause issues with #36089, but it seems very possible that it could given frequent tscache rotations. 38397: sql: deflake TestLogic//crdb_internal/max_retry_counter r=knz a=knz Fixes #38062. Release note: None 38654: exec: Handle NULLS in TopK sorter r=rohany a=rohany This commit fixes NULLs in the TopK sorter by avoiding use of the vec copy method, which has a bug. Instead, we add a set method to the vec comparator, and use the templatized comparator to perform the sets that the TopK sorter needs. To facilitate this, we add an UnsetNull method to the Nulls object. However, use of this method results in HasNull() maybe returning true even if the vector doesn't have nulls. This behavior already occurs when selection vectors are used. Based on discussions with @solongordon and @asubiotto, this behavior is OK, and future PR's will attempt to make this behavior better, and address the bugs within the Vec Copy method. 38725: cli/dump: more clearly inform the user upon tables with no visible columns r=knz a=knz Informs #37768. Informs #28948. This is coming up quite often on support, lately again on gitter and forum https://forum.cockroachlabs.com/t/error-while-dumping-core-backup/2901/3. This PR aims to lessen the burden on support and propose a clear "next action" for the user. Before: ``` kena@kenax ~/cockroach % ./cockroach dump --insecure defaultdb CREATE TABLE t (, FAMILY "primary" (rowid) ); Error: pq: at or near "from": syntax error Failed running "dump" ``` After: ``` kena@kenax ~/cockroach % ./cockroach dump --insecure defaultdb CREATE TABLE t (, FAMILY "primary" (rowid) ); Error: table "defaultdb.public.t" has no visible columns HINT: To proceed with the dump, either omit this table from the list of tables to dump, drop the table, or add some visible columns. -- See: #37768 Failed running "dump" ``` Release note (cli change): `cockroach dump` will now more clearly refer to issue #37768 when it encounters a table with no visible columns, which (currently) cannot be dumped successfully. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Raphael 'kena' Poss <knz@cockroachlabs.com> Co-authored-by: Rohan Yadav <rohany@alumni.cmu.edu>

craig · 2019-07-08T16:25:37Z

Build succeeded

GitHub CI (Cockroach)

Fixes cockroachdb#36089. Informs cockroachdb#37199. This commit addresses the second concern from cockroachdb#37199 (comment) by implementing its suggestion #1. It augments the TxnMeta struct to begin storing the transaction's minimum timestamp. This allows pushers to have perfect accuracy into whether an intent is part of a transaction that can eventually be committed or whether it has been aborted by any other pusher and uncommittable. This allows us to get rid of the false positive cases where a pusher incorrectly detects that a transaction is still active and begins waiting on it. In this worst case, this could lead to transactions waiting for the entire transaction expiration for a contending txn. Release note: None

@tbg

38782: storage: persist minimum transaction timestamps in intents r=nvanbenschoten a=nvanbenschoten Fixes #36089. Informs #37199. This commit addresses the second concern from #37199 (comment) by implementing its suggestion #1. It augments the TxnMeta struct to begin storing the transaction's minimum timestamp. This allows pushers to have perfect accuracy into whether an intent is part of a transaction that can eventually be committed or whether it has been aborted by any other pusher and uncommittable. This allows us to get rid of the false positive cases where a pusher incorrectly detects that a transaction is still active and begins waiting on it. In this worst case, this could lead to transactions waiting for the entire transaction expiration for a contending txn. @tbg I'm assigning you because you reviewed most of the lazy transaction record stuff (especially #33523), but let me know if you'd like me to find a different reviewer. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

nvb requested review from a team and ajwerner April 30, 2019 02:20

tbg mentioned this pull request Apr 30, 2019

roachtest: TC timing out with test downloading debug.zip after acceptance/bank/cluster-recovery timeout #37200

Closed

ajwerner approved these changes Apr 30, 2019

View reviewed changes

nvb mentioned this pull request May 1, 2019

roachtest: tpcc/headroom/n4cpu16 failed #37163

Closed

nvb force-pushed the nvanbenschoten/cqErr branch from 93cdbb8 to b3bc9fd Compare July 3, 2019 18:38

nvb force-pushed the nvanbenschoten/cqErr branch from b3bc9fd to 541b3e3 Compare July 3, 2019 18:51

nvb force-pushed the nvanbenschoten/cqErr branch from 541b3e3 to 01294fe Compare July 3, 2019 18:55

nvb force-pushed the nvanbenschoten/cqErr branch from 01294fe to 1df7a72 Compare July 3, 2019 19:19

ajwerner approved these changes Jul 8, 2019

View reviewed changes

nvb added 7 commits July 8, 2019 11:48

storage: assert that MaybeWatchForMerge is false in handleLocalEvalRe…

960b97b

…sult Drive by fix. Release note: None

storage/txnwait: move txnqueue.go to queue.go

24dcc6c

The test was already called queue_test.go and package is called txnwait, so the prefix on the name is unnecessary. Only code movement. Release note: None

base: differentiate various "heartbeat intervals"

2312515

We had three different defaultHeartbeatInterval constants and it was unclear which values corresponded to which heartbeats without reading their documentation. This commit renames these constants to avoid confusion. Release note: None

roachtest: skip kv/contention/nodes=4

82aeef4

Until cockroachdb#36089 is fixed, we know this will fail regularly. This is especially true now that we extended TxnLivenessThreshold. Release note: None

nvb force-pushed the nvanbenschoten/cqErr branch from 1df7a72 to 82aeef4 Compare July 8, 2019 15:50

craig bot merged commit 82aeef4 into cockroachdb:master Jul 8, 2019

nvb deleted the nvanbenschoten/cqErr branch July 8, 2019 21:34

nvb mentioned this pull request Jul 9, 2019

storage: persist minimum transaction timestamps in intents #38782

Merged

Conversation

nvb commented Apr 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Apr 30, 2019

Uh oh!

nvb commented Apr 30, 2019

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

nvb commented May 1, 2019

So what's going wrong? Why isn't this resolving itself?

If 8703f023 and ee447be3 are aborted, why isn't f081497c noticing?

Why aren't there any updates to 8703f023 and ee447be3's transaction records

Why isn't deadlock detection working?

Why do 8703f023 and ee447be3 think each other are aborted but f081497c is still in the TxnWaitQueue?

Uh oh!

tbg commented May 1, 2019

Uh oh!

nvb commented May 1, 2019

Uh oh!

tbg commented May 1, 2019

Uh oh!

andreimatei commented May 1, 2019

Uh oh!

tbg commented Jun 4, 2019

Uh oh!

nvb commented Jun 4, 2019

Uh oh!

nvb commented Jul 3, 2019

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

nvb commented Jul 8, 2019

Uh oh!

craig bot commented Jul 8, 2019

Build succeeded

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

nvb commented Apr 30, 2019 •

edited

Loading

If `8703f023` and `ee447be3` are aborted, why isn't `f081497c` noticing?

Why aren't there any updates to `8703f023` and `ee447be3`'s transaction records

Why do `8703f023` and `ee447be3` think each other are aborted but `f081497c` is still in the `TxnWaitQueue`?