storage: propagate errors from contentionQueue, catch stalls in roachtest#37199
storage: propagate errors from contentionQueue, catch stalls in roachtest#37199craig[bot] merged 7 commits intocockroachdb:masterfrom
Conversation
|
NOTE: I saw this get stuck once after bumping the |
ajwerner
left a comment
There was a problem hiding this comment.
Nice find! I know you're not planning on merging this yet but
Reviewed 3 of 3 files at r1, 2 of 2 files at r2.
Reviewable status:complete! 1 of 0 LGTMs obtained
|
I think I tracked down the other stall here. It has to do with a rare interleaving of requests in the The first two transactions are stuck in the So what's going wrong? Why isn't this resolving itself?If
|
|
Great writeup! Before I dive in, I'm curious if any of the stuff @darinpp added to the txnWaitQ helped (#35921) or would've helped figure out where "something" was stuck. (Or, if you repro this scenario again, I'd be grateful if you gave it a try and reported back, in particular the "have been waiting" log message). I just took a look at the contentionQ code to remind myself how it works. Oof. I'd be surprised if this were the last bug.
We query the pushee periodically, but I just realized that this is based on the TxnExpiration, which is artificially high in this test. Just pointing this out for anyone else wondering. I like option 1. We can optimize for the common case in which Timestamp == Epoch0Timestamp by adding the convention that the epoch 0 timestamp defaults to the timestamp to avoid writing more than we do today (if we also decide to make the field nullable). |
The "have been waiting" log message in I also think the reason why the "have been waiting" log message was more useful than the metrics is because the log message actually includes transaction ids. Without that, it's not really possible to come to any conclusions about the state of a deadlock. With that in mind, a "have been waiting" log message in
👍 I was leaning that way myself. It will be really nice to remove this uncertainty. I don't think the migration will be too difficult either, as the new field will need to be optional no matter what. |
👍 that's about as good as I hoped it was going to get. The metrics can't do the debugging for you but looks like they gave you a good starting point to logspy in. |
|
I like option 1 too. Option 2 - writing more ABORTED txn records - seems indeed that we'd be going backwards. I hope we get to where we never write any ABORTED records. |
|
Friendly ping, was just looking through our roachtests and this seems like a nice wart to iron out. |
|
Yep, this is on my plate for after we stabilize parallel commits. The two additional items that I need to polish off are:
|
… observed From cockroachdb#37199 (comment): > There actually appear to be two ways for us to get in this deadlock scenario. The first is due to a timestamp cache rotation. Imagine that 8703f023 and ee447be3 queried each other's transaction records after f081497c queried their transaction records. Between this time, nothing was written. However, there was a timestamp cache rotation! Each of these transactions would consider the other aborted because of ABORT_REASON_TIMESTAMP_CACHE_REJECTED_POSSIBLE_REPLAY. If the abort was noticed during a QueryTxnRequest, no one would notify f081497c, who is still sitting in the TxnWaitQueue, and it would be stuck there until the pushee expired. We can fix this case by informing the TxnWaitQueue if a QueryTxnRequest ever decides that a transaction is aborted. Release note: None
|
I've updated this PR to address the first additional item that we need to fix. The second one is larger, so I'm going to put that into a different PR. I've downgraded this to no longer say "Fixes" because it doesn't include the full fix. PTAL @ajwerner. |
ajwerner
left a comment
There was a problem hiding this comment.
Thanks for the great writeup. My mental model on the TxnWaitQueue is still fuzzier than I'd like but this change is straightforward enough. Are there any downsides to notifying all of the pushees upon a single abort?
Reviewed 3 of 5 files at r3, 1 of 2 files at r4, 2 of 2 files at r5, 2 of 3 files at r6, 10 of 10 files at r8.
Reviewable status:complete! 1 of 0 LGTMs obtained
Informs cockroachdb#36089. Before this commit, requests could get stuck repeatedly attempting to push a transaction only to repeatedly find that they themselves were already aborted. The error would not propagate up to the transaction coordinator and the request would get stuck. This commit fixes this behavior by correctly propagating errors observed by the contentionQueue. Release note: None
…hold Informs cockroachdb#36089. This commit bumps the TxnLivenessThreshold for clusters running `kv/contention/nodes=4` to 10 minutes. This is sufficiently large such that if at any point a transaction is abandoned then all other transactions will begin waiting for it and the test will fail to achieve its minimum QPS requirement. Release note: None
…sult Drive by fix. Release note: None
… observed From cockroachdb#37199 (comment): > There actually appear to be two ways for us to get in this deadlock scenario. The first is due to a timestamp cache rotation. Imagine that 8703f023 and ee447be3 queried each other's transaction records after f081497c queried their transaction records. Between this time, nothing was written. However, there was a timestamp cache rotation! Each of these transactions would consider the other aborted because of ABORT_REASON_TIMESTAMP_CACHE_REJECTED_POSSIBLE_REPLAY. If the abort was noticed during a QueryTxnRequest, no one would notify f081497c, who is still sitting in the TxnWaitQueue, and it would be stuck there until the pushee expired. We can fix this case by informing the TxnWaitQueue if a QueryTxnRequest ever decides that a transaction is aborted. Release note: None
The test was already called queue_test.go and package is called txnwait, so the prefix on the name is unnecessary. Only code movement. Release note: None
We had three different defaultHeartbeatInterval constants and it was unclear which values corresponded to which heartbeats without reading their documentation. This commit renames these constants to avoid confusion. Release note: None
Until cockroachdb#36089 is fixed, we know this will fail regularly. This is especially true now that we extended TxnLivenessThreshold. Release note: None
I don't think so. Once the pushee is aborted, all of the pushers would like to know ASAP. Thanks for the review. bors r+ |
37199: storage: propagate errors from contentionQueue, catch stalls in roachtest r=nvanbenschoten a=nvanbenschoten Informs #36089. The PR is split into a series of commits. The first fixes part of a bug that was causing #36089 to fail (thanks to #36748) and the second improves the test to have a more obvious failure condition for this class of bug in the future. The third, fifth, and sixth clean up code. Finally, the fourth fixes another bug that could cause issues with #36089. Before the first commit, requests could get stuck repeatedly attempting to push a transaction only to repeatedly find that they themselves were already aborted. The error would not propagate up to the transaction coordinator and the request would get stuck. This commit fixes this behavior by correctly propagating errors observed by the `contentionQueue`. The second commit bumps the TxnLivenessThreshold for clusters running `kv/contention/nodes=4` to 10 minutes. This is sufficiently large such that if at any point a transaction is abandoned then all other transactions will begin waiting for it, throughput will drop to 0 for 10 straight minutes, and the test will fail to achieve its minimum QPS requirement. The fourth commit instructs pushers in the `txnwait.Queue` to inform all other pushers that are waiting for the same transaction when it observes an ABORTED transaction. I never saw this cause issues with #36089, but it seems very possible that it could given frequent tscache rotations. 38397: sql: deflake TestLogic//crdb_internal/max_retry_counter r=knz a=knz Fixes #38062. Release note: None 38654: exec: Handle NULLS in TopK sorter r=rohany a=rohany This commit fixes NULLs in the TopK sorter by avoiding use of the vec copy method, which has a bug. Instead, we add a set method to the vec comparator, and use the templatized comparator to perform the sets that the TopK sorter needs. To facilitate this, we add an UnsetNull method to the Nulls object. However, use of this method results in HasNull() maybe returning true even if the vector doesn't have nulls. This behavior already occurs when selection vectors are used. Based on discussions with @solongordon and @asubiotto, this behavior is OK, and future PR's will attempt to make this behavior better, and address the bugs within the Vec Copy method. 38725: cli/dump: more clearly inform the user upon tables with no visible columns r=knz a=knz Informs #37768. Informs #28948. This is coming up quite often on support, lately again on gitter and forum https://forum.cockroachlabs.com/t/error-while-dumping-core-backup/2901/3. This PR aims to lessen the burden on support and propose a clear "next action" for the user. Before: ``` kena@kenax ~/cockroach % ./cockroach dump --insecure defaultdb CREATE TABLE t (, FAMILY "primary" (rowid) ); Error: pq: at or near "from": syntax error Failed running "dump" ``` After: ``` kena@kenax ~/cockroach % ./cockroach dump --insecure defaultdb CREATE TABLE t (, FAMILY "primary" (rowid) ); Error: table "defaultdb.public.t" has no visible columns HINT: To proceed with the dump, either omit this table from the list of tables to dump, drop the table, or add some visible columns. -- See: #37768 Failed running "dump" ``` Release note (cli change): `cockroach dump` will now more clearly refer to issue #37768 when it encounters a table with no visible columns, which (currently) cannot be dumped successfully. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Raphael 'kena' Poss <knz@cockroachlabs.com> Co-authored-by: Rohan Yadav <rohany@alumni.cmu.edu>
Build succeeded |
Fixes cockroachdb#36089. Informs cockroachdb#37199. This commit addresses the second concern from cockroachdb#37199 (comment) by implementing its suggestion #1. It augments the TxnMeta struct to begin storing the transaction's minimum timestamp. This allows pushers to have perfect accuracy into whether an intent is part of a transaction that can eventually be committed or whether it has been aborted by any other pusher and uncommittable. This allows us to get rid of the false positive cases where a pusher incorrectly detects that a transaction is still active and begins waiting on it. In this worst case, this could lead to transactions waiting for the entire transaction expiration for a contending txn. Release note: None
Fixes cockroachdb#36089. Informs cockroachdb#37199. This commit addresses the second concern from cockroachdb#37199 (comment) by implementing its suggestion #1. It augments the TxnMeta struct to begin storing the transaction's minimum timestamp. This allows pushers to have perfect accuracy into whether an intent is part of a transaction that can eventually be committed or whether it has been aborted by any other pusher and uncommittable. This allows us to get rid of the false positive cases where a pusher incorrectly detects that a transaction is still active and begins waiting on it. In this worst case, this could lead to transactions waiting for the entire transaction expiration for a contending txn. Release note: None
Fixes cockroachdb#36089. Informs cockroachdb#37199. This commit addresses the second concern from cockroachdb#37199 (comment) by implementing its suggestion #1. It augments the TxnMeta struct to begin storing the transaction's minimum timestamp. This allows pushers to have perfect accuracy into whether an intent is part of a transaction that can eventually be committed or whether it has been aborted by any other pusher and uncommittable. This allows us to get rid of the false positive cases where a pusher incorrectly detects that a transaction is still active and begins waiting on it. In this worst case, this could lead to transactions waiting for the entire transaction expiration for a contending txn. Release note: None
Fixes cockroachdb#36089. Informs cockroachdb#37199. This commit addresses the second concern from cockroachdb#37199 (comment) by implementing its suggestion #1. It augments the TxnMeta struct to begin storing the transaction's minimum timestamp. This allows pushers to have perfect accuracy into whether an intent is part of a transaction that can eventually be committed or whether it has been aborted by any other pusher and uncommittable. This allows us to get rid of the false positive cases where a pusher incorrectly detects that a transaction is still active and begins waiting on it. In this worst case, this could lead to transactions waiting for the entire transaction expiration for a contending txn. Release note: None
38782: storage: persist minimum transaction timestamps in intents r=nvanbenschoten a=nvanbenschoten Fixes #36089. Informs #37199. This commit addresses the second concern from #37199 (comment) by implementing its suggestion #1. It augments the TxnMeta struct to begin storing the transaction's minimum timestamp. This allows pushers to have perfect accuracy into whether an intent is part of a transaction that can eventually be committed or whether it has been aborted by any other pusher and uncommittable. This allows us to get rid of the false positive cases where a pusher incorrectly detects that a transaction is still active and begins waiting on it. In this worst case, this could lead to transactions waiting for the entire transaction expiration for a contending txn. @tbg I'm assigning you because you reviewed most of the lazy transaction record stuff (especially #33523), but let me know if you'd like me to find a different reviewer. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Informs #36089.
The PR is split into a series of commits. The first fixes part of a bug that was causing #36089 to fail (thanks to #36748) and the second improves the test to have a more obvious failure condition for this class of bug in the future. The third, fifth, and sixth clean up code. Finally, the fourth fixes another bug that could cause issues with #36089.
Before the first commit, requests could get stuck repeatedly attempting to push a transaction only to repeatedly find that they themselves were already aborted. The error would not propagate up to the transaction coordinator and the request would get stuck. This commit fixes this behavior by correctly propagating errors observed by the
contentionQueue.The second commit bumps the TxnLivenessThreshold for clusters running
kv/contention/nodes=4to 10 minutes. This is sufficiently large such that if at any point a transaction is abandoned then all other transactions will begin waiting for it, throughput will drop to 0 for 10 straight minutes, and the test will fail to achieve its minimum QPS requirement.The fourth commit instructs pushers in the
txnwait.Queueto inform all other pushers that are waiting for the same transaction when it observes an ABORTED transaction. I never saw this cause issues with #36089, but it seems very possible that it could given frequent tscache rotations.