kvserver: IsCompleteTransaction might panic with certain batch sequences#80832
Conversation
8f76eaa to
74d27e4
Compare
nvb
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @shralex)
-- commits line 5 at r1:
Do we have a theory about why a sequence number might be missing or duplicated for operations in a batch. Request sequence numbers are assigned by the txnSeqNumAllocator. Are there flaws in its sequence number allocation logic?
and adds more information in case the panic still occurs.
It doesn't look like we've added any new information to the panic. This seems like a good idea to help us understand the failure mode better, if we ever see it again.
|
Also, even though things are now tracked in Jira, we'll still want to cross-reference PRs like this with their corresponding Github issue (#79188). |
2187f81 to
02c0d2c
Compare
nvb
left a comment
There was a problem hiding this comment.
Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @shralex)
pkg/roachpb/batch.go line 379 at r2 (raw file):
} } panic(fmt.Errorf("unreachable. Batch requests: %s", ba.Requests))
nit: fmt.Sprintf should be sufficient here.
pkg/roachpb/batch.go line 379 at r2 (raw file):
} } panic(fmt.Errorf("unreachable. Batch requests: %s", ba.Requests))
Does the string format of []RequestUnion print the sequence number of each request? Let's make sure we put enough information in here so that we can have a good chance of diagnosing the issue next time this happens.
02c0d2c to
0b61ab3
Compare
shralex
left a comment
There was a problem hiding this comment.
Done
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Do we have a theory about why a sequence number might be missing or duplicated for operations in a batch. Request sequence numbers are assigned by the
txnSeqNumAllocator. Are there flaws in its sequence number allocation logic?
Added all we know to the PR description
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
and adds more information in case the panic still occurs.
It doesn't look like we've added any new information to the panic. This seems like a good idea to help us understand the failure mode better, if we ever see it again.
Added
pkg/roachpb/batch.go line 379 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit:
fmt.Sprintfshould be sufficient here.
Done
pkg/roachpb/batch.go line 379 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Does the string format of
[]RequestUnionprint the sequence number of each request? Let's make sure we put enough information in here so that we can have a good chance of diagnosing the issue next time this happens.
You were right. Changed this, now if the test does "{[]Request{withSeq(put, 1), withSeq(etC, -1)}, true}," it will print:
panic: unreachable. Batch requests: put:<header:<sequence:1 > value:<timestamp:<> > > ,end_txn:<header:<sequence:-1 > commit:true >
nvb
left a comment
There was a problem hiding this comment.
Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @shralex)
pkg/roachpb/batch.go line 379 at r3 (raw file):
} } // Unreachable
nit: Missing punctuation on this comment. // Unreachable.
pkg/roachpb/batch.go line 383 at r3 (raw file):
for i, args := range ba.Requests { sb.WriteString(args.String()) if i < len(ba.Requests) - 1 {
nit: moving this up and making the check if i > 0 { may be slightly easier to reason about.
In IsCompleteTransaction it seems theoretically possible that a panic could happen if some sequence numbers are missing (0) or duplicated for operations in the batch. This PR returns false in such case, and adds more information in case the panic still occurs. Release note: None Jira issue: https://cockroachlabs.atlassian.net/browse/CRDB-14627
0b61ab3 to
aab78b7
Compare
|
bors r+ |
|
@nvanbenschoten thanks for the reviews! |
80832: kvserver: IsCompleteTransaction might panic with certain batch sequences r=shralex a=shralex It's unclear how this panic happened. One possibility is that EntTxn had a negative sequence number. Another hypothesis is that ba.Requests was concurrently mutated due to a data race. This happened once, so for now adding more info to the panic. Release note: None Jira issue: https://cockroachlabs.atlassian.net/browse/CRDB-14627 Co-authored-by: shralex <shralex@gmail.com>
|
Build failed: |
|
bors r+ |
|
Build failed (retrying...): |
|
Build failed (retrying...): |
|
Build succeeded: |
It's unclear how this panic happened. One possibility is that EntTxn had a negative sequence number. Another hypothesis is that ba.Requests was concurrently mutated due to a data race. This happened once, so for now adding more info to the panic.
Release note: None
Jira issue: https://cockroachlabs.atlassian.net/browse/CRDB-14627