Skip to content

kvserver: IsCompleteTransaction might panic with certain batch sequences#80832

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
shralex:complete_transaction_panic
May 14, 2022
Merged

kvserver: IsCompleteTransaction might panic with certain batch sequences#80832
craig[bot] merged 1 commit intocockroachdb:masterfrom
shralex:complete_transaction_panic

Conversation

@shralex
Copy link
Copy Markdown
Contributor

@shralex shralex commented Apr 30, 2022

It's unclear how this panic happened. One possibility is that EntTxn had a negative sequence number. Another hypothesis is that ba.Requests was concurrently mutated due to a data race. This happened once, so for now adding more info to the panic.

Release note: None

Jira issue: https://cockroachlabs.atlassian.net/browse/CRDB-14627

@shralex shralex requested a review from nvb April 30, 2022 21:27
@shralex shralex requested a review from a team as a code owner April 30, 2022 21:27
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@shralex shralex changed the title kvserver: IsCompleteTransaction might panic with certain batch sequences DRAFT: kvserver: IsCompleteTransaction might panic with certain batch sequences May 1, 2022
@shralex shralex force-pushed the complete_transaction_panic branch 2 times, most recently from 8f76eaa to 74d27e4 Compare May 7, 2022 04:55
Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @shralex)


-- commits line 5 at r1:
Do we have a theory about why a sequence number might be missing or duplicated for operations in a batch. Request sequence numbers are assigned by the txnSeqNumAllocator. Are there flaws in its sequence number allocation logic?


-- commits line 7 at r1:

and adds more information in case the panic still occurs.

It doesn't look like we've added any new information to the panic. This seems like a good idea to help us understand the failure mode better, if we ever see it again.

@nvb
Copy link
Copy Markdown
Contributor

nvb commented May 9, 2022

Also, even though things are now tracked in Jira, we'll still want to cross-reference PRs like this with their corresponding Github issue (#79188).

@shralex shralex force-pushed the complete_transaction_panic branch 3 times, most recently from 2187f81 to 02c0d2c Compare May 10, 2022 02:27
@shralex shralex changed the title DRAFT: kvserver: IsCompleteTransaction might panic with certain batch sequences kvserver: IsCompleteTransaction might panic with certain batch sequences May 10, 2022
Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @shralex)


pkg/roachpb/batch.go line 379 at r2 (raw file):

		}
	}
	panic(fmt.Errorf("unreachable. Batch requests: %s", ba.Requests))

nit: fmt.Sprintf should be sufficient here.


pkg/roachpb/batch.go line 379 at r2 (raw file):

		}
	}
	panic(fmt.Errorf("unreachable. Batch requests: %s", ba.Requests))

Does the string format of []RequestUnion print the sequence number of each request? Let's make sure we put enough information in here so that we can have a good chance of diagnosing the issue next time this happens.

@shralex shralex force-pushed the complete_transaction_panic branch from 02c0d2c to 0b61ab3 Compare May 10, 2022 03:10
Copy link
Copy Markdown
Contributor Author

@shralex shralex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


-- commits line 5 at r1:

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do we have a theory about why a sequence number might be missing or duplicated for operations in a batch. Request sequence numbers are assigned by the txnSeqNumAllocator. Are there flaws in its sequence number allocation logic?

Added all we know to the PR description


-- commits line 7 at r1:

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

and adds more information in case the panic still occurs.

It doesn't look like we've added any new information to the panic. This seems like a good idea to help us understand the failure mode better, if we ever see it again.

Added


pkg/roachpb/batch.go line 379 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: fmt.Sprintf should be sufficient here.

Done


pkg/roachpb/batch.go line 379 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Does the string format of []RequestUnion print the sequence number of each request? Let's make sure we put enough information in here so that we can have a good chance of diagnosing the issue next time this happens.

You were right. Changed this, now if the test does "{[]Request{withSeq(put, 1), withSeq(etC, -1)}, true}," it will print:
panic: unreachable. Batch requests: put:<header:<sequence:1 > value:<timestamp:<> > > ,end_txn:<header:<sequence:-1 > commit:true >

Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @shralex)


pkg/roachpb/batch.go line 379 at r3 (raw file):

		}
	}
	// Unreachable

nit: Missing punctuation on this comment. // Unreachable.


pkg/roachpb/batch.go line 383 at r3 (raw file):

	for i, args := range ba.Requests {
		sb.WriteString(args.String())
		if i < len(ba.Requests) - 1 {

nit: moving this up and making the check if i > 0 { may be slightly easier to reason about.

In IsCompleteTransaction it seems theoretically possible that a
panic could happen if some sequence numbers are missing (0) or duplicated for operations
in the batch. This PR returns false in such case, and adds more information in
case the panic still occurs.

Release note: None

Jira issue: https://cockroachlabs.atlassian.net/browse/CRDB-14627
@shralex shralex force-pushed the complete_transaction_panic branch from 0b61ab3 to aab78b7 Compare May 10, 2022 10:15
@shralex
Copy link
Copy Markdown
Contributor Author

shralex commented May 10, 2022

bors r+

@shralex
Copy link
Copy Markdown
Contributor Author

shralex commented May 10, 2022

@nvanbenschoten thanks for the reviews!

craig bot pushed a commit that referenced this pull request May 10, 2022
80832: kvserver: IsCompleteTransaction might panic with certain batch sequences r=shralex a=shralex

It's unclear how this panic happened. One possibility is that EntTxn had a negative sequence number. Another hypothesis is that ba.Requests was concurrently mutated due to a data race. This happened once, so for now adding more info to the panic.

Release note: None

Jira issue: https://cockroachlabs.atlassian.net/browse/CRDB-14627

Co-authored-by: shralex <shralex@gmail.com>
@craig
Copy link
Copy Markdown
Contributor

craig bot commented May 10, 2022

Build failed:

@shralex
Copy link
Copy Markdown
Contributor Author

shralex commented May 13, 2022

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented May 13, 2022

Build failed (retrying...):

@craig
Copy link
Copy Markdown
Contributor

craig bot commented May 13, 2022

Build failed (retrying...):

@craig
Copy link
Copy Markdown
Contributor

craig bot commented May 14, 2022

Build succeeded:

@craig craig bot merged commit 16c484a into cockroachdb:master May 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

roachpb: v21.2.6: IsCompleteTransaction panicked from supposedly unreachable line

3 participants