workload/schemachange: complete error screening by jayshrivastava · Pull Request #57491 · cockroachdb/cockroach

jayshrivastava · 2020-12-03T18:42:14Z

workload/schemachange: complete error screening

This change adds error screening for the following ops:

createIndex
setColumnNotNull
dropConstraint
dropIndex
setColumnDefault
dropColumnStored
setColumnType
renameIndex

Since all ops now have error screening, this change removes the overhead due to supporting ops that screen for errors and do not.

Closes: #56119
Release note: None

workload/schemachange: refactor type resolution

Previously, the type resolver used to generate types.T structs
from type name strings did not have logic for checking if schema names
were present. This was causing enums to only exist in the public schema.

Furthermore, the use of tree.ResolvableTypeReference in functions
such as randType would make it difficult to inspect the type to see
if a schema name were present.

This change updates the type resolver to correctly handle type names
which are prefixed with schemas. Furthermore, the type resolver will
now fetch OIDs which can be used to check for type equivalence when
screening for type-related errors. This change also refactors the
operation generator to use a helper function for type resolution and
the tree.TypeName struct to represent type names.

Release note: None

workload/schemachange: cleanup error states on op generation failure

Previously, if an op func failed, then the op state would not
get reset on subsequent iterations of the loop inside randOp.
This meant that any expected execution errors or commit errors
from the failed op generation would linger and potentially affect
the next generated op.

This change adds some simple mechanisms to ensure the expected
errors get reset on op generation failure.

Release note: None

workload/schemachange: allow validate after write

Transactions do not support schema changes after writes in the
same transaction. Previously, a FeatureNotSupported error was
anticipated when a validate operation followed an insertRow operation,
but this error would never occur. This change removes that behavior
because a validate should be allowed to occur after an insert
in a transaction without expecting any errors.

Release note: None

workload/schemachange: use regex to fetch tables

Previously, the clause LIKE 'table%' was used to fetch tables.
This pattern was capabable of fetching system tables, such as
information_schema.tables. Generally, this workload should not
alter the system tables, and should only test using the tables it
creates. Tables created by the workload can be fetched using
the clause ~ 'table[0-9]+', which is the word 'table' followed
by a sequence number.

Release note: None

workload/schemachange: use nested transactions for op gen and err screening

Previously, it was possible for transactions to become aborted
while generating operations or screening for errors. When a
transaction became aborted and an operation generation function
returned an error, a new operation generation function would be
called with the same transaction. Because the transaction was
aborted, all subsequent attempts to use it would fail, so this
situation resulted in an infinite loop.

This change updates the operation generator to use savepoints
before each call to an operation generation function. If
the operation generation function returns an error, the
transaction is rolled back to the savepoint so the transaction
can be reused.

If there are any errors caused by using savepoints, the workload
will terminate with a fatal error.

Closes: #56120

Release note: None

cockroach-teamcity · 2020-12-03T18:42:37Z

This change is

ajwerner

Reviewed 4 of 4 files at r1, 2 of 2 files at r2, 2 of 2 files at r3, 1 of 1 files at r4, 4 of 4 files at r5, 2 of 2 files at r6, 2 of 2 files at r7, 1 of 1 files at r8, 1 of 1 files at r9, 2 of 2 files at r10, 3 of 3 files at r11, 1 of 1 files at r12, 1 of 1 files at r13, 2 of 2 files at r14, 1 of 1 files at r15.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jayshrivastava)

pkg/workload/schemachange/operation_generator.go, line 1954 at r5 (raw file):

		return nil, err
	}
	typ, err := tree.ResolveType(

this is cute, I like it.

pkg/workload/schemachange/operation_generator.go, line 54 at r11 (raw file):

	// function succeeds and the op statement is constructed, then candidateExpectedCommitErrors
	// are added to expectedCommitErrors.

Does this policy always make sense? I wonder if this will mask errors at commit time we don't expect. I understand figuring out the set of cases might not be super easy but it might be worth an experiment

pkg/workload/schemachange/operation_generator.go, line 231 at r14 (raw file):

Quoted 4 lines of code…

		if strings.Contains(err.Error(), pgcode.InFailedSQLTransaction.String()) {
			log.WriteString(fmt.Sprintf("Aborted: %s -> %v\n", op, err))
			return "SELECT 1", log.String(), err
		}

This deserves more commentary. This is the case where we've become aborted during op generation. Why does this happen? Why SELECT 1 and not just ""? Should this fail out the workload?

pkg/workload/schemachange/schemachange.go, line 406 at r1 (raw file):

		// Check for any expected errors.
		if w.opGen.expectedCommitErrors.empty() || !w.opGen.expectedCommitErrors.contains(pgcode.MakeCode(pgErr.Code)) {

nit: why the empty check? shouldn't contains just return false if it's empty?

pkg/workload/schemachange/schemachange.go, line 95 at r14 (raw file):

		s.flags.IntVar(&s.sequenceOwnedByPct, `seq-owned-pct`, defaultSequenceOwnedByPct,
			`Percentage of times that a sequence is owned by column upon creation.`)
		s.flags.BoolVar(&s.retryOpGenFailures, `retry-op-gen`, true,

I'd like to hear more justification for this behavior.

Release note: None

Previously, the type resolver used to generate types.T structs from type name strings did not have logic for checking if schema names were present. This was causing enums to only exist in the public schema. Furthermore, the use of tree.ResolvableTypeReference in functions such as randType would make it difficult to inspect the type to see if a schema name were present. This change updates the type resolver to correctly handle type names which are prefixed with schemas. Furthermore, the type resolver will now fetch OIDs which can be used to check for type equivalence when screening for type-related errors. This change also refactors the operation generator to use a helper function for type resolution and the tree.TypeName struct to represent type names. Release note: None

…ning Release note: None

Release note: None

Previously, certain ops screened for errors while others did not. For this reason, there was some overhead in terms of states and storage to deal with errors differently depending on the op that produced the error. Now, all ops screen for errors, so extra states and mechanisms can be removed. Release note: None

Previously, if an op func failed, then the op state would not get reset on subsequent iterations of the loop inside randOp. This meant that any expected execution errors or commit errors from the failed op generation would linger and potentially affect the next generated op. This change adds some simple mechanisms to ensure the expected errors get reset on op generation failure. Release note: None

Transactions do not support schema changes after writes in the same transaction. Previously, a FeatureNotSupported error was anticipated when a validate operation followed an insertRow operation, but this error would never occur. This change removes that behavior because a validate should be allowed to occur after an insert in a transaction without expecting any errors. Release note: None

Previously, the clause `LIKE 'table%'` was used to fetch tables. This pattern was capabable of fetching system tables, such as information_schema.tables. Generally, this workload should not alter the system tables, and should only test using the tables it creates. Tables created by the workload can be fetched using the clause `~ 'table[0-9]+'`, which is the word 'table' followed by a sequence number. Release note: None

…eening Previously, it was possible for transactions to become aborted while generating operations or screening for errors. When a transaction became aborted and an operation generation function returned an error, a new operation generation function would be called with the same transaction. Because the transaction was aborted, all subsequent attempts to use it would fail, so this situation resulted in an infinite loop. This change updates the operation generator to use savepoints before each call to an operation generation function. If the operation generation function returns an error, the transaction is rolled back to the savepoint so the transaction can be reused. If there are any errors caused by using savepoints, the workload will terminate with a fatal error. Closes: cockroachdb#56120 Release note: None

Previously, the createSequence op checked for an UndefinedTable error twice. This change ensures that it only checks for it once. Release note: None

…rence Previously, tree.TypeName, did not implement tree.ResolvableTypeReference even though the struct represented a type reference. This change adds the SQLString() method to tree.TypeName such that it implements tree.ResolvableTypeReference. Previously, tree.UnresolvedObjectName implemented tree.ResolvableTypeReference, but the SQLString() method did not resolve type strings correctly. For example, the type DECIMAL would resolve to "DECIMAL", which cannot be parsed correctly to a valid type. This change updates the SQLString() method on tree.UnresolvedObjectName to not wrap type names with quotations. Release note: None

jayshrivastava

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)

pkg/sql/sem/tree/type_name.go, line 249 at r37 (raw file):

func (name *UnresolvedObjectName) SQLString() string {
	// FmtBareIdentifiers prevents the TypeName string from being wrapped in quotations.
	return AsStringWithFlags(name, FmtBareIdentifiers)

I noticed that tree.TypeName and tree.UnresolvedObjectName don't implement ResolvableTypeReference very well, so I thought I would go ahead and make them better. I added some more details in the commit message.

pkg/workload/schemachange/operation_generator.go, line 54 at r11 (raw file):

Previously, ajwerner wrote…

	// function succeeds and the op statement is constructed, then candidateExpectedCommitErrors
	// are added to expectedCommitErrors.
Does this policy always make sense? I wonder if this will mask errors at commit time we don't expect. I understand figuring out the set of cases might not be super easy but it might be worth an experiment

With respect to masking commit errors, there's two cases of things that could happen:

Op 1 expects error 1 on commit, but op 1 will actually cause a different error on commit. Op 2 adds the correct commit error to the set. Then, the txn commits op1/op2 and the error caused by op 1 will not be unexpected because op 2 added it to the expected commit error set. This can occur, and I think the best way to solve this problem is to just try to commit the txn as soon as there are any expected commit errors in the set. I added some logic + a comment torunInTxn() in schemachange.go that does this.
Op 1 expects no errors, but a commit error will occur because of it. Op 2 adds the right commit error to the set. Then, the txn commits and the unexpected error from txn 1 will be masked. This is also possible, but there's no way to catch it. I think we just need to hope that a similar transaction gets randomly constructed eventually where op 1 is the only op in the txn that affects expected commit errors.

pkg/workload/schemachange/operation_generator.go, line 231 at r14 (raw file):

Previously, ajwerner wrote…

		if strings.Contains(err.Error(), pgcode.InFailedSQLTransaction.String()) {
			log.WriteString(fmt.Sprintf("Aborted: %s -> %v\n", op, err))
			return "SELECT 1", log.String(), err
		}
This deserves more commentary. This is the case where we've become aborted during op generation. Why does this happen? Why SELECT 1 and not just ""? Should this fail out the workload?

I refactored this to use savepoints, which I feel are better than using regex to see if the txn became aborted. In the comment, I have an example of a transaction becoming aborted due to a bug in system tables.

We need to decide what the best wrt debugging the workload (ie. if generating a sql statement fails, should we terminate the workload?). Generally, I've only noticed 3 situations where generating a statement fails.

Failure to meet the expected isolation level (which is the error I described above). There's an issue for this bug. Once it's fixed, we won't have that problem.
There are not enough rows in the database to generate a statement (if you call addColumn with no tables so randTable() returns no rows). This can safely be ignored, and we can try generating another op.
You are developing/debugging a new op locally. With verbosity >= 2, it's easy to spot if your new op function is failing (which is usually due to broken sql introspections). However, it's possible that you miss an edge case locally and a broken op gets merged to master.

I think we should make the workload terminate if we see an op function return any error other than a no rows error, but not yet. This is because the first reason I mentioned above will manifest itself in confusing ways, so it's better to fix it first rather than to debug things caused by it. I also think it would be beneficial to squash all the bugs the workload has found as well. Once we know those bugs are fixed, then we can turn our attention towards verifying that the workload has no edge cases that have been overlooked.

pkg/workload/schemachange/schemachange.go, line 95 at r14 (raw file):

Previously, ajwerner wrote…

I'd like to hear more justification for this behavior.

Let's talk about this in the operation_generator.go thread.

pkg/workload/schemachange/schemachange.go, line 404 at r31 (raw file):

			errors.Errorf("***UNEXPECTED SUCCESS"),
			errRunInTxnFatalSentinel,
		)

I realized this case was missing, so I added it.

ajwerner

The SAVEPOINT thing is nice.

Reviewed 3 of 6 files at r16.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @jayshrivastava)

pkg/sql/sem/tree/type_name.go, line 249 at r37 (raw file):

Previously, jayshrivastava (Jayant Shrivastava) wrote…

I noticed that tree.TypeName and tree.UnresolvedObjectName don't implement ResolvableTypeReference very well, so I thought I would go ahead and make them better. I added some more details in the commit message.

👍

pkg/workload/schemachange/operation_generator.go, line 54 at r11 (raw file):

Previously, jayshrivastava (Jayant Shrivastava) wrote…

With respect to masking commit errors, there's two cases of things that could happen:

Op 1 expects error 1 on commit, but op 1 will actually cause a different error on commit. Op 2 adds the correct commit error to the set. Then, the txn commits op1/op2 and the error caused by op 1 will not be unexpected because op 2 added it to the expected commit error set. This can occur, and I think the best way to solve this problem is to just try to commit the txn as soon as there are any expected commit errors in the set. I added some logic + a comment torunInTxn() in schemachange.go that does this.

Op 1 expects no errors, but a commit error will occur because of it. Op 2 adds the right commit error to the set. Then, the txn commits and the unexpected error from txn 1 will be masked. This is also possible, but there's no way to catch it. I think we just need to hope that a similar transaction gets randomly constructed eventually where op 1 is the only op in the txn that affects expected commit errors.

Thanks

pkg/workload/schemachange/operation_generator.go, line 231 at r14 (raw file):

Previously, jayshrivastava (Jayant Shrivastava) wrote…

I refactored this to use savepoints, which I feel are better than using regex to see if the txn became aborted. In the comment, I have an example of a transaction becoming aborted due to a bug in system tables.

We need to decide what the best wrt debugging the workload (ie. if generating a sql statement fails, should we terminate the workload?). Generally, I've only noticed 3 situations where generating a statement fails.

Failure to meet the expected isolation level (which is the error I described above). There's an issue for this bug. Once it's fixed, we won't have that problem.

There are not enough rows in the database to generate a statement (if you call addColumn with no tables so randTable() returns no rows). This can safely be ignored, and we can try generating another op.

You are developing/debugging a new op locally. With verbosity >= 2, it's easy to spot if your new op function is failing (which is usually due to broken sql introspections). However, it's possible that you miss an edge case locally and a broken op gets merged to master.

I think we should make the workload terminate if we see an op function return any error other than a no rows error, but not yet. This is because the first reason I mentioned above will manifest itself in confusing ways, so it's better to fix it first rather than to debug things caused by it. I also think it would be beneficial to squash all the bugs the workload has found as well. Once we know those bugs are fixed, then we can turn our attention towards verifying that the workload has no edge cases that have been overlooked.

Can we have a separate issue to ensure that we fix this?

jayshrivastava

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner)

pkg/workload/schemachange/operation_generator.go, line 231 at r14 (raw file):

Previously, ajwerner wrote…

Can we have a separate issue to ensure that we fix this?

#57949 👍

jayshrivastava · 2020-12-15T14:54:38Z

bors r=ajwerner

craig · 2020-12-15T15:29:34Z

Build succeeded:

GitHub CI (Cockroach)

jayshrivastava marked this pull request as ready for review December 3, 2020 18:49

jayshrivastava requested a review from ajwerner December 3, 2020 18:49

jayshrivastava changed the title ~~Screen errors in schemachange wl 3~~ workload/schemachange: complete error screening Dec 3, 2020

jayshrivastava force-pushed the screen-errors-in-schemachange-wl-3 branch 6 times, most recently from 4e5f229 to a311330 Compare December 4, 2020 19:10

jayshrivastava mentioned this pull request Dec 7, 2020

roachtest: scbench/randomload/nodes=3/ops=10000/conc=20 failed #56230

Closed

jayshrivastava force-pushed the screen-errors-in-schemachange-wl-3 branch from a311330 to 75b49ac Compare December 8, 2020 20:27

jayshrivastava mentioned this pull request Dec 8, 2020

roachtest: schemachange/random-load failed #56235

Closed

ajwerner reviewed Dec 10, 2020

View reviewed changes

jayshrivastava force-pushed the screen-errors-in-schemachange-wl-3 branch from 75b49ac to 3c08a6e Compare December 10, 2020 20:56

jayshrivastava added 10 commits December 11, 2020 11:03

workload/schemachange: screen for errors in createIndex op

61e9278

Release note: None

workload/schemachange: screen for errors in setColumnNotNull op

93bd7a1

Release note: None

workload/schemachange: screen for errors in dropConstraint op

64f2da0

Release note: None

workload/schemachange: screen for errors in dropIndex op

c6e0034

Release note: None

workload/schemachange: implement setColumnDefault op with error scree…

4e894c3

…ning Release note: None

workload/schemachange: screen for errors in dropColumnStored op

8b35c62

Release note: None

workload/schemachange: screen for errors in setColumnType op

3a8df87

Release note: None

workload/schemachange: screen for errors in renameIndex op

7c27938

Release note: None

jayshrivastava force-pushed the screen-errors-in-schemachange-wl-3 branch 3 times, most recently from f40f85f to fe4914c Compare December 11, 2020 20:48

jayshrivastava added 2 commits December 11, 2020 17:39

jayshrivastava added 4 commits December 11, 2020 17:39

workload/schemachange: remove code duplication in createSequence

b759e0a

Previously, the createSequence op checked for an UndefinedTable error twice. This change ensures that it only checks for it once. Release note: None

jayshrivastava force-pushed the screen-errors-in-schemachange-wl-3 branch from fe4914c to dbdfcb7 Compare December 11, 2020 22:39

jayshrivastava commented Dec 11, 2020

View reviewed changes

jayshrivastava requested a review from ajwerner December 14, 2020 14:53

ajwerner approved these changes Dec 15, 2020

View reviewed changes

jayshrivastava commented Dec 15, 2020

View reviewed changes

jayshrivastava mentioned this pull request Dec 15, 2020

workload/schemachange: classify and handle errors in all operations #56119

Closed

29 tasks

craig bot merged commit 33091b3 into cockroachdb:master Dec 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

workload/schemachange: complete error screening#57491

workload/schemachange: complete error screening#57491
craig[bot] merged 16 commits intocockroachdb:masterfrom
jayshrivastava:screen-errors-in-schemachange-wl-3

jayshrivastava commented Dec 3, 2020 •

edited

Loading

Uh oh!

cockroach-teamcity commented Dec 3, 2020

Uh oh!

ajwerner left a comment

Uh oh!

jayshrivastava left a comment

Uh oh!

ajwerner left a comment

Uh oh!

jayshrivastava left a comment

Uh oh!

jayshrivastava commented Dec 15, 2020

Uh oh!

craig bot commented Dec 15, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jayshrivastava commented Dec 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Dec 3, 2020

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

jayshrivastava left a comment

Choose a reason for hiding this comment

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

jayshrivastava left a comment

Choose a reason for hiding this comment

Uh oh!

jayshrivastava commented Dec 15, 2020

Uh oh!

craig bot commented Dec 15, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jayshrivastava commented Dec 3, 2020 •

edited

Loading