Skip to content

workload/schemachange: complete error screening#57491

Merged
craig[bot] merged 16 commits intocockroachdb:masterfrom
jayshrivastava:screen-errors-in-schemachange-wl-3
Dec 15, 2020
Merged

workload/schemachange: complete error screening#57491
craig[bot] merged 16 commits intocockroachdb:masterfrom
jayshrivastava:screen-errors-in-schemachange-wl-3

Conversation

@jayshrivastava
Copy link
Copy Markdown
Contributor

@jayshrivastava jayshrivastava commented Dec 3, 2020

workload/schemachange: complete error screening

This change adds error screening for the following ops:

  • createIndex
  • setColumnNotNull
  • dropConstraint
  • dropIndex
  • setColumnDefault
  • dropColumnStored
  • setColumnType
  • renameIndex

Since all ops now have error screening, this change removes the overhead due to supporting ops that screen for errors and do not.

Closes: #56119
Release note: None

workload/schemachange: refactor type resolution

Previously, the type resolver used to generate types.T structs
from type name strings did not have logic for checking if schema names
were present. This was causing enums to only exist in the public schema.

Furthermore, the use of tree.ResolvableTypeReference in functions
such as randType would make it difficult to inspect the type to see
if a schema name were present.

This change updates the type resolver to correctly handle type names
which are prefixed with schemas. Furthermore, the type resolver will
now fetch OIDs which can be used to check for type equivalence when
screening for type-related errors. This change also refactors the
operation generator to use a helper function for type resolution and
the tree.TypeName struct to represent type names.

Release note: None

workload/schemachange: cleanup error states on op generation failure

Previously, if an op func failed, then the op state would not
get reset on subsequent iterations of the loop inside randOp.
This meant that any expected execution errors or commit errors
from the failed op generation would linger and potentially affect
the next generated op.

This change adds some simple mechanisms to ensure the expected
errors get reset on op generation failure.

Release note: None

workload/schemachange: allow validate after write

Transactions do not support schema changes after writes in the
same transaction. Previously, a FeatureNotSupported error was
anticipated when a validate operation followed an insertRow operation,
but this error would never occur. This change removes that behavior
because a validate should be allowed to occur after an insert
in a transaction without expecting any errors.

Release note: None

workload/schemachange: use regex to fetch tables

Previously, the clause LIKE 'table%' was used to fetch tables.
This pattern was capabable of fetching system tables, such as
information_schema.tables. Generally, this workload should not
alter the system tables, and should only test using the tables it
creates. Tables created by the workload can be fetched using
the clause ~ 'table[0-9]+', which is the word 'table' followed
by a sequence number.

Release note: None

workload/schemachange: use nested transactions for op gen and err screening

Previously, it was possible for transactions to become aborted
while generating operations or screening for errors. When a
transaction became aborted and an operation generation function
returned an error, a new operation generation function would be
called with the same transaction. Because the transaction was
aborted, all subsequent attempts to use it would fail, so this
situation resulted in an infinite loop.

This change updates the operation generator to use savepoints
before each call to an operation generation function. If
the operation generation function returns an error, the
transaction is rolled back to the savepoint so the transaction
can be reused.

If there are any errors caused by using savepoints, the workload
will terminate with a fatal error.

Closes: #56120

Release note: None

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@jayshrivastava jayshrivastava marked this pull request as ready for review December 3, 2020 18:49
@jayshrivastava jayshrivastava changed the title Screen errors in schemachange wl 3 workload/schemachange: complete error screening Dec 3, 2020
@jayshrivastava jayshrivastava force-pushed the screen-errors-in-schemachange-wl-3 branch 6 times, most recently from 4e5f229 to a311330 Compare December 4, 2020 19:10
@jayshrivastava jayshrivastava force-pushed the screen-errors-in-schemachange-wl-3 branch from a311330 to 75b49ac Compare December 8, 2020 20:27
Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r1, 2 of 2 files at r2, 2 of 2 files at r3, 1 of 1 files at r4, 4 of 4 files at r5, 2 of 2 files at r6, 2 of 2 files at r7, 1 of 1 files at r8, 1 of 1 files at r9, 2 of 2 files at r10, 3 of 3 files at r11, 1 of 1 files at r12, 1 of 1 files at r13, 2 of 2 files at r14, 1 of 1 files at r15.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jayshrivastava)


pkg/workload/schemachange/operation_generator.go, line 1954 at r5 (raw file):

		return nil, err
	}
	typ, err := tree.ResolveType(

this is cute, I like it.


pkg/workload/schemachange/operation_generator.go, line 54 at r11 (raw file):

	// function succeeds and the op statement is constructed, then candidateExpectedCommitErrors
	// are added to expectedCommitErrors.

Does this policy always make sense? I wonder if this will mask errors at commit time we don't expect. I understand figuring out the set of cases might not be super easy but it might be worth an experiment


pkg/workload/schemachange/operation_generator.go, line 231 at r14 (raw file):

Quoted 4 lines of code…
		if strings.Contains(err.Error(), pgcode.InFailedSQLTransaction.String()) {
			log.WriteString(fmt.Sprintf("Aborted: %s -> %v\n", op, err))
			return "SELECT 1", log.String(), err
		}

This deserves more commentary. This is the case where we've become aborted during op generation. Why does this happen? Why SELECT 1 and not just ""? Should this fail out the workload?


pkg/workload/schemachange/schemachange.go, line 406 at r1 (raw file):

		// Check for any expected errors.
		if w.opGen.expectedCommitErrors.empty() || !w.opGen.expectedCommitErrors.contains(pgcode.MakeCode(pgErr.Code)) {

nit: why the empty check? shouldn't contains just return false if it's empty?


pkg/workload/schemachange/schemachange.go, line 95 at r14 (raw file):

		s.flags.IntVar(&s.sequenceOwnedByPct, `seq-owned-pct`, defaultSequenceOwnedByPct,
			`Percentage of times that a sequence is owned by column upon creation.`)
		s.flags.BoolVar(&s.retryOpGenFailures, `retry-op-gen`, true,

I'd like to hear more justification for this behavior.

@jayshrivastava jayshrivastava force-pushed the screen-errors-in-schemachange-wl-3 branch from 75b49ac to 3c08a6e Compare December 10, 2020 20:56
Previously, the type resolver used to generate types.T structs
from type name strings did not have logic for checking if schema names
were present. This was causing enums to only exist in the public schema.

Furthermore, the use of tree.ResolvableTypeReference in functions
such as randType would make it difficult to inspect the type to see
if a schema name were present.

This change updates the type resolver to correctly handle type names
which are prefixed with schemas. Furthermore, the type resolver will
now fetch OIDs which can be used to check for type equivalence when
screening for type-related errors. This change also refactors the
operation generator to use a helper function for type resolution and
the tree.TypeName struct to represent type names.

Release note: None
Previously, certain ops screened for errors while others
did not. For this reason, there was some overhead in terms
of states and storage to deal with errors differently depending
on the op that produced the error.

Now, all ops screen for errors, so extra states and mechanisms can
be removed.

Release note: None
@jayshrivastava jayshrivastava force-pushed the screen-errors-in-schemachange-wl-3 branch 3 times, most recently from f40f85f to fe4914c Compare December 11, 2020 20:48
Previously, if an op func failed, then the op state would not
get reset on subsequent iterations of the loop inside randOp.
This meant that any expected execution errors or commit errors
from the failed op generation would linger and potentially affect
the next generated op.

This change adds some simple mechanisms to ensure the expected
errors get reset on op generation failure.

Release note: None
Transactions do not support schema changes after writes in the
same transaction. Previously, a FeatureNotSupported error was
anticipated when a validate operation followed an insertRow operation,
but this error would never occur. This change removes that behavior
because a validate should be allowed to occur after an insert
in a transaction without expecting any errors.

Release note: None
Previously, the clause `LIKE 'table%'` was used to fetch tables.
This pattern was capabable of fetching system tables, such as
information_schema.tables. Generally, this workload should not
alter the system tables, and should only test using the tables it
creates. Tables created by the workload can be fetched using
the clause `~ 'table[0-9]+'`, which is the word 'table' followed
by a sequence number.

Release note: None
…eening

Previously, it was possible for transactions to become aborted
while generating operations or screening for errors. When a
transaction became aborted and an operation generation function
returned an error, a new operation generation function would be
called with the same transaction. Because the transaction was
aborted, all subsequent attempts to use it would fail, so this
situation resulted in an infinite loop.

This change updates the operation generator to use savepoints
before each call to an operation generation function. If
the operation generation function returns an error, the
transaction is rolled back to the savepoint so the transaction
can be reused.

If there are any errors caused by using savepoints, the workload
will terminate with a fatal error.

Closes: cockroachdb#56120

Release note: None
Previously, the createSequence op checked for an UndefinedTable
error twice. This change ensures that it only checks for it once.

Release note: None
…rence

Previously, tree.TypeName, did not implement tree.ResolvableTypeReference
even though the struct represented a type reference. This change adds
the SQLString() method to tree.TypeName such that it implements
tree.ResolvableTypeReference.

Previously, tree.UnresolvedObjectName implemented
tree.ResolvableTypeReference, but the SQLString() method did not
resolve type strings correctly. For example, the type DECIMAL
would resolve to "DECIMAL", which cannot be parsed correctly
to a valid type. This change updates the SQLString()
method on tree.UnresolvedObjectName to not wrap type names
with quotations.
Release note: None
@jayshrivastava jayshrivastava force-pushed the screen-errors-in-schemachange-wl-3 branch from fe4914c to dbdfcb7 Compare December 11, 2020 22:39
Copy link
Copy Markdown
Contributor Author

@jayshrivastava jayshrivastava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)


pkg/sql/sem/tree/type_name.go, line 249 at r37 (raw file):

func (name *UnresolvedObjectName) SQLString() string {
	// FmtBareIdentifiers prevents the TypeName string from being wrapped in quotations.
	return AsStringWithFlags(name, FmtBareIdentifiers)

I noticed that tree.TypeName and tree.UnresolvedObjectName don't implement ResolvableTypeReference very well, so I thought I would go ahead and make them better. I added some more details in the commit message.


pkg/workload/schemachange/operation_generator.go, line 54 at r11 (raw file):

Previously, ajwerner wrote…
	// function succeeds and the op statement is constructed, then candidateExpectedCommitErrors
	// are added to expectedCommitErrors.

Does this policy always make sense? I wonder if this will mask errors at commit time we don't expect. I understand figuring out the set of cases might not be super easy but it might be worth an experiment

With respect to masking commit errors, there's two cases of things that could happen:

  1. Op 1 expects error 1 on commit, but op 1 will actually cause a different error on commit. Op 2 adds the correct commit error to the set. Then, the txn commits op1/op2 and the error caused by op 1 will not be unexpected because op 2 added it to the expected commit error set. This can occur, and I think the best way to solve this problem is to just try to commit the txn as soon as there are any expected commit errors in the set. I added some logic + a comment torunInTxn() in schemachange.go that does this.

  2. Op 1 expects no errors, but a commit error will occur because of it. Op 2 adds the right commit error to the set. Then, the txn commits and the unexpected error from txn 1 will be masked. This is also possible, but there's no way to catch it. I think we just need to hope that a similar transaction gets randomly constructed eventually where op 1 is the only op in the txn that affects expected commit errors.


pkg/workload/schemachange/operation_generator.go, line 231 at r14 (raw file):

Previously, ajwerner wrote…
		if strings.Contains(err.Error(), pgcode.InFailedSQLTransaction.String()) {
			log.WriteString(fmt.Sprintf("Aborted: %s -> %v\n", op, err))
			return "SELECT 1", log.String(), err
		}

This deserves more commentary. This is the case where we've become aborted during op generation. Why does this happen? Why SELECT 1 and not just ""? Should this fail out the workload?

I refactored this to use savepoints, which I feel are better than using regex to see if the txn became aborted. In the comment, I have an example of a transaction becoming aborted due to a bug in system tables.

We need to decide what the best wrt debugging the workload (ie. if generating a sql statement fails, should we terminate the workload?). Generally, I've only noticed 3 situations where generating a statement fails.

  1. Failure to meet the expected isolation level (which is the error I described above). There's an issue for this bug. Once it's fixed, we won't have that problem.
  2. There are not enough rows in the database to generate a statement (if you call addColumn with no tables so randTable() returns no rows). This can safely be ignored, and we can try generating another op.
  3. You are developing/debugging a new op locally. With verbosity >= 2, it's easy to spot if your new op function is failing (which is usually due to broken sql introspections). However, it's possible that you miss an edge case locally and a broken op gets merged to master.

I think we should make the workload terminate if we see an op function return any error other than a no rows error, but not yet. This is because the first reason I mentioned above will manifest itself in confusing ways, so it's better to fix it first rather than to debug things caused by it. I also think it would be beneficial to squash all the bugs the workload has found as well. Once we know those bugs are fixed, then we can turn our attention towards verifying that the workload has no edge cases that have been overlooked.


pkg/workload/schemachange/schemachange.go, line 95 at r14 (raw file):

Previously, ajwerner wrote…

I'd like to hear more justification for this behavior.

Let's talk about this in the operation_generator.go thread.


pkg/workload/schemachange/schemachange.go, line 404 at r31 (raw file):

			errors.Errorf("***UNEXPECTED SUCCESS"),
			errRunInTxnFatalSentinel,
		)

I realized this case was missing, so I added it.

Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

The SAVEPOINT thing is nice.

Reviewed 3 of 6 files at r16.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @jayshrivastava)


pkg/sql/sem/tree/type_name.go, line 249 at r37 (raw file):

Previously, jayshrivastava (Jayant Shrivastava) wrote…

I noticed that tree.TypeName and tree.UnresolvedObjectName don't implement ResolvableTypeReference very well, so I thought I would go ahead and make them better. I added some more details in the commit message.

👍


pkg/workload/schemachange/operation_generator.go, line 54 at r11 (raw file):

Previously, jayshrivastava (Jayant Shrivastava) wrote…

With respect to masking commit errors, there's two cases of things that could happen:

  1. Op 1 expects error 1 on commit, but op 1 will actually cause a different error on commit. Op 2 adds the correct commit error to the set. Then, the txn commits op1/op2 and the error caused by op 1 will not be unexpected because op 2 added it to the expected commit error set. This can occur, and I think the best way to solve this problem is to just try to commit the txn as soon as there are any expected commit errors in the set. I added some logic + a comment torunInTxn() in schemachange.go that does this.

  2. Op 1 expects no errors, but a commit error will occur because of it. Op 2 adds the right commit error to the set. Then, the txn commits and the unexpected error from txn 1 will be masked. This is also possible, but there's no way to catch it. I think we just need to hope that a similar transaction gets randomly constructed eventually where op 1 is the only op in the txn that affects expected commit errors.

Thanks


pkg/workload/schemachange/operation_generator.go, line 231 at r14 (raw file):

Previously, jayshrivastava (Jayant Shrivastava) wrote…

I refactored this to use savepoints, which I feel are better than using regex to see if the txn became aborted. In the comment, I have an example of a transaction becoming aborted due to a bug in system tables.

We need to decide what the best wrt debugging the workload (ie. if generating a sql statement fails, should we terminate the workload?). Generally, I've only noticed 3 situations where generating a statement fails.

  1. Failure to meet the expected isolation level (which is the error I described above). There's an issue for this bug. Once it's fixed, we won't have that problem.
  2. There are not enough rows in the database to generate a statement (if you call addColumn with no tables so randTable() returns no rows). This can safely be ignored, and we can try generating another op.
  3. You are developing/debugging a new op locally. With verbosity >= 2, it's easy to spot if your new op function is failing (which is usually due to broken sql introspections). However, it's possible that you miss an edge case locally and a broken op gets merged to master.

I think we should make the workload terminate if we see an op function return any error other than a no rows error, but not yet. This is because the first reason I mentioned above will manifest itself in confusing ways, so it's better to fix it first rather than to debug things caused by it. I also think it would be beneficial to squash all the bugs the workload has found as well. Once we know those bugs are fixed, then we can turn our attention towards verifying that the workload has no edge cases that have been overlooked.

Can we have a separate issue to ensure that we fix this?

Copy link
Copy Markdown
Contributor Author

@jayshrivastava jayshrivastava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner)


pkg/workload/schemachange/operation_generator.go, line 231 at r14 (raw file):

Previously, ajwerner wrote…

Can we have a separate issue to ensure that we fix this?

#57949 👍

@jayshrivastava
Copy link
Copy Markdown
Contributor Author

bors r=ajwerner

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Dec 15, 2020

Build succeeded:

@craig craig bot merged commit 33091b3 into cockroachdb:master Dec 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

workload/schemachange: handle aborted transactions workload/schemachange: classify and handle errors in all operations

3 participants