workload/schemachange: complete error screening#57491
workload/schemachange: complete error screening#57491craig[bot] merged 16 commits intocockroachdb:masterfrom
Conversation
4e5f229 to
a311330
Compare
a311330 to
75b49ac
Compare
ajwerner
left a comment
There was a problem hiding this comment.
Reviewed 4 of 4 files at r1, 2 of 2 files at r2, 2 of 2 files at r3, 1 of 1 files at r4, 4 of 4 files at r5, 2 of 2 files at r6, 2 of 2 files at r7, 1 of 1 files at r8, 1 of 1 files at r9, 2 of 2 files at r10, 3 of 3 files at r11, 1 of 1 files at r12, 1 of 1 files at r13, 2 of 2 files at r14, 1 of 1 files at r15.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @jayshrivastava)
pkg/workload/schemachange/operation_generator.go, line 1954 at r5 (raw file):
return nil, err } typ, err := tree.ResolveType(
this is cute, I like it.
pkg/workload/schemachange/operation_generator.go, line 54 at r11 (raw file):
// function succeeds and the op statement is constructed, then candidateExpectedCommitErrors // are added to expectedCommitErrors.
Does this policy always make sense? I wonder if this will mask errors at commit time we don't expect. I understand figuring out the set of cases might not be super easy but it might be worth an experiment
pkg/workload/schemachange/operation_generator.go, line 231 at r14 (raw file):
Quoted 4 lines of code…
if strings.Contains(err.Error(), pgcode.InFailedSQLTransaction.String()) { log.WriteString(fmt.Sprintf("Aborted: %s -> %v\n", op, err)) return "SELECT 1", log.String(), err }
This deserves more commentary. This is the case where we've become aborted during op generation. Why does this happen? Why SELECT 1 and not just ""? Should this fail out the workload?
pkg/workload/schemachange/schemachange.go, line 406 at r1 (raw file):
// Check for any expected errors. if w.opGen.expectedCommitErrors.empty() || !w.opGen.expectedCommitErrors.contains(pgcode.MakeCode(pgErr.Code)) {
nit: why the empty check? shouldn't contains just return false if it's empty?
pkg/workload/schemachange/schemachange.go, line 95 at r14 (raw file):
s.flags.IntVar(&s.sequenceOwnedByPct, `seq-owned-pct`, defaultSequenceOwnedByPct, `Percentage of times that a sequence is owned by column upon creation.`) s.flags.BoolVar(&s.retryOpGenFailures, `retry-op-gen`, true,
I'd like to hear more justification for this behavior.
75b49ac to
3c08a6e
Compare
Release note: None
Release note: None
Release note: None
Release note: None
Previously, the type resolver used to generate types.T structs from type name strings did not have logic for checking if schema names were present. This was causing enums to only exist in the public schema. Furthermore, the use of tree.ResolvableTypeReference in functions such as randType would make it difficult to inspect the type to see if a schema name were present. This change updates the type resolver to correctly handle type names which are prefixed with schemas. Furthermore, the type resolver will now fetch OIDs which can be used to check for type equivalence when screening for type-related errors. This change also refactors the operation generator to use a helper function for type resolution and the tree.TypeName struct to represent type names. Release note: None
…ning Release note: None
Release note: None
Release note: None
Release note: None
Previously, certain ops screened for errors while others did not. For this reason, there was some overhead in terms of states and storage to deal with errors differently depending on the op that produced the error. Now, all ops screen for errors, so extra states and mechanisms can be removed. Release note: None
f40f85f to
fe4914c
Compare
Previously, if an op func failed, then the op state would not get reset on subsequent iterations of the loop inside randOp. This meant that any expected execution errors or commit errors from the failed op generation would linger and potentially affect the next generated op. This change adds some simple mechanisms to ensure the expected errors get reset on op generation failure. Release note: None
Transactions do not support schema changes after writes in the same transaction. Previously, a FeatureNotSupported error was anticipated when a validate operation followed an insertRow operation, but this error would never occur. This change removes that behavior because a validate should be allowed to occur after an insert in a transaction without expecting any errors. Release note: None
Previously, the clause `LIKE 'table%'` was used to fetch tables. This pattern was capabable of fetching system tables, such as information_schema.tables. Generally, this workload should not alter the system tables, and should only test using the tables it creates. Tables created by the workload can be fetched using the clause `~ 'table[0-9]+'`, which is the word 'table' followed by a sequence number. Release note: None
…eening Previously, it was possible for transactions to become aborted while generating operations or screening for errors. When a transaction became aborted and an operation generation function returned an error, a new operation generation function would be called with the same transaction. Because the transaction was aborted, all subsequent attempts to use it would fail, so this situation resulted in an infinite loop. This change updates the operation generator to use savepoints before each call to an operation generation function. If the operation generation function returns an error, the transaction is rolled back to the savepoint so the transaction can be reused. If there are any errors caused by using savepoints, the workload will terminate with a fatal error. Closes: cockroachdb#56120 Release note: None
Previously, the createSequence op checked for an UndefinedTable error twice. This change ensures that it only checks for it once. Release note: None
…rence Previously, tree.TypeName, did not implement tree.ResolvableTypeReference even though the struct represented a type reference. This change adds the SQLString() method to tree.TypeName such that it implements tree.ResolvableTypeReference. Previously, tree.UnresolvedObjectName implemented tree.ResolvableTypeReference, but the SQLString() method did not resolve type strings correctly. For example, the type DECIMAL would resolve to "DECIMAL", which cannot be parsed correctly to a valid type. This change updates the SQLString() method on tree.UnresolvedObjectName to not wrap type names with quotations. Release note: None
fe4914c to
dbdfcb7
Compare
jayshrivastava
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)
pkg/sql/sem/tree/type_name.go, line 249 at r37 (raw file):
func (name *UnresolvedObjectName) SQLString() string { // FmtBareIdentifiers prevents the TypeName string from being wrapped in quotations. return AsStringWithFlags(name, FmtBareIdentifiers)
I noticed that tree.TypeName and tree.UnresolvedObjectName don't implement ResolvableTypeReference very well, so I thought I would go ahead and make them better. I added some more details in the commit message.
pkg/workload/schemachange/operation_generator.go, line 54 at r11 (raw file):
Previously, ajwerner wrote…
// function succeeds and the op statement is constructed, then candidateExpectedCommitErrors // are added to expectedCommitErrors.Does this policy always make sense? I wonder if this will mask errors at commit time we don't expect. I understand figuring out the set of cases might not be super easy but it might be worth an experiment
With respect to masking commit errors, there's two cases of things that could happen:
-
Op 1 expects error 1 on commit, but op 1 will actually cause a different error on commit. Op 2 adds the correct commit error to the set. Then, the txn commits op1/op2 and the error caused by op 1 will not be unexpected because op 2 added it to the expected commit error set. This can occur, and I think the best way to solve this problem is to just try to commit the txn as soon as there are any expected commit errors in the set. I added some logic + a comment to
runInTxn()inschemachange.gothat does this. -
Op 1 expects no errors, but a commit error will occur because of it. Op 2 adds the right commit error to the set. Then, the txn commits and the unexpected error from txn 1 will be masked. This is also possible, but there's no way to catch it. I think we just need to hope that a similar transaction gets randomly constructed eventually where op 1 is the only op in the txn that affects expected commit errors.
pkg/workload/schemachange/operation_generator.go, line 231 at r14 (raw file):
Previously, ajwerner wrote…
if strings.Contains(err.Error(), pgcode.InFailedSQLTransaction.String()) { log.WriteString(fmt.Sprintf("Aborted: %s -> %v\n", op, err)) return "SELECT 1", log.String(), err }This deserves more commentary. This is the case where we've become aborted during op generation. Why does this happen? Why
SELECT 1and not just""? Should this fail out the workload?
I refactored this to use savepoints, which I feel are better than using regex to see if the txn became aborted. In the comment, I have an example of a transaction becoming aborted due to a bug in system tables.
We need to decide what the best wrt debugging the workload (ie. if generating a sql statement fails, should we terminate the workload?). Generally, I've only noticed 3 situations where generating a statement fails.
- Failure to meet the expected isolation level (which is the error I described above). There's an issue for this bug. Once it's fixed, we won't have that problem.
- There are not enough rows in the database to generate a statement (if you call addColumn with no tables so randTable() returns no rows). This can safely be ignored, and we can try generating another op.
- You are developing/debugging a new op locally. With verbosity >= 2, it's easy to spot if your new op function is failing (which is usually due to broken sql introspections). However, it's possible that you miss an edge case locally and a broken op gets merged to master.
I think we should make the workload terminate if we see an op function return any error other than a no rows error, but not yet. This is because the first reason I mentioned above will manifest itself in confusing ways, so it's better to fix it first rather than to debug things caused by it. I also think it would be beneficial to squash all the bugs the workload has found as well. Once we know those bugs are fixed, then we can turn our attention towards verifying that the workload has no edge cases that have been overlooked.
pkg/workload/schemachange/schemachange.go, line 95 at r14 (raw file):
Previously, ajwerner wrote…
I'd like to hear more justification for this behavior.
Let's talk about this in the operation_generator.go thread.
pkg/workload/schemachange/schemachange.go, line 404 at r31 (raw file):
errors.Errorf("***UNEXPECTED SUCCESS"), errRunInTxnFatalSentinel, )
I realized this case was missing, so I added it.
ajwerner
left a comment
There was a problem hiding this comment.
The SAVEPOINT thing is nice.
Reviewed 3 of 6 files at r16.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @jayshrivastava)
pkg/sql/sem/tree/type_name.go, line 249 at r37 (raw file):
Previously, jayshrivastava (Jayant Shrivastava) wrote…
I noticed that tree.TypeName and tree.UnresolvedObjectName don't implement
ResolvableTypeReferencevery well, so I thought I would go ahead and make them better. I added some more details in the commit message.
👍
pkg/workload/schemachange/operation_generator.go, line 54 at r11 (raw file):
Previously, jayshrivastava (Jayant Shrivastava) wrote…
With respect to masking commit errors, there's two cases of things that could happen:
Op 1 expects error 1 on commit, but op 1 will actually cause a different error on commit. Op 2 adds the correct commit error to the set. Then, the txn commits op1/op2 and the error caused by op 1 will not be unexpected because op 2 added it to the expected commit error set. This can occur, and I think the best way to solve this problem is to just try to commit the txn as soon as there are any expected commit errors in the set. I added some logic + a comment to
runInTxn()inschemachange.gothat does this.Op 1 expects no errors, but a commit error will occur because of it. Op 2 adds the right commit error to the set. Then, the txn commits and the unexpected error from txn 1 will be masked. This is also possible, but there's no way to catch it. I think we just need to hope that a similar transaction gets randomly constructed eventually where op 1 is the only op in the txn that affects expected commit errors.
Thanks
pkg/workload/schemachange/operation_generator.go, line 231 at r14 (raw file):
Previously, jayshrivastava (Jayant Shrivastava) wrote…
I refactored this to use savepoints, which I feel are better than using regex to see if the txn became aborted. In the comment, I have an example of a transaction becoming aborted due to a bug in system tables.
We need to decide what the best wrt debugging the workload (ie. if generating a sql statement fails, should we terminate the workload?). Generally, I've only noticed 3 situations where generating a statement fails.
- Failure to meet the expected isolation level (which is the error I described above). There's an issue for this bug. Once it's fixed, we won't have that problem.
- There are not enough rows in the database to generate a statement (if you call addColumn with no tables so randTable() returns no rows). This can safely be ignored, and we can try generating another op.
- You are developing/debugging a new op locally. With verbosity >= 2, it's easy to spot if your new op function is failing (which is usually due to broken sql introspections). However, it's possible that you miss an edge case locally and a broken op gets merged to master.
I think we should make the workload terminate if we see an op function return any error other than a
no rowserror, but not yet. This is because the first reason I mentioned above will manifest itself in confusing ways, so it's better to fix it first rather than to debug things caused by it. I also think it would be beneficial to squash all the bugs the workload has found as well. Once we know those bugs are fixed, then we can turn our attention towards verifying that the workload has no edge cases that have been overlooked.
Can we have a separate issue to ensure that we fix this?
jayshrivastava
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @ajwerner)
pkg/workload/schemachange/operation_generator.go, line 231 at r14 (raw file):
Previously, ajwerner wrote…
Can we have a separate issue to ensure that we fix this?
#57949 👍
|
bors r=ajwerner |
|
Build succeeded: |
workload/schemachange: complete error screening
This change adds error screening for the following ops:
Since all ops now have error screening, this change removes the overhead due to supporting ops that screen for errors and do not.
Closes: #56119
Release note: None
workload/schemachange: refactor type resolution
Previously, the type resolver used to generate types.T structs
from type name strings did not have logic for checking if schema names
were present. This was causing enums to only exist in the public schema.
Furthermore, the use of tree.ResolvableTypeReference in functions
such as randType would make it difficult to inspect the type to see
if a schema name were present.
This change updates the type resolver to correctly handle type names
which are prefixed with schemas. Furthermore, the type resolver will
now fetch OIDs which can be used to check for type equivalence when
screening for type-related errors. This change also refactors the
operation generator to use a helper function for type resolution and
the tree.TypeName struct to represent type names.
Release note: None
workload/schemachange: cleanup error states on op generation failure
Previously, if an op func failed, then the op state would not
get reset on subsequent iterations of the loop inside randOp.
This meant that any expected execution errors or commit errors
from the failed op generation would linger and potentially affect
the next generated op.
This change adds some simple mechanisms to ensure the expected
errors get reset on op generation failure.
Release note: None
workload/schemachange: allow validate after write
Transactions do not support schema changes after writes in the
same transaction. Previously, a FeatureNotSupported error was
anticipated when a validate operation followed an insertRow operation,
but this error would never occur. This change removes that behavior
because a validate should be allowed to occur after an insert
in a transaction without expecting any errors.
Release note: None
workload/schemachange: use regex to fetch tables
Previously, the clause
LIKE 'table%'was used to fetch tables.This pattern was capabable of fetching system tables, such as
information_schema.tables. Generally, this workload should not
alter the system tables, and should only test using the tables it
creates. Tables created by the workload can be fetched using
the clause
~ 'table[0-9]+', which is the word 'table' followedby a sequence number.
Release note: None
workload/schemachange: use nested transactions for op gen and err screening
Previously, it was possible for transactions to become aborted
while generating operations or screening for errors. When a
transaction became aborted and an operation generation function
returned an error, a new operation generation function would be
called with the same transaction. Because the transaction was
aborted, all subsequent attempts to use it would fail, so this
situation resulted in an infinite loop.
This change updates the operation generator to use savepoints
before each call to an operation generation function. If
the operation generation function returns an error, the
transaction is rolled back to the savepoint so the transaction
can be reused.
If there are any errors caused by using savepoints, the workload
will terminate with a fatal error.
Closes: #56120
Release note: None