workload/schemachange: classify schema change errors by spaskob · Pull Request #48338 · cockroachdb/cockroach

spaskob · 2020-05-04T12:16:18Z

Previously this workload will not fail even when a schema change
operation returns an error. Now we extract the SQLSTATE code from
the error and if it is of class 09, i.e. Triggered Action Exception
or XX, i.e. Internal Error we abort the workload and report the error.
Also previously we aborted the txn at the first operation error, now
we continue unless the error is one of the classes above.
We also add a roachtest schemachange/random-load that runs
the workload and fails if unclassified errors are encountered.

Fixes #23286.

Release note: none.

cockroach-teamcity · 2020-05-04T12:16:25Z

This change is

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @spaskob)

pkg/workload/schemachange/schemachange.go, line 274 at r1 (raw file):

}

func (w *schemaChangeWorker) runInTxn(tx *pgx.Tx) (string, bool, error) {

I have mixed feelings about all of the string inspection happening here. Have you trying pulling a pgconn.PgError out of the error returned from tx.Exec and then pulling its code out?

blathers-crl · 2020-05-04T13:25:47Z

❌ The GitHub CI (Cockroach) build has failed on 1be29086.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

blathers-crl · 2020-05-04T16:43:04Z

❌ The GitHub CI (Cockroach) build has failed on 517c8ef9.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

blathers-crl · 2020-05-04T19:53:54Z

❌ The GitHub CI (Cockroach) build has failed on 4439ba64.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

spaskob

Addressed the feedback and fixed the TC failures are due the fact that the workload actually finds current issues and fails so this PR will have to be merged after I fix the long tail of these problems. They seem to be related to my recent changed of validation - certain validations for duplicates don't get triggered and we hit the assertion failures in ValidateTable but at least it;'s working as expected and is finding bugs!

Reviewable status: complete! 0 of 0 LGTMs obtained

blathers-crl · 2020-05-05T17:20:58Z

❌ The GitHub CI (Cockroach) build has failed on badfefe4.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

blathers-crl · 2020-05-06T18:27:14Z

❌ The GitHub CI (Cockroach) build has failed on 7a8f6ef2.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @spaskob)

pkg/cmd/roachtest/schemachange_random_load.go, line 48 at r4 (raw file):

	runCmd := []string{
		"./workload run",
		fmt.Sprintf("schemachange  --concurrency %d --max-ops %d --verbose=1", maxOps,

These arguments seem backwards

Also seems weird that this is just a single string.

pkg/workload/schemachange/schemachange.go, line 247 at r4 (raw file):

}

func (w *schemaChangeWorker) runInTxn(tx *pgx.Tx) (string, error) {

This rollback and error classification seems like it might be happening at the wrong level. It seems like any error needs to be dealt with for the whole transaction. It feels like this function should just return the error and then run should rollback on probably all of the errors but then decide whether to tolerate the error or not.

pkg/workload/schemachange/schemachange.go, line 272 at r4 (raw file):

}

func (w *schemaChangeWorker) runInTxn(tx *pgx.Tx) (string, bool, error) {

Can you name the bool return param?

pkg/workload/schemachange/schemachange.go, line 304 at r4 (raw file):

		if seriousErr := handleOpError(err); seriousErr != nil {
			if w.tolerateErrors {
				opsBuf.WriteString(fmt.Sprintf("Serious error: %v\n", seriousErr))

Don’t you still want to bail out of this transaction?

pkg/workload/schemachange/schemachange.go, line 308 at r4 (raw file):

				return opsBuf.String(), rollback, seriousErr
			}
		}

If rollback ever becomes true, do you really want to proceed to the next iteration of the loop? The next command is likely to fail now that you’re in an error state.

spaskob

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @spaskob)

pkg/cmd/roachtest/schemachange_random_load.go, line 48 at r4 (raw file):

Previously, ajwerner wrote…

These arguments seem backwards

Also seems weird that this is just a single string.

good catch - made it more maintainable

pkg/workload/schemachange/schemachange.go, line 247 at r4 (raw file):

Previously, ajwerner wrote…

This rollback and error classification seems like it might be happening at the wrong level. It seems like any error needs to be dealt with for the whole transaction. It feels like this function should just return the error and then run should rollback on probably all of the errors but then decide whether to tolerate the error or not.

well I tried alternatives before coming up with this and also considered what you are suggesting - the trouble is we are still not clear on what errors we want to catch exactly and until we have some clarity as a team on what ultimately should be the purpose of this tool, I prefer to leave it as it is and not sink too much cost upfront. A few open questions:

do we want to abort txn on first failure or continue to understand failure scenarios under cascaded errors?
what kind of errors do we want to generate on purpose
how to actually find out if something broke - currently we don't and we need some kind of database internals consistency checks after each txn commits or is rolled back

So there is too much uncertainty about this tool so I'd rather punt on the issue your highlighting for now.

pkg/workload/schemachange/schemachange.go, line 272 at r4 (raw file):

Previously, ajwerner wrote…

Can you name the bool return param?

What do you mean? Use a named return value? I thought that was discouraged unless it's really necessary or you are asking for a comment for the function?

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @spaskob)

pkg/workload/schemachange/schemachange.go, line 247 at r4 (raw file):

do we want to abort txn on first failure or continue to understand failure scenarios under cascaded errors?

In SQL once you get an error, unless you ROLLBACK TO SAVEPOINT your transaction is toast and all future statements are invalid. Given this fact, I'm not sure the rest of your points make as much sense. If you want the higher level to be able to cope with the error better, we should wrap the surrounding context and pass it up the stack. Dealing with the error here makes the code hard to follow. It seems to me that this lower level should inform the high level of the relevant context of the error but should leave things like rollback decisions to the higher level based on the context communicated from the lower level.

Maybe in the future we introduce savepoints in the run call, I think that's probably an interesting space for bugs.

pkg/workload/schemachange/schemachange.go, line 272 at r4 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

What do you mean? Use a named return value? I thought that was discouraged unless it's really necessary or you are asking for a comment for the function?

It's encouraged to use the named return values, it makes reading signatures easier. It's highly discouraged to do naked returns (which are enabled by using return values).

spaskob

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @spaskob)

pkg/workload/schemachange/schemachange.go, line 247 at r4 (raw file):

Previously, ajwerner wrote…

do we want to abort txn on first failure or continue to understand failure scenarios under cascaded errors?

In SQL once you get an error, unless you ROLLBACK TO SAVEPOINT your transaction is toast and all future statements are invalid. Given this fact, I'm not sure the rest of your points make as much sense. If you want the higher level to be able to cope with the error better, we should wrap the surrounding context and pass it up the stack. Dealing with the error here makes the code hard to follow. It seems to me that this lower level should inform the high level of the relevant context of the error but should leave things like rollback decisions to the higher level based on the context communicated from the lower level.

Maybe in the future we introduce savepoints in the run call, I think that's probably an interesting space for bugs.

I agree about terminating the txn on first error - you are right. As for handling the errors in the caller, here is my point this function has 2 type of errors:

ones that force terminating the whole workload:
- unable to create a new op
- having encountered a serious error (this one is debatable as we may still want to proceed, save the serious error and report to the user in the end)
ones that just force rollback but the workload continues.

The caller will needs to distinguish b/w the two types and also create the log of operations which kinda depends on the guts of this function. So I am leaning towards rolling back in side this function if we have to and the caller just dealing with questions around should we terminate the whole workload.

WDYT?

ajwerner · 2020-05-18T14:35:25Z

So I am leaning towards rolling back in side this function if we have to and the caller just dealing with questions around should we terminate the whole workload.

That sounds good to me.

spaskob · 2020-05-27T14:13:02Z

PTAL

ajwerner

Reviewed 1 of 2 files at r6, 1 of 2 files at r7.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @spaskob)

pkg/workload/schemachange/schemachange.go, line 247 at r4 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

I agree about terminating the txn on first error - you are right. As for handling the errors in the caller, here is my point this function has 2 type of errors:

ones that force terminating the whole workload:

unable to create a new op

having encountered a serious error (this one is debatable as we may still want to proceed, save the serious error and report to the user in the end)

ones that just force rollback but the workload continues.

The caller will needs to distinguish b/w the two types and also create the log of operations which kinda depends on the guts of this function. So I am leaning towards rolling back in side this function if we have to and the caller just dealing with questions around should we terminate the whole workload.

WDYT?

👍

pkg/workload/schemachange/schemachange.go, line 293 at r5 (raw file):

			if _, err = tx.Exec(op); err != nil {
				histBin = "txnRbk"
				log.WriteString(fmt.Sprintf("***FAIL: %v\n", err))

Should this ***FAIL be so loud here? Don't we expect these pretty frequently? It is good to log the error, just seems to indicate that

Also, maybe it makes sense to append the ROLLBACK to the logs in run and similarly add COMMIT upon success there as well.

spaskob

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner)

pkg/workload/schemachange/schemachange.go, line 293 at r5 (raw file):

Previously, ajwerner wrote…

Should this ***FAIL be so loud here? Don't we expect these pretty frequently? It is good to log the error, just seems to indicate that

Also, maybe it makes sense to append the ROLLBACK to the logs in run and similarly add COMMIT upon success there as well.

This is at verbosity level 1 and above so only loud if you ask for it - I find it helpful for grepping and classifying the errors. At verbosity 0 only serious errors will be printed.
As for the sites where to ROLLBACK; I've tried both more than once, there's always pros and cons. I'd rather not touch it again.

Previously this workload will not fail even when a schema change operation returns an error. Now we extract the SQLSTATE code from the error and if it is of class `09`, i.e. Triggered Action Exception or `XX`, i.e. Internal Error we report the error to the workload runner which will fail if `tolerate-errors=false`. This PR also refactors the `runInTxn` and `run` functions. `runInTxn` now handles txn rollback upon first error encountered and marks the return errors as fatal, meaning the workload should be aborted or rollback when rollback was performed. Additionally schema change operations are times in `runInTxn` without counting the time for producing a random operation since random operation creation may require a few failed attempts. Release note: none.

This is a small refactor so that the roachtest can specify the concurrency in addition to the max_ops for the schemachange workload Release note: none.

This PR adds a simple roachtest that runs the schemachange workload on a 3-node cluster. The purpose is to detect any regression in the error reporting of schema changes. Since the workload still discovers failing errors that will take some time to be fixed the workload can be run with option `tolerate-errors=true` to prevent aborting on these errors. Release note: none.

Release note (bug fix): If `ctx.err()` is nil the cli will panic when a workload returns an error.

spaskob · 2020-06-02T17:08:56Z

bors r+

craig · 2020-06-02T18:34:03Z

Build succeeded

GitHub CI (Cockroach)

spaskob force-pushed the roachtest-sc-no-uuxxx branch from c28b990 to 1be2908 Compare May 4, 2020 12:23

spaskob requested a review from ajwerner May 4, 2020 12:24

ajwerner reviewed May 4, 2020

View reviewed changes

spaskob force-pushed the roachtest-sc-no-uuxxx branch from 1be2908 to 517c8ef Compare May 4, 2020 16:38

spaskob force-pushed the roachtest-sc-no-uuxxx branch from 517c8ef to 4439ba6 Compare May 4, 2020 19:14

spaskob force-pushed the roachtest-sc-no-uuxxx branch from 4439ba6 to badfefe Compare May 5, 2020 16:45

spaskob commented May 5, 2020

View reviewed changes

spaskob force-pushed the roachtest-sc-no-uuxxx branch from badfefe to 7a8f6ef Compare May 6, 2020 18:03

spaskob force-pushed the roachtest-sc-no-uuxxx branch 2 times, most recently from 772156c to 1d279b5 Compare May 8, 2020 13:16

ajwerner reviewed May 8, 2020

View reviewed changes

spaskob force-pushed the roachtest-sc-no-uuxxx branch from 1d279b5 to f6ddf0d Compare May 8, 2020 15:37

spaskob commented May 8, 2020

View reviewed changes

ajwerner reviewed May 8, 2020

View reviewed changes

spaskob commented May 18, 2020

View reviewed changes

spaskob force-pushed the roachtest-sc-no-uuxxx branch from f6ddf0d to c442309 Compare May 27, 2020 14:12

ajwerner approved these changes May 27, 2020

View reviewed changes

spaskob commented May 27, 2020

View reviewed changes

spaskob force-pushed the roachtest-sc-no-uuxxx branch 4 times, most recently from 5969961 to 415e3b3 Compare May 29, 2020 19:55

spaskob force-pushed the roachtest-sc-no-uuxxx branch from 415e3b3 to 168b9b0 Compare June 1, 2020 19:51

Spas Bojanov added 4 commits June 2, 2020 12:08

roachtest/schemachange: specify concurrency

642c03e

This is a small refactor so that the roachtest can specify the concurrency in addition to the max_ops for the schemachange workload Release note: none.

workload: check for nil error

02aa0c9

Release note (bug fix): If `ctx.err()` is nil the cli will panic when a workload returns an error.

spaskob force-pushed the roachtest-sc-no-uuxxx branch from 168b9b0 to 02aa0c9 Compare June 2, 2020 16:09

craig bot merged commit 464e5c4 into cockroachdb:master Jun 2, 2020

thoszhang mentioned this pull request Jun 4, 2020

workload: schema change "nemesis" needs to classify errors as failing vs not failing #47151

Closed

Conversation

spaskob commented May 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented May 4, 2020

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

blathers-crl bot commented May 4, 2020

Uh oh!

blathers-crl bot commented May 4, 2020

Uh oh!

blathers-crl bot commented May 4, 2020

Uh oh!

spaskob left a comment

Choose a reason for hiding this comment

Uh oh!

blathers-crl bot commented May 5, 2020

Uh oh!

blathers-crl bot commented May 6, 2020

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

spaskob left a comment

Choose a reason for hiding this comment

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

spaskob left a comment

Choose a reason for hiding this comment

Uh oh!

ajwerner commented May 18, 2020

Uh oh!

spaskob commented May 27, 2020

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

spaskob left a comment

Choose a reason for hiding this comment

Uh oh!

spaskob commented Jun 2, 2020

Uh oh!

craig bot commented Jun 2, 2020

Build succeeded

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

spaskob commented May 4, 2020 •

edited

Loading