Skip to content

workload/schemachange: classify schema change errors#48338

Merged
craig[bot] merged 4 commits intocockroachdb:masterfrom
spaskob:roachtest-sc-no-uuxxx
Jun 2, 2020
Merged

workload/schemachange: classify schema change errors#48338
craig[bot] merged 4 commits intocockroachdb:masterfrom
spaskob:roachtest-sc-no-uuxxx

Conversation

@spaskob
Copy link
Copy Markdown
Contributor

@spaskob spaskob commented May 4, 2020

Previously this workload will not fail even when a schema change
operation returns an error. Now we extract the SQLSTATE code from
the error and if it is of class 09, i.e. Triggered Action Exception
or XX, i.e. Internal Error we abort the workload and report the error.
Also previously we aborted the txn at the first operation error, now
we continue unless the error is one of the classes above.
We also add a roachtest schemachange/random-load that runs
the workload and fails if unclassified errors are encountered.

Fixes #23286.

Release note: none.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@spaskob spaskob force-pushed the roachtest-sc-no-uuxxx branch from c28b990 to 1be2908 Compare May 4, 2020 12:23
@spaskob spaskob requested a review from ajwerner May 4, 2020 12:24
Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @spaskob)


pkg/workload/schemachange/schemachange.go, line 274 at r1 (raw file):

}

func (w *schemaChangeWorker) runInTxn(tx *pgx.Tx) (string, bool, error) {

I have mixed feelings about all of the string inspection happening here. Have you trying pulling a pgconn.PgError out of the error returned from tx.Exec and then pulling its code out?

@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented May 4, 2020

❌ The GitHub CI (Cockroach) build has failed on 1be29086.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@spaskob spaskob force-pushed the roachtest-sc-no-uuxxx branch from 1be2908 to 517c8ef Compare May 4, 2020 16:38
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented May 4, 2020

❌ The GitHub CI (Cockroach) build has failed on 517c8ef9.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@spaskob spaskob force-pushed the roachtest-sc-no-uuxxx branch from 517c8ef to 4439ba6 Compare May 4, 2020 19:14
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented May 4, 2020

❌ The GitHub CI (Cockroach) build has failed on 4439ba64.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@spaskob spaskob force-pushed the roachtest-sc-no-uuxxx branch from 4439ba6 to badfefe Compare May 5, 2020 16:45
Copy link
Copy Markdown
Contributor Author

@spaskob spaskob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed the feedback and fixed the TC failures are due the fact that the workload actually finds current issues and fails so this PR will have to be merged after I fix the long tail of these problems. They seem to be related to my recent changed of validation - certain validations for duplicates don't get triggered and we hit the assertion failures in ValidateTable but at least it;'s working as expected and is finding bugs!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented May 5, 2020

❌ The GitHub CI (Cockroach) build has failed on badfefe4.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@spaskob spaskob force-pushed the roachtest-sc-no-uuxxx branch from badfefe to 7a8f6ef Compare May 6, 2020 18:03
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented May 6, 2020

❌ The GitHub CI (Cockroach) build has failed on 7a8f6ef2.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@spaskob spaskob force-pushed the roachtest-sc-no-uuxxx branch 2 times, most recently from 772156c to 1d279b5 Compare May 8, 2020 13:16
Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @spaskob)


pkg/cmd/roachtest/schemachange_random_load.go, line 48 at r4 (raw file):

	runCmd := []string{
		"./workload run",
		fmt.Sprintf("schemachange  --concurrency %d --max-ops %d --verbose=1", maxOps,

These arguments seem backwards

Also seems weird that this is just a single string.


pkg/workload/schemachange/schemachange.go, line 247 at r4 (raw file):

}

func (w *schemaChangeWorker) runInTxn(tx *pgx.Tx) (string, error) {

This rollback and error classification seems like it might be happening at the wrong level. It seems like any error needs to be dealt with for the whole transaction. It feels like this function should just return the error and then run should rollback on probably all of the errors but then decide whether to tolerate the error or not.


pkg/workload/schemachange/schemachange.go, line 272 at r4 (raw file):

}

func (w *schemaChangeWorker) runInTxn(tx *pgx.Tx) (string, bool, error) {

Can you name the bool return param?


pkg/workload/schemachange/schemachange.go, line 304 at r4 (raw file):

		if seriousErr := handleOpError(err); seriousErr != nil {
			if w.tolerateErrors {
				opsBuf.WriteString(fmt.Sprintf("Serious error: %v\n", seriousErr))

Don’t you still want to bail out of this transaction?


pkg/workload/schemachange/schemachange.go, line 308 at r4 (raw file):

				return opsBuf.String(), rollback, seriousErr
			}
		}

If rollback ever becomes true, do you really want to proceed to the next iteration of the loop? The next command is likely to fail now that you’re in an error state.

@spaskob spaskob force-pushed the roachtest-sc-no-uuxxx branch from 1d279b5 to f6ddf0d Compare May 8, 2020 15:37
Copy link
Copy Markdown
Contributor Author

@spaskob spaskob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @spaskob)


pkg/cmd/roachtest/schemachange_random_load.go, line 48 at r4 (raw file):

Previously, ajwerner wrote…

These arguments seem backwards

Also seems weird that this is just a single string.

good catch - made it more maintainable


pkg/workload/schemachange/schemachange.go, line 247 at r4 (raw file):

Previously, ajwerner wrote…

This rollback and error classification seems like it might be happening at the wrong level. It seems like any error needs to be dealt with for the whole transaction. It feels like this function should just return the error and then run should rollback on probably all of the errors but then decide whether to tolerate the error or not.

well I tried alternatives before coming up with this and also considered what you are suggesting - the trouble is we are still not clear on what errors we want to catch exactly and until we have some clarity as a team on what ultimately should be the purpose of this tool, I prefer to leave it as it is and not sink too much cost upfront. A few open questions:

  • do we want to abort txn on first failure or continue to understand failure scenarios under cascaded errors?
  • what kind of errors do we want to generate on purpose
  • how to actually find out if something broke - currently we don't and we need some kind of database internals consistency checks after each txn commits or is rolled back

So there is too much uncertainty about this tool so I'd rather punt on the issue your highlighting for now.


pkg/workload/schemachange/schemachange.go, line 272 at r4 (raw file):

Previously, ajwerner wrote…

Can you name the bool return param?

What do you mean? Use a named return value? I thought that was discouraged unless it's really necessary or you are asking for a comment for the function?

Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @spaskob)


pkg/workload/schemachange/schemachange.go, line 247 at r4 (raw file):

do we want to abort txn on first failure or continue to understand failure scenarios under cascaded errors?

In SQL once you get an error, unless you ROLLBACK TO SAVEPOINT your transaction is toast and all future statements are invalid. Given this fact, I'm not sure the rest of your points make as much sense. If you want the higher level to be able to cope with the error better, we should wrap the surrounding context and pass it up the stack. Dealing with the error here makes the code hard to follow. It seems to me that this lower level should inform the high level of the relevant context of the error but should leave things like rollback decisions to the higher level based on the context communicated from the lower level.

Maybe in the future we introduce savepoints in the run call, I think that's probably an interesting space for bugs.


pkg/workload/schemachange/schemachange.go, line 272 at r4 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

What do you mean? Use a named return value? I thought that was discouraged unless it's really necessary or you are asking for a comment for the function?

It's encouraged to use the named return values, it makes reading signatures easier. It's highly discouraged to do naked returns (which are enabled by using return values).

Copy link
Copy Markdown
Contributor Author

@spaskob spaskob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @spaskob)


pkg/workload/schemachange/schemachange.go, line 247 at r4 (raw file):

Previously, ajwerner wrote…

do we want to abort txn on first failure or continue to understand failure scenarios under cascaded errors?

In SQL once you get an error, unless you ROLLBACK TO SAVEPOINT your transaction is toast and all future statements are invalid. Given this fact, I'm not sure the rest of your points make as much sense. If you want the higher level to be able to cope with the error better, we should wrap the surrounding context and pass it up the stack. Dealing with the error here makes the code hard to follow. It seems to me that this lower level should inform the high level of the relevant context of the error but should leave things like rollback decisions to the higher level based on the context communicated from the lower level.

Maybe in the future we introduce savepoints in the run call, I think that's probably an interesting space for bugs.

I agree about terminating the txn on first error - you are right. As for handling the errors in the caller, here is my point this function has 2 type of errors:

  • ones that force terminating the whole workload:
    • unable to create a new op
    • having encountered a serious error (this one is debatable as we may still want to proceed, save the serious error and report to the user in the end)
  • ones that just force rollback but the workload continues.

The caller will needs to distinguish b/w the two types and also create the log of operations which kinda depends on the guts of this function. So I am leaning towards rolling back in side this function if we have to and the caller just dealing with questions around should we terminate the whole workload.

WDYT?

@ajwerner
Copy link
Copy Markdown
Contributor

So I am leaning towards rolling back in side this function if we have to and the caller just dealing with questions around should we terminate the whole workload.

That sounds good to me.

@spaskob spaskob force-pushed the roachtest-sc-no-uuxxx branch from f6ddf0d to c442309 Compare May 27, 2020 14:12
@spaskob
Copy link
Copy Markdown
Contributor Author

spaskob commented May 27, 2020

PTAL

Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 2 files at r6, 1 of 2 files at r7.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @spaskob)


pkg/workload/schemachange/schemachange.go, line 247 at r4 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

I agree about terminating the txn on first error - you are right. As for handling the errors in the caller, here is my point this function has 2 type of errors:

  • ones that force terminating the whole workload:
    • unable to create a new op
    • having encountered a serious error (this one is debatable as we may still want to proceed, save the serious error and report to the user in the end)
  • ones that just force rollback but the workload continues.

The caller will needs to distinguish b/w the two types and also create the log of operations which kinda depends on the guts of this function. So I am leaning towards rolling back in side this function if we have to and the caller just dealing with questions around should we terminate the whole workload.

WDYT?

👍


pkg/workload/schemachange/schemachange.go, line 293 at r5 (raw file):

			if _, err = tx.Exec(op); err != nil {
				histBin = "txnRbk"
				log.WriteString(fmt.Sprintf("***FAIL: %v\n", err))

Should this ***FAIL be so loud here? Don't we expect these pretty frequently? It is good to log the error, just seems to indicate that

Also, maybe it makes sense to append the ROLLBACK to the logs in run and similarly add COMMIT upon success there as well.

Copy link
Copy Markdown
Contributor Author

@spaskob spaskob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner)


pkg/workload/schemachange/schemachange.go, line 293 at r5 (raw file):

Previously, ajwerner wrote…

Should this ***FAIL be so loud here? Don't we expect these pretty frequently? It is good to log the error, just seems to indicate that

Also, maybe it makes sense to append the ROLLBACK to the logs in run and similarly add COMMIT upon success there as well.

This is at verbosity level 1 and above so only loud if you ask for it - I find it helpful for grepping and classifying the errors. At verbosity 0 only serious errors will be printed.
As for the sites where to ROLLBACK; I've tried both more than once, there's always pros and cons. I'd rather not touch it again.

@spaskob spaskob force-pushed the roachtest-sc-no-uuxxx branch 4 times, most recently from 5969961 to 415e3b3 Compare May 29, 2020 19:55
@spaskob spaskob force-pushed the roachtest-sc-no-uuxxx branch from 415e3b3 to 168b9b0 Compare June 1, 2020 19:51
Spas Bojanov added 4 commits June 2, 2020 12:08
Previously this workload will not fail even when a schema change
operation returns an error. Now we extract the SQLSTATE code from
the error and if it is of class `09`, i.e. Triggered Action Exception
or `XX`, i.e. Internal Error we report the error to the workload runner
which will fail if `tolerate-errors=false`. This PR also refactors the
`runInTxn` and `run` functions. `runInTxn` now handles txn rollback upon
first error encountered and marks the return errors as fatal, meaning
the workload should be aborted or rollback when rollback was performed.
Additionally schema change operations are times in `runInTxn` without
counting the time for producing a random operation since random operation
creation may require a few failed attempts.

Release note: none.
This is a small refactor so that the roachtest can specify
the concurrency in addition to the max_ops for the
schemachange workload

Release note: none.
This PR adds a simple roachtest that runs the schemachange
workload on a 3-node cluster. The purpose is to detect any
regression in the error reporting of schema changes. Since
the workload still discovers failing errors that will take
some time to be fixed the workload can be run with option
`tolerate-errors=true` to prevent aborting on these errors.

Release note: none.
Release note (bug fix):
If `ctx.err()` is nil the cli will panic when a workload
returns an error.
@spaskob spaskob force-pushed the roachtest-sc-no-uuxxx branch from 168b9b0 to 02aa0c9 Compare June 2, 2020 16:09
@spaskob
Copy link
Copy Markdown
Contributor Author

spaskob commented Jun 2, 2020

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jun 2, 2020

Build succeeded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

roachtest: table creation stress test

3 participants