workload/schemachange: concurrent randomized schema change workload by spaskob · Pull Request #46632 · cockroachdb/cockroach

spaskob · 2020-03-26T17:42:20Z

Randomly generate (concurrent) schema changes. The tables intentionally
contain no actual data as the focus here is on stressing the machinery
around schema changes, not the machinery around backfills.

Touches #23286.

Release note: None

Release justification: non-production code changes. This is only test
code.

cockroach-teamcity · 2020-03-26T17:42:27Z

This change is

petermattis

Thanks for picking this up, @spaskob.

You should remove the [WIP] from the commit messages.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @spaskob)

pkg/workload/schemachange/schemachange.go, line 303 at r2 (raw file):

	}
	// TODO(spaskob): figure out what to do with the error returned.
	_ := w.runOps(tx, ops)

Generating the ops to run and then running is a bit different than the previous behavior. With the previous code, the second op could discover a table/column/index that the first op had created. What motivated this part of the change?

pkg/workload/schemachange/schemachange.go, line 741 at r2 (raw file):

) (name string, err error) {
	r := w.rng.Intn(100)
	if r < pctExisting {

There is a general preference to avoid named return values except when they are necessary. I have a personal preference to early return when possible. I think we could achieve both by doing:

if w.rng.Intn(100) >= pctExisting {
  return fmt.Sprintf(...), err
}

// Here goes the previous code.

spaskob

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @petermattis)

pkg/workload/schemachange/schemachange.go, line 303 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Generating the ops to run and then running is a bit different than the previous behavior. With the previous code, the second op could discover a table/column/index that the first op had created. What motivated this part of the change?

I missed that! The error handling in run is subtle and that was a way to wrap my head around it.
I briefly chatted to @ajwerner yesterday about this workload and he had some requests around better reporting of errors. Here are some different types of errors that can occur:

from executing an operation.
from BEGIN, COMMIT and ROLLBACK
from randOp

Additionally, the logic around what is printed and in what order was also a bit hard to follow.
I've tried a different refactoring approach (see new commit) - let me know how you like it. I hope it also makes it easier to write a test if we decided one is needed.

We need to decide from workload's users point of view how to communicate the different types of errors and when error should abort the workload. @ajwerner could you please chime in as well.

pkg/workload/schemachange/schemachange.go, line 741 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

There is a general preference to avoid named return values except when they are necessary. I have a personal preference to early return when possible. I think we could achieve both by doing:
if w.rng.Intn(100) >= pctExisting {
  return fmt.Sprintf(...), err
}

// Here goes the previous code.

Done.

petermattis

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @spaskob)

pkg/workload/schemachange/schemachange.go, line 303 at r2 (raw file):

The error handling in run is subtle and that was a way to wrap my head around it.

Heh. It sort of grew organically as the PR evolved. I agree it had reach a somewhat con

pkg/workload/schemachange/schemachange.go, line 741 at r2 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

Done.

Looks good. Let's follow this pattern for the other rand* methods.

pkg/workload/schemachange/schemachange.go, line 207 at r4 (raw file):

// used to generate new unique names. Note that this assumes that no
// other workload is being run at the same time.
// TODO(spaskob): Enable workloads running concurrently.

Can you elaborate on what you're think of here? I suspect one workload instance can drive enough schema change concurrency for even a very large cluster.

pkg/workload/schemachange/schemachange.go, line 239 at r4 (raw file):

}

func (w *schemaChangeWorker) runOps(tx *pgx.Tx, ops []string) (err error) {

I believe this method is now unused and can be deleted.

pkg/workload/schemachange/schemachange.go, line 306 at r4 (raw file):

}

func (w *schemaChangeWorker) wrapInTxn(_ context.Context) error {

I'm ok with this structure, though this name doesn't look quite right to me. We're not actually wrapping any of the arguments. How about naming this method run, and renaming the existing run method to runInTxn?

spaskob

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @petermattis)

pkg/workload/schemachange/schemachange.go, line 207 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Can you elaborate on what you're think of here? I suspect one workload instance can drive enough schema change concurrency for even a very large cluster.

sorry bad choice of words, I am worried about the workload semantics if for whatever reason two workloads are run on the same cluster. It's possible that I am seeing ghosts.

pkg/workload/schemachange/schemachange.go, line 239 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I believe this method is now unused and can be deleted.

Done.

pkg/workload/schemachange/schemachange.go, line 306 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I'm ok with this structure, though this name doesn't look quite right to me. We're not actually wrapping any of the arguments. How about naming this method run, and renaming the existing run method to runInTxn?

Done.

petermattis

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @spaskob)

pkg/workload/schemachange/schemachange.go, line 207 at r4 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

sorry bad choice of words, I am worried about the workload semantics if for whatever reason two workloads are run on the same cluster. It's possible that I am seeing ghosts.

Got it. Running multiple instances of the same workload on a cluster is pretty rare. We do it for geo-partitioned tpcc and perhaps one or two other tests. Most of the time we run a single workload instance. I think that will be true of this workload. That said, I'm fine with leaving this TODO here.

pkg/workload/schemachange/schemachange.go, line 72 at r5 (raw file):

		s.flags.StringVar(&s.dbOverride, `db`, ``,
			`Override for the SQL database to use. If empty, defaults to the generator name`)
		s.flags.IntVar(&s.concurrency, `concurrency`, 2*runtime.NumCPU(), /* TODO(spaskob): sensible default? */

I think the sensible default will change over time. 2 seems too low. 2*num-cpu seems high. Fine with leaving this as-is for now.

pkg/workload/schemachange/schemachange.go, line 171 at r5 (raw file):

	}
	cfg := workload.MultiConnPoolCfg{
		MaxTotalConnections: s.concurrency * 2, //TODO

Nit: s/TODO/TODO(spaskob)

pkg/workload/schemachange/schemachange.go, line 289 at r5 (raw file):

		}
	}
	// TODO(spaskob): what to do with the error instead dropping it?

Some workloads have a --tolerate-errors flag. I didn't add one to this workload because errors are an essential part of the workload. I think logging them is fine.

ajwerner

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @spaskob)

pkg/workload/schemachange/schemachange.go, line 303 at r2 (raw file):

We need to decide from workload's users point of view how to communicate the different types of errors and when error should abort the workload. @ajwerner could you please chime in as well.

Yes indeed. The long tail work on this workload is to introspect errors and determine whether they're expected or unexpected. Can we add a big TODO about that before merging this? Eventually tolerate errors should be added to tollerate unexpected errors and then unexpected errors should fail the workload.

For example, an attempt to do something we don't support should be swallowed (though if we can detect that maybe we should just not do it, e.g). It will be hard to use this test for anything more than liveness detection until we go through the tedious process of classifying errors.

spaskob

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @petermattis)

pkg/workload/schemachange/schemachange.go, line 303 at r2 (raw file):

Previously, ajwerner wrote…

We need to decide from workload's users point of view how to communicate the different types of errors and when error should abort the workload. @ajwerner could you please chime in as well.

Yes indeed. The long tail work on this workload is to introspect errors and determine whether they're expected or unexpected. Can we add a big TODO about that before merging this? Eventually tolerate errors should be added to tollerate unexpected errors and then unexpected errors should fail the workload.

For example, an attempt to do something we don't support should be swallowed (though if we can detect that maybe we should just not do it, e.g). It will be hard to use this test for anything more than liveness detection until we go through the tedious process of classifying errors.

done

pkg/workload/schemachange/schemachange.go, line 207 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Got it. Running multiple instances of the same workload on a cluster is pretty rare. We do it for geo-partitioned tpcc and perhaps one or two other tests. Most of the time we run a single workload instance. I think that will be true of this workload. That said, I'm fine with leaving this TODO here.

Done.

pkg/workload/schemachange/schemachange.go, line 72 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I think the sensible default will change over time. 2 seems too low. 2*num-cpu seems high. Fine with leaving this as-is for now.

Done.

pkg/workload/schemachange/schemachange.go, line 289 at r5 (raw file):

fine.
added todo

petermattis

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

Randomly generate (concurrent) schema changes. The tables intentionally contain no actual data as the focus here is on stressing the machinery around schema changes, not the machinery around backfills. Release note: None Release justification: non-production code changes. This is only test code.

This commit refactors some of the common random choice logic in the workload and adds more flags to replace most of the hard-coded constants. Release justification: non-production code changes Release note: None

spaskob · 2020-03-30T22:08:21Z

bors r+

craig · 2020-03-30T22:32:01Z

Build failed (retrying...)

GitHub CI (Cockroach)

craig · 2020-03-31T00:11:50Z

Build succeeded

GitHub CI (Cockroach)

@spaskob

…nced This commit is to help @spaskob avoid a hang observed while working on turning cockroachdb#46632 (workload/schemachange) into a roachtest. The idea is that as a stopgap the roachtest can issue: ``` SET CLUSTER SETTING sql.lease_manager.remove_lease_once_deferenced = true; ``` The hang was discovered while looking at cockroachdb#46402. Release note: None

spaskob changed the title ~~Rand sc wkld~~ workload/schemachange: concurrent randomized schema change workload Mar 26, 2020

spaskob changed the title ~~workload/schemachange: concurrent randomized schema change workload~~ [WIP] workload/schemachange: concurrent randomized schema change workload Mar 26, 2020

petermattis reviewed Mar 26, 2020

View reviewed changes

petermattis mentioned this pull request Mar 27, 2020

[WIP] workload/schemachange: concurrent randomized schema change workload #46402

Closed

spaskob force-pushed the rand-sc-wkld branch from 7ae3208 to ca49468 Compare March 28, 2020 01:34

spaskob commented Mar 28, 2020

View reviewed changes

spaskob changed the title ~~[WIP] workload/schemachange: concurrent randomized schema change workload~~ workload/schemachange: concurrent randomized schema change workload Mar 28, 2020

petermattis reviewed Mar 28, 2020

View reviewed changes

spaskob force-pushed the rand-sc-wkld branch 2 times, most recently from 7661b9e to 4d2b077 Compare March 28, 2020 17:28

spaskob commented Mar 28, 2020

View reviewed changes

petermattis approved these changes Mar 29, 2020

View reviewed changes

ajwerner reviewed Mar 30, 2020

View reviewed changes

spaskob force-pushed the rand-sc-wkld branch 2 times, most recently from c0c41de to 88325e1 Compare March 30, 2020 17:13

spaskob commented Mar 30, 2020

View reviewed changes

petermattis approved these changes Mar 30, 2020

View reviewed changes

petermattis and others added 2 commits March 30, 2020 13:55

workload/schemachange: refactor

fe4d161

This commit refactors some of the common random choice logic in the workload and adds more flags to replace most of the hard-coded constants. Release justification: non-production code changes Release note: None

spaskob force-pushed the rand-sc-wkld branch from 88325e1 to fe4d161 Compare March 30, 2020 18:21

craig bot merged commit 6cecad5 into cockroachdb:master Mar 31, 2020

spaskob mentioned this pull request Mar 31, 2020

roachtest: table creation stress test #23286

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

workload/schemachange: concurrent randomized schema change workload#46632

workload/schemachange: concurrent randomized schema change workload#46632
craig[bot] merged 2 commits intocockroachdb:masterfrom
spaskob:rand-sc-wkld

spaskob commented Mar 26, 2020 •

edited

Loading

Uh oh!

cockroach-teamcity commented Mar 26, 2020

Uh oh!

petermattis left a comment

Uh oh!

spaskob left a comment

Uh oh!

petermattis left a comment

Uh oh!

spaskob left a comment

Uh oh!

petermattis left a comment

Uh oh!

ajwerner left a comment

Uh oh!

spaskob left a comment

Uh oh!

petermattis left a comment

Uh oh!

spaskob commented Mar 30, 2020

Uh oh!

craig bot commented Mar 30, 2020

Uh oh!

craig bot commented Mar 31, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

spaskob commented Mar 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Mar 26, 2020

Uh oh!

petermattis left a comment

Choose a reason for hiding this comment

Uh oh!

spaskob left a comment

Choose a reason for hiding this comment

Uh oh!

petermattis left a comment

Choose a reason for hiding this comment

Uh oh!

spaskob left a comment

Choose a reason for hiding this comment

Uh oh!

petermattis left a comment

Choose a reason for hiding this comment

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

spaskob left a comment

Choose a reason for hiding this comment

Uh oh!

petermattis left a comment

Choose a reason for hiding this comment

Uh oh!

spaskob commented Mar 30, 2020

Uh oh!

craig bot commented Mar 30, 2020

Build failed (retrying...)

Uh oh!

craig bot commented Mar 31, 2020

Build succeeded

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

spaskob commented Mar 26, 2020 •

edited

Loading