Skip to content

workload/schemachange: concurrent randomized schema change workload#46632

Merged
craig[bot] merged 2 commits intocockroachdb:masterfrom
spaskob:rand-sc-wkld
Mar 31, 2020
Merged

workload/schemachange: concurrent randomized schema change workload#46632
craig[bot] merged 2 commits intocockroachdb:masterfrom
spaskob:rand-sc-wkld

Conversation

@spaskob
Copy link
Copy Markdown
Contributor

@spaskob spaskob commented Mar 26, 2020

Randomly generate (concurrent) schema changes. The tables intentionally
contain no actual data as the focus here is on stressing the machinery
around schema changes, not the machinery around backfills.

Touches #23286.

Release note: None

Release justification: non-production code changes. This is only test
code.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@spaskob spaskob changed the title Rand sc wkld workload/schemachange: concurrent randomized schema change workload Mar 26, 2020
@spaskob spaskob changed the title workload/schemachange: concurrent randomized schema change workload [WIP] workload/schemachange: concurrent randomized schema change workload Mar 26, 2020
Copy link
Copy Markdown
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this up, @spaskob.

You should remove the [WIP] from the commit messages.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @spaskob)


pkg/workload/schemachange/schemachange.go, line 303 at r2 (raw file):

	}
	// TODO(spaskob): figure out what to do with the error returned.
	_ := w.runOps(tx, ops)

Generating the ops to run and then running is a bit different than the previous behavior. With the previous code, the second op could discover a table/column/index that the first op had created. What motivated this part of the change?


pkg/workload/schemachange/schemachange.go, line 741 at r2 (raw file):

) (name string, err error) {
	r := w.rng.Intn(100)
	if r < pctExisting {

There is a general preference to avoid named return values except when they are necessary. I have a personal preference to early return when possible. I think we could achieve both by doing:

if w.rng.Intn(100) >= pctExisting {
  return fmt.Sprintf(...), err
}

// Here goes the previous code.

Copy link
Copy Markdown
Contributor Author

@spaskob spaskob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @petermattis)


pkg/workload/schemachange/schemachange.go, line 303 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Generating the ops to run and then running is a bit different than the previous behavior. With the previous code, the second op could discover a table/column/index that the first op had created. What motivated this part of the change?

I missed that! The error handling in run is subtle and that was a way to wrap my head around it.
I briefly chatted to @ajwerner yesterday about this workload and he had some requests around better reporting of errors. Here are some different types of errors that can occur:

  1. from executing an operation.
  2. from BEGIN, COMMIT and ROLLBACK
  3. from randOp

Additionally, the logic around what is printed and in what order was also a bit hard to follow.
I've tried a different refactoring approach (see new commit) - let me know how you like it. I hope it also makes it easier to write a test if we decided one is needed.

We need to decide from workload's users point of view how to communicate the different types of errors and when error should abort the workload. @ajwerner could you please chime in as well.


pkg/workload/schemachange/schemachange.go, line 741 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

There is a general preference to avoid named return values except when they are necessary. I have a personal preference to early return when possible. I think we could achieve both by doing:

if w.rng.Intn(100) >= pctExisting {
  return fmt.Sprintf(...), err
}

// Here goes the previous code.

Done.

@spaskob spaskob changed the title [WIP] workload/schemachange: concurrent randomized schema change workload workload/schemachange: concurrent randomized schema change workload Mar 28, 2020
Copy link
Copy Markdown
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @spaskob)


pkg/workload/schemachange/schemachange.go, line 303 at r2 (raw file):

The error handling in run is subtle and that was a way to wrap my head around it.

Heh. It sort of grew organically as the PR evolved. I agree it had reach a somewhat con


pkg/workload/schemachange/schemachange.go, line 741 at r2 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

Done.

Looks good. Let's follow this pattern for the other rand* methods.


pkg/workload/schemachange/schemachange.go, line 207 at r4 (raw file):

// used to generate new unique names. Note that this assumes that no
// other workload is being run at the same time.
// TODO(spaskob): Enable workloads running concurrently.

Can you elaborate on what you're think of here? I suspect one workload instance can drive enough schema change concurrency for even a very large cluster.


pkg/workload/schemachange/schemachange.go, line 239 at r4 (raw file):

}

func (w *schemaChangeWorker) runOps(tx *pgx.Tx, ops []string) (err error) {

I believe this method is now unused and can be deleted.


pkg/workload/schemachange/schemachange.go, line 306 at r4 (raw file):

}

func (w *schemaChangeWorker) wrapInTxn(_ context.Context) error {

I'm ok with this structure, though this name doesn't look quite right to me. We're not actually wrapping any of the arguments. How about naming this method run, and renaming the existing run method to runInTxn?

@spaskob spaskob force-pushed the rand-sc-wkld branch 2 times, most recently from 7661b9e to 4d2b077 Compare March 28, 2020 17:28
Copy link
Copy Markdown
Contributor Author

@spaskob spaskob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @petermattis)


pkg/workload/schemachange/schemachange.go, line 207 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Can you elaborate on what you're think of here? I suspect one workload instance can drive enough schema change concurrency for even a very large cluster.

sorry bad choice of words, I am worried about the workload semantics if for whatever reason two workloads are run on the same cluster. It's possible that I am seeing ghosts.


pkg/workload/schemachange/schemachange.go, line 239 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I believe this method is now unused and can be deleted.

Done.


pkg/workload/schemachange/schemachange.go, line 306 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I'm ok with this structure, though this name doesn't look quite right to me. We're not actually wrapping any of the arguments. How about naming this method run, and renaming the existing run method to runInTxn?

Done.

Copy link
Copy Markdown
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @spaskob)


pkg/workload/schemachange/schemachange.go, line 207 at r4 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

sorry bad choice of words, I am worried about the workload semantics if for whatever reason two workloads are run on the same cluster. It's possible that I am seeing ghosts.

Got it. Running multiple instances of the same workload on a cluster is pretty rare. We do it for geo-partitioned tpcc and perhaps one or two other tests. Most of the time we run a single workload instance. I think that will be true of this workload. That said, I'm fine with leaving this TODO here.


pkg/workload/schemachange/schemachange.go, line 72 at r5 (raw file):

		s.flags.StringVar(&s.dbOverride, `db`, ``,
			`Override for the SQL database to use. If empty, defaults to the generator name`)
		s.flags.IntVar(&s.concurrency, `concurrency`, 2*runtime.NumCPU(), /* TODO(spaskob): sensible default? */

I think the sensible default will change over time. 2 seems too low. 2*num-cpu seems high. Fine with leaving this as-is for now.


pkg/workload/schemachange/schemachange.go, line 171 at r5 (raw file):

	}
	cfg := workload.MultiConnPoolCfg{
		MaxTotalConnections: s.concurrency * 2, //TODO

Nit: s/TODO/TODO(spaskob)


pkg/workload/schemachange/schemachange.go, line 289 at r5 (raw file):

		}
	}
	// TODO(spaskob): what to do with the error instead dropping it?

Some workloads have a --tolerate-errors flag. I didn't add one to this workload because errors are an essential part of the workload. I think logging them is fine.

Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @spaskob)


pkg/workload/schemachange/schemachange.go, line 303 at r2 (raw file):

We need to decide from workload's users point of view how to communicate the different types of errors and when error should abort the workload. @ajwerner could you please chime in as well.

Yes indeed. The long tail work on this workload is to introspect errors and determine whether they're expected or unexpected. Can we add a big TODO about that before merging this? Eventually tolerate errors should be added to tollerate unexpected errors and then unexpected errors should fail the workload.

For example, an attempt to do something we don't support should be swallowed (though if we can detect that maybe we should just not do it, e.g). It will be hard to use this test for anything more than liveness detection until we go through the tedious process of classifying errors.

@spaskob spaskob force-pushed the rand-sc-wkld branch 2 times, most recently from c0c41de to 88325e1 Compare March 30, 2020 17:13
Copy link
Copy Markdown
Contributor Author

@spaskob spaskob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @petermattis)


pkg/workload/schemachange/schemachange.go, line 303 at r2 (raw file):

Previously, ajwerner wrote…

We need to decide from workload's users point of view how to communicate the different types of errors and when error should abort the workload. @ajwerner could you please chime in as well.

Yes indeed. The long tail work on this workload is to introspect errors and determine whether they're expected or unexpected. Can we add a big TODO about that before merging this? Eventually tolerate errors should be added to tollerate unexpected errors and then unexpected errors should fail the workload.

For example, an attempt to do something we don't support should be swallowed (though if we can detect that maybe we should just not do it, e.g). It will be hard to use this test for anything more than liveness detection until we go through the tedious process of classifying errors.

done


pkg/workload/schemachange/schemachange.go, line 207 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Got it. Running multiple instances of the same workload on a cluster is pretty rare. We do it for geo-partitioned tpcc and perhaps one or two other tests. Most of the time we run a single workload instance. I think that will be true of this workload. That said, I'm fine with leaving this TODO here.

Done.


pkg/workload/schemachange/schemachange.go, line 72 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I think the sensible default will change over time. 2 seems too low. 2*num-cpu seems high. Fine with leaving this as-is for now.

Done.


pkg/workload/schemachange/schemachange.go, line 289 at r5 (raw file):

fine.
added todo

Copy link
Copy Markdown
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)

petermattis and others added 2 commits March 30, 2020 13:55
Randomly generate (concurrent) schema changes. The tables intentionally
contain no actual data as the focus here is on stressing the machinery
around schema changes, not the machinery around backfills.

Release note: None

Release justification: non-production code changes. This is only test
code.
This commit refactors some of the common random choice
logic in the workload and adds more flags to replace
most of the hard-coded constants.

Release justification: non-production code changes

Release note: None
@spaskob
Copy link
Copy Markdown
Contributor Author

spaskob commented Mar 30, 2020

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 30, 2020

Build failed (retrying...)

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 31, 2020

Build succeeded

@craig craig bot merged commit 6cecad5 into cockroachdb:master Mar 31, 2020
ajwerner pushed a commit to ajwerner/cockroach that referenced this pull request Apr 9, 2020
…nced

This commit is to help @spaskob avoid a hang observed while working on
turning cockroachdb#46632 (workload/schemachange) into a roachtest. The idea is
that as a stopgap the roachtest can issue:

```
SET CLUSTER SETTING sql.lease_manager.remove_lease_once_deferenced = true;
```

The hang was discovered while looking at cockroachdb#46402.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants