workload/schemachange: concurrent randomized schema change workload#46632
workload/schemachange: concurrent randomized schema change workload#46632craig[bot] merged 2 commits intocockroachdb:masterfrom
Conversation
petermattis
left a comment
There was a problem hiding this comment.
Thanks for picking this up, @spaskob.
You should remove the [WIP] from the commit messages.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @spaskob)
pkg/workload/schemachange/schemachange.go, line 303 at r2 (raw file):
} // TODO(spaskob): figure out what to do with the error returned. _ := w.runOps(tx, ops)
Generating the ops to run and then running is a bit different than the previous behavior. With the previous code, the second op could discover a table/column/index that the first op had created. What motivated this part of the change?
pkg/workload/schemachange/schemachange.go, line 741 at r2 (raw file):
) (name string, err error) { r := w.rng.Intn(100) if r < pctExisting {
There is a general preference to avoid named return values except when they are necessary. I have a personal preference to early return when possible. I think we could achieve both by doing:
if w.rng.Intn(100) >= pctExisting {
return fmt.Sprintf(...), err
}
// Here goes the previous code.
spaskob
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @petermattis)
pkg/workload/schemachange/schemachange.go, line 303 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Generating the ops to run and then running is a bit different than the previous behavior. With the previous code, the second op could discover a table/column/index that the first op had created. What motivated this part of the change?
I missed that! The error handling in run is subtle and that was a way to wrap my head around it.
I briefly chatted to @ajwerner yesterday about this workload and he had some requests around better reporting of errors. Here are some different types of errors that can occur:
- from executing an operation.
- from
BEGIN,COMMITandROLLBACK - from
randOp
Additionally, the logic around what is printed and in what order was also a bit hard to follow.
I've tried a different refactoring approach (see new commit) - let me know how you like it. I hope it also makes it easier to write a test if we decided one is needed.
We need to decide from workload's users point of view how to communicate the different types of errors and when error should abort the workload. @ajwerner could you please chime in as well.
pkg/workload/schemachange/schemachange.go, line 741 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
There is a general preference to avoid named return values except when they are necessary. I have a personal preference to early return when possible. I think we could achieve both by doing:
if w.rng.Intn(100) >= pctExisting { return fmt.Sprintf(...), err } // Here goes the previous code.
Done.
petermattis
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @spaskob)
pkg/workload/schemachange/schemachange.go, line 303 at r2 (raw file):
The error handling in run is subtle and that was a way to wrap my head around it.
Heh. It sort of grew organically as the PR evolved. I agree it had reach a somewhat con
pkg/workload/schemachange/schemachange.go, line 741 at r2 (raw file):
Previously, spaskob (Spas Bojanov) wrote…
Done.
Looks good. Let's follow this pattern for the other rand* methods.
pkg/workload/schemachange/schemachange.go, line 207 at r4 (raw file):
// used to generate new unique names. Note that this assumes that no // other workload is being run at the same time. // TODO(spaskob): Enable workloads running concurrently.
Can you elaborate on what you're think of here? I suspect one workload instance can drive enough schema change concurrency for even a very large cluster.
pkg/workload/schemachange/schemachange.go, line 239 at r4 (raw file):
} func (w *schemaChangeWorker) runOps(tx *pgx.Tx, ops []string) (err error) {
I believe this method is now unused and can be deleted.
pkg/workload/schemachange/schemachange.go, line 306 at r4 (raw file):
} func (w *schemaChangeWorker) wrapInTxn(_ context.Context) error {
I'm ok with this structure, though this name doesn't look quite right to me. We're not actually wrapping any of the arguments. How about naming this method run, and renaming the existing run method to runInTxn?
7661b9e to
4d2b077
Compare
spaskob
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @petermattis)
pkg/workload/schemachange/schemachange.go, line 207 at r4 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Can you elaborate on what you're think of here? I suspect one
workloadinstance can drive enough schema change concurrency for even a very large cluster.
sorry bad choice of words, I am worried about the workload semantics if for whatever reason two workloads are run on the same cluster. It's possible that I am seeing ghosts.
pkg/workload/schemachange/schemachange.go, line 239 at r4 (raw file):
Previously, petermattis (Peter Mattis) wrote…
I believe this method is now unused and can be deleted.
Done.
pkg/workload/schemachange/schemachange.go, line 306 at r4 (raw file):
Previously, petermattis (Peter Mattis) wrote…
I'm ok with this structure, though this name doesn't look quite right to me. We're not actually wrapping any of the arguments. How about naming this method
run, and renaming the existingrunmethod torunInTxn?
Done.
petermattis
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @spaskob)
pkg/workload/schemachange/schemachange.go, line 207 at r4 (raw file):
Previously, spaskob (Spas Bojanov) wrote…
sorry bad choice of words, I am worried about the workload semantics if for whatever reason two workloads are run on the same cluster. It's possible that I am seeing ghosts.
Got it. Running multiple instances of the same workload on a cluster is pretty rare. We do it for geo-partitioned tpcc and perhaps one or two other tests. Most of the time we run a single workload instance. I think that will be true of this workload. That said, I'm fine with leaving this TODO here.
pkg/workload/schemachange/schemachange.go, line 72 at r5 (raw file):
s.flags.StringVar(&s.dbOverride, `db`, ``, `Override for the SQL database to use. If empty, defaults to the generator name`) s.flags.IntVar(&s.concurrency, `concurrency`, 2*runtime.NumCPU(), /* TODO(spaskob): sensible default? */
I think the sensible default will change over time. 2 seems too low. 2*num-cpu seems high. Fine with leaving this as-is for now.
pkg/workload/schemachange/schemachange.go, line 171 at r5 (raw file):
} cfg := workload.MultiConnPoolCfg{ MaxTotalConnections: s.concurrency * 2, //TODO
Nit: s/TODO/TODO(spaskob)
pkg/workload/schemachange/schemachange.go, line 289 at r5 (raw file):
} } // TODO(spaskob): what to do with the error instead dropping it?
Some workloads have a --tolerate-errors flag. I didn't add one to this workload because errors are an essential part of the workload. I think logging them is fine.
ajwerner
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @spaskob)
pkg/workload/schemachange/schemachange.go, line 303 at r2 (raw file):
We need to decide from workload's users point of view how to communicate the different types of errors and when error should abort the workload. @ajwerner could you please chime in as well.
Yes indeed. The long tail work on this workload is to introspect errors and determine whether they're expected or unexpected. Can we add a big TODO about that before merging this? Eventually tolerate errors should be added to tollerate unexpected errors and then unexpected errors should fail the workload.
For example, an attempt to do something we don't support should be swallowed (though if we can detect that maybe we should just not do it, e.g). It will be hard to use this test for anything more than liveness detection until we go through the tedious process of classifying errors.
c0c41de to
88325e1
Compare
spaskob
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @petermattis)
pkg/workload/schemachange/schemachange.go, line 303 at r2 (raw file):
Previously, ajwerner wrote…
We need to decide from workload's users point of view how to communicate the different types of errors and when error should abort the workload. @ajwerner could you please chime in as well.
Yes indeed. The long tail work on this workload is to introspect errors and determine whether they're expected or unexpected. Can we add a big TODO about that before merging this? Eventually tolerate errors should be added to tollerate unexpected errors and then unexpected errors should fail the workload.
For example, an attempt to do something we don't support should be swallowed (though if we can detect that maybe we should just not do it, e.g). It will be hard to use this test for anything more than liveness detection until we go through the tedious process of classifying errors.
done
pkg/workload/schemachange/schemachange.go, line 207 at r4 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Got it. Running multiple instances of the same workload on a cluster is pretty rare. We do it for geo-partitioned tpcc and perhaps one or two other tests. Most of the time we run a single workload instance. I think that will be true of this workload. That said, I'm fine with leaving this TODO here.
Done.
pkg/workload/schemachange/schemachange.go, line 72 at r5 (raw file):
Previously, petermattis (Peter Mattis) wrote…
I think the sensible default will change over time.
2seems too low.2*num-cpuseems high. Fine with leaving this as-is for now.
Done.
pkg/workload/schemachange/schemachange.go, line 289 at r5 (raw file):
fine.
added todo
petermattis
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale)
Randomly generate (concurrent) schema changes. The tables intentionally contain no actual data as the focus here is on stressing the machinery around schema changes, not the machinery around backfills. Release note: None Release justification: non-production code changes. This is only test code.
This commit refactors some of the common random choice logic in the workload and adds more flags to replace most of the hard-coded constants. Release justification: non-production code changes Release note: None
|
bors r+ |
Build failed (retrying...) |
Build succeeded |
…nced This commit is to help @spaskob avoid a hang observed while working on turning cockroachdb#46632 (workload/schemachange) into a roachtest. The idea is that as a stopgap the roachtest can issue: ``` SET CLUSTER SETTING sql.lease_manager.remove_lease_once_deferenced = true; ``` The hang was discovered while looking at cockroachdb#46402. Release note: None
Randomly generate (concurrent) schema changes. The tables intentionally
contain no actual data as the focus here is on stressing the machinery
around schema changes, not the machinery around backfills.
Touches #23286.
Release note: None
Release justification: non-production code changes. This is only test
code.