changefeedccl: roachtest refactor and initial-scan-only#91394
changefeedccl: roachtest refactor and initial-scan-only#91394craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
pkg/cmd/roachtest/tests/cdc.go
Outdated
| WithPrometheusNode(t.workloadNode.InstallNodes()[0]). | ||
| WithCluster(t.crdbNodes.InstallNodes()). | ||
| WithNodeExporter(t.crdbNodes.InstallNodes()). | ||
| WithGrafanaDashboard("https://gist.githubusercontent.com/samiskin/75b10ccd4a08280cf1f5d20e916188ae/raw/a5556a5c0c3d82b3449f3140107366995fa16de3/dashboard.json") |
There was a problem hiding this comment.
This is just a simple one I threw together right now, will be tweaking it prior to merge and then I'll put it in as a proper file in this repository and link to that, just wanted to get this up for review.
b7d169b to
310f441
Compare
pkg/cmd/roachtest/tests/cdc.go
Outdated
| WithPrometheusNode(t.workloadNode.InstallNodes()[0]). | ||
| WithCluster(t.crdbNodes.InstallNodes()). | ||
| WithNodeExporter(t.crdbNodes.InstallNodes()). | ||
| WithGrafanaDashboard("https://gist.githubusercontent.com/samiskin/75b10ccd4a08280cf1f5d20e916188ae/raw/a5556a5c0c3d82b3449f3140107366995fa16de3/dashboard.json") |
310f441 to
9ca758b
Compare
renatolabs
left a comment
There was a problem hiding this comment.
🎉
Thanks for refactoring, this looks great! Minor comments and questions.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @HonoreDB, @jayshrivastava, and @samiskin)
pkg/cmd/roachtest/tests/cdc.go line 114 at r3 (raw file):
Timer: Periodic{Period: 2 * time.Minute, DownTime: 20 * time.Second}, Target: t.crdbNodes.RandNode, Stopper: chaosStopper,
Idea: how about randomly setting the DrainAndQuit option here? Right now the chaos loop will always force-kill nodes; randomly letting them drain could expose bugs (as has been the case in the past in the cdc/mixed-versions test).
pkg/cmd/roachtest/tests/cdc.go line 190 at r3 (raw file):
// TolerateErrors if crdbChaos is true; otherwise, the workload will fail // if it attempts to use the node which was brought down by chaos. tolerateErrors: t.tolerateNodeErrors,
Comment on API: the relationship between startCRDBChaos and this function is subtle. It's not clear (especially to the caller) that you are supposed to call the chaos function first so that this tpcc function will know that it should tolerate errors. IMO, a better API would be to pass, in some way, tolerateErrors as a parameter to here. This makes things more explicit and makes it possible to run the workload before starting the chaos loop.
pkg/cmd/roachtest/tests/cdc.go line 200 at r3 (raw file):
} if args.duration != "" {
Is this what you intended? Only run the workload if a duration is passed and print a skip message if not passed? Looks confusing.
pkg/cmd/roachtest/tests/cdc.go line 242 at r3 (raw file):
func (t *cdcTester) DB() *gosql.DB { return t.cluster.Conn(t.ctx, t.rt.L(), 1)
Since, IMO, one of the goals here is to provide a struct that automatically gives you good defaults, we could take the chance to select a random node when a DB connection is needed.
pkg/cmd/roachtest/tests/cdc.go line 304 at r3 (raw file):
feedOptions := make(map[string]string) feedOptions["min_checkpoint_frequency"] = "'10s'" if args.sinkType == cloudStorageSink || args.sinkType == webhookSink {
I'm curious if we still think these defaults make sense. Should we really be setting these options based on the sink? I'm not from the CDC team, but I wouldn't know why 10s is a good setting for resolved for cloud/webhook sinks, but an empty resolved option is a good default otherwise. Should we be explicit? If this is indeed better, maybe we could at least have a comment explaining why the distinction exists and why the defaults are sensible.
pkg/cmd/roachtest/tests/cdc.go line 751 at r3 (raw file):
} func runCDCKafkaAuth(ctx context.Context, rt test.Test, c cluster.Cluster) {
Nit: why did we change the parameter name here and in other tests? I believe pretty much every other roachtest uses t for the test.Test struct. Keeping it the same, in my opinion, reduces the cognitive burden. I'm also not sure what the r stands for in rt
9ca758b to
c9ad50e
Compare
samiskin
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @HonoreDB, @jayshrivastava, @miretskiy, and @renatolabs)
pkg/cmd/roachtest/tests/cdc.go line 438 at r1 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
ack.
Updated this with a go.crdb.dev url that we can freely change.
pkg/cmd/roachtest/tests/cdc.go line 121 at r2 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
can we switch to
switchinstead?
Done.
pkg/cmd/roachtest/tests/cdc.go line 204 at r2 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
wdyt about moving these if/else branches to their own helpers (e.g. runLedgerWorkload, runTPCCWorkload)?
Done.
pkg/cmd/roachtest/tests/cdc.go line 114 at r3 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Idea: how about randomly setting the
DrainAndQuitoption here? Right now the chaos loop will always force-kill nodes; randomly letting them drain could expose bugs (as has been the case in the past in the cdc/mixed-versions test).
I thought about it but I wanted to leave this PR as refactor-only-no-functionality-change as possible at first so that if we do start seeing failures I know its a roachtest problem rather than a changefeed problem.
pkg/cmd/roachtest/tests/cdc.go line 190 at r3 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Comment on API: the relationship between
startCRDBChaosand this function is subtle. It's not clear (especially to the caller) that you are supposed to call the chaos function first so that this tpcc function will know that it should tolerate errors. IMO, a better API would be to pass, in some way,tolerateErrorsas a parameter to here. This makes things more explicit and makes it possible to run the workload before starting the chaos loop.
Good point 👍, I made tolerateErrors just be an argument for the feed job / workload
pkg/cmd/roachtest/tests/cdc.go line 200 at r3 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Is this what you intended? Only run the workload if a duration is passed and print a
skipmessage if not passed? Looks confusing.
Yep, since for initial scan tests we don't necessarily care about a workload running, only the initial fixture.
pkg/cmd/roachtest/tests/cdc.go line 242 at r3 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Since, IMO, one of the goals here is to provide a struct that automatically gives you good defaults, we could take the chance to select a random node when a DB connection is needed.
Same as above I want to leave functionality unchanged for this pr to make it easier to isolate the reason for a test failure.
pkg/cmd/roachtest/tests/cdc.go line 304 at r3 (raw file):
Previously, renatolabs (Renato Costa) wrote…
I'm curious if we still think these defaults make sense. Should we really be setting these options based on the sink? I'm not from the CDC team, but I wouldn't know why
10sis a good setting forresolvedfor cloud/webhook sinks, but an emptyresolvedoption is a good default otherwise. Should we be explicit? If this is indeed better, maybe we could at least have a comment explaining why the distinction exists and why the defaults are sensible.
Not exactly sure why the 10s was picked, but I added a comment about envelope=wrapped. We could rethink these but for now leaving all functionality unchanged.
pkg/cmd/roachtest/tests/cdc.go line 751 at r3 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Nit: why did we change the parameter name here and in other tests? I believe pretty much every other roachtest uses
tfor thetest.Teststruct. Keeping it the same, in my opinion, reduces the cognitive burden. I'm also not sure what therstands for inrt
I had rt for roachtest since its from the roachtest package. I changed it to ct for cdcTester and t for roachtest 👍, forgot to consider folks like test eng that would be dealing with multiple team's roachtests.
|
there ended up being a few failures after running all the roachtests manually that I couldn't fully resolve in time so this'll have to wait until I return from PTO next tuesday |
jayshrivastava
left a comment
There was a problem hiding this comment.
I don't think the debug flag works as intended. I tried roachtest run cdc/initial-scan-only --debug --cockroach=artifacts/cockroach and the grafana dashboard dies after the test. I think this is because the nodes shut down.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @HonoreDB, @miretskiy, @renatolabs, and @samiskin)
|
@jayshrivastava try naming the cluster |
4b883c6 to
79c9ab7
Compare
|
bors r+ |
|
🕐 Waiting for PR status (Github check) to be set, probably by CI. Bors will automatically try to run when all required PR statuses are set. |
376f780 to
9d84e25
Compare
Changefeed roachtests were setup focused on running a workload for a specific duration and then quitting, making it difficult to run an `initial_scan_only` test that terminated upon Job success. We as a team have also noticed a greater need to test and observe changefeeds running in production against real sinks to catch issues we are unable to mock or observe from simple unit tests. This is currently a notable hassle as one has to set up each individual sink and run them, ensure the changefeed is pointing to the right URI, and then be able to monitor the metrics of this long running process. This change refactors the cdcBasicTest into distinct pieces that are then put together in a test. This allows for easier experimentation with live tests, allowing us to spin up a cluster and a workload, run one or more changefeeds on it, set up a poller to print out job details,have an accessible grafana URL to view metrics, and wait for some completion condition. Changing the specialized `runCDCKafkaAuth`, `runCDCBank`, and `runCDCSchemaRegistry` functions were left out of scope for this first big change. The main APIs involved in basic roachtests are now: - `newCDCTester`: This creates a tester struct to run the rest of the APIs and initializes the database - `tester.runTPCCWorkload(tpccArgs)`: Starts a TPCC workload from the last node in the cluster - `tester.runLedgerWorkload(ledgerArgs)`: Starts a Ledger workload from the last node in the cluster - `tester.newChangefeed(feedArgs)`: starts a new changefeed on the cluster and returns `changefeedJob` object - `tester.runFeedLatencyVerifier(changefeedJob, latencyTargets)`: starts a routine that monitors the changefeed latency until the tester is `Close`'d - `tester.waitForWorkload`: waits for a workload started by `setupAndRunWorkload` to complete its duration - `changefeedJob.waitForCompletion`: waits for a changefeed to complete (either success or failure) - `tester.startCRDBChaos`: This starts a Chaos routine that periodically shuts nodes down and brings them back up APIs that are going to be more useful for experimentation are: - `tester.startGrafana`: Sets up a grafana instance on the last node of the cluster and prints out a link to a grafana dashboard with some basic changefeed metrics - `changefeedJob.runFeedPoller(ctx, stopper, onInfo)`: runs a given callback every second with the changefeed info Roachtests can be ran locally with the `--local` flag or on an existing cluster without destroying it afterwards with `--cluster="my-cluter" --debug` Ex: After adding a new test (lets say "cdc/my-test") to the registerCDC function you can keep running ``` ./dev build cockroach --cross # if changes made to crdb ./dev build roachtest # if changes made to the test ./bin/roachtest run cdc/my-test --cluster="my-cluster" --debug ``` as you try out different changes or options. If you want to try a set of steps against different versions of the app you could download those binaries and use the --cockroach="path-to-binary" flag to test against those instead. If you want to set up a large TPCC database on a cluster and reuse it for tests this can be done with roachtests's --wipe and --skip-init flags. Release note: None
9d84e25 to
46c75a5
Compare
|
bors r+ |
|
Build succeeded: |
Epic: https://cockroachlabs.atlassian.net/browse/CRDB-19057
Changefeed roachtests were setup focused on running a workload for a specific duration and then quitting, making it difficult to run an
initial_scan_onlytest that terminated upon Job success.We as a team have also noticed a greater need to test and observe changefeeds running in production against real sinks to catch issues we are unable to mock or observe from simple unit tests. This is currently a notable hassle as one has to set up each individual sink and run them, ensure the changefeed is pointing to the right URI, and then be able to monitor the metrics of this long running process.
This change refactors the cdcBasicTest into distinct pieces that are then put together in a test. This allows for easier experimentation with live tests, allowing us to spin up a cluster and a workload, run one or more changefeeds on it, set up a poller to print out job details,have an accessible grafana URL to view metrics, and wait for some completion condition.
Changing the specialized
runCDCKafkaAuth,runCDCBank, andrunCDCSchemaRegistryfunctions were left out of scope for this first big change.The main APIs involved in basic roachtests are now:
newCDCTester: This creates a tester struct to run the rest of the APIs and initializes the databasetester.runTPCCWorkload(tpccArgs): Starts a TPCC workload from the last node in the clustertester.runLedgerWorkload(ledgerArgs): Starts a Ledger workload from the last node in the clustertester.runFeedLatencyVerifier(changefeedJob, latencyTargets): starts a routine that monitors the changefeed latency until the tester isClose'dtester.waitForWorkload: waits for a workload started bysetupAndRunWorkloadto complete its durationtester.startCRDBChaos: This starts a Chaos routine that periodically shuts nodes down and brings them back uptester.newChangefeed(feedArgs): starts a new changefeed on the cluster and returnschangefeedJobobjectchangefeedJob.waitForCompletion: waits for a changefeed to complete (either success or failure)tester.startGrafana: Sets up a grafana instance on the last node of the cluster and prints out a link to a grafana, this automatically runs unless--skip-initis provided. If--debugis not used,StopGrafanawill be called on test teardown to publish prometheus metrics to the artifacts directory.An API that is going to be more useful for experimentation are:
changefeedJob.runFeedPoller(ctx, stopper, onInfo): runs a given callback every second with the changefeed infoRoachtests can be ran locally with the
--localflag or on an existing cluster without destroying it afterwards with--cluster="my-cluster" --debugEx: After adding a new test (lets say
"cdc/my-test") to theregisterCDCfunction you can keep runningas you try out different changes or options. If you want to try a set of steps against different versions of the app you could download those binaries and use the
--cockroach="path-to-binary"flag to test against those instead.If you want to set up a large TPCC database on a cluster and reuse it for tests this can be done with roachtests's
--wipeand--skip-initflags.Release note: None