changefeedccl: roachtest refactor and initial-scan-only by samiskin · Pull Request #91394 · cockroachdb/cockroach

samiskin · 2022-11-07T12:46:58Z

Epic: https://cockroachlabs.atlassian.net/browse/CRDB-19057

Changefeed roachtests were setup focused on running a workload for a specific duration and then quitting, making it difficult to run an initial_scan_only test that terminated upon Job success.

We as a team have also noticed a greater need to test and observe changefeeds running in production against real sinks to catch issues we are unable to mock or observe from simple unit tests. This is currently a notable hassle as one has to set up each individual sink and run them, ensure the changefeed is pointing to the right URI, and then be able to monitor the metrics of this long running process.

This change refactors the cdcBasicTest into distinct pieces that are then put together in a test. This allows for easier experimentation with live tests, allowing us to spin up a cluster and a workload, run one or more changefeeds on it, set up a poller to print out job details,have an accessible grafana URL to view metrics, and wait for some completion condition.

Changing the specialized runCDCKafkaAuth, runCDCBank, and runCDCSchemaRegistry functions were left out of scope for this first big change.

The main APIs involved in basic roachtests are now:

newCDCTester: This creates a tester struct to run the rest of the APIs and initializes the database
tester.runTPCCWorkload(tpccArgs): Starts a TPCC workload from the last node in the cluster
tester.runLedgerWorkload(ledgerArgs): Starts a Ledger workload from the last node in the cluster
tester.runFeedLatencyVerifier(changefeedJob, latencyTargets): starts a routine that monitors the changefeed latency until the tester is Close'd
tester.waitForWorkload: waits for a workload started by setupAndRunWorkload to complete its duration
tester.startCRDBChaos: This starts a Chaos routine that periodically shuts nodes down and brings them back up
tester.newChangefeed(feedArgs): starts a new changefeed on the cluster and returns changefeedJob object
changefeedJob.waitForCompletion: waits for a changefeed to complete (either success or failure)
tester.startGrafana: Sets up a grafana instance on the last node of the cluster and prints out a link to a grafana, this automatically runs unless --skip-init is provided. If --debug is not used, StopGrafana will be called on test teardown to publish prometheus metrics to the artifacts directory.

An API that is going to be more useful for experimentation are:

changefeedJob.runFeedPoller(ctx, stopper, onInfo): runs a given callback every second with the changefeed info

Roachtests can be ran locally with the --local flag or on an existing cluster without destroying it afterwards with --cluster="my-cluster" --debug

Ex: After adding a new test (lets say "cdc/my-test") to the registerCDC function you can keep running

./dev build cockroach --cross # if changes made to crdb
./dev build roachtest         # if changes made to the test

./bin/roachtest run cdc/my-test --cluster="my-cluster" --debug

as you try out different changes or options. If you want to try a set of steps against different versions of the app you could download those binaries and use the --cockroach="path-to-binary" flag to test against those instead.

If you want to set up a large TPCC database on a cluster and reuse it for tests this can be done with roachtests's --wipe and --skip-init flags.

Release note: None

cockroach-teamcity · 2022-11-07T12:47:07Z

This change is

samiskin · 2022-11-07T13:06:10Z

pkg/cmd/roachtest/tests/cdc.go

+		WithPrometheusNode(t.workloadNode.InstallNodes()[0]).
+		WithCluster(t.crdbNodes.InstallNodes()).
+		WithNodeExporter(t.crdbNodes.InstallNodes()).
+		WithGrafanaDashboard("https://gist.githubusercontent.com/samiskin/75b10ccd4a08280cf1f5d20e916188ae/raw/a5556a5c0c3d82b3449f3140107366995fa16de3/dashboard.json")


This is just a simple one I threw together right now, will be tweaking it prior to merge and then I'll put it in as a proper file in this repository and link to that, just wanted to get this up for review.

miretskiy

all nits

pkg/cmd/roachtest/tests/cdc.go

pkg/cmd/roachtest/tests/mixed_version_cdc.go

pkg/cmd/roachtest/tests/cdc.go

miretskiy · 2022-11-07T14:15:32Z

pkg/cmd/roachtest/tests/cdc.go

+		WithPrometheusNode(t.workloadNode.InstallNodes()[0]).
+		WithCluster(t.crdbNodes.InstallNodes()).
+		WithNodeExporter(t.crdbNodes.InstallNodes()).
+		WithGrafanaDashboard("https://gist.githubusercontent.com/samiskin/75b10ccd4a08280cf1f5d20e916188ae/raw/a5556a5c0c3d82b3449f3140107366995fa16de3/dashboard.json")


renatolabs

🎉

Thanks for refactoring, this looks great! Minor comments and questions.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @HonoreDB, @jayshrivastava, and @samiskin)

pkg/cmd/roachtest/tests/cdc.go line 114 at r3 (raw file):

		Timer:   Periodic{Period: 2 * time.Minute, DownTime: 20 * time.Second},
		Target:  t.crdbNodes.RandNode,
		Stopper: chaosStopper,

Idea: how about randomly setting the DrainAndQuit option here? Right now the chaos loop will always force-kill nodes; randomly letting them drain could expose bugs (as has been the case in the past in the cdc/mixed-versions test).

pkg/cmd/roachtest/tests/cdc.go line 190 at r3 (raw file):

		// TolerateErrors if crdbChaos is true; otherwise, the workload will fail
		// if it attempts to use the node which was brought down by chaos.
		tolerateErrors: t.tolerateNodeErrors,

Comment on API: the relationship between startCRDBChaos and this function is subtle. It's not clear (especially to the caller) that you are supposed to call the chaos function first so that this tpcc function will know that it should tolerate errors. IMO, a better API would be to pass, in some way, tolerateErrors as a parameter to here. This makes things more explicit and makes it possible to run the workload before starting the chaos loop.

pkg/cmd/roachtest/tests/cdc.go line 200 at r3 (raw file):

	}

	if args.duration != "" {

Is this what you intended? Only run the workload if a duration is passed and print a skip message if not passed? Looks confusing.

pkg/cmd/roachtest/tests/cdc.go line 242 at r3 (raw file):

func (t *cdcTester) DB() *gosql.DB {
	return t.cluster.Conn(t.ctx, t.rt.L(), 1)

Since, IMO, one of the goals here is to provide a struct that automatically gives you good defaults, we could take the chance to select a random node when a DB connection is needed.

pkg/cmd/roachtest/tests/cdc.go line 304 at r3 (raw file):

	feedOptions := make(map[string]string)
	feedOptions["min_checkpoint_frequency"] = "'10s'"
	if args.sinkType == cloudStorageSink || args.sinkType == webhookSink {

I'm curious if we still think these defaults make sense. Should we really be setting these options based on the sink? I'm not from the CDC team, but I wouldn't know why 10s is a good setting for resolved for cloud/webhook sinks, but an empty resolved option is a good default otherwise. Should we be explicit? If this is indeed better, maybe we could at least have a comment explaining why the distinction exists and why the defaults are sensible.

pkg/cmd/roachtest/tests/cdc.go line 751 at r3 (raw file):

}

func runCDCKafkaAuth(ctx context.Context, rt test.Test, c cluster.Cluster) {

Nit: why did we change the parameter name here and in other tests? I believe pretty much every other roachtest uses t for the test.Test struct. Keeping it the same, in my opinion, reduces the cognitive burden. I'm also not sure what the r stands for in rt

samiskin

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @HonoreDB, @jayshrivastava, @miretskiy, and @renatolabs)

pkg/cmd/roachtest/tests/cdc.go line 438 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

ack.

Updated this with a go.crdb.dev url that we can freely change.

pkg/cmd/roachtest/tests/cdc.go line 121 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

can we switch to switch instead?

Done.

pkg/cmd/roachtest/tests/cdc.go line 204 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

wdyt about moving these if/else branches to their own helpers (e.g. runLedgerWorkload, runTPCCWorkload)?

Done.

pkg/cmd/roachtest/tests/cdc.go line 114 at r3 (raw file):