kv: Test to measure slowdown after a node restart by andrewbaptist · Pull Request #95161 · cockroachdb/cockroach

andrewbaptist · 2023-01-12T17:48:58Z

After a node is down for a few minutes and then starts up again, there is a slowdown related to it catching up on Raft messages it missed while down. This can cause an IO Overload scenario and greatly impact performance on the cluster.

This adds a test for the issue, a separate PR will be created to enable this test and fix the issue.

Informs: #95159

Epic: none
Release note: None

cockroach-teamcity · 2023-01-12T17:49:11Z

This change is

andrewbaptist · 2023-01-18T15:09:35Z

@irfansharif When you get a chance can you review this PR. It is a fairly simple test but will be useful to try with the new follower admission control changes. Basically, it shows a 5-10min outage after a node is down and then comes back online. The test is "skipped" now, but can be tested on your branch to see if it resolves the issue.

Let me know if you have any questions about it. Thanks!

irfansharif

LGTM.

[tiny nit] Re: PR+commit title, change it to roachtest: add test to measure throughput impact during node restart. I've historically pointed to the imperative tense we mention over at https://cockroachlabs.atlassian.net/wiki/spaces/CRDB/pages/73072807/Git+Commit+Messages.

irfansharif · 2023-01-18T15:20:55Z

pkg/cmd/roachtest/tests/kv.go

+func registerKVRestartImpact(r registry.Registry) {
+	r.Add(registry.TestSpec{
+		Skip:    "#95159",
+		Name:    "kv/restart/nodes=12",


This is a relatively large ($$$) test with 12 nodes running for 30m. Could we make it smaller? We're using 8KiB writes below with only 50% of the total ops being just writes, but we could use larger writes or more blocks written per write, or a higher percentage of reads.

irfansharif · 2023-01-18T15:24:31Z

pkg/cmd/roachtest/tests/kv.go

+			db := c.Conn(ctx, t.L(), 1)
+			defer db.Close()
+
+			t.Status("initializing kv dataset <", 3*time.Minute)


[tiny nit] Instead of all the string concatenation here and below in t.Status, use t.Status(fmt.Sprintf("...", ...)).

irfansharif · 2023-01-18T15:30:01Z

pkg/cmd/roachtest/tests/kv.go

+			m.Go(func(ctx context.Context) error {
+				testDurationStr := " --duration=" + testDuration.String()
+				concurrency := ifLocal(c, "  --concurrency=8", " --concurrency=64")
+				// Don't include the last node when starting the workload since it will


Does the last node end up meaningfully acquiring leases for long since no traffic is ever routed to it? What exactly happens if we point to it in the workload generator - I thought it's supposed to be robust enough to the node being shut down.

irfansharif · 2023-01-18T15:30:35Z

pkg/cmd/roachtest/tests/kv.go

+			setReplicateQueueEnabled(true)
+			t.Status("waiting ", duration, " for the workload to finish and measuring the impact of the outage")
+
+			// Wait for IO overload and enough leases to be transferred back.


Perhaps include a stock grafana dashboard so you also get graphs + a prometheus dump for posterity.

irfansharif · 2023-01-18T15:31:05Z

pkg/cmd/roachtest/tests/kv.go

+			if !c.IsLocal() {
+				time.Sleep(3 * time.Minute)
+			}
+			qpsFinal := measureQPS(ctx, t, db, 5*time.Second)


Here and above, maybe measure it over 30s. This is an unscientific opinion.

After a node is down for a few minutes and then starts up again, there is a slowdown related to it catching up on Raft messages it missed while down. This can cause an IO Overload scenario and greatly impact performance on the cluster. This adds a test for the issue, a separate PR will be created to enable this test and fix the issue. Informs: cockroachdb#95159 Epic: none Release note: None

andrewbaptist

TFTR

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @renatolabs, and @srosenberg)

pkg/cmd/roachtest/tests/kv.go line 931 at r1 (raw file):