Skip to content

kv: Test to measure slowdown after a node restart#95161

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
andrewbaptist:230111.repro-startup-load
Jan 19, 2023
Merged

kv: Test to measure slowdown after a node restart#95161
craig[bot] merged 1 commit intocockroachdb:masterfrom
andrewbaptist:230111.repro-startup-load

Conversation

@andrewbaptist
Copy link
Copy Markdown

After a node is down for a few minutes and then starts up again, there is a slowdown related to it catching up on Raft messages it missed while down. This can cause an IO Overload scenario and greatly impact performance on the cluster.

This adds a test for the issue, a separate PR will be created to enable this test and fix the issue.

Informs: #95159

Epic: none
Release note: None

@andrewbaptist andrewbaptist requested a review from a team as a code owner January 12, 2023 17:48
@andrewbaptist andrewbaptist requested review from renatolabs and srosenberg and removed request for a team January 12, 2023 17:49
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@andrewbaptist andrewbaptist force-pushed the 230111.repro-startup-load branch from 55bf2ad to c099442 Compare January 12, 2023 17:59
@andrewbaptist
Copy link
Copy Markdown
Author

@irfansharif When you get a chance can you review this PR. It is a fairly simple test but will be useful to try with the new follower admission control changes. Basically, it shows a 5-10min outage after a node is down and then comes back online. The test is "skipped" now, but can be tested on your branch to see if it resolves the issue.

Let me know if you have any questions about it. Thanks!

Copy link
Copy Markdown
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

[tiny nit] Re: PR+commit title, change it to roachtest: add test to measure throughput impact during node restart. I've historically pointed to the imperative tense we mention over at https://cockroachlabs.atlassian.net/wiki/spaces/CRDB/pages/73072807/Git+Commit+Messages.

func registerKVRestartImpact(r registry.Registry) {
r.Add(registry.TestSpec{
Skip: "#95159",
Name: "kv/restart/nodes=12",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a relatively large ($$$) test with 12 nodes running for 30m. Could we make it smaller? We're using 8KiB writes below with only 50% of the total ops being just writes, but we could use larger writes or more blocks written per write, or a higher percentage of reads.

db := c.Conn(ctx, t.L(), 1)
defer db.Close()

t.Status("initializing kv dataset <", 3*time.Minute)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[tiny nit] Instead of all the string concatenation here and below in t.Status, use t.Status(fmt.Sprintf("...", ...)).

m.Go(func(ctx context.Context) error {
testDurationStr := " --duration=" + testDuration.String()
concurrency := ifLocal(c, " --concurrency=8", " --concurrency=64")
// Don't include the last node when starting the workload since it will
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the last node end up meaningfully acquiring leases for long since no traffic is ever routed to it? What exactly happens if we point to it in the workload generator - I thought it's supposed to be robust enough to the node being shut down.

setReplicateQueueEnabled(true)
t.Status("waiting ", duration, " for the workload to finish and measuring the impact of the outage")

// Wait for IO overload and enough leases to be transferred back.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps include a stock grafana dashboard so you also get graphs + a prometheus dump for posterity.

if !c.IsLocal() {
time.Sleep(3 * time.Minute)
}
qpsFinal := measureQPS(ctx, t, db, 5*time.Second)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and above, maybe measure it over 30s. This is an unscientific opinion.

After a node is down for a few minutes and then starts up again, there
is a slowdown related to it catching up on Raft messages it missed while
down. This can cause an IO Overload scenario and greatly impact
performance on the cluster.

This adds a test for the issue, a separate PR will be created to enable
this test and fix the issue.

Informs: cockroachdb#95159

Epic: none
Release note: None
@andrewbaptist andrewbaptist force-pushed the 230111.repro-startup-load branch from c099442 to b3ba4a3 Compare January 18, 2023 23:17
Copy link
Copy Markdown
Author

@andrewbaptist andrewbaptist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @renatolabs, and @srosenberg)


pkg/cmd/roachtest/tests/kv.go line 931 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

This is a relatively large ($$$) test with 12 nodes running for 30m. Could we make it smaller? We're using 8KiB writes below with only 50% of the total ops being just writes, but we could use larger writes or more blocks written per write, or a higher percentage of reads.

I left this for now as the test is Skipped. The problem is that it requires either a lot of Raft groups to write at once to overcome the L0 limit. I could lower to 6 nodes, but then had to increase the time of the test. The alternative is to lower the L0 limit (it gets to about 3000 files in this test) - but I figured you could mess with this once you start working on it.


pkg/cmd/roachtest/tests/kv.go line 949 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

[tiny nit] Instead of all the string concatenation here and below in t.Status, use t.Status(fmt.Sprintf("...", ...)).

Done


pkg/cmd/roachtest/tests/kv.go line 959 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Does the last node end up meaningfully acquiring leases for long since no traffic is ever routed to it? What exactly happens if we point to it in the workload generator - I thought it's supposed to be robust enough to the node being shut down.

It does end up acquiring leases today, I think the store rebalancer doesn't take into account where the traffic is coming from. This is a good point to look at if that assumption changes in the future. The reason I left it this way is I didn't want the amount of traffic to change either because the node was down or affect things. This could definitely be changed in the future if it makes sense, but it seems less confusing this way.


pkg/cmd/roachtest/tests/kv.go line 1009 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Perhaps include a stock grafana dashboard so you also get graphs + a prometheus dump for posterity.

I did this originally, but have since stopped. All the grafana data is available from the run here: https://grafana.testeng.crdb.io/ even without a grafana dashboard. I think we should change tests to not start grafana as it adds time to the run and doesn't add additional value (and it goes away after the run unless it is saved). This might be a good topic to discuss at a KV meeting some time as it seems pretty inconsistent how we are doing it.


pkg/cmd/roachtest/tests/kv.go line 1013 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Here and above, maybe measure it over 30s. This is an unscientific opinion.

The qpsFinal value is always 0 today so it doesn't really matter. It stays 0 for about 3-4 minutes straight. It would be worth trying some different values, but it is very consistent today (even at 1s it didn't have any variation in the "initial" number.

@andrewbaptist
Copy link
Copy Markdown
Author

bors r=irfansharif

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jan 19, 2023

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants