kv: Test to measure slowdown after a node restart#95161
kv: Test to measure slowdown after a node restart#95161craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
55bf2ad to
c099442
Compare
|
@irfansharif When you get a chance can you review this PR. It is a fairly simple test but will be useful to try with the new follower admission control changes. Basically, it shows a 5-10min outage after a node is down and then comes back online. The test is "skipped" now, but can be tested on your branch to see if it resolves the issue. Let me know if you have any questions about it. Thanks! |
There was a problem hiding this comment.
LGTM.
[tiny nit] Re: PR+commit title, change it to roachtest: add test to measure throughput impact during node restart. I've historically pointed to the imperative tense we mention over at https://cockroachlabs.atlassian.net/wiki/spaces/CRDB/pages/73072807/Git+Commit+Messages.
| func registerKVRestartImpact(r registry.Registry) { | ||
| r.Add(registry.TestSpec{ | ||
| Skip: "#95159", | ||
| Name: "kv/restart/nodes=12", |
There was a problem hiding this comment.
This is a relatively large ($$$) test with 12 nodes running for 30m. Could we make it smaller? We're using 8KiB writes below with only 50% of the total ops being just writes, but we could use larger writes or more blocks written per write, or a higher percentage of reads.
pkg/cmd/roachtest/tests/kv.go
Outdated
| db := c.Conn(ctx, t.L(), 1) | ||
| defer db.Close() | ||
|
|
||
| t.Status("initializing kv dataset <", 3*time.Minute) |
There was a problem hiding this comment.
[tiny nit] Instead of all the string concatenation here and below in t.Status, use t.Status(fmt.Sprintf("...", ...)).
| m.Go(func(ctx context.Context) error { | ||
| testDurationStr := " --duration=" + testDuration.String() | ||
| concurrency := ifLocal(c, " --concurrency=8", " --concurrency=64") | ||
| // Don't include the last node when starting the workload since it will |
There was a problem hiding this comment.
Does the last node end up meaningfully acquiring leases for long since no traffic is ever routed to it? What exactly happens if we point to it in the workload generator - I thought it's supposed to be robust enough to the node being shut down.
| setReplicateQueueEnabled(true) | ||
| t.Status("waiting ", duration, " for the workload to finish and measuring the impact of the outage") | ||
|
|
||
| // Wait for IO overload and enough leases to be transferred back. |
There was a problem hiding this comment.
Perhaps include a stock grafana dashboard so you also get graphs + a prometheus dump for posterity.
| if !c.IsLocal() { | ||
| time.Sleep(3 * time.Minute) | ||
| } | ||
| qpsFinal := measureQPS(ctx, t, db, 5*time.Second) |
There was a problem hiding this comment.
Here and above, maybe measure it over 30s. This is an unscientific opinion.
After a node is down for a few minutes and then starts up again, there is a slowdown related to it catching up on Raft messages it missed while down. This can cause an IO Overload scenario and greatly impact performance on the cluster. This adds a test for the issue, a separate PR will be created to enable this test and fix the issue. Informs: cockroachdb#95159 Epic: none Release note: None
c099442 to
b3ba4a3
Compare
andrewbaptist
left a comment
There was a problem hiding this comment.
TFTR
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @renatolabs, and @srosenberg)
pkg/cmd/roachtest/tests/kv.go line 931 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
This is a relatively large ($$$) test with 12 nodes running for 30m. Could we make it smaller? We're using 8KiB writes below with only 50% of the total ops being just writes, but we could use larger writes or more blocks written per write, or a higher percentage of reads.
I left this for now as the test is Skipped. The problem is that it requires either a lot of Raft groups to write at once to overcome the L0 limit. I could lower to 6 nodes, but then had to increase the time of the test. The alternative is to lower the L0 limit (it gets to about 3000 files in this test) - but I figured you could mess with this once you start working on it.
pkg/cmd/roachtest/tests/kv.go line 949 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
[tiny nit] Instead of all the string concatenation here and below in t.Status, use t.Status(fmt.Sprintf("...", ...)).
Done
pkg/cmd/roachtest/tests/kv.go line 959 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Does the last node end up meaningfully acquiring leases for long since no traffic is ever routed to it? What exactly happens if we point to it in the workload generator - I thought it's supposed to be robust enough to the node being shut down.
It does end up acquiring leases today, I think the store rebalancer doesn't take into account where the traffic is coming from. This is a good point to look at if that assumption changes in the future. The reason I left it this way is I didn't want the amount of traffic to change either because the node was down or affect things. This could definitely be changed in the future if it makes sense, but it seems less confusing this way.
pkg/cmd/roachtest/tests/kv.go line 1009 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Perhaps include a stock grafana dashboard so you also get graphs + a prometheus dump for posterity.
I did this originally, but have since stopped. All the grafana data is available from the run here: https://grafana.testeng.crdb.io/ even without a grafana dashboard. I think we should change tests to not start grafana as it adds time to the run and doesn't add additional value (and it goes away after the run unless it is saved). This might be a good topic to discuss at a KV meeting some time as it seems pretty inconsistent how we are doing it.
pkg/cmd/roachtest/tests/kv.go line 1013 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Here and above, maybe measure it over 30s. This is an unscientific opinion.
The qpsFinal value is always 0 today so it doesn't really matter. It stays 0 for about 3-4 minutes straight. It would be worth trying some different values, but it is very consistent today (even at 1s it didn't have any variation in the "initial" number.
|
bors r=irfansharif |
|
Build succeeded: |
After a node is down for a few minutes and then starts up again, there is a slowdown related to it catching up on Raft messages it missed while down. This can cause an IO Overload scenario and greatly impact performance on the cluster.
This adds a test for the issue, a separate PR will be created to enable this test and fix the issue.
Informs: #95159
Epic: none
Release note: None