c2c: deflake c2c/shutdown roachtests#102382
Conversation
fc91a0c to
81a07f9
Compare
08dd02c to
ad9c15d
Compare
lidorcarmel
left a comment
There was a problem hiding this comment.
looks good, a couple of small things..
| // Every C2C phase should last at least 30 seconds, so introduce a little | ||
| // bit of random waiting before node shutdown to ensure the shutdown occurs | ||
| // once we're settled into the target phase. | ||
| randomSleep := time.Duration(rand.Intn(6)) | ||
| // | ||
| // Sleep for a minimum of 5 seconds to ensure the stream processors have | ||
| // observed the cutover signal. | ||
| randomSleep := time.Duration(5 + rand.Intn(6)) | ||
| rrd.t.L().Printf("In target phase! Take a %d second power nap", randomSleep) | ||
| time.Sleep(randomSleep * time.Second) |
There was a problem hiding this comment.
I know my comment isn't really about this pr but should waitForTargetPhase do what it says, and just wait for the target phase without any sleep? and then the caller (node shutdown or something else) do the sleep instead?
when reading this code it's not so clear why node shutdown is mentioned in the comment.
WDYT?
There was a problem hiding this comment.
Makes sense, see the change i made.
| removeTenantRateLimiters(rd.t, rd.setup.dst.sysSQL, rd.setup.dst.name) | ||
|
|
||
| lv := makeLatencyVerifier("stream-ingestion", 0, 2*time.Minute, rd.t.L(), getStreamIngestionJobInfo, rd.t.Status, false) | ||
| lv := makeLatencyVerifier("stream-ingestion", 0, 2*time.Minute, rd.t.L(), getStreamIngestionJobInfo, rd.t.Status, true) |
There was a problem hiding this comment.
why do we want to ignore errors here?
don't we want the test to fail if we can't read from crdb_internal.system_jobs?
There was a problem hiding this comment.
good q: any query can fail if a node shuts down around the time of query execution. To ensure the latencyVerifier stays alive during a node shutdown event, we tolerate any of its failed queries. I made the code clearer here.
9f5357e to
60c7fe4
Compare
stevendanna
left a comment
There was a problem hiding this comment.
I wonder if it would also help to explicitly scatter the ranges.
62d5a16 to
40e4a0e
Compare
40e4a0e to
949e12e
Compare
The user can now run `./workload init kv --scatter ....` which scatters the kv table across the cluster after the initial data load. This flag is best used with `--splits`, `--max-block-bytes`, and `--insert-count`. Epic: none Release note: none
Release note: none
949e12e to
bedd7fc
Compare
renatolabs
left a comment
There was a problem hiding this comment.
All ignorable nits, don't mind me.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @lidorcarmel and @msbutler)
pkg/cmd/roachtest/tests/cluster_to_cluster.go line 243 at r5 (raw file):
func (kv replicateKV) sourceInitCmd(tenantName string, nodes option.NodeListOption) string { cmd := roachtestutil.NewCommand(`./workload init kv`)
Nit: you can chain all of these calls which arguably makes it nicer to read:
return roachtestutil.NewCommand("./workload init kv").
Flag("foo", 1).
Flag("bar", 2).
String()pkg/cmd/roachtest/tests/cluster_to_cluster.go line 245 at r5 (raw file):
cmd := roachtestutil.NewCommand(`./workload init kv`) cmd.MaybeFlag(kv.initRows > 0, "insert-count", kv.initRows) cmd.MaybeFlag(kv.initRows > 0, "max-block-bytes", kv.maxBlockBytes)
Did you want to check for maxBlockBytes here?
pkg/cmd/roachtest/tests/cluster_to_cluster.go line 248 at r5 (raw file):
cmd.Flag("splits", 100) cmd.Option("scatter") cmd.Arg(fmt.Sprintf("{pgurl%s:%s}", nodes, tenantName))
Nit: Arg is for single argument, and it already understands Printf-style formatting, so you can just do Arg("{pgurl%s:%s}", nodes, tenantName)
pkg/cmd/roachtest/tests/cluster_to_cluster.go line 1006 at r5 (raw file):
// once we're fully settled into the target phase (e.g. the stream ingestion // processors have observed the cutover signal). randomSleep := time.Duration(5 + rrd.rng.Intn(6))
Nit: making randomSleep a time.Duration (i.e., number of nanoseconds), even though the actual intended meaning is for it to be a number of seconds is slightly confusing. Suggestion:
dur := (5 + rrd.rng.Intn(6)) * time.Second
t.L().Printf("taking a %s power nap", dur)
time.Sleep(dur)This patch addresses to roachtest failure modes: - Prevents roachtest failure if a query fails during a node shutdown. - Prevents the src cluster from returning a single node topology, which could cause the stream ingestion job to hang if the participating src node gets shut down. Longer term, automatic replanning will prevent this. Specifically, this patch changes the kv workload to split and scatter the kv table across the cluster before the c2c job begins. Fixes cockroachdb#101898 Fixes cockroachdb#102111 This patch also makes it easier to reproduce c2c roachtest failures by plumbing a random seed to several components of the roachtest driver. Release note: None
bedd7fc to
093e2dd
Compare
msbutler
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @lidorcarmel and @renatolabs)
pkg/cmd/roachtest/tests/cluster_to_cluster.go line 243 at r5 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Nit: you can chain all of these calls which arguably makes it nicer to read:
return roachtestutil.NewCommand("./workload init kv"). Flag("foo", 1). Flag("bar", 2). String()
TIL. Done.
pkg/cmd/roachtest/tests/cluster_to_cluster.go line 245 at r5 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Did you want to check for
maxBlockByteshere?
Yeah I did this bc the parameter only seems relevant if i actually insert data, and bc I distrust the workload package's parameter handling logic :D
pkg/cmd/roachtest/tests/cluster_to_cluster.go line 248 at r5 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Nit:
Argis for single argument, and it already understandsPrintf-style formatting, so you can just doArg("{pgurl%s:%s}", nodes, tenantName)
Done
pkg/cmd/roachtest/tests/cluster_to_cluster.go line 1006 at r5 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Nit: making
randomSleepatime.Duration(i.e., number of nanoseconds), even though the actual intended meaning is for it to be a number of seconds is slightly confusing. Suggestion:dur := (5 + rrd.rng.Intn(6)) * time.Second t.L().Printf("taking a %s power nap", dur) time.Sleep(dur)
Done
|
thanks for the nits @renatolabs ! |
|
TFTRs!! bors r=stevendanna |
|
Build failed (retrying...): |
|
Build succeeded: |
|
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from d16c011 to blathers/backport-release-23.1-102382: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 23.1.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |

c2c: deflake c2c/shutdown roachtests