testutils: Improve TestCluster to support server restarts similar to multiTestContext#57123
Conversation
86b4717 to
fd48f8a
Compare
c5d283f to
5cc6293
Compare
knz
left a comment
There was a problem hiding this comment.
Very nice, @irfansharif and I were talking about this just yesterday.
Reviewed 10 of 10 files at r1.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @lunevalex)
pkg/server/config.go, line 502 at r1 (raw file):
if spec.StickyInMemoryEngineID != "" { if cfg.TestingKnobs.Server == nil { return Engines{}, errors.Errorf("Could not create a sticky " +
I'd suggest errors.AssertionFailedf so that if an end-user sees this (via cockroach demo), they don't get the mistaken understanding that they can do something about it themselves.
pkg/testutils/testcluster/testcluster.go, line 1102 at r1 (raw file):
func (tc *TestCluster) RestartServer(idx int) error { if !tc.serverStopped(idx) { return errors.Errorf("Server %d must be stopped before attempting to restart", idx)
nit : small s at the start of the message
pkg/testutils/testcluster/testcluster.go, line 1128 at r1 (raw file):
for _, specs := range serverArgs.StoreSpecs { if specs.StickyInMemoryEngineID == "" { return errors.Errorf(" Failed to restart Server %d. Restart can only be used when the servers were started with a sticky engine")
nit: no space and small case f
irfansharif
left a comment
There was a problem hiding this comment.
Thanks! Yea we were discussing how this would be a useful API to have, one that would let us stop leaning on roachtests for large swaths of "integration" tests where we desire real server life cycles.
re: naming, I remember being bitten by this in #54072 when I had tried something like:
tc.Start(t)
tc.StopServer(i)
tc.StartServer(i) // This is not a thing on master; I simply exported tc.startServer.
Which is to say I have a mild preference for overloading Start (for both individual servers and the entire cluster) instead of introducing the Restart symbol. It looks doable. Feel free to ignore though, or defer to a future PR.
Reviewed 1 of 10 files at r1.
Reviewable status:complete! 2 of 0 LGTMs obtained (waiting on @knz and @lunevalex)
pkg/server/sticky_engine.go, line 80 at r1 (raw file):
} // NewStickyInMemEnginesRegistry creates a new StickyInMemEnginesRegistry
nit: Periods after the sentence.
pkg/server/sticky_engine.go, line 88 at r1 (raw file):
// GetOrCreateStickyInMemEngine implements a method from // StickyInMemEnginesRegistry.
minor nit: we typically use the "X implements the Y interface" structure for these comments.
pkg/server/testing_knobs.go, line 81 at r1 (raw file):
OnDecommissionedCallback func(livenesspb.Liveness) // StickyEngineRegistry manages the lifecycle of sticky in memory engines, // which can be enabled via base.StoreSpec.StickyInMemoryEngineID
nit: Period after the sentence.
pkg/testutils/testcluster/testcluster.go, line 1108 at r1 (raw file):
if idx == 0 { // If it's the first server, then we need to restart the RPC listener by hand. // Look at #NewTestCluster method for more details.
nit: Drop the "#" and "method"
pkg/testutils/testcluster/testcluster.go, line 1128 at r1 (raw file):
Previously, knz (kena) wrote…
nit: no space and small case
f
Lower case "server" as well, we generally use a single sentence with no periods for error messages. This helps them compose better when one wraps the other (same reason for the lower case).
pkg/testutils/testcluster/testcluster_test.go, line 317 at r1 (raw file):
tc.WaitForValues(t, roachpb.Key("b"), []int64{9, 9, 9}) require.NoError(t, tc.Restart())
nit: would be nice to first restart a single server in isolation, test that, and only then restart the entire cluster (I can imagine tc.Restart accreting code in addition to simply restarting each server independently).
5cc6293 to
71f6d2d
Compare
lunevalex
left a comment
There was a problem hiding this comment.
Hmm, I am not sure thats possible as it was tricky to get this to work, specifically around having the servers being able to connect back to the cluster correctly. startServer also just starts a newly created server, restart recreates the server and places it back into the array, so it actually ends up being a mix of what AddServer does and startServer does.
Reviewable status:
complete! 2 of 0 LGTMs obtained (waiting on @irfansharif and @knz)
pkg/server/config.go, line 502 at r1 (raw file):
Previously, knz (kena) wrote…
I'd suggest
errors.AssertionFailedfso that if an end-user sees this (viacockroach demo), they don't get the mistaken understanding that they can do something about it themselves.
👍
pkg/testutils/testcluster/testcluster_test.go, line 317 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
nit: would be nice to first restart a single server in isolation, test that, and only then restart the entire cluster (I can imagine tc.Restart accreting code in addition to simply restarting each server independently).
Good idea, I'll add that to the test
ae11a06 to
cd5d8de
Compare
|
I looked at your CI failure. There's just one test that fails - the multi-node To run this test on your machine you can run the following command: # once
$ make buildshort
# every time you want to run the test
$ killall -9 cockroach cockroachshort
$ rm -rf logs .s.PGSQL.26257
$ expect -d -f pkg/cli/interactive_tests/test_demo_node_cmds.tcl ./cockroachbeware that the version of and if that fails, run the commands on your GCE worker. |
|
The failure here is this: input: expects: but finds: i.e. the restarted node "draining" status as exposed by |
|
you can also probably test this in your own
this way you can get your test iterations faster without the hassle of running |
cd5d8de to
998dfc7
Compare
…multiTestContext Makes progress on cockroachdb#8299 TestCluster does not have an easy way to restart a Server. Today to make that work one has to use a sticky_engine with TestCluster.StopServer and TestCluster.AddAndStartServer. While this works it quickly gets complicated to use and track where the newly restarted server(s) are running, as they just get appended to list of running servers. This change introduces two new methods TestCluster.Restart and TestCluster.RestartServer, which will restart the server with a sticky engine in place, keeping its position in the list. This matches the behavior of mtc and is easier to use in tests. Sticky engines work well for this use case, but they are global. This means if we run multiple tests in parallel they could have a potential to leak engines to each other. This change makes the StickEngineRegistry scoped by TestCluster, so that one TestCluster would not be able to accidently use an engine from another cluster. Release note: None
998dfc7 to
a378a02
Compare
|
bors r+ |
|
Build succeeded: |
Fixes cockroachdb#58959 A change made in cockroachdb#57123 introduced a race condition into the tear down of the demo cluster. On shutdown the storage engines for some servers could get closed, before the server itself is stopped, which leads to panics in Pebble. This change makes sure that all the servers are stopped before we close the in memory sticky storage engines. Release note: None
58903: kv: update replica-level tscache tests for synthetic timestamps r=nvanbenschoten a=nvanbenschoten This commit updates the following three tests to confirm that synthetic timestamp information is propagated from read/ranged-write requests, through the timestamp cache, to conflicting write requests. - TestReplicaUpdateTSCache - TestReplicaUseTSCache - TestReplicaTSCacheForwardsIntentTS 58972: cli: Fix for a race condition in the teardown of demo cluster r=lunevalex a=lunevalex Fixes #58959 A change made in #57123 introduced a race condition into the tear down of the demo cluster. On shutdown the storage engines for some servers could get closed, before the server itself is stopped, which leads to panics in Pebble. This change makes sure that all the servers are stopped before we close the in memory sticky storage engines. Release note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Alex Lunev <alexl@cockroachlabs.com>
Makes progress on #8299
TestCluster does not have an easy way to restart a Server. Today to make
that work one has to use a sticky_engine with TestCluster.StopServer and
TestCluster.AddAndStartServer. While this works it quickly gets complicated
to use and track where the newly restarted server(s) are running, as they
just get appended to list of running servers. This change introduces two new
methods TestCluster.Restart and TestCluster.RestartServer, which will
restart the server with a sticky engine in place, keeping its position in the
list. This matches the behavior of mtc and is easier to use in tests.
Sticky engines work well for this use case, but they are global. This means
if we run multiple tests in parallel they could have a potential to leak
engines to each other. This change makes the StickEngineRegistry scoped by
TestCluster, so that one TestCluster would not be able to accidentally use an
engine from another cluster.
Release note: None