testutils: Improve TestCluster to support server restarts similar to multiTestContext by lunevalex · Pull Request #57123 · cockroachdb/cockroach

lunevalex · 2020-11-25T16:02:45Z

Makes progress on #8299

TestCluster does not have an easy way to restart a Server. Today to make
that work one has to use a sticky_engine with TestCluster.StopServer and
TestCluster.AddAndStartServer. While this works it quickly gets complicated
to use and track where the newly restarted server(s) are running, as they
just get appended to list of running servers. This change introduces two new
methods TestCluster.Restart and TestCluster.RestartServer, which will
restart the server with a sticky engine in place, keeping its position in the
list. This matches the behavior of mtc and is easier to use in tests.

Sticky engines work well for this use case, but they are global. This means
if we run multiple tests in parallel they could have a potential to leak
engines to each other. This change makes the StickEngineRegistry scoped by
TestCluster, so that one TestCluster would not be able to accidentally use an
engine from another cluster.

Release note: None

cockroach-teamcity · 2020-11-25T16:02:52Z

This change is

knz

with nits

Very nice, @irfansharif and I were talking about this just yesterday.

Reviewed 10 of 10 files at r1.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @lunevalex)

pkg/server/config.go, line 502 at r1 (raw file):

			if spec.StickyInMemoryEngineID != "" {
				if cfg.TestingKnobs.Server == nil {
					return Engines{}, errors.Errorf("Could not create a sticky " +

I'd suggest errors.AssertionFailedf so that if an end-user sees this (via cockroach demo), they don't get the mistaken understanding that they can do something about it themselves.

pkg/testutils/testcluster/testcluster.go, line 1102 at r1 (raw file):

func (tc *TestCluster) RestartServer(idx int) error {
	if !tc.serverStopped(idx) {
		return errors.Errorf("Server %d must be stopped before attempting to restart", idx)

nit : small s at the start of the message

pkg/testutils/testcluster/testcluster.go, line 1128 at r1 (raw file):

	for _, specs := range serverArgs.StoreSpecs {
		if specs.StickyInMemoryEngineID == "" {
			return errors.Errorf(" Failed to restart Server %d. Restart can only be used when the servers were started with a sticky engine")

nit: no space and small case f

irfansharif

Thanks! Yea we were discussing how this would be a useful API to have, one that would let us stop leaning on roachtests for large swaths of "integration" tests where we desire real server life cycles.

re: naming, I remember being bitten by this in #54072 when I had tried something like:

tc.Start(t)
tc.StopServer(i)
tc.StartServer(i) // This is not a thing on master; I simply exported tc.startServer.

Which is to say I have a mild preference for overloading Start (for both individual servers and the entire cluster) instead of introducing the Restart symbol. It looks doable. Feel free to ignore though, or defer to a future PR.

Reviewed 1 of 10 files at r1.
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @knz and @lunevalex)

pkg/server/sticky_engine.go, line 80 at r1 (raw file):

}

// NewStickyInMemEnginesRegistry creates a new StickyInMemEnginesRegistry

nit: Periods after the sentence.

pkg/server/sticky_engine.go, line 88 at r1 (raw file):

// GetOrCreateStickyInMemEngine implements a method from
// StickyInMemEnginesRegistry.

minor nit: we typically use the "X implements the Y interface" structure for these comments.

pkg/server/testing_knobs.go, line 81 at r1 (raw file):

	OnDecommissionedCallback func(livenesspb.Liveness)
	// StickyEngineRegistry manages the lifecycle of sticky in memory engines,
	// which can be enabled via base.StoreSpec.StickyInMemoryEngineID

nit: Period after the sentence.

pkg/testutils/testcluster/testcluster.go, line 1108 at r1 (raw file):

	if idx == 0 {
		// If it's the first server, then we need to restart the RPC listener by hand.
		// Look at #NewTestCluster method for more details.

nit: Drop the "#" and "method"

pkg/testutils/testcluster/testcluster.go, line 1128 at r1 (raw file):

Previously, knz (kena) wrote…

nit: no space and small case f

Lower case "server" as well, we generally use a single sentence with no periods for error messages. This helps them compose better when one wraps the other (same reason for the lower case).

pkg/testutils/testcluster/testcluster_test.go, line 317 at r1 (raw file):

	tc.WaitForValues(t, roachpb.Key("b"), []int64{9, 9, 9})

	require.NoError(t, tc.Restart())

nit: would be nice to first restart a single server in isolation, test that, and only then restart the entire cluster (I can imagine tc.Restart accreting code in addition to simply restarting each server independently).

lunevalex

Hmm, I am not sure thats possible as it was tricky to get this to work, specifically around having the servers being able to connect back to the cluster correctly. startServer also just starts a newly created server, restart recreates the server and places it back into the array, so it actually ends up being a mix of what AddServer does and startServer does.

Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @irfansharif and @knz)

pkg/server/config.go, line 502 at r1 (raw file):

Previously, knz (kena) wrote…

I'd suggest errors.AssertionFailedf so that if an end-user sees this (via cockroach demo), they don't get the mistaken understanding that they can do something about it themselves.

👍

pkg/testutils/testcluster/testcluster_test.go, line 317 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

nit: would be nice to first restart a single server in isolation, test that, and only then restart the entire cluster (I can imagine tc.Restart accreting code in addition to simply restarting each server independently).

Good idea, I'll add that to the test

knz · 2021-01-12T17:10:30Z

I looked at your CI failure.

There's just one test that fails - the multi-node cockroach demo test that stops and restarts nodes. No surprise this may be affected by this PR.

To run this test on your machine you can run the following command:

# once
$ make buildshort

# every time you want to run the test 
$ killall -9 cockroach cockroachshort
$ rm -rf logs .s.PGSQL.26257 
$ expect -d -f pkg/cli/interactive_tests/test_demo_node_cmds.tcl ./cockroach

beware that the version of expect included in macos may not work. You might be successful with build/builder.sh expect ...

and if that fails, run the commands on your GCE worker.

knz · 2021-01-12T17:13:14Z

The failure here is this:

input:

\demo restart 3

select node_id, draining, decommissioning, membership from crdb_internal.gossip_liveness ORDER BY node_id;

expects:

1 |  false   |      false      | active
2 |  false   |      false      | active
3 |  false   |      false      | active

but finds:

        1 |  false   |      false      | active
        2 |  false   |      false      | active
        3 |   true   |      false      | active

i.e. the restarted node "draining" status as exposed by gossip_liveness remains true even after the restart.

knz · 2021-01-12T17:15:36Z

you can also probably test this in your own cockroach demo shell with:

--log-dir to ensure cockroach demo writes log files to a directory (log files are disabled by default)
run the \demo restart command interactively then the select above

this way you can get your test iterations faster without the hassle of running expect.

…multiTestContext Makes progress on cockroachdb#8299 TestCluster does not have an easy way to restart a Server. Today to make that work one has to use a sticky_engine with TestCluster.StopServer and TestCluster.AddAndStartServer. While this works it quickly gets complicated to use and track where the newly restarted server(s) are running, as they just get appended to list of running servers. This change introduces two new methods TestCluster.Restart and TestCluster.RestartServer, which will restart the server with a sticky engine in place, keeping its position in the list. This matches the behavior of mtc and is easier to use in tests. Sticky engines work well for this use case, but they are global. This means if we run multiple tests in parallel they could have a potential to leak engines to each other. This change makes the StickEngineRegistry scoped by TestCluster, so that one TestCluster would not be able to accidently use an engine from another cluster. Release note: None

lunevalex · 2021-01-13T04:56:09Z

bors r+

craig · 2021-01-13T05:38:22Z

Build succeeded:

GitHub CI (Cockroach)

Fixes cockroachdb#58959 A change made in cockroachdb#57123 introduced a race condition into the tear down of the demo cluster. On shutdown the storage engines for some servers could get closed, before the server itself is stopped, which leads to panics in Pebble. This change makes sure that all the servers are stopped before we close the in memory sticky storage engines. Release note: None

58903: kv: update replica-level tscache tests for synthetic timestamps r=nvanbenschoten a=nvanbenschoten This commit updates the following three tests to confirm that synthetic timestamp information is propagated from read/ranged-write requests, through the timestamp cache, to conflicting write requests. - TestReplicaUpdateTSCache - TestReplicaUseTSCache - TestReplicaTSCacheForwardsIntentTS 58972: cli: Fix for a race condition in the teardown of demo cluster r=lunevalex a=lunevalex Fixes #58959 A change made in #57123 introduced a race condition into the tear down of the demo cluster. On shutdown the storage engines for some servers could get closed, before the server itself is stopped, which leads to panics in Pebble. This change makes sure that all the servers are stopped before we close the in memory sticky storage engines. Release note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Alex Lunev <alexl@cockroachlabs.com>

lunevalex requested a review from a team as a code owner November 25, 2020 16:02

lunevalex force-pushed the alex/test-cluster-restarts branch from 86b4717 to fd48f8a Compare November 25, 2020 17:44

lunevalex force-pushed the alex/test-cluster-restarts branch 2 times, most recently from c5d283f to 5cc6293 Compare January 8, 2021 19:27

knz approved these changes Jan 8, 2021

View reviewed changes

irfansharif reviewed Jan 8, 2021

View reviewed changes

lunevalex force-pushed the alex/test-cluster-restarts branch from 5cc6293 to 71f6d2d Compare January 8, 2021 21:42

blathers-crl bot requested a review from irfansharif January 8, 2021 21:42

lunevalex commented Jan 8, 2021

View reviewed changes

lunevalex force-pushed the alex/test-cluster-restarts branch 2 times, most recently from ae11a06 to cd5d8de Compare January 11, 2021 15:56

lunevalex force-pushed the alex/test-cluster-restarts branch from cd5d8de to 998dfc7 Compare January 12, 2021 21:27

lunevalex force-pushed the alex/test-cluster-restarts branch from 998dfc7 to a378a02 Compare January 12, 2021 22:54

craig bot merged commit 03797b1 into cockroachdb:master Jan 13, 2021

lunevalex deleted the alex/test-cluster-restarts branch January 13, 2021 17:02

RaduBerinde mentioned this pull request Jan 13, 2021

cli: Example_demo_locality fails with "pebble: closed" panic #58959

Closed

lunevalex mentioned this pull request Jan 13, 2021

cli: Fix for a race condition in the teardown of demo cluster #58972

Merged

lunevalex mentioned this pull request Jan 14, 2021

kvserver: replace multiTestContext with TestCluster in client_raft_test #56812

Merged

lunevalex added the a-multi-test-context label Mar 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

testutils: Improve TestCluster to support server restarts similar to multiTestContext#57123

testutils: Improve TestCluster to support server restarts similar to multiTestContext#57123
craig[bot] merged 1 commit intocockroachdb:masterfrom
lunevalex:alex/test-cluster-restarts

lunevalex commented Nov 25, 2020

Uh oh!

cockroach-teamcity commented Nov 25, 2020

Uh oh!

knz left a comment

Uh oh!

irfansharif left a comment

Uh oh!

lunevalex left a comment

Uh oh!

knz commented Jan 12, 2021

Uh oh!

knz commented Jan 12, 2021

Uh oh!

knz commented Jan 12, 2021

Uh oh!

lunevalex commented Jan 13, 2021

Uh oh!

craig bot commented Jan 13, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

lunevalex commented Nov 25, 2020

Uh oh!

cockroach-teamcity commented Nov 25, 2020

Uh oh!

knz left a comment

Choose a reason for hiding this comment

Uh oh!

irfansharif left a comment

Choose a reason for hiding this comment

Uh oh!

lunevalex left a comment

Choose a reason for hiding this comment

Uh oh!

knz commented Jan 12, 2021

Uh oh!

knz commented Jan 12, 2021

Uh oh!

knz commented Jan 12, 2021

Uh oh!

lunevalex commented Jan 13, 2021

Uh oh!

craig bot commented Jan 13, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants