Skip to content

testutils: Improve TestCluster to support server restarts similar to multiTestContext#57123

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
lunevalex:alex/test-cluster-restarts
Jan 13, 2021
Merged

testutils: Improve TestCluster to support server restarts similar to multiTestContext#57123
craig[bot] merged 1 commit intocockroachdb:masterfrom
lunevalex:alex/test-cluster-restarts

Conversation

@lunevalex
Copy link
Copy Markdown

Makes progress on #8299

TestCluster does not have an easy way to restart a Server. Today to make
that work one has to use a sticky_engine with TestCluster.StopServer and
TestCluster.AddAndStartServer. While this works it quickly gets complicated
to use and track where the newly restarted server(s) are running, as they
just get appended to list of running servers. This change introduces two new
methods TestCluster.Restart and TestCluster.RestartServer, which will
restart the server with a sticky engine in place, keeping its position in the
list. This matches the behavior of mtc and is easier to use in tests.

Sticky engines work well for this use case, but they are global. This means
if we run multiple tests in parallel they could have a potential to leak
engines to each other. This change makes the StickEngineRegistry scoped by
TestCluster, so that one TestCluster would not be able to accidentally use an
engine from another cluster.

Release note: None

@lunevalex lunevalex requested a review from a team as a code owner November 25, 2020 16:02
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@lunevalex lunevalex force-pushed the alex/test-cluster-restarts branch from 86b4717 to fd48f8a Compare November 25, 2020 17:44
@lunevalex lunevalex force-pushed the alex/test-cluster-restarts branch 2 times, most recently from c5d283f to 5cc6293 Compare January 8, 2021 19:27
Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: with nits

Very nice, @irfansharif and I were talking about this just yesterday.

Reviewed 10 of 10 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @lunevalex)


pkg/server/config.go, line 502 at r1 (raw file):

			if spec.StickyInMemoryEngineID != "" {
				if cfg.TestingKnobs.Server == nil {
					return Engines{}, errors.Errorf("Could not create a sticky " +

I'd suggest errors.AssertionFailedf so that if an end-user sees this (via cockroach demo), they don't get the mistaken understanding that they can do something about it themselves.


pkg/testutils/testcluster/testcluster.go, line 1102 at r1 (raw file):

func (tc *TestCluster) RestartServer(idx int) error {
	if !tc.serverStopped(idx) {
		return errors.Errorf("Server %d must be stopped before attempting to restart", idx)

nit : small s at the start of the message


pkg/testutils/testcluster/testcluster.go, line 1128 at r1 (raw file):

	for _, specs := range serverArgs.StoreSpecs {
		if specs.StickyInMemoryEngineID == "" {
			return errors.Errorf(" Failed to restart Server %d. Restart can only be used when the servers were started with a sticky engine")

nit: no space and small case f

Copy link
Copy Markdown
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Thanks! Yea we were discussing how this would be a useful API to have, one that would let us stop leaning on roachtests for large swaths of "integration" tests where we desire real server life cycles.

re: naming, I remember being bitten by this in #54072 when I had tried something like:

tc.Start(t)
tc.StopServer(i)
tc.StartServer(i) // This is not a thing on master; I simply exported tc.startServer.

Which is to say I have a mild preference for overloading Start (for both individual servers and the entire cluster) instead of introducing the Restart symbol. It looks doable. Feel free to ignore though, or defer to a future PR.

Reviewed 1 of 10 files at r1.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @knz and @lunevalex)


pkg/server/sticky_engine.go, line 80 at r1 (raw file):

}

// NewStickyInMemEnginesRegistry creates a new StickyInMemEnginesRegistry

nit: Periods after the sentence.


pkg/server/sticky_engine.go, line 88 at r1 (raw file):

// GetOrCreateStickyInMemEngine implements a method from
// StickyInMemEnginesRegistry.

minor nit: we typically use the "X implements the Y interface" structure for these comments.


pkg/server/testing_knobs.go, line 81 at r1 (raw file):

	OnDecommissionedCallback func(livenesspb.Liveness)
	// StickyEngineRegistry manages the lifecycle of sticky in memory engines,
	// which can be enabled via base.StoreSpec.StickyInMemoryEngineID

nit: Period after the sentence.


pkg/testutils/testcluster/testcluster.go, line 1108 at r1 (raw file):

	if idx == 0 {
		// If it's the first server, then we need to restart the RPC listener by hand.
		// Look at #NewTestCluster method for more details.

nit: Drop the "#" and "method"


pkg/testutils/testcluster/testcluster.go, line 1128 at r1 (raw file):

Previously, knz (kena) wrote…

nit: no space and small case f

Lower case "server" as well, we generally use a single sentence with no periods for error messages. This helps them compose better when one wraps the other (same reason for the lower case).


pkg/testutils/testcluster/testcluster_test.go, line 317 at r1 (raw file):

	tc.WaitForValues(t, roachpb.Key("b"), []int64{9, 9, 9})

	require.NoError(t, tc.Restart())

nit: would be nice to first restart a single server in isolation, test that, and only then restart the entire cluster (I can imagine tc.Restart accreting code in addition to simply restarting each server independently).

@lunevalex lunevalex force-pushed the alex/test-cluster-restarts branch from 5cc6293 to 71f6d2d Compare January 8, 2021 21:42
@blathers-crl blathers-crl bot requested a review from irfansharif January 8, 2021 21:42
Copy link
Copy Markdown
Author

@lunevalex lunevalex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I am not sure thats possible as it was tricky to get this to work, specifically around having the servers being able to connect back to the cluster correctly. startServer also just starts a newly created server, restart recreates the server and places it back into the array, so it actually ends up being a mix of what AddServer does and startServer does.

Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @irfansharif and @knz)


pkg/server/config.go, line 502 at r1 (raw file):

Previously, knz (kena) wrote…

I'd suggest errors.AssertionFailedf so that if an end-user sees this (via cockroach demo), they don't get the mistaken understanding that they can do something about it themselves.

👍


pkg/testutils/testcluster/testcluster_test.go, line 317 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

nit: would be nice to first restart a single server in isolation, test that, and only then restart the entire cluster (I can imagine tc.Restart accreting code in addition to simply restarting each server independently).

Good idea, I'll add that to the test

@lunevalex lunevalex force-pushed the alex/test-cluster-restarts branch 2 times, most recently from ae11a06 to cd5d8de Compare January 11, 2021 15:56
@knz
Copy link
Copy Markdown
Contributor

knz commented Jan 12, 2021

I looked at your CI failure.

There's just one test that fails - the multi-node cockroach demo test that stops and restarts nodes. No surprise this may be affected by this PR.

To run this test on your machine you can run the following command:

# once
$ make buildshort

# every time you want to run the test 
$ killall -9 cockroach cockroachshort
$ rm -rf logs .s.PGSQL.26257 
$ expect -d -f pkg/cli/interactive_tests/test_demo_node_cmds.tcl ./cockroach

beware that the version of expect included in macos may not work. You might be successful with build/builder.sh expect ...

and if that fails, run the commands on your GCE worker.

@knz
Copy link
Copy Markdown
Contributor

knz commented Jan 12, 2021

The failure here is this:

input:

\demo restart 3

select node_id, draining, decommissioning, membership from crdb_internal.gossip_liveness ORDER BY node_id;

expects:

1 |  false   |      false      | active
2 |  false   |      false      | active
3 |  false   |      false      | active

but finds:

        1 |  false   |      false      | active
        2 |  false   |      false      | active
        3 |   true   |      false      | active

i.e. the restarted node "draining" status as exposed by gossip_liveness remains true even after the restart.

@knz
Copy link
Copy Markdown
Contributor

knz commented Jan 12, 2021

you can also probably test this in your own cockroach demo shell with:

  • --log-dir to ensure cockroach demo writes log files to a directory (log files are disabled by default)
  • run the \demo restart command interactively then the select above

this way you can get your test iterations faster without the hassle of running expect.

@lunevalex lunevalex force-pushed the alex/test-cluster-restarts branch from cd5d8de to 998dfc7 Compare January 12, 2021 21:27
…multiTestContext

Makes progress on cockroachdb#8299

TestCluster does not have an easy way to restart a Server. Today to make
that work one has to use a sticky_engine with TestCluster.StopServer and
TestCluster.AddAndStartServer. While this works it quickly gets complicated
to use and track where the newly restarted server(s) are running, as they
just get appended to list of running servers. This change introduces two new
methods TestCluster.Restart and TestCluster.RestartServer, which will
restart the server with a sticky engine in place, keeping its position in the
list. This matches the behavior of mtc and is easier to use in tests.

Sticky engines work well for this use case, but they are global. This means
if we run multiple tests in parallel they could have a potential to leak
engines to each other. This change makes the StickEngineRegistry scoped by
TestCluster, so that one TestCluster would not be able to accidently use an
engine from another cluster.

Release note: None
@lunevalex lunevalex force-pushed the alex/test-cluster-restarts branch from 998dfc7 to a378a02 Compare January 12, 2021 22:54
@lunevalex
Copy link
Copy Markdown
Author

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jan 13, 2021

Build succeeded:

@craig craig bot merged commit 03797b1 into cockroachdb:master Jan 13, 2021
@lunevalex lunevalex deleted the alex/test-cluster-restarts branch January 13, 2021 17:02
lunevalex pushed a commit to lunevalex/cockroach that referenced this pull request Jan 13, 2021
Fixes cockroachdb#58959
A change made in cockroachdb#57123 introduced a race condition into the tear
down of the demo cluster. On shutdown the storage engines for
some servers could get closed, before the server itself is stopped,
which leads to panics in Pebble. This change makes sure that all the
servers are stopped before we close the in memory sticky storage engines.

Release note: None
craig bot pushed a commit that referenced this pull request Jan 13, 2021
58903: kv: update replica-level tscache tests for synthetic timestamps r=nvanbenschoten a=nvanbenschoten

This commit updates the following three tests to confirm that synthetic
timestamp information is propagated from read/ranged-write requests,
through the timestamp cache, to conflicting write requests.

- TestReplicaUpdateTSCache
- TestReplicaUseTSCache
- TestReplicaTSCacheForwardsIntentTS

58972: cli: Fix for a race condition in the teardown of demo cluster r=lunevalex a=lunevalex

Fixes #58959
A change made in #57123 introduced a race condition into the tear
down of the demo cluster. On shutdown the storage engines for
some servers could get closed, before the server itself is stopped,
which leads to panics in Pebble. This change makes sure that all the
servers are stopped before we close the in memory sticky storage engines.

Release note: None

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Alex Lunev <alexl@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants