Skip to content

kvserver: replace multiTestContext with TestCluster in client_raft_test#56812

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
lunevalex:alex/remove-mtc-client-raft-test
Jan 19, 2021
Merged

kvserver: replace multiTestContext with TestCluster in client_raft_test#56812
craig[bot] merged 1 commit intocockroachdb:masterfrom
lunevalex:alex/remove-mtc-client-raft-test

Conversation

@lunevalex
Copy link
Copy Markdown

@lunevalex lunevalex commented Nov 17, 2020

Makes progress on #8299
Fixes #40351
Fixes #57560
Fixes #57537

multiTestContext is a legacy construct that is deprecated in favor of running
tests via TestCluster. This is one PR out of many to remove the usage of
multiTestContext in the client_raft test cases. This does not remove all the
uses of mtc, just the simple ones. Leaving the more complex uses cases for a later PR.

With this switch we can also clean up some TestingKnobs and TestServer interfaces.
- DisablePeriodicGossips flag is removed, it does not work with TestCluster
and is no longer used
- DontPreventUseOfOldLeaseOnStart flag is removed, it did not work consistently
in TestCluster. This flag tries to leave the Lease on the same node after a
restart, but CRDB makes no such guarantees in the real world and artificially
testing it does not prove anything. The affected tests were re-worked to
not rely on this condition and can deal with a lease holder moving on a restart.
- GetRaftLeader is ported from multiTestContext to TestCluster

Release note: None

@lunevalex lunevalex requested a review from a team as a code owner November 17, 2020 17:42
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@lunevalex lunevalex force-pushed the alex/remove-mtc-client-raft-test branch 6 times, most recently from 46dda5c to aab99c9 Compare November 18, 2020 03:46
@lunevalex lunevalex requested review from andreimatei and tbg November 18, 2020 06:26
@lunevalex lunevalex force-pushed the alex/remove-mtc-client-raft-test branch from aab99c9 to bbb7f2e Compare November 18, 2020 21:52
tbg added a commit to tbg/cockroach that referenced this pull request Nov 19, 2020
This was caught by our grpc race provoking transport while stressing
PR cockroachdb#56812.

Release note: None
Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

It's really hard to spot anything in these diffs (not your fault, it's just mind numbing and repetitive and it would take a lot of time to actually verify that the intent of each test is still there and no chance of flakes was introduced), so instead of a very thorough review I stressed the tests you touched via

make stressrace TESTFLAGS='-p 1' PKG=./pkg/kv/kvserver/ TESTS="$(git diff HEAD^ | grep -oE 'func Test[A-Za-z0-9]+' | awk '{ print $2 }' | sort | uniq | tr '\n' '|' | rev | cut -c 2- | rev)"

It did find this data race which isn't new here, fixing in #56899.

=== RUN   TestSnapshotAfterTruncation/differentTerm
==================
WARNING: DATA RACE
Write at 0x00c00edfd843 by goroutine 869:
  github.com/cockroachdb/cockroach/pkg/util/encoding.EncodeUint32Ascending()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/util/encoding/encoding.go:170 +0x369
  github.com/cockroachdb/cockroach/pkg/roachpb.(*Value).setChecksum()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/roachpb/data.go:260 +0x2d0
  github.com/cockroachdb/cockroach/pkg/roachpb.(*Value).ensureRawBytes()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/roachpb/data.go:374 +0x2af
  github.com/cockroachdb/cockroach/pkg/roachpb.(*Value).SetProto()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/roachpb/data.go:475 +0x2f
  github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).updateLivenessAttempt.func1()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:1260 +0x158
  github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn.func1()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:707 +0x5e
  github.com/cockroachdb/cockroach/pkg/kv.(*Txn).exec()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:811 +0xeb
  github.com/cockroachdb/cockroach/pkg/kv.(*Txn).exec()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:811 +0xeb
  github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:706 +0x124
  github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).updateLivenessAttempt()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:1257 +0x20a
  github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).updateLiveness()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:1216 +0x233
  github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).heartbeatInternal()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:888 +0x9a7
  github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).Start.func1.1()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:703 +0x2b2
  github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).getLivenessFromKV()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:1062 +0x51a
  github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).Start.func1.1()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:696 +0x4d1
  github.com/cockroachdb/cockroach/pkg/util/contextutil.RunWithTimeout()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/util/contextutil/context.go:140 +0xe2
  github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).Start.func1()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:689 +0x539
  github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:221 +0x1de

Previous read at 0x00c00edfd843 by goroutine 985:
  encoding/base64.(*Encoding).Encode()
      /usr/local/go/src/encoding/base64/base64.go:138 +0x11e
  encoding/json.encodeByteSlice()
      /usr/local/go/src/encoding/json/encode.go:832 +0x1de
  encoding/json.structEncoder.encode()
      /usr/local/go/src/encoding/json/encode.go:759 +0x3c4
  encoding/json.structEncoder.encode-fm()
      /usr/local/go/src/encoding/json/encode.go:730 +0xa4
  encoding/json.structEncoder.encode()
      /usr/local/go/src/encoding/json/encode.go:759 +0x3c4
  encoding/json.structEncoder.encode-fm()
      /usr/local/go/src/encoding/json/encode.go:730 +0xa4
  encoding/json.ptrEncoder.encode()
      /usr/local/go/src/encoding/json/encode.go:915 +0x1c4
  encoding/json.ptrEncoder.encode-fm()
      /usr/local/go/src/encoding/json/encode.go:900 +0x84
  encoding/json.structEncoder.encode()
      /usr/local/go/src/encoding/json/encode.go:759 +0x3c4
  encoding/json.structEncoder.encode-fm()
      /usr/local/go/src/encoding/json/encode.go:730 +0xa4
  encoding/json.ptrEncoder.encode()
      /usr/local/go/src/encoding/json/encode.go:915 +0x1c4
  encoding/json.ptrEncoder.encode-fm()
      /usr/local/go/src/encoding/json/encode.go:900 +0x84
  encoding/json.(*encodeState).reflectValue()
      /usr/local/go/src/encoding/json/encode.go:358 +0x93
  encoding/json.interfaceEncoder()
      /usr/local/go/src/encoding/json/encode.go:714 +0xfb
  encoding/json.structEncoder.encode()
      /usr/local/go/src/encoding/json/encode.go:759 +0x3c4
  encoding/json.structEncoder.encode-fm()
      /usr/local/go/src/encoding/json/encode.go:730 +0xa4
  encoding/json.arrayEncoder.encode()
      /usr/local/go/src/encoding/json/encode.go:886 +0xe4
  encoding/json.arrayEncoder.encode-fm()
      /usr/local/go/src/encoding/json/encode.go:879 +0x84
  encoding/json.sliceEncoder.encode()
      /usr/local/go/src/encoding/json/encode.go:860 +0xde
  encoding/json.sliceEncoder.encode-fm()
      /usr/local/go/src/encoding/json/encode.go:855 +0x84
  encoding/json.structEncoder.encode()
      /usr/local/go/src/encoding/json/encode.go:759 +0x3c4
  encoding/json.structEncoder.encode-fm()
      /usr/local/go/src/encoding/json/encode.go:730 +0xa4
  encoding/json.ptrEncoder.encode()
      /usr/local/go/src/encoding/json/encode.go:915 +0x1c4
  encoding/json.ptrEncoder.encode-fm()
      /usr/local/go/src/encoding/json/encode.go:900 +0x84
  encoding/json.(*encodeState).reflectValue()
      /usr/local/go/src/encoding/json/encode.go:358 +0x93
  encoding/json.(*encodeState).marshal()
      /usr/local/go/src/encoding/json/encode.go:330 +0xd2
  encoding/json.(*Encoder).Encode()
      /usr/local/go/src/encoding/json/stream.go:206 +0xdd
  github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.GRPCTransportFactory.func1()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/transport_race.go:124 +0x1e4
  github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:346 +0x169

Goroutine 869 (running) created at:
  github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:214 +0xc4
  github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).Start()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:670 +0x39d
  github.com/cockroachdb/cockroach/pkg/server.(*Server).PreStart()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/server/server.go:1588 +0x432d
  github.com/cockroachdb/cockroach/pkg/server.(*Server).Start()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/server/server.go:1036 +0x50
  github.com/cockroachdb/cockroach/pkg/server.(*TestServer).Start()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/server/testserver.go:418 +0x12c
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).startServer()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:428 +0x140
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).AddAndStartServer()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:359 +0x157
  github.com/cockroachdb/cockroach/pkg/kv/kvserver_test.TestSnapshotAfterTruncation.func1.1()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/client_raft_test.go:716 +0x184
  github.com/cockroachdb/cockroach/pkg/kv/kvserver_test.TestSnapshotAfterTruncation.func1()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/client_raft_test.go:725 +0xcd7
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:259 +0x2f7
  github.com/cockroachdb/cockroach/pkg/testutils/testcluster.StartTestCluster()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:140 +0x10a
  github.com/cockroachdb/cockroach/pkg/kv/kvserver_test.TestSnapshotAfterTruncation.func1()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/client_raft_test.go:657 +0x359
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1123 +0x202

Goroutine 985 (running) created at:
  github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:341 +0x138
  github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.GRPCTransportFactory()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/transport_race.go:102 +0x2c7
  github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendToReplicas()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1808 +0x521
  github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendPartialBatch()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1495 +0x3e9
  github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).divideAndSendBatchToRanges()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1133 +0x1c3e
  github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).Send()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:772 +0x73d
  github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnLockGatekeeper).SendLocked()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_lock_gatekeeper.go:86 +0x1c1
  github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnMetricRecorder).SendLocked()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_metric_recorder.go:46 +0x121
  github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnCommitter).SendLocked()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_committer.go:195 +0x214
  github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSpanRefresher).sendLockedWithRefreshAttempts()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:267 +0x134
  github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSpanRefresher).SendLocked()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:202 +0x464
  github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnPipeliner).SendLocked()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_pipeliner.go:252 +0x281
  github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSeqNumAllocator).SendLocked()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_seq_num_allocator.go:105 +0x341
  github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnHeartbeater).SendLocked()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go:171 +0x1f4
  github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*TxnCoordSender).Send()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_coord_sender.go:499 +0x6c1
  github.com/cockroachdb/cockroach/pkg/kv.(*DB).sendUsingSender()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:744 +0x1d4
  github.com/cockroachdb/cockroach/pkg/kv.(*Txn).Send()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:919 +0x207
  github.com/cockroachdb/cockroach/pkg/kv.(*Txn).Send-fm()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:903 +0xe4
  github.com/cockroachdb/cockroach/pkg/kv.sendAndFill()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:654 +0x1a5
  github.com/cockroachdb/cockroach/pkg/kv.(*Txn).Run()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:590 +0x148
  github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).updateLivenessAttempt.func1()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:1279 +0x64f
  github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn.func1()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:707 +0x5e
  github.com/cockroachdb/cockroach/pkg/kv.(*Txn).exec()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:811 +0xeb
  github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:706 +0x124
  github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).updateLivenessAttempt()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:1257 +0x20a
  github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).updateLiveness()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:1216 +0x233
  github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).heartbeatInternal()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:888 +0x9a7
  github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).Start.func1.1()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:703 +0x2b2
  github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).getLivenessFromKV()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:1062 +0x51a
  github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).Start.func1.1()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:696 +0x4d1
  github.com/cockroachdb/cockroach/pkg/util/contextutil.RunWithTimeout()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/util/contextutil/context.go:140 +0xe2
  github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).Start.func1()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:689 +0x539
  github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1()
      /home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:221 +0x1de
==================

I'll stress some more.

Reviewed 12 of 12 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei and @lunevalex)


pkg/kv/kvserver/client_raft_test.go, line 158 at r1 (raw file):

	// Now create a new store with the same engine and make sure the expected data is present.
	// We must use the same clock because a newly-created manual clock will be behind the one

stale comment.


pkg/kv/kvserver/client_raft_test.go, line 186 at r1 (raw file):

	numIncrements := 0
	keyA := roachpb.Key("a")

You could export the key that is used for ScratchRange and then use that instead of a which may be placed somewhere in the cluster that has real data. This would introduce a useful pattern for other tests.

scratchKey := keys.TableDataMax


pkg/kv/kvserver/client_raft_test.go, line 256 at r1 (raw file):

// and a range, replicating the range to the second store, and reading its data there.
func TestReplicateRange(t *testing.T) {
	defer leaktest.AfterTest(t)()

should leaktest.AfterTest(t)() also verify that the number of sticky engines before and after hasn't changed? I'm noticing that we do use sticky engines more and more and there's a good chance we're leaking some of them.

It also feels like the sticky stuff can use a general overhaul for usability, but not for this PR.


pkg/kv/kvserver/client_raft_test.go, line 348 at r1 (raw file):

	defer log.Scope(t).Close(t)

	skip.WithIssue(t, 40351)

Make sure #40351 is closed in this PR.


pkg/kv/kvserver/client_raft_test.go, line 1648 at r1 (raw file):

	tc := testcluster.StartTestCluster(t, numServers,
		base.TestClusterArgs{
			// Set to auto since we want the replication queue on

.


pkg/kv/kvserver/client_raft_test.go, line 1718 at r1 (raw file):

// tests, as can a similar situation where the first store is no longer the lease holder of
// the first range; this verifies that those tests will not be affected.
// TODO(lunevalex): Remove this test when removing MTC, no need to convert it

Why not remove it now then? And why is there no need to convert it?


pkg/testutils/testcluster/testcluster.go, line 1070 at r1 (raw file):

// GetRaftLeader returns the replica that is the current raft leader for the
// specified key.
func (tc *TestCluster) GetRaftLeader(t testing.TB, key roachpb.RKey) *kvserver.Replica {

It's been hard historically to pin raft leaders due to flakiness (i.e. where the leader changed right after this method returns). This is probably not new code (?) so hopefully the uses here don't have that problem, but I wanted to point it out.

Copy link
Copy Markdown
Author

@lunevalex lunevalex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @lunevalex, and @tbg)


pkg/kv/kvserver/client_raft_test.go, line 158 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

stale comment.

Removed


pkg/kv/kvserver/client_raft_test.go, line 186 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

You could export the key that is used for ScratchRange and then use that instead of a which may be placed somewhere in the cluster that has real data. This would introduce a useful pattern for other tests.

scratchKey := keys.TableDataMax

Ideally everyone should just use the scratch range in new tests, since that does the split underneath the covers. I wanted to minimize changes to these tests, which is why I left everything as roachpb.Key("a")


pkg/kv/kvserver/client_raft_test.go, line 256 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

should leaktest.AfterTest(t)() also verify that the number of sticky engines before and after hasn't changed? I'm noticing that we do use sticky engines more and more and there's a good chance we're leaking some of them.

It also feels like the sticky stuff can use a general overhaul for usability, but not for this PR.

Not sure I follow the intention here, the test will fail if you forget to close all the sticky engines server.CloseAllStickyInMemEngines() as they will leak Stoppers and internally you will not be able to reuse an engine without closing one first.


pkg/kv/kvserver/client_raft_test.go, line 348 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Make sure #40351 is closed in this PR.

Good catch.


pkg/kv/kvserver/client_raft_test.go, line 1718 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Why not remove it now then? And why is there no need to convert it?

This test tests multiTestContext itself, so it seemed prudent to leave it as is until we are ready to delete mtc. Since it does not add anything to the test coverage of raft aside from testing a peculiarity of mtc I did not feel the need to convert it


pkg/testutils/testcluster/testcluster.go, line 1070 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

It's been hard historically to pin raft leaders due to flakiness (i.e. where the leader changed right after this method returns). This is probably not new code (?) so hopefully the uses here don't have that problem, but I wanted to point it out.

This was copy/paste from mtc, so the semantics of the tests should have remained the same, just needed this function on TestCluster to replicate the behavior.

@lunevalex lunevalex force-pushed the alex/remove-mtc-client-raft-test branch 2 times, most recently from 9488014 to 30f3d50 Compare November 19, 2020 18:26
Copy link
Copy Markdown
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just did a first pass for now over TestCluster.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @lunevalex and @tbg)


pkg/kv/kvserver/client_raft_test.go, line 98 at r2 (raw file):

			ServerArgs:      stickySpecTestServerArgs,
		})
	defer tc.Stopper().Stop(context.Background())

nit: try not to use context.Background() in new code, it's too much noise. Declare a ctx := context.Background() at the beginning.


pkg/testutils/testcluster/testcluster.go, line 128 at r2 (raw file):

// ServerStopped determines if a server has been explicitly
// stopped by stopServer(s).

You mean StopServer() ?


pkg/testutils/testcluster/testcluster.go, line 352 at r2 (raw file):

func (tc *TestCluster) AddAndStartServer(t testing.TB, serverArgs base.TestServerArgs) {
	if serverArgs.JoinAddr == "" && len(tc.Servers) > 0 {
		serverArgs.JoinAddr = tc.GetFirstLiveServer().ServingRPCAddr()

nit: for future times consider isolating changes like these in their own commits, so they can carry an explanation.


pkg/testutils/testcluster/testcluster.go, line 1002 at r2 (raw file):

// from all configured engines, filling in zeros when the value is not
// found.
func (tc *TestCluster) readIntFromStores(key roachpb.Key) []int64 {

nit: comment on this method too that stopped servers don't count


pkg/testutils/testcluster/testcluster.go, line 1003 at r2 (raw file):

// found.
func (tc *TestCluster) readIntFromStores(key roachpb.Key) []int64 {
	results := make([]int64, 0)

A make with length 0 and no explicit capacity is unusual, so just do var result []int64 if that's what you want. But the make() was right the way it was, and it expressed the right thing - that results will usually have len(servers). This is different from expressing that results will always have len(servers), in which case the best style would have been to declare it as results := make([]int64, len(servers), and the loop below would have assigned directly to the array and not do appends.


pkg/testutils/testcluster/testcluster.go, line 1056 at r2 (raw file):

}

// GetFirstLiveServer Returns the first server in the list that has not been

s/Returns/returns


pkg/testutils/testcluster/testcluster.go, line 1056 at r2 (raw file):

}

// GetFirstLiveServer Returns the first server in the list that has not been

As far as I can tell, nobody outside of TestCluster is calling this. Consider making it private, as I'm not very convinced of the need for it.


pkg/testutils/testcluster/testcluster.go, line 1057 at r2 (raw file):

// GetFirstLiveServer Returns the first server in the list that has not been
// explicitly stopped. If all the servers are stopped, it returns the first one.

The semantics of returning the first if they're all stopped seems bizarre to me. Who needs that? Whoever it is, I think it should probably do something else, and this function should return nil.


pkg/testutils/testcluster/testcluster.go, line 1060 at r2 (raw file):

func (tc *TestCluster) GetFirstLiveServer() *server.TestServer {
	for i := range tc.Servers {
		// Use the first non-stopped server to avoid a deadlock

What's up with this comment? Doesn't seem to belong here.


pkg/testutils/testcluster/testcluster.go, line 1107 at r2 (raw file):

}

// Restart stops and then starts all the servers in the cluster

nit: period at the end of the sentence


pkg/testutils/testcluster/testcluster.go, line 1108 at r2 (raw file):

// Restart stops and then starts all the servers in the cluster
func (tc *TestCluster) Restart(t testing.TB, serverArgsPerNode map[int]base.TestServerArgs) {

The mechanism used by this Restart seems clunky to me; I think we should try to improve it. If I understand AddAndStartServer correctly, it'll leave the old servers in place, and appends new ones to the array. That's quite confusing; we shouldn't treat the servers in the cluster as "immutable" in this way; instead we should be mutating them in place - otherwise index management becomes really tedious.
This has ramifications for ServerStopped(), which should reflect a server having been restarted.


pkg/testutils/testcluster/testcluster.go, line 1115 at r2 (raw file):

		for _, specs := range args.StoreSpecs {
			if specs.StickyInMemoryEngineID == "" {
				t.Fatalf("Restart can only be used when the servers were started with a sticky engine")

I think it'd be better to panic here and not take the t as an argument. This failure seems like a static condition that either happens during development, or never, and so we don't need any care in handling it - and so it's not worth plumbing the t param.

Copy link
Copy Markdown
Author

@lunevalex lunevalex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @lunevalex, and @tbg)


pkg/kv/kvserver/client_raft_test.go, line 98 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit: try not to use context.Background() in new code, it's too much noise. Declare a ctx := context.Background() at the beginning.

ack


pkg/testutils/testcluster/testcluster.go, line 128 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

You mean StopServer() ?

yes


pkg/testutils/testcluster/testcluster.go, line 1003 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

A make with length 0 and no explicit capacity is unusual, so just do var result []int64 if that's what you want. But the make() was right the way it was, and it expressed the right thing - that results will usually have len(servers). This is different from expressing that results will always have len(servers), in which case the best style would have been to declare it as results := make([]int64, len(servers), and the loop below would have assigned directly to the array and not do appends.

So it will not have len(servers) since we ignore stopped ones, will remove make thanks for the tip.


pkg/testutils/testcluster/testcluster.go, line 1056 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

As far as I can tell, nobody outside of TestCluster is calling this. Consider making it private, as I'm not very convinced of the need for it.

ack


pkg/testutils/testcluster/testcluster.go, line 1060 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

What's up with this comment? Doesn't seem to belong here.

oops


pkg/testutils/testcluster/testcluster.go, line 1108 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

The mechanism used by this Restart seems clunky to me; I think we should try to improve it. If I understand AddAndStartServer correctly, it'll leave the old servers in place, and appends new ones to the array. That's quite confusing; we shouldn't treat the servers in the cluster as "immutable" in this way; instead we should be mutating them in place - otherwise index management becomes really tedious.
This has ramifications for ServerStopped(), which should reflect a server having been restarted.

Yes, it is tedious and I dont love it. I did not want to modify the array in place for this either, as that could also be confusing. What I am thinking of doing in a future iteration is to add methods to TestCluster that address servers by their (NodeID, StoreID) rather than index, which should remove a lot of clunkiness here. Then we can just make those methods look first non-stopped server with that (NodeID, StoreID).


pkg/testutils/testcluster/testcluster.go, line 1115 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I think it'd be better to panic here and not take the t as an argument. This failure seems like a static condition that either happens during development, or never, and so we don't need any care in handling it - and so it's not worth plumbing the t param.

No strong feelings on this one, I was following the general pattern for how we deal with errors in testing.

@lunevalex lunevalex force-pushed the alex/remove-mtc-client-raft-test branch from 30f3d50 to 7fda76d Compare November 19, 2020 21:41
Copy link
Copy Markdown
Author

@lunevalex lunevalex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @lunevalex, and @tbg)


pkg/testutils/testcluster/testcluster.go, line 1115 at r2 (raw file):

Previously, lunevalex wrote…

No strong feelings on this one, I was following the general pattern for how we deal with errors in testing.

Actually AddAndStartServer already takes t, so doing what you suggest would actually be more of a PITA, since we already have to pipe through t

@lunevalex lunevalex force-pushed the alex/remove-mtc-client-raft-test branch 2 times, most recently from fb9f583 to 748ee9a Compare November 20, 2020 17:15
Copy link
Copy Markdown
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @lunevalex, and @tbg)


pkg/kv/kvserver/client_raft_test.go, line 82 at r3 (raw file):

	defer leaktest.AfterTest(t)()
	defer log.Scope(t).Close(t)
	defer server.CloseAllStickyInMemEngines()

As far as I can tell, before this PR these "sticky in-mem engines" were used for cockroach demo, and for a single other test. In their current form, I don't think we should allow them to proliferate because of their global scope. Any use of globals in tests eventually proven a pain the ass in the past. At the very least, these CloseAllStickyInMemEngines() forever precludes us from running tests in parallel.
I think we should have tests create a MemorySpace or such, pass it into the StoreSpec, and have engines be associated with that space.

For TestClusters (as opposed to TestServers), I think StartTestCluster can allocate the MemorySpace, and we can have a new tc.Destroy() which cleans it up and does tc.Stopper().Stop() (btw, I generally think that tc.Stopper().Stop() should have a short-hand).


pkg/kv/kvserver/client_raft_test.go, line 88 at r3 (raw file):

			{
				InMemory:               true,
				StickyInMemoryEngineID: "1",

put the name of the test in this id so that we can track it down in case it leaks


pkg/kv/kvserver/client_raft_test.go, line 94 at r3 (raw file):

	ctx := context.Background()
	tc := testcluster.StartTestCluster(t, 1,

This test only uses one server at a time, so I think the fact that we have to use a TestCluster shows to something is funky; I think we should do what it takes so that it can use a TestServer. Which brings me to the question I was asking elsewhere - why can't we just server.Start() after having stopped it? Even if we can't get that to work (although it would improve the state of our testing a lot if it did work), I rather we build a second TestServer by hand rather than use any TestCluster methods.


pkg/kv/kvserver/node_liveness_test.go, line 71 at r3 (raw file):

func pauseNodeLivenessHeartbeatLoopsTC(tc *testcluster.TestCluster) func() {
	var enableFns []func()

I think this should be a method on TestCluster, and be available outside of the kvserver package.


pkg/testutils/testcluster/testcluster.go, line 1003 at r2 (raw file):

Previously, lunevalex wrote…

So it will not have len(servers) since we ignore stopped ones, will remove make thanks for the tip.

To clarify, I was saying that results := make([]int64, 0, len(tc.Servers)) would have been appropriate here. Whatever you want though.


pkg/testutils/testcluster/testcluster.go, line 1108 at r2 (raw file):

I did not want to modify the array in place for this either, as that could also be confusing.

Why would it be confusing? You're worried about people holding references to stores and such taken before the Restart? Cause if that's the worry, such a test would be broken with the current impl too.
What I actually would like to know is why do we need to assign or append to the array at all. Why can't we just stop and then Start() a server after having previouslt stopped it? Why do we need to construct a new server (with an old storage engine)?

At the very least, this method needs a comment explaining how the TestCluster will look afterwards.


pkg/testutils/testcluster/testcluster.go, line 1115 at r2 (raw file):

Previously, lunevalex wrote…

Actually AddAndStartServer already takes t, so doing what you suggest would actually be more of a PITA, since we already have to pipe through t

ok

@lunevalex
Copy link
Copy Markdown
Author


pkg/kv/kvserver/client_raft_test.go, line 82 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

As far as I can tell, before this PR these "sticky in-mem engines" were used for cockroach demo, and for a single other test. In their current form, I don't think we should allow them to proliferate because of their global scope. Any use of globals in tests eventually proven a pain the ass in the past. At the very least, these CloseAllStickyInMemEngines() forever precludes us from running tests in parallel.
I think we should have tests create a MemorySpace or such, pass it into the StoreSpec, and have engines be associated with that space.

For TestClusters (as opposed to TestServers), I think StartTestCluster can allocate the MemorySpace, and we can have a new tc.Destroy() which cleans it up and does tc.Stopper().Stop() (btw, I generally think that tc.Stopper().Stop() should have a short-hand).

Discussed with @andreimatei offline, going to split this change into 2 one to enhance TestCluster to better support in-memory-engines and restarts and the actual testing changes in client_raft

craig bot pushed a commit that referenced this pull request Nov 25, 2020
56899: liveness: remove a possible data race r=lunevalex a=tbg

This was caught by our grpc race provoking transport while stressing
PR #56812.

Release note: None


Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
@lunevalex lunevalex force-pushed the alex/remove-mtc-client-raft-test branch from 748ee9a to fe55907 Compare January 14, 2021 22:19
@lunevalex
Copy link
Copy Markdown
Author

@tbg @andreimatei TestCluster.Restart was pulled into #57123 and merged. I rebased this on-top of that work and cleaned it up. Let me know if you want to look at the changes again or if they are good to go.

Makes progress on cockroachdb#8299
Fixes cockroachdb#40351
Fixes cockroachdb#57560
Fixes cockroachdb#57537

multiTestContext is a legacy construct that is deprecated in favor of running
tests via TestCluster. This is one PR out of many to remove the usage of
multiTestContext in the client_raft test cases. This does not remove all the
uses of mtc, just the simple ones. Leaving the more complex uses cases for a later PR.

With this switch we can also clean up some TestingKnobs and TestServer interfaces.
    - DisablePeriodicGossips flag is removed, it does not work with TestCluster
      and is no longer used
    - DontPreventUseOfOldLeaseOnStart flag is removed, it did not work consistently
      in TestCluster. This flag tries to leave the Lease on the same node after a
      restart, but CRDB makes no such guarantees in the real world and artificially
      testing it does not prove anything. The affected tests were re-worked to
      not rely on this condition and can deal with a lease holder moving on a restart.
    - GetRaftLeader is ported from multiTestContext to TestCluster

Release note: None
@lunevalex lunevalex force-pushed the alex/remove-mtc-client-raft-test branch from fe55907 to 3876b44 Compare January 15, 2021 15:39
@tbg tbg requested a review from andreimatei January 18, 2021 09:20
Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't review again in detail, but what I saw looks good. Could you stress the pkg for an hour or so before merging?

Reviewed 7 of 10 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @lunevalex, and @tbg)


pkg/kv/kvserver/client_raft_test.go, line 88 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

put the name of the test in this id so that we can track it down in case it leaks

Ping


pkg/kv/kvserver/client_raft_test.go, line 94 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

This test only uses one server at a time, so I think the fact that we have to use a TestCluster shows to something is funky; I think we should do what it takes so that it can use a TestServer. Which brings me to the question I was asking elsewhere - why can't we just server.Start() after having stopped it? Even if we can't get that to work (although it would improve the state of our testing a lot if it did work), I rather we build a second TestServer by hand rather than use any TestCluster methods.

I'd almost make the counter-argument here, that if we have the ability to start a one-node cluster, that this should obviate maintaining a separate test server. I think in practice we do at least internally want to maintain a clean separation between the test server and a surrounding cluster (don't want another mtc where things are randomly shared between nodes). I don't think Alex can reasonably brush this up now, but I will keep this in mind.


pkg/kv/kvserver/client_raft_test.go, line 83 at r5 (raw file):

	defer log.Scope(t).Close(t)

	stickyEngineRegistry := server.NewStickyInMemEnginesRegistry()

👏

@tbg tbg self-requested a review January 18, 2021 09:20
@lunevalex
Copy link
Copy Markdown
Author


pkg/kv/kvserver/client_raft_test.go, line 88 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Ping

Now that the registry is scoped to the test it felt like a moot point

@lunevalex
Copy link
Copy Markdown
Author

ran make stress PKG=./pkg/kv/kvserver STRESSFLAGS='-p 2' on my laptop, no issues
Screen Shot 2021-01-19 at 9.50.27 AM.png

@lunevalex
Copy link
Copy Markdown
Author

TFTR!

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jan 19, 2021

Build failed (retrying...):

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jan 19, 2021

Build failed:

@lunevalex
Copy link
Copy Markdown
Author

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jan 19, 2021

Build succeeded:

@craig craig bot merged commit 47eb9f3 into cockroachdb:master Jan 19, 2021
@lunevalex lunevalex deleted the alex/remove-mtc-client-raft-test branch January 20, 2021 05:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

kv/kvserver: TestReplicateRange failed kv: TestReplicateAfterTruncation is flaky teamcity: failed test: TestRestoreReplicas [skipped]

4 participants