kvserver: replace multiTestContext with TestCluster in client_raft_test#56812
Conversation
46dda5c to
aab99c9
Compare
aab99c9 to
bbb7f2e
Compare
This was caught by our grpc race provoking transport while stressing PR cockroachdb#56812. Release note: None
tbg
left a comment
There was a problem hiding this comment.
It's really hard to spot anything in these diffs (not your fault, it's just mind numbing and repetitive and it would take a lot of time to actually verify that the intent of each test is still there and no chance of flakes was introduced), so instead of a very thorough review I stressed the tests you touched via
make stressrace TESTFLAGS='-p 1' PKG=./pkg/kv/kvserver/ TESTS="$(git diff HEAD^ | grep -oE 'func Test[A-Za-z0-9]+' | awk '{ print $2 }' | sort | uniq | tr '\n' '|' | rev | cut -c 2- | rev)"
It did find this data race which isn't new here, fixing in #56899.
=== RUN TestSnapshotAfterTruncation/differentTerm
==================
WARNING: DATA RACE
Write at 0x00c00edfd843 by goroutine 869:
github.com/cockroachdb/cockroach/pkg/util/encoding.EncodeUint32Ascending()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/util/encoding/encoding.go:170 +0x369
github.com/cockroachdb/cockroach/pkg/roachpb.(*Value).setChecksum()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/roachpb/data.go:260 +0x2d0
github.com/cockroachdb/cockroach/pkg/roachpb.(*Value).ensureRawBytes()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/roachpb/data.go:374 +0x2af
github.com/cockroachdb/cockroach/pkg/roachpb.(*Value).SetProto()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/roachpb/data.go:475 +0x2f
github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).updateLivenessAttempt.func1()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:1260 +0x158
github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn.func1()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:707 +0x5e
github.com/cockroachdb/cockroach/pkg/kv.(*Txn).exec()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:811 +0xeb
github.com/cockroachdb/cockroach/pkg/kv.(*Txn).exec()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:811 +0xeb
github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:706 +0x124
github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).updateLivenessAttempt()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:1257 +0x20a
github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).updateLiveness()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:1216 +0x233
github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).heartbeatInternal()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:888 +0x9a7
github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).Start.func1.1()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:703 +0x2b2
github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).getLivenessFromKV()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:1062 +0x51a
github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).Start.func1.1()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:696 +0x4d1
github.com/cockroachdb/cockroach/pkg/util/contextutil.RunWithTimeout()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/util/contextutil/context.go:140 +0xe2
github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).Start.func1()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:689 +0x539
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:221 +0x1de
Previous read at 0x00c00edfd843 by goroutine 985:
encoding/base64.(*Encoding).Encode()
/usr/local/go/src/encoding/base64/base64.go:138 +0x11e
encoding/json.encodeByteSlice()
/usr/local/go/src/encoding/json/encode.go:832 +0x1de
encoding/json.structEncoder.encode()
/usr/local/go/src/encoding/json/encode.go:759 +0x3c4
encoding/json.structEncoder.encode-fm()
/usr/local/go/src/encoding/json/encode.go:730 +0xa4
encoding/json.structEncoder.encode()
/usr/local/go/src/encoding/json/encode.go:759 +0x3c4
encoding/json.structEncoder.encode-fm()
/usr/local/go/src/encoding/json/encode.go:730 +0xa4
encoding/json.ptrEncoder.encode()
/usr/local/go/src/encoding/json/encode.go:915 +0x1c4
encoding/json.ptrEncoder.encode-fm()
/usr/local/go/src/encoding/json/encode.go:900 +0x84
encoding/json.structEncoder.encode()
/usr/local/go/src/encoding/json/encode.go:759 +0x3c4
encoding/json.structEncoder.encode-fm()
/usr/local/go/src/encoding/json/encode.go:730 +0xa4
encoding/json.ptrEncoder.encode()
/usr/local/go/src/encoding/json/encode.go:915 +0x1c4
encoding/json.ptrEncoder.encode-fm()
/usr/local/go/src/encoding/json/encode.go:900 +0x84
encoding/json.(*encodeState).reflectValue()
/usr/local/go/src/encoding/json/encode.go:358 +0x93
encoding/json.interfaceEncoder()
/usr/local/go/src/encoding/json/encode.go:714 +0xfb
encoding/json.structEncoder.encode()
/usr/local/go/src/encoding/json/encode.go:759 +0x3c4
encoding/json.structEncoder.encode-fm()
/usr/local/go/src/encoding/json/encode.go:730 +0xa4
encoding/json.arrayEncoder.encode()
/usr/local/go/src/encoding/json/encode.go:886 +0xe4
encoding/json.arrayEncoder.encode-fm()
/usr/local/go/src/encoding/json/encode.go:879 +0x84
encoding/json.sliceEncoder.encode()
/usr/local/go/src/encoding/json/encode.go:860 +0xde
encoding/json.sliceEncoder.encode-fm()
/usr/local/go/src/encoding/json/encode.go:855 +0x84
encoding/json.structEncoder.encode()
/usr/local/go/src/encoding/json/encode.go:759 +0x3c4
encoding/json.structEncoder.encode-fm()
/usr/local/go/src/encoding/json/encode.go:730 +0xa4
encoding/json.ptrEncoder.encode()
/usr/local/go/src/encoding/json/encode.go:915 +0x1c4
encoding/json.ptrEncoder.encode-fm()
/usr/local/go/src/encoding/json/encode.go:900 +0x84
encoding/json.(*encodeState).reflectValue()
/usr/local/go/src/encoding/json/encode.go:358 +0x93
encoding/json.(*encodeState).marshal()
/usr/local/go/src/encoding/json/encode.go:330 +0xd2
encoding/json.(*Encoder).Encode()
/usr/local/go/src/encoding/json/stream.go:206 +0xdd
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.GRPCTransportFactory.func1()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/transport_race.go:124 +0x1e4
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:346 +0x169
Goroutine 869 (running) created at:
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:214 +0xc4
github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).Start()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:670 +0x39d
github.com/cockroachdb/cockroach/pkg/server.(*Server).PreStart()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/server/server.go:1588 +0x432d
github.com/cockroachdb/cockroach/pkg/server.(*Server).Start()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/server/server.go:1036 +0x50
github.com/cockroachdb/cockroach/pkg/server.(*TestServer).Start()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/server/testserver.go:418 +0x12c
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).startServer()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:428 +0x140
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).AddAndStartServer()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:359 +0x157
github.com/cockroachdb/cockroach/pkg/kv/kvserver_test.TestSnapshotAfterTruncation.func1.1()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/client_raft_test.go:716 +0x184
github.com/cockroachdb/cockroach/pkg/kv/kvserver_test.TestSnapshotAfterTruncation.func1()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/client_raft_test.go:725 +0xcd7
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:259 +0x2f7
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.StartTestCluster()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:140 +0x10a
github.com/cockroachdb/cockroach/pkg/kv/kvserver_test.TestSnapshotAfterTruncation.func1()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/client_raft_test.go:657 +0x359
testing.tRunner()
/usr/local/go/src/testing/testing.go:1123 +0x202
Goroutine 985 (running) created at:
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:341 +0x138
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.GRPCTransportFactory()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/transport_race.go:102 +0x2c7
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendToReplicas()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1808 +0x521
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendPartialBatch()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1495 +0x3e9
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).divideAndSendBatchToRanges()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1133 +0x1c3e
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).Send()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:772 +0x73d
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnLockGatekeeper).SendLocked()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_lock_gatekeeper.go:86 +0x1c1
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnMetricRecorder).SendLocked()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_metric_recorder.go:46 +0x121
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnCommitter).SendLocked()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_committer.go:195 +0x214
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSpanRefresher).sendLockedWithRefreshAttempts()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:267 +0x134
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSpanRefresher).SendLocked()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:202 +0x464
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnPipeliner).SendLocked()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_pipeliner.go:252 +0x281
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSeqNumAllocator).SendLocked()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_seq_num_allocator.go:105 +0x341
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnHeartbeater).SendLocked()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go:171 +0x1f4
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*TxnCoordSender).Send()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_coord_sender.go:499 +0x6c1
github.com/cockroachdb/cockroach/pkg/kv.(*DB).sendUsingSender()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:744 +0x1d4
github.com/cockroachdb/cockroach/pkg/kv.(*Txn).Send()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:919 +0x207
github.com/cockroachdb/cockroach/pkg/kv.(*Txn).Send-fm()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:903 +0xe4
github.com/cockroachdb/cockroach/pkg/kv.sendAndFill()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:654 +0x1a5
github.com/cockroachdb/cockroach/pkg/kv.(*Txn).Run()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:590 +0x148
github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).updateLivenessAttempt.func1()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:1279 +0x64f
github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn.func1()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:707 +0x5e
github.com/cockroachdb/cockroach/pkg/kv.(*Txn).exec()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:811 +0xeb
github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:706 +0x124
github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).updateLivenessAttempt()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:1257 +0x20a
github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).updateLiveness()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:1216 +0x233
github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).heartbeatInternal()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:888 +0x9a7
github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).Start.func1.1()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:703 +0x2b2
github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).getLivenessFromKV()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:1062 +0x51a
github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).Start.func1.1()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:696 +0x4d1
github.com/cockroachdb/cockroach/pkg/util/contextutil.RunWithTimeout()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/util/contextutil/context.go:140 +0xe2
github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.(*NodeLiveness).Start.func1()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:689 +0x539
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:221 +0x1de
==================
I'll stress some more.
Reviewed 12 of 12 files at r1.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @andreimatei and @lunevalex)
pkg/kv/kvserver/client_raft_test.go, line 158 at r1 (raw file):
// Now create a new store with the same engine and make sure the expected data is present. // We must use the same clock because a newly-created manual clock will be behind the one
stale comment.
pkg/kv/kvserver/client_raft_test.go, line 186 at r1 (raw file):
numIncrements := 0 keyA := roachpb.Key("a")
You could export the key that is used for ScratchRange and then use that instead of a which may be placed somewhere in the cluster that has real data. This would introduce a useful pattern for other tests.
cockroach/pkg/server/testserver.go
Line 1270 in a88e8e7
pkg/kv/kvserver/client_raft_test.go, line 256 at r1 (raw file):
// and a range, replicating the range to the second store, and reading its data there. func TestReplicateRange(t *testing.T) { defer leaktest.AfterTest(t)()
should leaktest.AfterTest(t)() also verify that the number of sticky engines before and after hasn't changed? I'm noticing that we do use sticky engines more and more and there's a good chance we're leaking some of them.
It also feels like the sticky stuff can use a general overhaul for usability, but not for this PR.
pkg/kv/kvserver/client_raft_test.go, line 348 at r1 (raw file):
defer log.Scope(t).Close(t) skip.WithIssue(t, 40351)
Make sure #40351 is closed in this PR.
pkg/kv/kvserver/client_raft_test.go, line 1648 at r1 (raw file):
tc := testcluster.StartTestCluster(t, numServers, base.TestClusterArgs{ // Set to auto since we want the replication queue on
.
pkg/kv/kvserver/client_raft_test.go, line 1718 at r1 (raw file):
// tests, as can a similar situation where the first store is no longer the lease holder of // the first range; this verifies that those tests will not be affected. // TODO(lunevalex): Remove this test when removing MTC, no need to convert it
Why not remove it now then? And why is there no need to convert it?
pkg/testutils/testcluster/testcluster.go, line 1070 at r1 (raw file):
// GetRaftLeader returns the replica that is the current raft leader for the // specified key. func (tc *TestCluster) GetRaftLeader(t testing.TB, key roachpb.RKey) *kvserver.Replica {
It's been hard historically to pin raft leaders due to flakiness (i.e. where the leader changed right after this method returns). This is probably not new code (?) so hopefully the uses here don't have that problem, but I wanted to point it out.
lunevalex
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @lunevalex, and @tbg)
pkg/kv/kvserver/client_raft_test.go, line 158 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
stale comment.
Removed
pkg/kv/kvserver/client_raft_test.go, line 186 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
You could export the key that is used for
ScratchRangeand then use that instead ofawhich may be placed somewhere in the cluster that has real data. This would introduce a useful pattern for other tests.
cockroach/pkg/server/testserver.go
Line 1270 in a88e8e7
Ideally everyone should just use the scratch range in new tests, since that does the split underneath the covers. I wanted to minimize changes to these tests, which is why I left everything as roachpb.Key("a")
pkg/kv/kvserver/client_raft_test.go, line 256 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
should
leaktest.AfterTest(t)()also verify that the number of sticky engines before and after hasn't changed? I'm noticing that we do use sticky engines more and more and there's a good chance we're leaking some of them.It also feels like the sticky stuff can use a general overhaul for usability, but not for this PR.
Not sure I follow the intention here, the test will fail if you forget to close all the sticky engines server.CloseAllStickyInMemEngines() as they will leak Stoppers and internally you will not be able to reuse an engine without closing one first.
pkg/kv/kvserver/client_raft_test.go, line 348 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Make sure #40351 is closed in this PR.
Good catch.
pkg/kv/kvserver/client_raft_test.go, line 1718 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Why not remove it now then? And why is there no need to convert it?
This test tests multiTestContext itself, so it seemed prudent to leave it as is until we are ready to delete mtc. Since it does not add anything to the test coverage of raft aside from testing a peculiarity of mtc I did not feel the need to convert it
pkg/testutils/testcluster/testcluster.go, line 1070 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
It's been hard historically to pin raft leaders due to flakiness (i.e. where the leader changed right after this method returns). This is probably not new code (?) so hopefully the uses here don't have that problem, but I wanted to point it out.
This was copy/paste from mtc, so the semantics of the tests should have remained the same, just needed this function on TestCluster to replicate the behavior.
9488014 to
30f3d50
Compare
andreimatei
left a comment
There was a problem hiding this comment.
I just did a first pass for now over TestCluster.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @lunevalex and @tbg)
pkg/kv/kvserver/client_raft_test.go, line 98 at r2 (raw file):
ServerArgs: stickySpecTestServerArgs, }) defer tc.Stopper().Stop(context.Background())
nit: try not to use context.Background() in new code, it's too much noise. Declare a ctx := context.Background() at the beginning.
pkg/testutils/testcluster/testcluster.go, line 128 at r2 (raw file):
// ServerStopped determines if a server has been explicitly // stopped by stopServer(s).
You mean StopServer() ?
pkg/testutils/testcluster/testcluster.go, line 352 at r2 (raw file):
func (tc *TestCluster) AddAndStartServer(t testing.TB, serverArgs base.TestServerArgs) { if serverArgs.JoinAddr == "" && len(tc.Servers) > 0 { serverArgs.JoinAddr = tc.GetFirstLiveServer().ServingRPCAddr()
nit: for future times consider isolating changes like these in their own commits, so they can carry an explanation.
pkg/testutils/testcluster/testcluster.go, line 1002 at r2 (raw file):
// from all configured engines, filling in zeros when the value is not // found. func (tc *TestCluster) readIntFromStores(key roachpb.Key) []int64 {
nit: comment on this method too that stopped servers don't count
pkg/testutils/testcluster/testcluster.go, line 1003 at r2 (raw file):
// found. func (tc *TestCluster) readIntFromStores(key roachpb.Key) []int64 { results := make([]int64, 0)
A make with length 0 and no explicit capacity is unusual, so just do var result []int64 if that's what you want. But the make() was right the way it was, and it expressed the right thing - that results will usually have len(servers). This is different from expressing that results will always have len(servers), in which case the best style would have been to declare it as results := make([]int64, len(servers), and the loop below would have assigned directly to the array and not do appends.
pkg/testutils/testcluster/testcluster.go, line 1056 at r2 (raw file):
} // GetFirstLiveServer Returns the first server in the list that has not been
s/Returns/returns
pkg/testutils/testcluster/testcluster.go, line 1056 at r2 (raw file):
} // GetFirstLiveServer Returns the first server in the list that has not been
As far as I can tell, nobody outside of TestCluster is calling this. Consider making it private, as I'm not very convinced of the need for it.
pkg/testutils/testcluster/testcluster.go, line 1057 at r2 (raw file):
// GetFirstLiveServer Returns the first server in the list that has not been // explicitly stopped. If all the servers are stopped, it returns the first one.
The semantics of returning the first if they're all stopped seems bizarre to me. Who needs that? Whoever it is, I think it should probably do something else, and this function should return nil.
pkg/testutils/testcluster/testcluster.go, line 1060 at r2 (raw file):
func (tc *TestCluster) GetFirstLiveServer() *server.TestServer { for i := range tc.Servers { // Use the first non-stopped server to avoid a deadlock
What's up with this comment? Doesn't seem to belong here.
pkg/testutils/testcluster/testcluster.go, line 1107 at r2 (raw file):
} // Restart stops and then starts all the servers in the cluster
nit: period at the end of the sentence
pkg/testutils/testcluster/testcluster.go, line 1108 at r2 (raw file):
// Restart stops and then starts all the servers in the cluster func (tc *TestCluster) Restart(t testing.TB, serverArgsPerNode map[int]base.TestServerArgs) {
The mechanism used by this Restart seems clunky to me; I think we should try to improve it. If I understand AddAndStartServer correctly, it'll leave the old servers in place, and appends new ones to the array. That's quite confusing; we shouldn't treat the servers in the cluster as "immutable" in this way; instead we should be mutating them in place - otherwise index management becomes really tedious.
This has ramifications for ServerStopped(), which should reflect a server having been restarted.
pkg/testutils/testcluster/testcluster.go, line 1115 at r2 (raw file):
for _, specs := range args.StoreSpecs { if specs.StickyInMemoryEngineID == "" { t.Fatalf("Restart can only be used when the servers were started with a sticky engine")
I think it'd be better to panic here and not take the t as an argument. This failure seems like a static condition that either happens during development, or never, and so we don't need any care in handling it - and so it's not worth plumbing the t param.
lunevalex
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @lunevalex, and @tbg)
pkg/kv/kvserver/client_raft_test.go, line 98 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
nit: try not to use
context.Background()in new code, it's too much noise. Declare actx := context.Background()at the beginning.
ack
pkg/testutils/testcluster/testcluster.go, line 128 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
You mean
StopServer()?
yes
pkg/testutils/testcluster/testcluster.go, line 1003 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
A
makewith length 0 and no explicit capacity is unusual, so just dovar result []int64if that's what you want. But themake()was right the way it was, and it expressed the right thing - thatresultswill usually havelen(servers). This is different from expressing thatresultswill always havelen(servers), in which case the best style would have been to declare it asresults := make([]int64, len(servers), and the loop below would have assigned directly to the array and not doappends.
So it will not have len(servers) since we ignore stopped ones, will remove make thanks for the tip.
pkg/testutils/testcluster/testcluster.go, line 1056 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
As far as I can tell, nobody outside of
TestClusteris calling this. Consider making it private, as I'm not very convinced of the need for it.
ack
pkg/testutils/testcluster/testcluster.go, line 1060 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
What's up with this comment? Doesn't seem to belong here.
oops
pkg/testutils/testcluster/testcluster.go, line 1108 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
The mechanism used by this
Restartseems clunky to me; I think we should try to improve it. If I understandAddAndStartServercorrectly, it'll leave the old servers in place, and appends new ones to the array. That's quite confusing; we shouldn't treat the servers in the cluster as "immutable" in this way; instead we should be mutating them in place - otherwise index management becomes really tedious.
This has ramifications forServerStopped(), which should reflect a server having been restarted.
Yes, it is tedious and I dont love it. I did not want to modify the array in place for this either, as that could also be confusing. What I am thinking of doing in a future iteration is to add methods to TestCluster that address servers by their (NodeID, StoreID) rather than index, which should remove a lot of clunkiness here. Then we can just make those methods look first non-stopped server with that (NodeID, StoreID).
pkg/testutils/testcluster/testcluster.go, line 1115 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I think it'd be better to panic here and not take the
tas an argument. This failure seems like a static condition that either happens during development, or never, and so we don't need any care in handling it - and so it's not worth plumbing thetparam.
No strong feelings on this one, I was following the general pattern for how we deal with errors in testing.
30f3d50 to
7fda76d
Compare
lunevalex
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @lunevalex, and @tbg)
pkg/testutils/testcluster/testcluster.go, line 1115 at r2 (raw file):
Previously, lunevalex wrote…
No strong feelings on this one, I was following the general pattern for how we deal with errors in testing.
Actually AddAndStartServer already takes t, so doing what you suggest would actually be more of a PITA, since we already have to pipe through t
fb9f583 to
748ee9a
Compare
andreimatei
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @lunevalex, and @tbg)
pkg/kv/kvserver/client_raft_test.go, line 82 at r3 (raw file):
defer leaktest.AfterTest(t)() defer log.Scope(t).Close(t) defer server.CloseAllStickyInMemEngines()
As far as I can tell, before this PR these "sticky in-mem engines" were used for cockroach demo, and for a single other test. In their current form, I don't think we should allow them to proliferate because of their global scope. Any use of globals in tests eventually proven a pain the ass in the past. At the very least, these CloseAllStickyInMemEngines() forever precludes us from running tests in parallel.
I think we should have tests create a MemorySpace or such, pass it into the StoreSpec, and have engines be associated with that space.
For TestClusters (as opposed to TestServers), I think StartTestCluster can allocate the MemorySpace, and we can have a new tc.Destroy() which cleans it up and does tc.Stopper().Stop() (btw, I generally think that tc.Stopper().Stop() should have a short-hand).
pkg/kv/kvserver/client_raft_test.go, line 88 at r3 (raw file):
{ InMemory: true, StickyInMemoryEngineID: "1",
put the name of the test in this id so that we can track it down in case it leaks
pkg/kv/kvserver/client_raft_test.go, line 94 at r3 (raw file):
ctx := context.Background() tc := testcluster.StartTestCluster(t, 1,
This test only uses one server at a time, so I think the fact that we have to use a TestCluster shows to something is funky; I think we should do what it takes so that it can use a TestServer. Which brings me to the question I was asking elsewhere - why can't we just server.Start() after having stopped it? Even if we can't get that to work (although it would improve the state of our testing a lot if it did work), I rather we build a second TestServer by hand rather than use any TestCluster methods.
pkg/kv/kvserver/node_liveness_test.go, line 71 at r3 (raw file):
func pauseNodeLivenessHeartbeatLoopsTC(tc *testcluster.TestCluster) func() { var enableFns []func()
I think this should be a method on TestCluster, and be available outside of the kvserver package.
pkg/testutils/testcluster/testcluster.go, line 1003 at r2 (raw file):
Previously, lunevalex wrote…
So it will not have len(servers) since we ignore stopped ones, will remove make thanks for the tip.
To clarify, I was saying that results := make([]int64, 0, len(tc.Servers)) would have been appropriate here. Whatever you want though.
pkg/testutils/testcluster/testcluster.go, line 1108 at r2 (raw file):
I did not want to modify the array in place for this either, as that could also be confusing.
Why would it be confusing? You're worried about people holding references to stores and such taken before the Restart? Cause if that's the worry, such a test would be broken with the current impl too.
What I actually would like to know is why do we need to assign or append to the array at all. Why can't we just stop and then Start() a server after having previouslt stopped it? Why do we need to construct a new server (with an old storage engine)?
At the very least, this method needs a comment explaining how the TestCluster will look afterwards.
pkg/testutils/testcluster/testcluster.go, line 1115 at r2 (raw file):
Previously, lunevalex wrote…
Actually AddAndStartServer already takes t, so doing what you suggest would actually be more of a PITA, since we already have to pipe through t
ok
|
pkg/kv/kvserver/client_raft_test.go, line 82 at r3 (raw file): Previously, andreimatei (Andrei Matei) wrote…
Discussed with @andreimatei offline, going to split this change into 2 one to enhance TestCluster to better support in-memory-engines and restarts and the actual testing changes in client_raft |
748ee9a to
fe55907
Compare
|
@tbg @andreimatei TestCluster.Restart was pulled into #57123 and merged. I rebased this on-top of that work and cleaned it up. Let me know if you want to look at the changes again or if they are good to go. |
Makes progress on cockroachdb#8299 Fixes cockroachdb#40351 Fixes cockroachdb#57560 Fixes cockroachdb#57537 multiTestContext is a legacy construct that is deprecated in favor of running tests via TestCluster. This is one PR out of many to remove the usage of multiTestContext in the client_raft test cases. This does not remove all the uses of mtc, just the simple ones. Leaving the more complex uses cases for a later PR. With this switch we can also clean up some TestingKnobs and TestServer interfaces. - DisablePeriodicGossips flag is removed, it does not work with TestCluster and is no longer used - DontPreventUseOfOldLeaseOnStart flag is removed, it did not work consistently in TestCluster. This flag tries to leave the Lease on the same node after a restart, but CRDB makes no such guarantees in the real world and artificially testing it does not prove anything. The affected tests were re-worked to not rely on this condition and can deal with a lease holder moving on a restart. - GetRaftLeader is ported from multiTestContext to TestCluster Release note: None
fe55907 to
3876b44
Compare
tbg
left a comment
There was a problem hiding this comment.
I didn't review again in detail, but what I saw looks good. Could you stress the pkg for an hour or so before merging?
Reviewed 7 of 10 files at r4.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @lunevalex, and @tbg)
pkg/kv/kvserver/client_raft_test.go, line 88 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
put the name of the test in this id so that we can track it down in case it leaks
Ping
pkg/kv/kvserver/client_raft_test.go, line 94 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
This test only uses one server at a time, so I think the fact that we have to use a
TestClustershows to something is funky; I think we should do what it takes so that it can use aTestServer. Which brings me to the question I was asking elsewhere - why can't we justserver.Start()after having stopped it? Even if we can't get that to work (although it would improve the state of our testing a lot if it did work), I rather we build a secondTestServerby hand rather than use anyTestClustermethods.
I'd almost make the counter-argument here, that if we have the ability to start a one-node cluster, that this should obviate maintaining a separate test server. I think in practice we do at least internally want to maintain a clean separation between the test server and a surrounding cluster (don't want another mtc where things are randomly shared between nodes). I don't think Alex can reasonably brush this up now, but I will keep this in mind.
pkg/kv/kvserver/client_raft_test.go, line 83 at r5 (raw file):
defer log.Scope(t).Close(t) stickyEngineRegistry := server.NewStickyInMemEnginesRegistry()
👏
|
pkg/kv/kvserver/client_raft_test.go, line 88 at r3 (raw file): Previously, tbg (Tobias Grieger) wrote…
Now that the registry is scoped to the test it felt like a moot point |
|
TFTR! bors r+ |
|
Build failed (retrying...): |
|
Build failed: |
|
bors r+ |
|
Build succeeded: |

Makes progress on #8299
Fixes #40351
Fixes #57560
Fixes #57537
multiTestContext is a legacy construct that is deprecated in favor of running
tests via TestCluster. This is one PR out of many to remove the usage of
multiTestContext in the client_raft test cases. This does not remove all the
uses of mtc, just the simple ones. Leaving the more complex uses cases for a later PR.
With this switch we can also clean up some TestingKnobs and TestServer interfaces.
- DisablePeriodicGossips flag is removed, it does not work with TestCluster
and is no longer used
- DontPreventUseOfOldLeaseOnStart flag is removed, it did not work consistently
in TestCluster. This flag tries to leave the Lease on the same node after a
restart, but CRDB makes no such guarantees in the real world and artificially
testing it does not prove anything. The affected tests were re-worked to
not rely on this condition and can deal with a lease holder moving on a restart.
- GetRaftLeader is ported from multiTestContext to TestCluster
Release note: None