Skip to content

multitenant: randomly run tests with tenants#76582

Merged
craig[bot] merged 9 commits intocockroachdb:masterfrom
ajstorm:ajstorm-second-rework-testserver
Jun 17, 2022
Merged

multitenant: randomly run tests with tenants#76582
craig[bot] merged 9 commits intocockroachdb:masterfrom
ajstorm:ajstorm-second-rework-testserver

Conversation

@ajstorm
Copy link
Copy Markdown
Collaborator

@ajstorm ajstorm commented Feb 15, 2022

Previously, testServer would run by default in single-tenant mode. This PR
changes testServer to probabilistically run in multi-tenant mode, where it
runs all statements through the default test tenant. Since we want to test in
both single-tenant and multi-tenant mode, we only probabilistically run with
tenants (with probability 0.5). We avoid running all tests on each CI run through
both the single tenant and multi-tenant paths to save on resources. This
decision may be revisited down the road.

This PR also contains a few pieces of cleanup which were necessary to make
the testing changes:

  • Changed the multi-region backup tests to use node 0 instead of node 1, to
    allow them to work within a tenant. Tenants aren't able to access remote
    nodes of nodelocal stoarge.
  • Cleaned up a couple of cases where our error handling was passing nil
    structs to be printed out. These were only causing problems in some
    intermediate failure scenarios which have since been resolved. Still, it
    made sense to make these changes in case the code evolves down the
    road which makes it possible to hit these cases again.
  • Had to special case the setting of kv.range_merge.queue_enabled because
    it is only available to the system SQL server. As a result, all tests which
    check for its setting had to be modified to no longer check for it (since
    it will only be set in the case where we have no SQL server).

Additionally, we've created a tracking issue for all tests which currently
don't work under the default test tenant (#76378).

See individual commits for a detailed breakdown of the above.

Release note: None

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@ajstorm ajstorm force-pushed the ajstorm-second-rework-testserver branch 3 times, most recently from 93a2ced to 26d4861 Compare February 15, 2022 17:28
@ajstorm ajstorm force-pushed the ajstorm-second-rework-testserver branch 6 times, most recently from ac9dd49 to 112e3e5 Compare April 28, 2022 15:26
@ajstorm ajstorm force-pushed the ajstorm-second-rework-testserver branch 8 times, most recently from 26ad0d7 to b921775 Compare May 5, 2022 21:15
@ajstorm ajstorm force-pushed the ajstorm-second-rework-testserver branch 3 times, most recently from be291ff to 7d7ccc0 Compare May 6, 2022 20:55
rharding6373 added a commit to rharding6373/cockroach that referenced this pull request May 7, 2022
@ajstorm ajstorm force-pushed the ajstorm-second-rework-testserver branch 5 times, most recently from 8183061 to 671f8a1 Compare May 16, 2022 19:56
@ajstorm ajstorm force-pushed the ajstorm-second-rework-testserver branch 2 times, most recently from 7e8e93f to ddad2de Compare May 17, 2022 03:33
@knz
Copy link
Copy Markdown
Contributor

knz commented May 31, 2022

Could we stress test these in CI?

Alternatively, what about we make the random determination in CI, instead of within the go code?
i.e. have TC choose either one CI target or the other depending on a % random.

This way we would always see in the CI output what has been run and how

Copy link
Copy Markdown
Collaborator Author

@ajstorm ajstorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we stress test these in CI? E.g., we could trigger "Stress Trigger (Bazel)" against the PR branch.

What would this do under the covers? Are you thinking that we'd run all of the test explicitly in both forceTenant and forceNoTenant modes? Or that by stressing it we'd get that for free because after N iterations, we'd be guaranteed that we'd hit both modes? Or something else?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @knz, @RichardJCai, @stevendanna, and @tbg)

Copy link
Copy Markdown
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would this do under the covers?

It would add non-determinism to all the test runs. The nightly stress trigger [1] invokes each package with -maxruns 100 and -maxtime 1h0m0s using two separate builds, one with -race and one without; i.e., each package is stressed for a total of 2h during which the binary (with or without thread sanitizer) is forked up to 100 times.

[1] https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_Nightlies_StressTriggerBazel/5312206

Or that by stressing it we'd get that for free because after N iterations, we'd be guaranteed that we'd hit both modes?

Exactly. The default configuration should yield enough executions of each mode; on average, each package is executed at least a couple of dozen of times, e.g., [1]. We could also raise -maxruns and maxtime if these tests are taking long.

[1] https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_Nightlies_StressBazel/5340928

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @knz, @RichardJCai, @stevendanna, and @tbg)

Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave another round of review to this.

There are two general issues remaining:

  1. the current idea that the caller of the test "start" functions is responsible for cleaning up / shutting down things if the start function returns an error. This is not idiomatic go. The start functions that return an error are responsible for cleaning up after themselves.

  2. terminology. We've talked about "SQL server" but it's not just that that's problematic.
    In a couple of places you've introduced uses of the term "host cluster" where in this case "storage cluster" (or maybe even "nearest kv node") would be appropriate. I had comments about that already.

Reviewed 5 of 37 files at r16, 6 of 9 files at r17, 1 of 30 files at r20, 2 of 2 files at r21, 27 of 27 files at r22, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @knz, @RichardJCai, @stevendanna, and @tbg)


pkg/cli/testutils.go line 108 at r8 (raw file):

With the code which calls skip(), it's now possible that the test continues after fail() is called

skip also stops the test immediately (internally by emitting a panic and catching it in the go test code). What's an example of something bad that happens without it?


pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go line 494 at r8 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Good catch. Reverted.

I don't see it reverted. In fact, I see backupMsg being computed above the conditional with the .String() method. That's ... sub-optimal. The Sprintf call will already call the .String() method.
Is it possible to remove this file entirely from the PR?


pkg/server/testserver.go line 559 at r8 (raw file):

Previously, ajstorm (Adam Storm) wrote…

What's the "that" you're referring to here?

Your statement that "Trying to create a second test SQL server for the same test server" is "not currently supported"

It looks to me like the demo code does this, so it's in fact supported.

image.png


pkg/server/testserver.go line 560 at r22 (raw file):

		// Trying to create a second test SQL server for the same test server
		// is not currently supported.
		panic(fmt.Sprintf("invalid number of test SQL servers %d", len(ts.SQLServers)))

errors.AssertionFailedf and no need to panic.


pkg/sql/pgwire/pgerror/errors.go line 40 at r6 (raw file):

Previously, ajstorm (Adam Storm) wrote…

It's not strictly required now, but the way the code is written, it can't accommodate nil input. The problem is that if err comes in as nil, the errors.As() call will return false, putting us into the else branch. In that else branch, Flatten() will also return nil, and then we get a nil pointer dereference when we try to call pg.Severity (and also err.Error()). Through massaging this PR over time, I think I've worked around all of the cases I introduced which ended up passing in nil here, but I figured it was probably worth making this change anyway.

Are you thinking that we should treat nil differently in here?

FullError's API contract is that the input must be non-nil.
If given a nil error, it's OK for it to crash. If you want to make it more obvious, feel free to do if err == nil { panic(...) }.

In the end we want to ensure that callers provide non-nil errors to it; your change here removes our ability to detect when it's not the case.


pkg/testutils/serverutils/test_server_shim.go line 293 at r8 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Yes, there is something to stop. With the changes I've made, server.Start() now works in two steps. First it starts the server, then it probabilistically starts a test tenant. If the first part succeeds and the second part fails (say, because we don't have a CCL binary), then there's something to stop (the server).

That doesn't sound like the proper API design. If the Start() method fails in the 2nd step, then it should be responsible for stopping the things it has started so far. We use this pattern in many places. It's just too cumbersome (let alone not idiomatic go) to require the caller to do this conditional cleanup.


pkg/testutils/testcluster/testcluster.go line 361 at r8 (raw file):

Previously, ajstorm (Adam Storm) wrote…

In this case I think we need it because we've already started some servers and we're about to skip the test. I've added a comment to that effect.

But that's still not right I think. The call to startServer() could have failed also before you started this PR. The argument "we need to clean up the server" was equally true before. So why is it important to do this in this PR?


pkg/sql/logictest/testdata/logic_test/system line 1073 at r6 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Done.

what's done? Rachael was suggesting not having the condition anymore?


pkg/ccl/multitenantccl/tenantcostclient/tenant_side_test.go line 645 at r8 (raw file):

Previously, ajstorm (Adam Storm) wrote…

If we don't disable the extra SQL server here, the test will probabilistically run all operations which were intended for the system tenant, in an encapsulated (secondary) tenant. I can't remember what actually broke in this test when we did that, but perhaps it was caused by the fact that we're doing a low-level KV scan?

The thing is, I believe this test really should work on any tenant. So maybe worth investigating further?


pkg/server/server_test.go line 142 at r8 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Isn't it the case that if we get an error out of StartServerRaw() (which was triggered after the server was started), and then we don't add the stopper before calling t.Fatal(), that we'll leak the server? I thought I was seeing that in some of my tests, but perhaps I was mistaken.

I would hope the API of StartServerRaw is that the caller has nothing to do if it returns an error. That's the same idea as my other earlier comment.

Copy link
Copy Markdown
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was one failure during the stress job [1]. MT-mode for the unit tests in pkg/ccl appears to be chosen with p=0.5. I'll trigger another run; otherwise, the mechanism looks good to me.

   === RUN   TestTelemetry/multiregion/server
13:14:02         telemetry.go:110:
13:14:02             testdata/telemetry/multiregion:1:
13:14:02             feature-allowlist [0 args]
13:14:02             sql.multiregion.*
13:14:02             ----
13:14:02         telemetry.go:110:
13:14:02             testdata/telemetry/multiregion:5: CREATE DATABASE survive_zone PRIMARY REGION "us-east-1" SURVIVE ZONE FAILURE
13:14:02             expected:
13:14:02             sql.multiregion.create_database
13:14:02             sql.multiregion.create_database.survival_goal.survive_zone_failure
13:14:02     
13:14:02             found:
13:14:02             error: pq: setting sql.multi_region.allow_abstractions_for_secondary_tenants.enabled disallows use of multi-region abstractions
13:14:02             sql.multiregion.create_database
13:14:02             sql.multiregion.create_database.survival_goal.survive_zone_failure

[1] https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_ScratchProjectPutTcExperimentsInHere_StressBazel/5360495

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @knz, @RichardJCai, @stevendanna, and @tbg)

Copy link
Copy Markdown
Collaborator Author

@ajstorm ajstorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the second pass @knz . Your second issue is something I plan to address with one final commit once all of the other comments are addressed. I want to get everything else out of the way so that I have something stable on which to lay down the naming changes.

I've also addressed all of your new comments. Thanks, and RFAL!

Nice catch @srosenberg! Not sure why I wasn't seeing this failure in my runs, but I've added multiregion to the skiplist for the tenant-level telemetry tests. RFAL.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @knz, @RichardJCai, @stevendanna, and @tbg)


pkg/cli/testutils.go line 108 at r8 (raw file):

Previously, knz (kena) wrote…

With the code which calls skip(), it's now possible that the test continues after fail() is called

skip also stops the test immediately (internally by emitting a panic and catching it in the go test code). What's an example of something bad that happens without it?

Without the stopper in here I was seeing instances where leak detection was catching a leaked server.


pkg/server/testserver.go line 559 at r8 (raw file):

Previously, knz (kena) wrote…

Your statement that "Trying to create a second test SQL server for the same test server" is "not currently supported"

It looks to me like the demo code does this, so it's in fact supported.

image.png

Ahh, that :-)

My comment may not have been precise enough. You're right that creating a second test SQL server for the same test server is supported. What's not fully supported is creating multiple default SQL servers for a default test tenant. The reason why is because we currently store the default test SQL server in an array (SQLServers) and unconditionally return, in ServingSQLAddr(), the 0th entry in that array as the server's SQL address. I did it this way to abstract away the fact that tests may now be running within tenants, as otherwise I might have to change all of the callers to ServingSQLAddr().

I'm currently blocking creating multiple default test SQL servers to avoid the case where multiple are created but we're only ever using the first one in ServingSQLAddr(). If someone wants to open up the option to create multiple default test SQL servers at some point down the road, I also wanted to alert them to this issue through these panics.

This is explained more fully in the comment in ServingSQLAddr(), but I've clarified it here too.


pkg/server/testserver.go line 560 at r22 (raw file):

Previously, knz (kena) wrote…

errors.AssertionFailedf and no need to panic.

Cool, thanks!


pkg/sql/pgwire/pgerror/errors.go line 40 at r6 (raw file):

Previously, knz (kena) wrote…

FullError's API contract is that the input must be non-nil.
If given a nil error, it's OK for it to crash. If you want to make it more obvious, feel free to do if err == nil { panic(...) }.

In the end we want to ensure that callers provide non-nil errors to it; your change here removes our ability to detect when it's not the case.

SGTM. Done.


pkg/testutils/serverutils/test_server_shim.go line 293 at r8 (raw file):

Previously, knz (kena) wrote…

That doesn't sound like the proper API design. If the Start() method fails in the 2nd step, then it should be responsible for stopping the things it has started so far. We use this pattern in many places. It's just too cumbersome (let alone not idiomatic go) to require the caller to do this conditional cleanup.

Excellent point. I've made the change. It's much cleaner now.


pkg/testutils/testcluster/testcluster.go line 361 at r8 (raw file):

Previously, knz (kena) wrote…

But that's still not right I think. The call to startServer() could have failed also before you started this PR. The argument "we need to clean up the server" was equally true before. So why is it important to do this in this PR?

I just tested again with removing this code and it definitely fails. I think the difference between between before and this PR is that we're skipping instead of failing the test. Is it possible that when you fail, the leak detector doesn't run? That would seem to make sense, as in many failures I'd imagine we're leaking things.

I've added a comment to make this more clear.


pkg/sql/logictest/testdata/logic_test/system line 1073 at r6 (raw file):

Previously, knz (kena) wrote…

what's done? Rachael was suggesting not having the condition anymore?

She was suggesting not having the skipif now that we're filtering out the kv.range_merge stuff. I added kv.range_merge filtering and removed the skipif.


pkg/ccl/multitenantccl/tenantcostclient/tenant_side_test.go line 645 at r8 (raw file):

Previously, knz (kena) wrote…

The thing is, I believe this test really should work on any tenant. So maybe worth investigating further?

Sure, I've changed the comment and have added it to the tracking issue.


pkg/server/server_test.go line 142 at r8 (raw file):

Previously, knz (kena) wrote…

I would hope the API of StartServerRaw is that the caller has nothing to do if it returns an error. That's the same idea as my other earlier comment.

With my previous change, this is now revertable. Thanks.


pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go line 494 at r8 (raw file):

Previously, knz (kena) wrote…

I don't see it reverted. In fact, I see backupMsg being computed above the conditional with the .String() method. That's ... sub-optimal. The Sprintf call will already call the .String() method.
Is it possible to remove this file entirely from the PR?

Fully reverted now.

// We're starting the default SQL server (i.e. we're running this test in a
// tenant). Log this for easier debugging.
log.Shout(context.Background(), severity.INFO,
"NOTE: Running test with the default SQL Server (i.e. within a tenant). "+
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 This is useful for debugging; e.g., grep + wc to get a rough idea of the number of unit tests which ran in MT-mode.

Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 28 files at r19, 1 of 30 files at r23, 1 of 27 files at r25, 1 of 29 files at r26, 2 of 2 files at r27, 26 of 26 files at r28, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @RichardJCai, @stevendanna, and @tbg)


pkg/cli/testutils.go line 108 at r8 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Without the stopper in here I was seeing instances where leak detection was catching a leaked server.

I think the correct way forward here is not to add complexity in the tests themselves, and instead add the suitable condition to util/leaktest/leaktest.go:

                // If the test already failed, we don't pile on any more errors but we check
                // to see if the leak detector should be disabled for future tests.
                if t.Failed() {

maybe add || t.Skipped() in there.

Then check if the code here can be simplified accordingly.


pkg/server/admin_test.go line 139 at r4 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Done - #81590.

Maybe add a reference to the issue in the comment, also below.

Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have more comments about code organization. looking forward to the terminology fixes now.

Reviewed 1 of 55 files at r29, 19 of 19 files at r30, 7 of 8 files at r31, 2 of 2 files at r32, 26 of 26 files at r33, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @RichardJCai, @stevendanna, and @tbg)

Copy link
Copy Markdown
Collaborator Author

@ajstorm ajstorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've addressed all outstanding comments (including the naming with a 9th commit), and am hoping to merge this sometime later this week (or during breather week).

@knz can you take a look at the 9th commit, which should address all of your comments?
@stevendanna if you want to have a look, can you please do so sometime early next week?
@srosenberg if there's anything else you'd like to look at, can you also do it early next week? If not, can I get an LGTM for the first and 8th commits?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @RichardJCai, @srosenberg, @stevendanna, and @tbg)


pkg/cli/testutils.go line 108 at r8 (raw file):

Previously, knz (kena) wrote…

I think the correct way forward here is not to add complexity in the tests themselves, and instead add the suitable condition to util/leaktest/leaktest.go:

                // If the test already failed, we don't pile on any more errors but we check
                // to see if the leak detector should be disabled for future tests.
                if t.Failed() {

maybe add || t.Skipped() in there.

Then check if the code here can be simplified accordingly.

As mentioned over Slack, I'm going to stick with the stopper code here, for the following reasons:

  • There’s only a couple of places in the code where they’re needed
  • We know that the test is going to be skipped in those places, so doing a bit of cleanup doesn’t seem unprecedented (we do similar cleanup when we create a server and encounter an error).
  • It ensures that we’re not exposed to stopper leaks building up in tests and exhausting memory.
  • It’s not as simple as you described to get the leaktest disabling to work on skip. I tried it out just now as you suggested and it still detects leaks.

If we view this as a problem long term, we can investigate more fully down the road.


pkg/server/config.go line 182 at r8 (raw file):

Previously, knz (kena) wrote…

See previous comments about the naming of this attribute.

Renamed with 9th commit.


pkg/server/testserver.go line 318 at r8 (raw file):

Previously, knz (kena) wrote…

see previous comments about the naming of this attribute (and the explanatory comment).

Renamed to testTenants in 9th commit.


pkg/sql/logictest/logic.go line 560 at r8 (raw file):

Previously, knz (kena) wrote…

see previous discussion about the naming.

Renamed to disableDefaultTestTenant in the 9th commit.


pkg/testutils/serverutils/test_cluster_shim.go line 209 at r8 (raw file):

Previously, knz (kena) wrote…

s/Host/Storage

Done with 9th commit.


pkg/testutils/serverutils/test_server_shim.go line 93 at r28 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

👍 This is useful for debugging; e.g., grep + wc to get a rough idea of the number of unit tests which ran in MT-mode.

Done.


pkg/testutils/testcluster/testcluster.go line 60 at r8 (raw file):

Previously, knz (kena) wrote…

s/host/storage throughout.

Renamed in the 9th commit.


pkg/ccl/cliccl/debug_backup_test.go line 547 at r2 (raw file):

Previously, knz (kena) wrote…

Here and below, I think it would be more enlightening to say "for encapsulated tenants"

Done.


pkg/server/admin_test.go line 139 at r4 (raw file):

Previously, knz (kena) wrote…

Maybe add a reference to the issue in the comment, also below.

Done.

Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 19 files at r37, 16 of 20 files at r38, 4 of 8 files at r39, 12 of 26 files at r41, 66 of 66 files at r42, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @RichardJCai, @srosenberg, @stevendanna, and @tbg)


pkg/server/config.go line 183 at r42 (raw file):

	SpanConfigsDisabled bool

	// Disables the default SQL server.

this one still needs an update.

Also the name of the configuration knob I think?


pkg/server/testserver.go line 314 at r42 (raw file):

	// httpTestServer provides the HTTP APIs of TestTenantInterface.
	*httpTestServer
	// The SQL servers associated with this server, and used for probabilistic

this comment needs an update too?


pkg/server/testserver.go line 500 at r42 (raw file):

}

// maybeStartTestSQLServer might start a test SQL server. This can then be used

ditto here and below


pkg/testutils/serverutils/test_cluster_shim.go line 206 at r42 (raw file):

	// StartedDefaultSQLServer returns whether this cluster started a
	// default test SQL server.

test tenant? and maybe update the method name accordingly.


pkg/testutils/serverutils/test_server_shim.go line 62 at r42 (raw file):

)

// ShouldStartDefaultSQLServer determines whether a default SQL server (aka

Maybe ShouldStartTestTenant? and document as "... determines whether a test tenant should be started for test servers and clusters to serve SQL traffic by default."


pkg/testutils/serverutils/test_server_shim.go line 90 at r42 (raw file):

	}

	// We're starting the default SQL server (i.e. we're running this test in a

maybe update this accordingly to the above.


pkg/sql/logictest/testdata/logic_test/system line 927 at r42 (raw file):

# NB: the "order by" is necessary or this test is flaky under DistSQL.
# This is somewhat surprising.
# With probabilistic SQL server creation, we have to filter out

test tenant?


pkg/sql/logictest/testdata/logic_test/system line 929 at r42 (raw file):

# With probabilistic SQL server creation, we have to filter out
# kv.range_merge.queue_enabled, since it will only be set in cases
# where a SQL server is not allocated.

ditto


pkg/sql/logictest/testdata/logic_test/system line 964 at r42 (raw file):


# Have to exclude kv.range_merge.queue_enabled as it is not accessible
# to SQL servers.

ditto


pkg/ccl/changefeedccl/helpers_test.go line 345 at r42 (raw file):

		Knobs: knobs,
		// This test suite is already probabilistically running with
		// tenants. No need for the SQL server.

test tenant


pkg/cli/flags_test.go line 1278 at r42 (raw file):

                                  passing the content of the file via the --log flag.
                                  (default <unset>)
      --tenantMode string        tenantMode in which to run tests. Options are forceTenant, forceNoTenant, and default which alternates between tenant and no-tenant mode probabilistically. Note that the two force modes are ignored if the test is already forced to run in one of the two modes. (default "default")

This is a regression. You need to mark the new flag as hidden.

See the call to flag.VisitAll() in cli/flags.go, functioninit()

Copy link
Copy Markdown
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an epic PR! I've reviewed the 1st and the 8th commits. A few minor comments are below; LGTM for the rest.

  • could you lift the literal "skipping due to lack of CCL binary" into a const; I counted 5 occurrences across different test files
  • I now wonder if we should revert the default mode to tenantModeForceNoTenant so that we don't end up breaking the build for external users without the CCL license? Instead, we could override it via the env. var in TeamCity.
  • Latest stress run [1] seems to have picked new test failures,
    • FAIL: TestConsumptionExternalStorage/export_in_a_tenant_increments_egress_bytes (0.06s)
    • FAIL: TestChangefeedAuthorization/cloud (1.28s)
    • FAIL: TestConsumptionExternalStorage/export_in_a_tenant_increments_egress_bytes
    • FAIL: TestCCLLogic/multiregion-9node-3region-3azs-no-los/multi_region_backup (4.53s)
    • FAIL: TestCCLLogic/multiregion-9node-3region-3azs-no-los/alter_table_locality (26.69s)
    • FAIL: //pkg/ccl/multitenantccl/tenantcostclient:tenantcostclient_test

[1] https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_ScratchProjectPutTcExperimentsInHere_StressBazel/5453749

Reviewed 1 of 1 files at r1, 98 of 98 files at r34, 22 of 22 files at r35, 3 of 3 files at r36, 19 of 19 files at r37.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @knz, @RichardJCai, @srosenberg, @stevendanna, and @tbg)

Copy link
Copy Markdown
Collaborator Author

@ajstorm ajstorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @knz for your continuous reviews. I've made all of the changes you've requested. Somehow in all of the rebasing I've accidentally squashed the 8th and 9th commits together. Seeing as how 9 was the rename, I think it's probably OK, and have amended the 8th commit to include a comment around the rename. I've also added a new 9th commit (see below). Apologies for making this more complicated that it needs to be, but this is my first really large and complex PR, and I clearly need to evolve my rebasing technique.

@srosenberg thanks for having another look! I should have most of those test cases addressed now, but will also do one final run right after my last rebase to make sure nothing else broke in the interim.

As for the default mode, I'm wondering what guarantees we make to external users running our testing? Is there some assumption that the CCL tests will work in that environment? Or do we only run some subset of tests with BSL-based customers? I'm fine flipping the default and handling things later, but it implies that until we turn this on, there's the potential that this PR can rot. As long as you're OK with that, I'm good with it too.

New 9th commit has the literal for "skipping due to lack of CCL binary".

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @RichardJCai, @srosenberg, @stevendanna, and @tbg)


pkg/server/config.go line 183 at r42 (raw file):

Previously, knz (kena) wrote…

this one still needs an update.

Also the name of the configuration knob I think?

Done. Configuration knobs should be OK. It was COCKROACH_TEST_TENANT_MODE and tenantMode from the beginning.


pkg/server/testserver.go line 314 at r42 (raw file):

Previously, knz (kena) wrote…

this comment needs an update too?

Done.


pkg/server/testserver.go line 500 at r42 (raw file):

Previously, knz (kena) wrote…

ditto here and below

Done.


pkg/testutils/serverutils/test_cluster_shim.go line 206 at r42 (raw file):

Previously, knz (kena) wrote…

test tenant? and maybe update the method name accordingly.

Done.


pkg/testutils/serverutils/test_server_shim.go line 62 at r42 (raw file):

Previously, knz (kena) wrote…

Maybe ShouldStartTestTenant? and document as "... determines whether a test tenant should be started for test servers and clusters to serve SQL traffic by default."

Done.


pkg/testutils/serverutils/test_server_shim.go line 90 at r42 (raw file):

Previously, knz (kena) wrote…

maybe update this accordingly to the above.

Done.


pkg/sql/logictest/testdata/logic_test/system line 927 at r42 (raw file):

Previously, knz (kena) wrote…

test tenant?

Done


pkg/sql/logictest/testdata/logic_test/system line 929 at r42 (raw file):

Previously, knz (kena) wrote…

ditto

Done


pkg/sql/logictest/testdata/logic_test/system line 964 at r42 (raw file):

Previously, knz (kena) wrote…

ditto

Done.


pkg/ccl/changefeedccl/helpers_test.go line 345 at r42 (raw file):

Previously, knz (kena) wrote…

test tenant

Done.


pkg/cli/flags_test.go line 1278 at r42 (raw file):

Previously, knz (kena) wrote…

This is a regression. You need to mark the new flag as hidden.

See the call to flag.VisitAll() in cli/flags.go, functioninit()

Great catch! I made a change in cli/flags.go but I'm not sure it's idiomatic. Can you have a look and let me know?

Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: thanks

Apologies for making this more complicated that it needs to be, but this is my first really large and complex PR, and I clearly need to evolve my rebasing technique.

For this type of work, I would recommend the technique we've applied for #74301 and #82020 - split the commits into a sequence of branch tips in the cockroach repository, each associated with a PR against the previous tip (see e.g. #82069, #82070). This gives you a reviewable experience with just 1-2 commits per PR, with the benefit of CI on each intermediate step.

Reviewed 1 of 81 files at r46, 2 of 19 files at r50, 16 of 20 files at r51, 2 of 8 files at r52, 77 of 80 files at r54, 3 of 3 files at r55, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @RichardJCai, @srosenberg, @stevendanna, and @tbg)

ajstorm added 7 commits June 16, 2022 09:20
This commit introduces some infrastructure which will eventually be used to
probabilistically run tests in a tenant. The DisableDefaultSQLServer flag is
used by a test case to indicate that it can not be run in a default SQL Server
(i.e. test tenant). The DisableCreateTenant flag will eventually be a more
accurately named replacement for the Existing flag.

Release note: None
Disable bulk tests which fail when run under the default SQL server.  More
investigation is required for some of these, and is tracked with cockroachdb#76378.

Release note: None
Disable tests in CDC which fail when run probabilistically under the default
SQL server. Re-enabling these tests (if possible) is tracked with cockroachdb#76378.

Release note: None
…rver

Disable the KV/server tests which fail or hang when running probabilistically
with a default SQL server (aka tenant). Re-enabling the hanging/failing test
cases is tracked with cockroachdb#76378.

Release note: None
This commit lays some groundowrk for enabling probabilistic testing of the
default SQL server (aka tenant). This groundwork consists of the following:

- Allow for zone config manipulation and multi-region abstraction usage by
  secondary tenants in several of the MR test cases. This is required since
  once we enable probabilistic testing within SQL servers, the tests will run
  some portion of the time from the default SQL server, which is a secondary
  tenant.
- Change the backup target to nodelocal://0 as nodelocal://1 doesn't work from
  within the default SQL server.

Release note: None
Lay the SQL groundwork to probabilistically testing in a default SQL server
(aka a tenant).

Release note: None
Lay the groundwork in workload to probabilistically test with the default SQL
server (aka tenant).

Release note: None
Copy link
Copy Markdown
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, let's roll with the 0.5 default. This way, we'll get more coverage from local dev. environments.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @knz, @RichardJCai, @srosenberg, @stevendanna, and @tbg)

ajstorm added 2 commits June 16, 2022 15:19
Previously we had a suite of tests which were designed to run within tenants.
This provided some test coverage of tenants, but there was a desire for broader
coverage. This commit adds the notion of a default SQL server, which can be
used in any test which leverages testServer/testCluster to run that test within
a tenant environment.

By default, the default SQL server runs in all tests which are able to run in a
tenanted environment with probability 0.5. A given test can be forced to run
without the default SQL server by setting the DisableDefaultSQLServer parameter
of the base.TestServerArgs structure.

Additionally, this commit ads a tenantMode flag to the test infrastructure
which can force a test to run in tenant mode or force it to not run in tenant
mode.

The commit also contains test changes which rework the Existing flag. The
Existing flag used to be used by test cases to signal when a test SQL server
already existed, thus preventing it from being created a second time (which
would have resulted in a failure). This was problematic behavior for the
probabilistic testing change described above, because now a test SQL server
is being created far more frequently (and non-deterministically). As a result,
we've changed the Existing flag to DisableCreateTenant. This new flag should
be used only in cases where we want to test SQL server creation failure, or in
cases where multiple SQL servers are being created for the same tenant.
For all other cases, the code now first checks if the SQL server exists before
trying to create it. If the SQL server exists, the creation check is skipped.

As a final note, this commit also renames the earlier "SQL server" flags to be
"test tenant" flags, for clarity.

Release note: None
Small change to add constants for a couple of commonly used strings.

Release note: None
@ajstorm
Copy link
Copy Markdown
Collaborator Author

ajstorm commented Jun 17, 2022

TFTR all!

bors r=srosenberg,knz

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jun 17, 2022

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants