Skip to content

multitenant: cannot query after stopping a server in a unit test #107499

@lidorcarmel

Description

@lidorcarmel

In a unit test, we try to stop a server and the shared tenant processes, then we restart the shared tenant, and then fail querying that tenant.

The original test that tried to do this is this c2c test: https://github.com/cockroachdb/cockroach/blob/bcf1f9c17b38ea1bc3c2996bffdbc715980e48c9/pkg/ccl/streamingccl/streamingest/replication_stream_e2e_test.go#L463

Slack thread https://cockroachlabs.slack.com/archives/C02HWA24541/p1690223205316979

Example:

func TestMultiTenantStopServer(t *testing.T) {
	defer leaktest.AfterTest(t)()
	defer log.Scope(t).Close(t)
	ctx := context.Background()

	// 1. Start a test cluster:
	serverArgs := base.TestServerArgs{
		DefaultTestTenant: base.TODOTestTenantDisabled,
	}
	c := testcluster.StartTestCluster(t, 4, base.TestClusterArgs{ServerArgs: serverArgs})
	defer c.Stopper().Stop(ctx)

	// 2. Start a tenant:
	tenantArgs := base.TestSharedProcessTenantArgs{
		TenantName: "mytenant",
		TenantID:   roachpb.MustMakeTenantID(2),
	}
	tenantServer, tenantConn := serverutils.StartSharedProcessTenant(t, c.Server(0), tenantArgs)
	testutils.SucceedsSoon(t, func() error {
		return tenantConn.Ping()
	})
	sysSQL := sqlutils.MakeSQLRunner(c.ServerConn(0))
	tenantSQL := sqlutils.MakeSQLRunner(tenantConn)

	// 3. Write stuff:
	numRanges := 50
	rowsPerRange := 20
	sysSQL.Exec(t, `ALTER TENANT mytenant SET CLUSTER SETTING sql.split_at.allow_for_secondary_tenant.enabled=true`)
	sysSQL.Exec(t, `ALTER TENANT mytenant SET CLUSTER SETTING sql.scatter.allow_for_secondary_tenant.enabled=true`)
	tenantSQL.Exec(t, "CREATE DATABASE d")
	tenantSQL.Exec(t, "CREATE TABLE d.scattered (key INT PRIMARY KEY)")
	tenantSQL.Exec(t, "INSERT INTO d.scattered (key) SELECT * FROM generate_series(1, $1)",
		numRanges*rowsPerRange)
	tenantSQL.Exec(t, "ALTER TABLE d.scattered SPLIT AT (SELECT * FROM generate_series($1::INT, $2::INT, $3::INT))",
		rowsPerRange, (numRanges-1)*rowsPerRange, rowsPerRange)
	tenantSQL.Exec(t, "ALTER TABLE d.scattered SCATTER")

	// 4. Verify we can read it:
	_ = tenantSQL.QueryStr(t, "SELECT * FROM d.scattered")

	// 5. Stop the tenant and stop node 1.
	sysSQL.Exec(t, `ALTER TENANT mytenant STOP SERVICE`)
	tenantServer.Stopper().Stop(ctx)
	c.StopServer(0)

	// 6. Start the shared process tenant again:
	_, alternateTenantConn := serverutils.StartSharedProcessTenant(t, c.Server(1),
		base.TestSharedProcessTenantArgs{
			TenantName: "mytenant",
			TenantID:   roachpb.MustMakeTenantID(2),
		})
	defer alternateTenantConn.Close()
	alternateTenantSQL := sqlutils.MakeSQLRunner(alternateTenantConn)

	// 7. Try to read again, which sometimes fails with:
	// replication_stream_e2e_test.go:518: error executing 'SELECT * FROM d.scattered': pq: failed to connect to n1 at
	// 127.0.0.1:52291: grpc: connection error: desc = "transport: error while dialing: connection interrupted (did the
	// remote node shut down or are there networking issues?)" [code 14/Unavailable]
	_ = alternateTenantSQL.QueryStr(t, "SELECT * FROM d.scattered")
}

Jira issue: CRDB-30082

Metadata

Metadata

Assignees

Labels

A-multitenancyRelated to multi-tenancyC-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.T-multitenantIssues owned by the multi-tenant virtual team

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions