Skip to content

multitenant: implement tenant upgrade interlock#94998

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
ajstorm:ajstorm-version-interlock
Mar 19, 2023
Merged

multitenant: implement tenant upgrade interlock#94998
craig[bot] merged 1 commit intocockroachdb:masterfrom
ajstorm:ajstorm-version-interlock

Conversation

@ajstorm
Copy link
Copy Markdown
Collaborator

@ajstorm ajstorm commented Jan 10, 2023

CRDB has an upgrade interlock between nodes to ensure that as a cluster is upgrading, the node that is driving the upgrade keeps all other nodes in sync (and prevents nodes with a down-level binary from joining the cluster). This interlock mechanism hasn't existed for tenant pods during a tenant upgrade until this commit.

The commit adds a similar interlock to the one used for nodes. When a tenant pod begins upgrading it will first confirm that all other running tenant pods are on an acceptable binary version and that the version for the attempted upgrade is less than the binary version of all tenant pods as well as greater than (or equal to) the minimum supported binary version. Then, it will begin to run migrations (upgrades). After each migration, it will push out the intermediate cluster version to all running tenant pods and revalidate that the upgrade can continue.

Epic: CRDB-18735

Release note: None

@ajstorm ajstorm requested review from a team, ajwerner, healthy-pod and knz January 10, 2023 14:50
@ajstorm ajstorm requested review from a team as code owners January 10, 2023 14:50
@ajstorm ajstorm requested a review from a team January 10, 2023 14:50
@ajstorm ajstorm requested a review from a team as a code owner January 10, 2023 14:50
@ajstorm ajstorm requested review from herkolategan and srosenberg and removed request for a team January 10, 2023 14:50
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@ajstorm ajstorm removed request for a team, herkolategan and srosenberg January 10, 2023 14:50
@ajstorm
Copy link
Copy Markdown
Collaborator Author

ajstorm commented Jan 10, 2023

Still have to rebase, but wanted to get this on people's radars as it could take a bit of time to review.

I tried breaking it up into smaller commits, but there didn't seem to be an easy way to do so. The bulk of the changes however, are in testing, so at first it may look like a larger change than it actually is. The interlock logic is relatively small, and is mostly copied/massaged from the node interlock code (in pkg/upgrade/upgrademanager/manager.go)

@ajstorm ajstorm force-pushed the ajstorm-version-interlock branch from 1978b99 to 286e55e Compare January 10, 2023 14:59
@ajstorm
Copy link
Copy Markdown
Collaborator Author

ajstorm commented Jan 10, 2023

Nevermind - rebase was straightforward. This is ready for review now.

@ajstorm ajstorm force-pushed the ajstorm-version-interlock branch 2 times, most recently from 9c72940 to d1264fb Compare January 16, 2023 22:11
Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concerns are around edge cases, but if we want to really rely on this, we should sort them out.

Reviewed 5 of 18 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @healthy-pod, and @knz)


pkg/server/tenant_migration.go line 62 at r2 (raw file):

			"binary's minimum supported version %s",
			targetCV, tenantVersion.BinaryMinSupportedVersion())
		log.Warningf(ctx, "%s", msg)

I think the better thing to do here would be to construct the error using our error constructors which should do a good job with the redaction of the arguments and then pass the error to the logging function because transitively the redaction will be handled. Same for the next stanza.


pkg/server/tenant_migration.go line 69 at r2 (raw file):

		msg := fmt.Sprintf("sql pod %s is running a binary version %s which is "+
			"less than the attempted upgrade version %s",
			m.sqlServer.sqlIDContainer.SQLInstanceID().String(),

nit: don't call String here, it'll work against what I think you want to do regarding redaction.


pkg/upgrade/system_upgrade.go line 32 at r2 (raw file):

	// cluster. This is merely a convenience method and is not meant to be used
	// to infer cluster stability; for that, use UntilClusterStable.
	NumNodesOrTenantPods(ctx context.Context) (int, error)

I'm not up to date on the latest terminology. This is fine with me but a bit of a mouthful. What about NumServers= and ForEveryServer?

Code quote:

	// NumNodesOrTenantPods returns the number of nodes or tenant pods in the
	// cluster. This is merely a convenience method and is not meant to be used
	// to infer cluster stability; for that, use UntilClusterStable.
	NumNodesOrTenantPods(ctx context.Context) (int, error)

pkg/upgrade/upgradecluster/tenant_cluster.go line 132 at r2 (raw file):

	// Dialer allows for the construction of connections to other SQL pods.
	Dialer         NodeDialer
	InstanceReader *instancestorage.Reader

A delicate thing that I'm not so certain this PR deals with is the semantics of GetAllInstances. In particular, GetAllInstances uses a cache and can return stale results. What I think we want is to know the set of instances which might be live. I think this means we need a transactional mechanism to determine the set of instances. Even this isn't fundamentally safe.

0: podA starts up
1: podA begins migration protocol
2: podA determines it is the only pod, so it decides to bump the version gate
3: podB starts up and is using an old version (v1)
4: podB reads the cluster setting at v1
5: podA bumps the cluster version to v2
6: podA reads all the instances after the bump (assuming you've fixed the caching thing) and finds only itself
7: podB writes its instance entry and proceeds to serve traffic -- violating the invariant

I think that we need some way to synchronize the writing of the instance entry with the validation of the current version during startup.

One way we could deal with this would be to read the version from KV again after writing the instance row and making sure it's kosher. It's an extra RPC, but with the MR plans, it shouldn't be too painful.


pkg/upgrade/upgradecluster/tenant_cluster.go line 190 at r2 (raw file):

			conn, err := t.Dialer.Dial(ctx, roachpb.NodeID(id), rpc.DefaultClass)
			if err != nil {
				if strings.Contains(err.Error(), "failed to connect") {

this is suspicious to me, is there no more structured way to get at this condition? Do you have a test which exercises this path so we can go and poke at the error structure?

@ajwerner
Copy link
Copy Markdown
Contributor

I was speaking with @jeffswenson about a similar set of tenant upgrades. Synchronizing pods and versions was relevant. He had the good idea that one way we can massively simplify this interlock is if we read the version in the code that writes the instance row and we read the instances in the transaction which writes the new version to the setting. It's much simpler.

@jeffswenson is going to be writing a helper to read the version with a *kv.Txn.

Concretely:

  1. Confirm that the version is new enough in the same transaction that writes the instance table.
    • This will require using the above mentioned helper.
    • If the version is newer than your binary version, die.
  2. Run the bumpClusterVersion as you would for the fence version, but have it keep track of the set of instances it communicated with.
  3. Write the fence version to the cluster setting (at least for the first fence version of a migration, there's some nuance with version skipping). In this transaction which writes that fence, read the set of instances which you bumped in the previous step. If the set of instances is the same, you're done. If it's not the same, go to step 2.

@ajstorm
Copy link
Copy Markdown
Collaborator Author

ajstorm commented Jan 23, 2023

Clever idea. For this part:

In this transaction which writes that fence, read the set of instances which you bumped in the previous step. If the set of instances is the same, you're done. If it's not the same, go to step 2.

Instead of re-reading, could we not rely on read-set validation to do this for us? Revised protocol would look like this:

  1. Confirm that the version is new enough in the same transaction that writes the instance table.
    • This will require using the above mentioned helper.
    • If the version is newer than your binary version, die.
  2. Run the bumpClusterVersion as you would for the fence version , but have it keep track of the set of instances it communicated with.
  3. Re-read all instances of this tenant from the instances table. In the same transaction, write the fence version to the cluster setting (at least for the first fence version of a migration, there's some nuance with version skipping). In this transaction which writes that fence, read the set of instances which you bumped in the previous step. If the set of instances is the same, you're done. If it's not the same, Commit the transaction. If it succeeds, you're done. If you get a serializability error and the transaction aborts, go to step 2.

@ajwerner
Copy link
Copy Markdown
Contributor

What if between step 2 and step 3, a new instance writes to the instances table?

@ajwerner
Copy link
Copy Markdown
Contributor

To clarify the question given the async nature of this exchange. If a new instance writes itself to the table, it will see the previous version and it will succeed (we haven't bumped the stored version to the fence version until step 3). Yes, in step 3 we'll see this new instance, but we won't know that it's new (right?) and so we'll just say that everything is fine. It was to address this that I suggested that we remember the set of instances which we had bumped.

Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 8 of 8 files at r13, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajstorm, @ajwerner, and @healthy-pod)


pkg/cmd/roachtest/tests/multitenant_upgrade.go line 214 at r12 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Hmm, interesting thought. Last I checked, multitenant-upgrade was failing (this PR is out to address it: #98198). While I'd like to make the change you're suggesting in this PR, I don't think that'll be possible without first merging 98198. After that merges, I'm happy to revisit this. Does that work for you?

Yes, but let's have an issue open to track this followup work.


pkg/sql/sqlinstance/instancestorage/instancereader.go line 193 at r12 (raw file):

Previously, ajstorm (Adam Storm) wrote…

It's a good point, but it may be a bit more complex that what you're suggesting. In getInstanceRows we don't query via SQL, but rather, go directly to the KV batch and read from there. I'd either have to do the manual filtering in that function, or do it in the caller. Since there was only one caller that had to do filtering at the time, I decided to do it in the caller. If you're insistent upon the new function I can build it, but if not, I'd prefer to leave this as-is for now.

The instance ID is the primary key. This means that you can narrow down the scan in the kv batch by encoding the ID at the end of the prefix (And turn the scan into a get).

The problem I'm trying to avoid is that if we have lots of entries in the table (or, stale entries) this is going to waste time and memory.

I'm not considering this blocking here, but I do recommend this to be optimized in a later stage so you can file an issue and refer to it here.


pkg/sql/sqlinstance/instancestorage/instancereader.go line 237 at r12 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Yes. That would be a good optimization. Any objections to this being done in a follow-on PR?

No objection but let's have an issue to track this.


pkg/upgrade/upgradecluster/tenant_cluster.go line 197 at r12 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Modified.

ok, this raises a followup question: what happens when an instance terminates but does not clean up its instance record? Do version upgrades become completely blocked after that point? What would be the recommended action to repair from that point onwards?


pkg/upgrade/upgradecluster/tenant_cluster.go line 207 at r12 (raw file):

Previously, ajstorm (Adam Storm) wrote…

If a new instance is added in-between, we need to retry. See the comment above:

... If new SQL servers have joined in the
// interim, we must revalidate that their binary versions are sufficiently
// up-to-date to continue with the upgrade process.

I see.


pkg/upgrade/upgradecluster/tenant_cluster.go line 276 at r12 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Yes, this should be covered in the call to StartWithCtx

// StartWithCtx returns a new Retry initialized to some default values. The
// Retry can then be used in an exponential-backoff retry loop. If the provided
// context is canceled (see Context.Done), the retry loop ends early, but will
// always run at least once.

I know this about retry but this is not my question.

My question is, does this loop properly terminate when the server is requested to shut down via the stopper. Depending on which ctx is used here, it may or may not be cancelled upon quiescing the stopper. This is what we should check.


pkg/upgrade/upgrademanager/manager.go line 621 at r12 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Done.

It's the other way around.

   err = errors.HandledWithMessage(err, "validate cluster version failed: some enant pods running on binary less than 23.1")

pkg/upgrade/upgrademanager/manager.go line 409 at r13 (raw file):

	// to every server (SQL server in the case of secondary tenants, or
	// combined KV/Storage server in the case of the storage layer) in the
	// cluster. Each node (or pod) will persist the version, bump the local

The point of my previous request was to enable you to replace all instances of "node (or pod)" by just "server' throughout the rest of this explanation.


pkg/ccl/kvccl/kvtenantccl/tenant_upgrade_test.go line 535 at r12 (raw file):

Previously, ajstorm (Adam Storm) wrote…

FWIW, Andrew has already reviewed this test in its entirety. I'm a bit reluctant to refactor it for a second review, especially since I'm hoping to get this in before Tuesday's branch cut. Are you of the opinion that enough has changed in this commit to necessitate a full rereview?

I am not able to review this test in its current form. But I trust Andrew's review so you can go by that.


pkg/ccl/kvccl/kvtenantccl/tenant_upgrade_test.go line 548 at r12 (raw file):

Previously, ajstorm (Adam Storm) wrote…

I agree, it's not great. There is a flag which allows you to bypass the circuit breaker, but we'd have to create a new testing knob for the upgrade path and plumb things through to all the RPC calls it makes (unless I'm missing a more obvious path).

why not disable the circuit breaker throughout the entire test?

On the one hand i agree this is more work so it shouldn't block the PR.
On the other hand I am expecting the choice made here to cause hard-to-troubleshoot flakiness in the future. At the very least we should have a follow-up issue and invest into this to strengthen the test.


pkg/ccl/serverccl/server_startup_guardrails_test.go line 103 at r12 (raw file):

Previously, ajstorm (Adam Storm) wrote…

In my testing, the predefined stopper is only populated on success. On failure, there could still be work to do to stop the tenant, and this stopper allows us to perform that cleanup.

That seems like a bug. If you can confirm it, it's serious as it would affect our UA story.
So if confirmed please file an issue that refers to this test and calls for an investigation.
(And add a link to your issue in a comment here, so that folk do not get surprised when they see your choice)


pkg/ccl/serverccl/tenant_migration_test.go line 239 at r12 (raw file):

Previously, ajstorm (Adam Storm) wrote…

There are a bunch of ways the upgrade can fail. Is there a prescribed method to test for all of them?

require.Error would replace the code you've written already, by checking that the error message contains the string you expect.

Copy link
Copy Markdown
Collaborator Author

@ajstorm ajstorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed this round. RFAL. Still working on @healthy-pod's addition.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @healthy-pod, and @knz)


pkg/cmd/roachtest/tests/multitenant_upgrade.go line 214 at r12 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

Yes, but let's have an issue open to track this followup work.

#98869


pkg/sql/sqlinstance/instancestorage/instancereader.go line 193 at r12 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

The instance ID is the primary key. This means that you can narrow down the scan in the kv batch by encoding the ID at the end of the prefix (And turn the scan into a get).

The problem I'm trying to avoid is that if we have lots of entries in the table (or, stale entries) this is going to waste time and memory.

I'm not considering this blocking here, but I do recommend this to be optimized in a later stage so you can file an issue and refer to it here.

#98869


pkg/sql/sqlinstance/instancestorage/instancereader.go line 237 at r12 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

No objection but let's have an issue to track this.

#98869


pkg/upgrade/upgradecluster/tenant_cluster.go line 197 at r12 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

ok, this raises a followup question: what happens when an instance terminates but does not clean up its instance record? Do version upgrades become completely blocked after that point? What would be the recommended action to repair from that point onwards?

Yeah, it was something I was going to raise after this PR got merged. They're not completely blocked, but they'll take 10 minutes to remove failed SQL servers before they can be reattempted. This may lengthen the time required to perform upgrades in Serverless, where I'm assuming a retry loop is present. The recommended action at this point is to wait. In theory, customers could also set the cluster settings to reduce the session timeout, but that may have other complications.


pkg/upgrade/upgradecluster/tenant_cluster.go line 276 at r12 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

I know this about retry but this is not my question.

My question is, does this loop properly terminate when the server is requested to shut down via the stopper. Depending on which ctx is used here, it may or may not be cancelled upon quiescing the stopper. This is what we should check.

Are you suggesting that we validate this with a new test? If so, are you thinking that this needs validation before this PR merges?


pkg/upgrade/upgrademanager/manager.go line 621 at r12 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

It's the other way around.

   err = errors.HandledWithMessage(err, "validate cluster version failed: some enant pods running on binary less than 23.1")

Done.


pkg/upgrade/upgrademanager/manager.go line 409 at r13 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

The point of my previous request was to enable you to replace all instances of "node (or pod)" by just "server' throughout the rest of this explanation.

Done


pkg/ccl/kvccl/kvtenantccl/tenant_upgrade_test.go line 535 at r12 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

I am not able to review this test in its current form. But I trust Andrew's review so you can go by that.

Thanks.


pkg/ccl/kvccl/kvtenantccl/tenant_upgrade_test.go line 548 at r12 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

why not disable the circuit breaker throughout the entire test?

On the one hand i agree this is more work so it shouldn't block the PR.
On the other hand I am expecting the choice made here to cause hard-to-troubleshoot flakiness in the future. At the very least we should have a follow-up issue and invest into this to strengthen the test.

#98869


pkg/ccl/serverccl/server_startup_guardrails_test.go line 103 at r12 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

That seems like a bug. If you can confirm it, it's serious as it would affect our UA story.
So if confirmed please file an issue that refers to this test and calls for an investigation.
(And add a link to your issue in a comment here, so that folk do not get surprised when they see your choice)

It's definitely an issue - I tested it yesterday. Annotated with the issue: #98868


pkg/ccl/serverccl/tenant_migration_test.go line 239 at r12 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

require.Error would replace the code you've written already, by checking that the error message contains the string you expect.

Right, but in this case we want to accept all errors except this one. Is there a require function for that?

Copy link
Copy Markdown
Collaborator Author

@ajstorm ajstorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Third commit is now there for @healthy-pod's changes. multitenant-upgrade should now be clean 🎉.

Ready for another (hopefully final) look. 🤞

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @healthy-pod, and @knz)

Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r14.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajstorm, @ajwerner, and @healthy-pod)


pkg/upgrade/upgradecluster/tenant_cluster.go line 197 at r12 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Yeah, it was something I was going to raise after this PR got merged. They're not completely blocked, but they'll take 10 minutes to remove failed SQL servers before they can be reattempted. This may lengthen the time required to perform upgrades in Serverless, where I'm assuming a retry loop is present. The recommended action at this point is to wait. In theory, customers could also set the cluster settings to reduce the session timeout, but that may have other complications.

Could you propagate this explanation in the commit message. I think it's an important callout.


pkg/upgrade/upgradecluster/tenant_cluster.go line 276 at r12 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Are you suggesting that we validate this with a new test? If so, are you thinking that this needs validation before this PR merges?

I do not think we need an automated test before it merges, but i'd like us to do a manual test.

This is similar to a change David made yesterday in a related area (retrying the allocation of instance ID forever until it succeeds). My recommended strategy for the test was to:

  1. modify the code here with something inside the loop that does "if current instance ID == 3, then print a dummy log message and loop forever"
  2. start a 3 or 5-node cluster with roachprod create local
  3. use the create tenant / alter tenant start service shared
  4. run roachprod sql -d cluster:tenantname to connect to a specific tenant and issue the version upgrade manually
  5. verify visually in logs that the dummy message is produced and the loop is indeed stuck
  6. run roachprod stop --sig=15 on the problem node to request a graceful shutdown
  7. verify visually that the graceful shutdown completes

pkg/ccl/serverccl/server_startup_guardrails_test.go line 103 at r12 (raw file):

Previously, ajstorm (Adam Storm) wrote…

It's definitely an issue - I tested it yesterday. Annotated with the issue: #98868

Thanks


pkg/ccl/serverccl/tenant_migration_test.go line 239 at r12 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Right, but in this case we want to accept all errors except this one. Is there a require function for that?

if that is what you want the code to do, then i believe there's a mistake in the code. That if !contains(...) will fail the test in the opposite condition. Or are my eyes failing me?

Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajstorm, @ajwerner, and @healthy-pod)


pkg/upgrade/upgradecluster/tenant_cluster.go line 276 at r12 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

I do not think we need an automated test before it merges, but i'd like us to do a manual test.

This is similar to a change David made yesterday in a related area (retrying the allocation of instance ID forever until it succeeds). My recommended strategy for the test was to:

  1. modify the code here with something inside the loop that does "if current instance ID == 3, then print a dummy log message and loop forever"
  2. start a 3 or 5-node cluster with roachprod create local
  3. use the create tenant / alter tenant start service shared
  4. run roachprod sql -d cluster:tenantname to connect to a specific tenant and issue the version upgrade manually
  5. verify visually in logs that the dummy message is produced and the loop is indeed stuck
  6. run roachprod stop --sig=15 on the problem node to request a graceful shutdown
  7. verify visually that the graceful shutdown completes

(The alternative is to path through the chain of calls and build confidence that this ctx gets, indeed, canceled on server shutdown. I tried and it wasn't obvious)

@knz
Copy link
Copy Markdown
Contributor

knz commented Mar 17, 2023

fwiw david did add the ShouldQuiesce check in his loop to make it work https://github.com/cockroachdb/cockroach/pull/98716/files

Copy link
Copy Markdown
Collaborator Author

@ajstorm ajstorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @healthy-pod, and @knz)


pkg/upgrade/upgradecluster/tenant_cluster.go line 197 at r12 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

Could you propagate this explanation in the commit message. I think it's an important callout.

Done


pkg/upgrade/upgradecluster/tenant_cluster.go line 276 at r12 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

(The alternative is to path through the chain of calls and build confidence that this ctx gets, indeed, canceled on server shutdown. I tried and it wasn't obvious)

To be clear, are you suggesting that I should check to see if the SQL server's instance is quiesced? Or is it sufficient to see if the ctx has been marked Done? If it's the former, I'll need to do a bit more plumbing, as I don't have access to the SQL server's instance from in here.


pkg/ccl/serverccl/tenant_migration_test.go line 239 at r12 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

if that is what you want the code to do, then i believe there's a mistake in the code. That if !contains(...) will fail the test in the opposite condition. Or are my eyes failing me?

Sorry, my earlier comment was incorrect. We want to accept this error and not all others. In that light wouldn't require.Error do the wrong thing here?

Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajstorm, @ajwerner, and @healthy-pod)


pkg/upgrade/upgradecluster/tenant_cluster.go line 276 at r12 (raw file):

Previously, ajstorm (Adam Storm) wrote…

To be clear, are you suggesting that I should check to see if the SQL server's instance is quiesced? Or is it sufficient to see if the ctx has been marked Done? If it's the former, I'll need to do a bit more plumbing, as I don't have access to the SQL server's instance from in here.

I am asking to check that this loop indeed exits when the process that runs it is asked to shut down gracefully. Both standalone SQL pods and embedded secondary tenant SQL servers are subject to the same shutdown / quiescence logic, so if it works for one it will work for the other.

Right now, I do not have confidence that this loop won't keep forever running and prevent a graceful shutdown. Because i don't see logic through which the ctx would get canceled in that case.


pkg/ccl/serverccl/tenant_migration_test.go line 239 at r12 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Sorry, my earlier comment was incorrect. We want to accept this error and not all others. In that light wouldn't require.Error do the wrong thing here?

ok then you can use require.ErrorContains https://github.com/stretchr/testify/blob/master/require/require.go#L328

Copy link
Copy Markdown
Collaborator Author

@ajstorm ajstorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All comments addressed. RFAL.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @healthy-pod, and @knz)


pkg/upgrade/upgradecluster/tenant_cluster.go line 276 at r12 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

I am asking to check that this loop indeed exits when the process that runs it is asked to shut down gracefully. Both standalone SQL pods and embedded secondary tenant SQL servers are subject to the same shutdown / quiescence logic, so if it works for one it will work for the other.

Right now, I do not have confidence that this loop won't keep forever running and prevent a graceful shutdown. Because i don't see logic through which the ctx would get canceled in that case.

OK, testing here seems to be OK. I did a modified version of what you suggested above:

  1. Modified the loop so that it just retried forever and every time through the loop printed "retrying forever" to the logs.
  2. Modified TestTenantUpgrade so that it created a 3 node instead of a 1 node cluster.
  3. Added a go routine in TestTenantUpgrade right before performing the upgrade which logs its presence to the logs, sleeps for 15 seconds and then performs one of a few actions {stop the whole cluster, stop the node on which the tenant is running, stops the tenant (using a custom stopper), closes the DB connection in which the upgrade is running}

The findings are interesting. If we stop the cluster, node or the tenant, the loop exits. I think this is the behaviour that you were worried about, so I think the code is good as-is.

In the case where we just stop the connection that the upgrade is running in, we loop forever. I'm not sure if this is bad or good, but it does indicate that my earlier comment is true - that without that custom stopper on SQL server creation, the SQL server instance won't get shut down.

Either way, I think as far as this PR is concerned, we don't have an issue here.


pkg/ccl/serverccl/tenant_migration_test.go line 239 at r12 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

ok then you can use require.ErrorContains https://github.com/stretchr/testify/blob/master/require/require.go#L328

Done.

Copy link
Copy Markdown
Collaborator Author

@ajstorm ajstorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last forced push was just a rebase.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @healthy-pod, and @knz)

Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 20 files at r21, 10 of 12 files at r22, 2 of 2 files at r23, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajstorm, @ajwerner, and @healthy-pod)


pkg/upgrade/upgradecluster/tenant_cluster.go line 276 at r12 (raw file):

In the case where we just stop the connection that the upgrade is running in, we loop forever. I'm not sure if this is bad or good

I believe this is expected -- once the order is given to start upgrading, the cluster will continue to attempt to perform the upgrade until it can. This has always been the case. We could build a mechanism to interrupt that process if we find it important.

CRDB has an upgrade interlock between nodes to ensure that as a cluster is
upgrading, the node that is driving the upgrade keeps all other nodes in sync
(and prevents nodes with a down-level binary from joining the cluster). This
interlock mechanism hasn't existed for tenant pods during a tenant upgrade
until this commit.

The commit adds to SQL servers, a similar interlock to the one used for nodes.
When a tenant pod begins upgrading it will first confirm that all other running
tenant pods are on an acceptable binary version and that the version for the
attempted upgrade is less than the binary version of all tenant pods, as well
as greater than (or equal to) the minimum supported binary version. Then, it
will begin to run migrations (upgrades). After each migration, it will push out
the intermediate cluster version to all running tenant pods and revalidate that
the upgrade can continue.

It's worth noting that if a SQL server fails while we're in the process of
upgrading (or shortly before), the upgrade process will see the failed SQL
server instance, be unable to contact it via RPC, and fail the tenant upgrade.
The workaround for this problem is to wait until the SQL server is cleaned up
(by default, 10 minutes after it fails) and retry the tenant upgrade.

Epic: CRDB-18735

Release note: None
@ajstorm
Copy link
Copy Markdown
Collaborator Author

ajstorm commented Mar 19, 2023

TFTR all!

bors r=knz,ajwerner

@knz
Copy link
Copy Markdown
Contributor

knz commented Mar 19, 2023

i believe you meant

bors r=knz,ajwerner

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 19, 2023

Build failed:

@healthy-pod
Copy link
Copy Markdown
Contributor

bors retry

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 19, 2023

Build failed:

@healthy-pod
Copy link
Copy Markdown
Contributor

bors retry

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 19, 2023

Build failed:

@healthy-pod
Copy link
Copy Markdown
Contributor

bors retry

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 19, 2023

Build failed:

@ajstorm
Copy link
Copy Markdown
Collaborator Author

ajstorm commented Mar 19, 2023

Should be good now that #98894 is in the queue.

bors retry

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 19, 2023

Already running a review

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 19, 2023

Build failed:

@ajstorm
Copy link
Copy Markdown
Collaborator Author

ajstorm commented Mar 19, 2023

Gonna give it one last try now that #98894 made it in.

bors retry

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 19, 2023

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants