roachprod: provide workaround for long-running AWS clusters#98682
Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom Mar 16, 2023
Merged
Conversation
In cockroachdb#98076, we started validating hostnames before running any commands to avoid situations where a stale cache could lead to unintended interference with other clusters due to public IP reuse. The check relies on the VM's `hostname` matching the expected cluster name in the cache. GCP and Azure clusters set the hostname to the instance name by default, but that is not the case for AWS; the aforementioned PR explicitly sets the hostname when the instance is created. However, in the case of long running AWS clusters (created before host validation was introduced) or clusters that are created with an outdated version of `roachprod`, the hostname will still be the default AWS hostname, and any interaction with that cluster will fail if using a recent `roachprod` version. To remedy this situation, this commit includes: * better error reporting. When we attempt to run a command on an AWS cluster and host validation fails, we display a message to the user explaining that their hostnames may need fixing. * if the user confirms that the cluster still exists (by running `roachprod list`), they are able to automatically fix the hostnames to the expected value by running a new `fix-long-running-aws-hostnames` command. This is a temporary workaround that should be removed once we no longer have clusters that would be affected by this issue. This commit will be reverted once we no longer have clusters created with the default hostnames; this will be easier to achieve once we have an easy way for everyone to upgrade their `roachprod` (see cockroachdb#97311). Epic: none Release note: None
Member
srosenberg
approved these changes
Mar 16, 2023
Author
|
bors r=srosenberg TFTR! |
Contributor
|
Build succeeded: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
In #98076, we started validating hostnames before running any commands to avoid situations where a stale cache could lead to unintended interference with other clusters due to public IP reuse. The check relies on the VM's
hostnamematching the expected cluster name in the cache. GCP and Azure clusters set the hostname to the instance name by default, but that is not the case for AWS; the aforementioned PR explicitly sets the hostname when the instance is created.However, in the case of long running AWS clusters (created before host validation was introduced) or clusters that are created with an outdated version of
roachprod, the hostname will still be the default AWS hostname, and any interaction with that cluster will fail if using a recentroachprodversion. To remedy this situation, this commit includes:better error reporting. When we attempt to run a command on an AWS cluster and host validation fails, we display a message to the user explaining that their hostnames may need fixing.
if the user confirms that the cluster still exists (by running
roachprod list), they are able to automatically fix the hostnames to the expected value by running a newfix-long-running-aws-hostnamescommand. This is a temporary workaround that should be removed once we no longer have clusters that would be affected by this issue.This commit will be reverted once we no longer have clusters created with the default hostnames; this will be easier to achieve once we have an easy way for everyone to upgrade their
roachprod(see #97311).Epic: none
Release note: None