roachprod: support multiple local clusters#71970
roachprod: support multiple local clusters#71970craig[bot] merged 10 commits intocockroachdb:masterfrom
Conversation
There was a problem hiding this comment.
Reviewed 2 of 2 files at r1, 7 of 7 files at r2, 3 of 3 files at r3, 5 of 5 files at r4, 5 of 5 files at r5, 8 of 8 files at r6, 14 of 14 files at r7, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @petermattis, @RaduBerinde, and @tbg)
a discussion (no related file):
The ability to have multiple local clusters seems very useful, thanks! I like the cleanups and think that a one-time manual cleanup of local clusters isn't too high a cost.
I left some nit-pick comments, but feel free to ignore them.
I'd like to give it a spin locally before approving.
pkg/cmd/roachprod/clusters_cache.go, line 56 at r4 (raw file):
filename := clusterFilename(c.Name) return os.WriteFile(filename, []byte(b.String()), 0666)
[nit] bytes.Buffer has a Bytes() method you can use instead []bytes(b.String())
Any reason we need 0666 vs 0660 here?
pkg/cmd/roachprod/clusters_cache.go, line 71 at r4 (raw file):
if err := dec.Decode(c); err != nil { return nil, err }
[nit] since you are already reading in the entire file, you could do (warning, code written in comment box):
data, err := os.ReadFile(filename)
if err != nil {
return nil, err
}
c := &cloud.Cluster{}
if err := json.Unmarshal(data, c); err != nil {
return nil, err
}pkg/cmd/roachprod/clusters_cache.go, line 101 at r4 (raw file):
if len(c.VMs) == 0 { return errors.Errorf("found no VMs in %s", clusterFilename(name)) }
[nit] Maybe move this up above the initialisation of sc
pkg/cmd/roachprod/clusters_cache.go, line 128 at r4 (raw file):
// lock acquired by syncAll, but other processes may be reading the host // files concurrently so we need to write the files atomically by writing to // a temporary file and renaming.
I'm not seeing where we do this atomic rename. Seems like it would make sense for saveCluster to just always use an atomic writing strategy.
pkg/cmd/roachprod/clusters_cache.go, line 164 at r4 (raw file):
var result []string cd := os.ExpandEnv(config.ClustersDir) files, err := ioutil.ReadDir(cd)
[nit] You can probably get away with os.ReadDir here. It almost surely doesn't matter here, but for larger directories it saves on some lstat calls. If you do convert it, I believe file.Mode().IsRegular() below would need to change to file.Type().IsRegular().
pkg/cmd/roachprod/config/config.go, line 78 at r7 (raw file):
return true } return len(clusterName) >= 2 && strings.HasPrefix(clusterName, "-")
[nit] Code is fine as-is, but perhaps something like this would be slightly clearer:
if clusterName == Local {
return true
}
if !strings.HasPrefix(clusterName, LocalPrefix) {
return false
}
clusterName = strings.TrimPrefix(clusterName, LocalPrefix)
return clusterName != ""pkg/cmd/roachprod/install/cluster_synced.go, line 246 at r7 (raw file):
} for _, dir := range dirs { cmd += fmt.Sprintf(`rm -fr %s/%s`, c.localVMDir(c.Nodes[i]), dir)
I'm a scaredy cat so I would maybe add a comment to the VMDir function to remind us that its output is used in rm -rf statements so it should be careful to not return "" or "/".
pkg/cmd/roachprod/install/download.go, line 79 at r7 (raw file):
// the download node to the other nodes. if c.IsLocal() && !filepath.IsAbs(dest) { // Eg ~/local/1/./bar.txt or ~/local/local-foo/1/./bar.txt
Is this comment correct? I think the second example should be ~/local/foo-1/./bar.txt or have I misread?
|
a discussion (no related file): Previously, stevendanna (Steven Danna) wrote…
TFTR! I appreciate you giving it a spin! Hold off on it, I found a problem - we are no longer wiping the local cluster before destroying, will update when it's fixed. |
791b583 to
4dac71f
Compare
RaduBerinde
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @petermattis, @RaduBerinde, @stevendanna, and @tbg)
a discussion (no related file):
Previously, RaduBerinde wrote…
TFTR! I appreciate you giving it a spin! Hold off on it, I found a problem - we are no longer wiping the local cluster before destroying, will update when it's fixed.
Ok, I fixed that issue (in the "clean up local cluster deletion" commit).
pkg/cmd/roachprod/clusters_cache.go, line 56 at r4 (raw file):
Previously, stevendanna (Steven Danna) wrote…
[nit]
bytes.Bufferhas aBytes()method you can use instead[]bytes(b.String())Any reason we need
0666vs0660here?
Just went with what the old code did, changed to 0660 now.
pkg/cmd/roachprod/clusters_cache.go, line 101 at r4 (raw file):
Previously, stevendanna (Steven Danna) wrote…
[nit] Maybe move this up above the initialisation of
sc
Done.
pkg/cmd/roachprod/clusters_cache.go, line 128 at r4 (raw file):
Previously, stevendanna (Steven Danna) wrote…
I'm not seeing where we do this atomic rename. Seems like it would make sense for
saveClusterto just always use an atomic writing strategy.
Done.
pkg/cmd/roachprod/clusters_cache.go, line 164 at r4 (raw file):
Previously, stevendanna (Steven Danna) wrote…
[nit] You can probably get away with
os.ReadDirhere. It almost surely doesn't matter here, but for larger directories it saves on some lstat calls. If you do convert it, I believefile.Mode().IsRegular()below would need to change tofile.Type().IsRegular().
Done.
pkg/cmd/roachprod/config/config.go, line 78 at r7 (raw file):
Previously, stevendanna (Steven Danna) wrote…
[nit] Code is fine as-is, but perhaps something like this would be slightly clearer:
if clusterName == Local { return true } if !strings.HasPrefix(clusterName, LocalPrefix) { return false } clusterName = strings.TrimPrefix(clusterName, LocalPrefix) return clusterName != ""
That code is not quite right (it's not dealing with the dash). I switched to a regex and added a test.
pkg/cmd/roachprod/install/cluster_synced.go, line 246 at r7 (raw file):
Previously, stevendanna (Steven Danna) wrote…
I'm a scaredy cat so I would maybe add a comment to the VMDir function to remind us that its output is used in
rm -rfstatements so it should be careful to not return "" or "/".
Done.
pkg/cmd/roachprod/install/download.go, line 79 at r7 (raw file):
Previously, stevendanna (Steven Danna) wrote…
Is this comment correct? I think the second example should be
~/local/foo-1/./bar.txtor have I misread?
Yeah, it was stale, I removed it altogether (not very helpful, and not sure what the . in there was about).
4dac71f to
dadb30e
Compare
RaduBerinde
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @petermattis, @stevendanna, and @tbg)
pkg/cmd/roachprod/clusters_cache.go, line 71 at r4 (raw file):
Previously, stevendanna (Steven Danna) wrote…
[nit] since you are already reading in the entire file, you could do (warning, code written in comment box):
data, err := os.ReadFile(filename) if err != nil { return nil, err } c := &cloud.Cluster{} if err := json.Unmarshal(data, c); err != nil { return nil, err }
Done.
dadb30e to
74405db
Compare
|
This still has problems. When stopping a local cluster, we kill processes for all local clusters. There were also some acceptance test failures, so something else must be off on top of that. |
stevendanna
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @petermattis, @stevendanna, and @tbg)
pkg/cmd/roachprod/config/config.go, line 78 at r7 (raw file):
Previously, RaduBerinde wrote…
That code is not quite right (it's not dealing with the dash). I switched to a regex and added a test.
Ah, dropped the line where I defined LocalPrefix which would have dealt with the dash, sorry for the confusion.
74405db to
49b6f3e
Compare
49b6f3e to
5f4e516
Compare
|
Updated. I had to add the cluster name to the |
|
I'm slightly worried about the impact of changing the |
Just a drive by, but I wouldn't let that stand in the way of progress. But if you wanted to, you could make sure that "new roachprod" will simply refuse to do anything on clusters created by "old roachprod" (except |
|
Interesting thought, I'll have to think about how we could detect that. Might need to add some piece of metadata (with support for all cloud providers). |
|
Apologies for throwing something else on your plate, but I have also noticed that roachprod does not tolerate mixing projects. If in one shell you use GCE_PROJECT=a and in another one GCE_PROJECT=b, then roachprod sync on one shell will wipe out the clusters of the other shell (and this causes problems in practice when you're repro'ing a roachtest in one shell where you likely use |
|
pkg/cmd/roachprod/install/cluster_synced.go, line 204 at r10 (raw file):
Random question but what do these parens around the value do? |
Hm.. Interesting. This is a great time to bring this up (we can minimize disruption if we take care of it now). But let me make sure I understand exactly what you are seeing. So the way it works is that each cluster is cached in
The problem is that Is there a similar concept in AWS or Azure? (vm.Project is only used for GCE) |
ajwerner
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @petermattis, @RaduBerinde, @stevendanna, and @tbg)
pkg/cmd/roachprod/install/cluster_synced.go, line 204 at r10 (raw file):
Previously, RaduBerinde wrote…
Random question but what do these parens around the value do?
Inside // it's a regexp so the parens are a group in the pattern. It doesn't seem that think we're making any use of that fact. I support removing it to avoid confusion.
|
pkg/cmd/roachprod/install/cluster_synced.go, line 204 at r10 (raw file): Previously, ajwerner wrote…
Thanks, yeah I didn't see the purpose of the group. |
stevendanna
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @petermattis, @RaduBerinde, @stevendanna, and @tbg)
a discussion (no related file):
I tried this out locally and things seem to be working well.
I'm slightly worried about the impact of changing the ROACHPROD= convention - if you have a running (non-local) cluster, after you update roachprod, the stop command will no longer work. I could probably make it so that only local clusters use the new convention, but it would make things even more messy.
🤔 I wonder how many running, non-teamcity roachprod clusters there are. At the time of this writing there were 12 in the GCE projects I checked, but there are many GCE projects.
If we kept the old stop implementation around and provided it under a legacy-stop utility command so that those with old clusters have a tool to use until they can recreate their cluster (or cycle through legacy-stop and start on their nodes), would that be enough to allow us to handle the transition period without too much disruption? We could remove legacy-stop once we've given those with long-running clusters some time.
I could probably make it so that only local clusters use the new convention, but it would make things even more messy.
It might not be too messy code-wise. A method on SyncedCluster could construct the right value for ROACHPROD based on the cluster type and I don't think other functions would need to know about the difference. That said, if we can help manage the disruption enough, it seems like we should err on the side of not having another minor detail to keep in mind.
pkg/cmd/roachprod/clusters_cache.go, line 63 at r10 (raw file):
return err } return os.Rename(tmpFilename, filename)
We can fix in a follow-up if you'd like, but a couple of other things we might want to consider
- Sync the temporary file before the rename to ensure no one can end up seeing a zero-byte file; and,
- Attempt to clean up the temporary file in the case of WriteFile or Rename failing.
I suppose we could also use ioutil.TempFile to create the temporary file, but the higher level file lock we take before writing into this directory makes it unnecessary.
|
@RaduBerinde re: #71970 (comment), I think the problem is that when you sync in another project, it removes the hosts from the original project. In my case, what I ran into a few times until my muscle memory adapted is this: shell1: GCE_PROJECT=andrei-jepsen roachstress.sh ... // this will run ~forever Whatever roachtest is running in shell1 right now is likely going to fail because the clusters are no longer represented in the local state. (We don't run "roachprod sync" before every invocation of roachprod). So what should happen instead is that roachprod sync on project1 is completely oblivious to any clusters tracked for project2. They just shouldn't share any state, period. |
|
a discussion (no related file): Previously, stevendanna (Steven Danna) wrote…
Yes I agree - I went down the path of extracting the code around the ROACHPROD value and it should be easy now. |
That's the thing - they should share state wrt to the local cluster (and arguably AWS, azure). Treating GCE separately is a bit tricky. But I'll come up with something. |
This change adds `SQLPort` and `AdminUIPort` fields to `vm.VM`. This allows us to remove the special hardcoded values for the local cluster. Having these fields stored in the clusters cache will allow having multiple local clusters, each with their own set of ports. Release note: None
This change adds support for multiple local clusters. Local cluster names must be either "local" or of the form "local-foo". When the cluster is named "local", the node directories stay in the same place, e.g. `~/local/1`. If the cluster is named "local-foo", node directories are like `~/local/foo-1`. For local clusters we include the cluster name in the ROACHPROD variable; this is necessary to distinguish between processes of different local clusters. The relevant code is cleaned up to centralize the logic related to the ROACHPROD variable. Fixes cockroachdb#71945. Release note: None meh
This commit speeds up the slowest step of roachprod: listing VMs from all providers. We now list the VMs in parallel across all providers instead of doing it serially. Release note: None
Currently roachprod has very poor behavior when used with different projects on the same host. For example: ``` shell1: GCE_PROJECT=andrei-jepsen roachstress.sh ... // this will run ~forever sometime later in shell2: roachprod sync (on the default project) ``` The sync on the default project removes the cached information for the cluster on `andrei-jepsen`, which causes `roachprod` commands against that cluster (from within the `roachstress.sh` script) to fail. We fix this by ignoring any cached clusters that reference a project that the provider was not configured for - both when loading clusters into memory and when deleting stale cluster files during `sync`. As part of the change, we also improve the output of `list` to remove the colon after the cluster name and to include the GCE project: ``` $ roachprod list --gce-project cockroach-ephemeral,andrei-jepsen Syncing... Refreshing DNS entries... glenn-anarest [aws] 9 (142h41m39s) glenn-drive [aws] 1 (141h41m39s) jane-1635868819-01-n1cpu4 [gce:cockroach-ephemeral] 1 (10h41m39s) lin-ana [aws] 9 (178h41m39s) local-foo [local] 4 (-) radu-foo [gce:andrei-jepsen] 4 (12h41m39s) radu-test [gce:cockroach-ephemeral] 4 (12h41m39s) ``` Release note: None
We use a LOCK file during sync. We create the file, acquire an exclusive lock and at the end remove the file. The removal of the file will fail if another process was waiting for the lock. Also, there is a race where we could be deleting the file that is in use by another process, and that would allow a third process to create the file again. To fix these issues, we let the LOCK file be; there is no need to remove it - we are relying on `flock`, not on exclusive file creation. Release note: None
0ed03de to
3e3433c
Compare
RaduBerinde
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @healthy-pod, @rail, and @stevendanna)
pkg/cmd/roachprod/clusters_cache.go, line 63 at r10 (raw file):
Previously, stevendanna (Steven Danna) wrote…
We can fix in a follow-up if you'd like, but a couple of other things we might want to consider
- Sync the temporary file before the rename to ensure no one can end up seeing a zero-byte file; and,
- Attempt to clean up the temporary file in the case of WriteFile or Rename failing.
I suppose we could also use
ioutil.TempFileto create the temporary file, but the higher level file lock we take before writing into this directory makes it unnecessary.
Done.
rail
left a comment
There was a problem hiding this comment.
Love it! The code is much easier to navigate and read now.
Sorry for the previous merge conflicts :/
Reviewed 47 of 52 files at r13, 5 of 5 files at r14, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @healthy-pod, and @stevendanna)
pkg/roachprod/install/cockroach.go, line 60 at r14 (raw file):
} path := filepath.Join(c.localVMDir(node), config.Binary)
So much easier to read 👍
pkg/roachprod/config/config_test.go, line 15 at r14 (raw file):
import "testing" func TestIsLocalClusterName(t *testing.T) {
Yay tests!
pkg/roachprod/install/cluster_synced_test.go, line 19 at r14 (raw file):
// TestRoachprodEnv tests the roachprodEnvRegex and roachprodEnvValue methods. func TestRoachprodEnv(t *testing.T) {
yay, moar tests!
There was a problem hiding this comment.
TFTRs!
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @healthy-pod, and @stevendanna)
|
bors r+ |
|
Build succeeded: |
|
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from fb61ceb to blathers/backport-release-21.1-71970: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 21.1.x failed. See errors above. error creating merge commit from fb61ceb to blathers/backport-release-21.2-71970: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 21.2.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
The behavior changed in cockroachdb#71970. Release note: None
The behavior changed in cockroachdb#71970. Release note: None
72641: release-21.2: roachprod: backport changes from master as of 2021-11-11 r=RaduBerinde a=RaduBerinde
This PR backports all changes involving roachprod as of 2021-11-11. There have been large refactorings which we want to backport, or it will make backporting any future necessary roachtest fixes much harder. We also want new upcoming features around multi-tenancy available for 21.2.
CC @cockroachdb/release
#### roachprod/vm/aws: improve help text for multiple stores
```bash
roachprod create ajwerner-test -n1 --clouds aws \
--aws-ebs-volume='{"VolumeType": "io2", "VolumeSize": 213, "Iops": 321}' \
--aws-ebs-volume='{"VolumeType": "io2", "VolumeSize": 213, "Iops": 321}' \
--aws-enable-multiple-stores=true
roachprod stage ajwerner-test cockroach
roachprod start ajwerner-test --store-count 2
```
The above commands will create a node with multiple stores and start cockroach
on them. Hopefully these minor help changes make that clearer.
Release note: None
#### roachprod: add stageurl command
Sometimes it is useful to be able to download these artifacts
directly. For example, when trying to bisect a problem. But, the URL
can take a second to remember the format of.
The stageurl command prints the staging URL of the given application.
I've reorganized some of the code to reduce duplication between the
stage and stageurl command. There is still more duplication than I
would like. But I figured I would see if this seems useful to others
before further refactoring.
Release note: None
#### roachprod: clean up roachprod ssh keys in aws
Many SSH keys created by roachprod are no longer used, and some were created by former employees.
This needed to change because it's a security issue that former employees may exploit.
This patch adds another step to roachprod-gc cronjob to tag any untagged keys created by roachprod in AWS and delete them if they are unused.
Release note: None
#### roachprod: upgrade Azure Ubuntu image to 20.04
Previously, currently used Ubuntu 18.04 doesn't support `systemd-run
--same-dir`, which is used by some roachprod scripts. Additionally, GCE
and AWS already use Ubuntu 20.04 based images for roachprod.
Updating the base image to Ubuntu 20.04 fixes the issue above and aligns
the version with other cloud providers.
Release note: None
#### roachprod: update azure SDK
This is a partial backport of the commit below (only the part that
affects roachprod).
metric: Add Alert and Aggregation Rule interface
In this commit, the interfaces for Alert and Aggregation rule
interfaces are outlined. These interfaces will be used
by a new endpoint which will expose these rules in a YAML
format. This endpoint can be used by our end users to
configure alerts/monitoring for CockroachDB clusters.
This commit also updates the prometheus dependency in the
vendor submodule.
Release note: None
#### roachprod: fix roachprod gc docker build
Previously, the roachprod garbage collector docker image build process
was using the `go get` approach to build roachprod.
Currently, this method doesn't work, because it doesn't use any pinning,
so the build ends up with all kind of deprecation warnings and failures.
* Use multi-stage docker build in order to separate build and runtime.
It also reduces the image size from 1.9G to 700M.
* Build roachprod using the checked out commit SHA.
* Use the Bazel build image we use in CI to build roachprod.
* Use Bazel to build roachprod.
* Added `cloudbuild.yaml` to publish the docker image to GCR and use a
beefier instance type.
* Modify the entrypoint script to set the default region, required by
the AWS Go SDK library.
* Add `push.sh` to script deployment.
Release note: None
#### roachprod: correct spelling mistakes
Release note: None
#### roachprod: install AWS CLI v2 for GC images
Previously, after regenerating the GC docker images, roachprod stopped
listing AWS as an available provider, because Debian ships with AWS CLI
v1, but roachprod doesn't support it.
This patch installs AWS CLI v2.
Release note: None
#### roachprod: making roachprod subcommands point to a new library
Previously, roachprod binary interfaced directly with roachorod's functionality
and there was no way for another tool to make use of that functionality.
This needed to change to create a library that can be used by roachprod binary
and also other tools.
This patch migrates the subcommands functionality to a new library and makes
the binary point to the new library.
Release note: None
#### roachprod: avoid flaky test due to unused functions
Merging #71660 trigerred a flaky test due to unused functions.
This patch avoids that test by making use of / commenting unused functions.
Release note: None
#### roachprod: minor cleanup for cloud.Cloud
This change fills in some missing comments from `cloud.Cloud` and
improves the interface a bit. Some of the related roachprod code is
cleaned up as well.
Release note: None
#### roachprod: clean up local cluster metadata
The logic around how the local cluster metadata is loaded and saved is
very convoluted. The local provider is using `install.Clusters` and is
writing directly to the `.hosts/local` file.
This commit disentangles this logic: it is now up to the main program
to call `local.AddCluster()` to inject local cluster information. The
main program also provides the implementation for a new
`local.VMStorage` interface, allowing the code for saving the hosts
file to live where it belongs.
Release note: None
#### roachprod: clean up local cluster deletion
This change moves the code to destroy the local cluster to the local
provider. The hosts file is deleted through LocalVMStorage.
Release note: None
#### roachprod: rework clusters cache
This commit changes roachprod from using `hosts`-style files in
`~/.roachprod/hosts` for caching clusters to using json files in
`~/.roachprod/clusters`.
Like before, each cluster has its own file. The main advantage is
that we can now store the entire cluster metadata instead of
manufacturing it based on one-off parsing.
WARNING: after this change, the information in `~/.roachprod/hosts`
will no longer be used. If a local cluster exists, the new `roachprod`
version will not know about it. It is recommended to destroy any local
cluster before using the new version. A local cluster can also be
cleaned up manually using:
```
killall -9 cockroach
rm -rf ~/.roachprod/local
```
Release note: None
#### roachprod: use cloud.Cluster in SyncedCluster
This change stores Cluster metadata directly in SyncedCluster, instead
of making copies of various fields.
#### roachprod: store ports in vm.VM
This change adds `SQLPort` and `AdminUIPort` fields to `vm.VM`. This
allows us to remove the special hardcoded values for the local
cluster.
Having these fields stored in the clusters cache will allow having
multiple local clusters, each with their own set of ports.
Release note: None
#### roachprod: support multiple local clusters
This change adds support for multiple local clusters. Local cluster
names must be either "local" or of the form "local-foo".
When the cluster is named "local", the node directories stay in the
same place, e.g. `~/local/1`. If the cluster is named "local-foo",
node directories are like `~/local/foo-1`.
For local clusters we include the cluster name in the ROACHPROD
variable; this is necessary to distinguish between processes of
different local clusters. The relevant code is cleaned up to
centralize the logic related to the ROACHPROD variable.
Fixes #71945.
Release note: None
meh
#### roachprod: list VMs in parallel
This commit speeds up the slowest step of roachprod: listing VMs from
all providers. We now list the VMs in parallel across all providers
instead of doing it serially.
Release note: None
#### roachprod: fix behavior when mixing GCE projects
Currently roachprod has very poor behavior when used with different
projects on the same host. For example:
```
shell1: GCE_PROJECT=andrei-jepsen roachstress.sh ... // this will run ~forever
sometime later in shell2: roachprod sync (on the default project)
```
The sync on the default project removes the cached information for the
cluster on `andrei-jepsen`, which causes `roachprod` commands against
that cluster (from within the `roachstress.sh` script) to fail.
We fix this by ignoring any cached clusters that reference a project
that the provider was not configured for - both when loading clusters
into memory and when deleting stale cluster files during `sync`.
As part of the change, we also improve the output of `list` to remove
the colon after the cluster name and to include the GCE project:
```
$ roachprod list --gce-project cockroach-ephemeral,andrei-jepsen
Syncing...
Refreshing DNS entries...
glenn-anarest [aws] 9 (142h41m39s)
glenn-drive [aws] 1 (141h41m39s)
jane-1635868819-01-n1cpu4 [gce:cockroach-ephemeral] 1 (10h41m39s)
lin-ana [aws] 9 (178h41m39s)
local-foo [local] 4 (-)
radu-foo [gce:andrei-jepsen] 4 (12h41m39s)
radu-test [gce:cockroach-ephemeral] 4 (12h41m39s)
```
Release note: None
#### roachprod: don't remove LOCK file
We use a LOCK file during sync. We create the file, acquire an
exclusive lock and at the end remove the file. The removal of the file
will fail if another process was waiting for the lock. Also, there is
a race where we could be deleting the file that is in use by another
process, and that would allow a third process to create the file
again.
To fix these issues, we let the LOCK file be; there is no need to
remove it - we are relying on `flock`, not on exclusive file creation.
Release note: None
#### roachprod: fix improperly wrapped errors
Partial backport of this commit:
*: fix improperly wrapped errors
I'm working on a linter that detects errors that are not wrapped
correctly, and it discovered these.
Release note: None
#### roachprod: fix `roachprod start` ignoring --binary flag
Merging #71660 introduced a bug where roachprod ignores --binary
flag when running `roachprod start`.
This patch reverts to the old way of setting config.Binary.
Release note: None
Fixes #72425 #72420 #72373 #72372
#### roachprod: update doc on local clusters
The behavior changed in
#71970.
Release note: None
#### pkg/roachprod: allow multiple-stores to be created on GCP
Port an existing flag from the AWS roachprod flags that allows multiple
stores to be created. When this flag is enabled, multiple data
directories are created and mounted as `/mnt/data{1..N}`.
Standardize the existing ext4 disk creation logic in the GCE setup
script to match the AWS functionality. Interleave the existing ZFS setup
commands based on the `--filesystem` flag.
Fix a bug introduced in #54986 that will always create multiple data
disks, ignoring the value of the flag. This has the effect of never
creating a RAID 0 array, which is the intended default behavior.
The ability to create a RAID 0 array on GCE VMs is required for the
Pebble write-throughput benchmarks.
Release note: None
#### roachprod: move quiet determination out of the library
Moving the logic of automatically enabling Quiet in non-terminal
output.
Release note: None
#### roachprod: clean up use of SyncedCluster
`SyncedCluster` is currently used to pass the cluster name (with
optional node selector) and the settings. This is a misuse of the type
and complicates things conceptually.
This change separates out the relevant settings into a new struct
`ClusterSettings`. All commands now pass the cluster name and the
`ClusterSettings` instead of passing a `SyncedCluster`.
Release note: None
Co-authored-by: Andrew Werner <awerner32@gmail.com>
Co-authored-by: Steven Danna <danna@cockroachlabs.com>
Co-authored-by: Ahmad Abedalqader <ahmad.abedalqader@cockroachlabs.com>
Co-authored-by: Rail Aliiev <rail@iqchoice.com>
Co-authored-by: rimadeodhar <rima@cockroachlabs.com>
Co-authored-by: Radu Berinde <radu@cockroachlabs.com>
Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Co-authored-by: Nick Travers <travers@cockroachlabs.com>
Many of these commits are cleanup and reorganization. The functional changes are:
When this PR merges, roachprod will lose track of any existing local cluster; it will need to be cleaned up and recreated.
roachprod: minor cleanup for cloud.Cloud
This change fills in some missing comments from
cloud.Cloudandimproves the interface a bit. Some of the related roachprod code is
cleaned up as well.
Release note: None
roachprod: clean up local cluster metadata
The logic around how the local cluster metadata is loaded and saved is
very convoluted. The local provider is using
install.Clustersand iswriting directly to the
.hosts/localfile.This commit disentangles this logic: it is now up to the main program
to call
local.AddCluster()to inject local cluster information. Themain program also provides the implementation for a new
local.VMStorageinterface, allowing the code for saving the hostsfile to live where it belongs.
Release note: None
roachprod: clean up local cluster deletion
This change moves the code to destroy the local cluster to the local
provider. The hosts file is deleted through LocalVMStorage.
Release note: None
roachprod: rework clusters cache
This commit changes roachprod from using
hosts-style files in~/.roachprod/hostsfor caching clusters to using json files in~/.roachprod/clusters.Like before, each cluster has its own file. The main advantage is
that we can now store the entire cluster metadata instead of
manufacturing it based on one-off parsing.
WARNING: after this change, the information in
~/.roachprod/hostswill no longer be used. If a local cluster exists, the new
roachprodversion will not know about it. It is recommended to destroy any local
cluster before using the new version. A local cluster can also be
cleaned up manually using:
Release note: None
roachprod: use cloud.Cluster in SyncedCluster
This change stores Cluster metadata directly in SyncedCluster, instead
of making copies of various fields.
roachprod: store ports in vm.VM
This change adds
SQLPortandAdminUIPortfields tovm.VM. Thisallows us to remove the special hardcoded values for the local
cluster.
Having these fields stored in the clusters cache will allow having
multiple local clusters, each with their own set of ports.
Release note: None
roachprod: support multiple local clusters
This change adds support for multiple local clusters. Local cluster
names must be either "local" or of the form "local-foo".
When the cluster is named "local", the node directories stay in the
same place, e.g.
~/local/1. If the cluster is named "local-foo",node directories are like
~/local/foo-1.For local clusters we include the cluster name in the ROACHPROD
variable; this is necessary to distinguish between processes of
different local clusters. The relevant code is cleaned up to
centralize the logic related to the ROACHPROD variable.
Fixes #71945.
Release note: None
meh
roachprod: list VMs in parallel
This commit speeds up the slowest step of roachprod: listing VMs from
all providers. We now list the VMs in parallel across all providers
instead of doing it serially.
Release note: None
roachprod: fix behavior when mixing GCE projects
Currently roachprod has very poor behavior when used with different
projects on the same host. For example:
The sync on the default project removes the cached information for the
cluster on
andrei-jepsen, which causesroachprodcommands againstthat cluster (from within the
roachstress.shscript) to fail.We fix this by ignoring any cached clusters that reference a project
that the provider was not configured for - both when loading clusters
into memory and when deleting stale cluster files during
sync.As part of the change, we also improve the output of
listto removethe colon after the cluster name and to include the GCE project:
Release note: None