Skip to content

roachprod: making roachprod subcommands point to a new library#71660

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
healthy-pod:migrate-roachprod-binary-to-library
Nov 2, 2021
Merged

roachprod: making roachprod subcommands point to a new library#71660
craig[bot] merged 1 commit intocockroachdb:masterfrom
healthy-pod:migrate-roachprod-binary-to-library

Conversation

@healthy-pod
Copy link
Copy Markdown
Contributor

Previously, roachprod binary interfaced directly with roachorod's functionality
and there was no way for another tool to make use of that functionality.

This needed to change to create a library that can be used by roachprod binary
and also other tools.

This patch migrates the subcommands functionality to a new library and makes
the binary point to the new library.

Release note: None

@healthy-pod healthy-pod added the do-not-merge bors won't merge a PR with this label. label Oct 18, 2021
@healthy-pod healthy-pod requested a review from rail October 18, 2021 14:41
@healthy-pod healthy-pod requested a review from a team as a code owner October 18, 2021 14:41
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@healthy-pod healthy-pod force-pushed the migrate-roachprod-binary-to-library branch 3 times, most recently from 9d985ec to 40c2def Compare October 19, 2021 02:06
Copy link
Copy Markdown
Member

@rail rail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wooo! Sounds very exciting. This is my first pass.

Reviewed 6 of 6 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @healthy-pod)


pkg/cmd/roachprod/main.go, line 628 at r1 (raw file):

	Args:  cobra.NoArgs,
	Run: wrap(func(cmd *cobra.Command, args []string) error {
		cachedHosts, err := roachprod.CachedHosts(cachedHostsCluster)

Looks like you moved the filtering logic from the CLI to the library. I like the idea, and think it would be even better if you can make the filtering part more flexible. In other words, let's not hardvcode the filtering logic and limit it to "teamcity" - we may change the CI or the naming schema at some point. Maybe add another parameter (optional?) where you pass either the prefix you want to exclude, or a function that would decide whether a hostname should be accepted or not. What do you think?


pkg/cmd/roachprod/main.go, line 686 at r1 (raw file):

	Args: cobra.RangeArgs(0, 1),
	Run: wrap(func(cmd *cobra.Command, args []string) error {
		return roachprod.List(quiet, listMine, listDetails, listJSON, args)

A nit: roachprod.SetupSSH() uses quiet as its last argument. I'd keep it consistent. Also it caries the least amount of mental load, so let's move it to the end. :)

Also, it feels like List() should return something that you print here, not only an error. Otherwise the library consumers would need to parse the output. In other words, the library shouldn't print anything, just let the consumers do that.


pkg/cmd/roachprod/main.go, line 851 at r1 (raw file):

	Args: cobra.ExactArgs(1),
	Run: wrap(func(cmd *cobra.Command, args []string) error {
		newClusterOpts := roachprod.NewClusterOpts{

Hmm, newClusterOpts sounds like something for a new cluster. In other words, I was expecting to have something different for existing clusters. Maybe remove the "New" prefix? Or am I missing something?


pkg/cmd/roachprod/main.go, line 856 at r1 (raw file):

			NodeEnv: nodeEnv, NumRacks: numRacks, MaxConcurrency: maxConcurrency,
		}
		return roachprod.Extend(newClusterOpts, extendLifetime, username)

Same here. The printing part should be done in main().


pkg/cmd/roachprod/main.go, line 1038 at r1 (raw file):

			NodeEnv: nodeEnv, NumRacks: numRacks, MaxConcurrency: maxConcurrency,
		}
		return roachprod.Monitor(newClusterOpts, monitorIgnoreEmptyNodes, monitorOneShot)

Same here. Print stuff here.


pkg/cmd/roachprod/main.go, line 1277 at r1 (raw file):

			NodeEnv: nodeEnv, NumRacks: numRacks, MaxConcurrency: maxConcurrency,
		}
		return roachprod.SQL(newClusterOpts, args[1:])

Feels like this operation should return something, not quite sure what format we should use though...


pkg/cmd/roachprod/main.go, line 1293 at r1 (raw file):

			NodeEnv: nodeEnv, NumRacks: numRacks, MaxConcurrency: maxConcurrency,
		}
		return roachprod.PgURL(newClusterOpts, external)

Something similar here. The library prints things instead of main().


pkg/cmd/roachprod/main.go, line 1454 at r1 (raw file):

			NodeEnv: nodeEnv, NumRacks: numRacks, MaxConcurrency: maxConcurrency,
		}
		return roachprod.AdminUrl(newClusterOpts, adminurlIPs, adminurlOpen, adminurlPath)

Same same. Print stuff here :)


pkg/roachprod/roachprod.go, line 89 at r1 (raw file):

	// Acquire a filesystem lock so that two concurrent synchronizations of
	// roachprod state don't clobber each other.
	f, err := os.Create(lockFile)

Hmm, maybe we should also clean up this file after the operation is done. Apparently I have it :)


pkg/roachprod/roachprod.go, line 138 at r1 (raw file):

	if refreshDNS {
		if !quiet {
			fmt.Println("Refreshing DNS entries...")

I'd probably search for every fmt in this file and either replace it with logging or make sure that the printed value is returned to the caller.


pkg/roachprod/roachprod.go, line 207 at r1 (raw file):

	if listJSON {
		if listDetails {
			return errors.New("--json cannot be combined with --detail")

Can you move this check before you call Sync(). It'd be better to fail faster, without any network operations.


pkg/roachprod/utils.go, line 23 at r1 (raw file):

	"github.com/cockroachdb/cockroach/pkg/cmd/roachprod/config"
	"github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install"
	"github.com/cockroachdb/cockroach/pkg/cmd/roachprod/vm"

I think this is OK for now, but it feels that a library shouldn't use anything from pkg/cmd in its final version.

Also, I'm not a big fan of file names like "util", "common", "misc", etc. :) Not sure how to name this one, but we can probably move some parts of it to other files.

@healthy-pod healthy-pod force-pushed the migrate-roachprod-binary-to-library branch 20 times, most recently from 56272ae to ed5ed54 Compare October 31, 2021 23:09
@healthy-pod healthy-pod removed the do-not-merge bors won't merge a PR with this label. label Nov 1, 2021
Copy link
Copy Markdown
Contributor Author

@healthy-pod healthy-pod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @healthy-pod and @rail)


pkg/cmd/roachprod/main.go, line 628 at r1 (raw file):

Previously, rail (Rail Aliiev) wrote…

Looks like you moved the filtering logic from the CLI to the library. I like the idea, and think it would be even better if you can make the filtering part more flexible. In other words, let's not hardvcode the filtering logic and limit it to "teamcity" - we may change the CI or the naming schema at some point. Maybe add another parameter (optional?) where you pass either the prefix you want to exclude, or a function that would decide whether a hostname should be accepted or not. What do you think?

Changed the library implementation to return unfiltered hosts to give users more flexibility with their filtering logic.


pkg/cmd/roachprod/main.go, line 686 at r1 (raw file):

Previously, rail (Rail Aliiev) wrote…

A nit: roachprod.SetupSSH() uses quiet as its last argument. I'd keep it consistent. Also it caries the least amount of mental load, so let's move it to the end. :)

Also, it feels like List() should return something that you print here, not only an error. Otherwise the library consumers would need to parse the output. In other words, the library shouldn't print anything, just let the consumers do that.

quiet is no longer passed as a parameter to roachprod.SetupSSH() as I realized it's part of ClusterOpts struct (or install.SyncedCluster).


pkg/cmd/roachprod/main.go, line 851 at r1 (raw file):

Previously, rail (Rail Aliiev) wrote…

Hmm, newClusterOpts sounds like something for a new cluster. In other words, I was expecting to have something different for existing clusters. Maybe remove the "New" prefix? Or am I missing something?

Realized that NewClusterOpts type is just a subset of install.SyncedCluster so used that instead.

healthy-pod pushed a commit to healthy-pod/cockroach that referenced this pull request Nov 4, 2021
Merging cockroachdb#71660 introduced a bug where roachprod ignores --binary
flag when running `roachprod start`.

This patch reverts to the old way of setting config.Binary.

Release note: None
healthy-pod pushed a commit to healthy-pod/cockroach that referenced this pull request Nov 4, 2021
Merging cockroachdb#71660 introduced a bug where roachprod ignores --binary
flag when running `roachprod start`.

This patch reverts to the old way of setting config.Binary.

Release note: None

Fixes cockroachdb#72425
healthy-pod pushed a commit to healthy-pod/cockroach that referenced this pull request Nov 4, 2021
Merging cockroachdb#71660 introduced a bug where roachprod ignores --binary
flag when running `roachprod start`.

This patch reverts to the old way of setting config.Binary.

Release note: None

Fixes cockroachdb#72425 cockroachdb#72420 cockroachdb#72373 cockroachdb#72372
craig bot pushed a commit that referenced this pull request Nov 4, 2021
70330: util/log: add buffer sink decorator r=knz a=rauchenstein

Previously, only the file sink had buffering, and in that case it is
built into the sink.  It's important to add buffering to network sinks
for various reasons -- reducing network chatter, and making the
networking call itself asynchronous so the log call returns with very
low latency.

This change adds a buffering decorator so that buffering can be added to
any log sink with little or no development effort, and allowing
buffering to be configured in a uniform way.

Release note (cli change): Add buffering to log sinks. This can be
configured with the new "buffering" field on any log sink provided via
the "--log" or "--log-config-file" flags.

Release justification: This change is safe because it is a no-op without
a configuration change specifically enabling it.

72353: *: fix improperly wrapped errors r=otan,RaduBerinde,stevendanna a=rafiss

refs #42510

I'm working on a linter that detects errors that are not wrapped
correctly, and it discovered these.

Release note: None

72417: sql: add unit tests for creating default privileges r=ajwerner a=RichardJCai

Adding some unit test coverage so we don't hit bugs like this again.
#72322

Release note: None

72430: kvserver: use wrapper type for Store.mu.replicas r=erikgrinaker a=tbg

This simplifies lots of callers and it will also make it easier to work
on #72374, where this map will start containing more than one type as
value.

Release note: None


72432: roachprod: fix `roachprod start` ignoring --binary flag r=[rail,tbg] a=healthy-pod

Merging #71660 introduced a bug where roachprod ignores --binary
flag when running `roachprod start`.

This patch reverts to the old way of setting config.Binary as a quick solution to the bug.

Release note: None

Fixes #72425 #72420 #72373 #72372

Co-authored-by: Jay Rauchenstein <rauchenstein@cockroachlabs.com>
Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
Co-authored-by: richardjcai <caioftherichard@gmail.com>
Co-authored-by: Richard Cai <RichardJCai@users.noreply.github.com>
Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Co-authored-by: Ahmad Abedalqader <ahmad.abedalqader@cockroachlabs.com>
healthy-pod pushed a commit to healthy-pod/cockroach that referenced this pull request Nov 5, 2021
In cockroachdb#71660, roachprod library was created under pkg/roachprod
by moving the logic under pkg/cmd/roachprod to pkg/roachprod and
pointing the binary subcommands to the library.

This patch updates the logic to make it more suitable for a library
than a binary and integrates those changes into related tools such
as pkg/cmd/roachprod and pkg/cmd/roachtest.

Release note: None
RaduBerinde pushed a commit to RaduBerinde/cockroach that referenced this pull request Nov 15, 2021
Merging cockroachdb#71660 trigerred a flaky test due to unused functions.

This patch avoids that test by making use of / commenting unused functions.

Release note: None
RaduBerinde pushed a commit to RaduBerinde/cockroach that referenced this pull request Nov 15, 2021
Merging cockroachdb#71660 introduced a bug where roachprod ignores --binary
flag when running `roachprod start`.

This patch reverts to the old way of setting config.Binary.

Release note: None

Fixes cockroachdb#72425 cockroachdb#72420 cockroachdb#72373 cockroachdb#72372
craig bot pushed a commit that referenced this pull request Nov 17, 2021
72641: release-21.2: roachprod: backport changes from master as of 2021-11-11 r=RaduBerinde a=RaduBerinde

This PR backports all changes involving roachprod as of 2021-11-11. There have been large refactorings which we want to backport, or it will make backporting any future necessary roachtest fixes much harder. We also want new upcoming features around multi-tenancy available for 21.2.

CC @cockroachdb/release 

#### roachprod/vm/aws: improve help text for multiple stores

```bash
roachprod create ajwerner-test -n1 --clouds aws \
--aws-ebs-volume='{"VolumeType": "io2", "VolumeSize": 213, "Iops": 321}' \
--aws-ebs-volume='{"VolumeType": "io2", "VolumeSize": 213, "Iops": 321}' \
--aws-enable-multiple-stores=true
roachprod stage ajwerner-test cockroach
roachprod start ajwerner-test --store-count 2
```

The above commands will create a node with multiple stores and start cockroach
on them. Hopefully these minor help changes make that clearer.

Release note: None

#### roachprod: add stageurl command

Sometimes it is useful to be able to download these artifacts
directly. For example, when trying to bisect a problem. But, the URL
can take a second to remember the format of.

The stageurl command prints the staging URL of the given application.

I've reorganized some of the code to reduce duplication between the
stage and stageurl command. There is still more duplication than I
would like. But I figured I would see if this seems useful to others
before further refactoring.

Release note: None

#### roachprod: clean up roachprod ssh keys in aws

Many SSH keys created by roachprod are no longer used, and some were created by former employees.

This needed to change because it's a security issue that former employees may exploit.

This patch adds another step to roachprod-gc cronjob to tag any untagged keys created by roachprod in AWS and delete them if they are unused.

Release note: None

#### roachprod: upgrade Azure Ubuntu image to 20.04

Previously, currently used Ubuntu 18.04 doesn't support `systemd-run
--same-dir`, which is used by some roachprod scripts. Additionally, GCE
and AWS already use Ubuntu 20.04 based images for roachprod.

Updating the base image to Ubuntu 20.04 fixes the issue above and aligns
the version with other cloud providers.

Release note: None

#### roachprod: update azure SDK

This is a partial backport of the commit below (only the part that
affects roachprod).

  metric: Add Alert and Aggregation Rule interface

  In this commit, the interfaces for Alert and Aggregation rule
  interfaces are outlined. These interfaces will be used
  by a new endpoint which will expose these rules in a YAML
  format. This endpoint can be used by our end users to
  configure alerts/monitoring for CockroachDB clusters.
  This commit also updates the prometheus dependency in the
  vendor submodule.

Release note: None

#### roachprod: fix roachprod gc docker build

Previously, the roachprod garbage collector docker image build process
was using the `go get` approach to build roachprod.

Currently, this method doesn't work, because it doesn't use any pinning,
so the build ends up with all kind of deprecation warnings and failures.

* Use multi-stage docker build in order to separate build and runtime.
  It also reduces the image size from 1.9G to 700M.
* Build roachprod using the checked out commit SHA.
* Use the Bazel build image we use in CI to build roachprod.
* Use Bazel to build roachprod.
* Added `cloudbuild.yaml` to publish the docker image to GCR and use a
  beefier instance type.
* Modify the entrypoint script to set the default region, required by
  the AWS Go SDK library.
* Add `push.sh` to script deployment.

Release note: None

#### roachprod: correct spelling mistakes

Release note: None

#### roachprod: install AWS CLI v2 for GC images

Previously, after regenerating the GC docker images, roachprod stopped
listing AWS as an available provider, because Debian ships with AWS CLI
v1, but roachprod doesn't support it.

This patch installs AWS CLI v2.

Release note: None

#### roachprod: making roachprod subcommands point to a new library

Previously, roachprod binary interfaced directly with roachorod's functionality
and there was no way for another tool to make use of that functionality.

This needed to change to create a library that can be used by roachprod binary
and also other tools.

This patch migrates the subcommands functionality to a new library and makes
the binary point to the new library.

Release note: None

#### roachprod: avoid flaky test due to unused functions

Merging #71660 trigerred a flaky test due to unused functions.

This patch avoids that test by making use of / commenting unused functions.

Release note: None

#### roachprod: minor cleanup for cloud.Cloud

This change fills in some missing comments from `cloud.Cloud` and
improves the interface a bit. Some of the related roachprod code is
cleaned up as well.

Release note: None

#### roachprod: clean up local cluster metadata

The logic around how the local cluster metadata is loaded and saved is
very convoluted. The local provider is using `install.Clusters` and is
writing directly to the `.hosts/local` file.

This commit disentangles this logic: it is now up to the main program
to call `local.AddCluster()` to inject local cluster information. The
main program also provides the implementation for a new
`local.VMStorage` interface, allowing the code for saving the hosts
file to live where it belongs.

Release note: None

#### roachprod: clean up local cluster deletion

This change moves the code to destroy the local cluster to the local
provider. The hosts file is deleted through LocalVMStorage.

Release note: None

#### roachprod: rework clusters cache

This commit changes roachprod from using `hosts`-style files in
`~/.roachprod/hosts` for caching clusters to using json files in
`~/.roachprod/clusters`.

Like before, each cluster has its own file. The main advantage is
that we can now store the entire cluster metadata instead of
manufacturing it based on one-off parsing.

WARNING: after this change, the information in `~/.roachprod/hosts`
will no longer be used. If a local cluster exists, the new `roachprod`
version will not know about it. It is recommended to destroy any local
cluster before using the new version. A local cluster can also be
cleaned up manually using:
```
killall -9 cockroach
rm -rf ~/.roachprod/local
```

Release note: None

#### roachprod: use cloud.Cluster in SyncedCluster

This change stores Cluster metadata directly in SyncedCluster, instead
of making copies of various fields.

#### roachprod: store ports in vm.VM

This change adds `SQLPort` and `AdminUIPort` fields to `vm.VM`. This
allows us to remove the special hardcoded values for the local
cluster.

Having these fields stored in the clusters cache will allow having
multiple local clusters, each with their own set of ports.

Release note: None

#### roachprod: support multiple local clusters

This change adds support for multiple local clusters. Local cluster
names must be either "local" or of the form "local-foo".

When the cluster is named "local", the node directories stay in the
same place, e.g. `~/local/1`. If the cluster is named "local-foo",
node directories are like `~/local/foo-1`.

For local clusters we include the cluster name in the ROACHPROD
variable; this is necessary to distinguish between processes of
different local clusters. The relevant code is cleaned up to
centralize the logic related to the ROACHPROD variable.

Fixes #71945.

Release note: None

meh

#### roachprod: list VMs in parallel

This commit speeds up the slowest step of roachprod: listing VMs from
all providers. We now list the VMs in parallel across all providers
instead of doing it serially.

Release note: None

#### roachprod: fix behavior when mixing GCE projects

Currently roachprod has very poor behavior when used with different
projects on the same host. For example:
```
shell1: GCE_PROJECT=andrei-jepsen roachstress.sh ... // this will run ~forever
sometime later in shell2: roachprod sync (on the default project)
```

The sync on the default project removes the cached information for the
cluster on `andrei-jepsen`, which causes `roachprod` commands against
that cluster (from within the `roachstress.sh` script) to fail.

We fix this by ignoring any cached clusters that reference a project
that the provider was not configured for - both when loading clusters
into memory and when deleting stale cluster files during `sync`.

As part of the change, we also improve the output of `list` to remove
the colon after the cluster name and to include the GCE project:

```
$ roachprod list --gce-project cockroach-ephemeral,andrei-jepsen
Syncing...
Refreshing DNS entries...
glenn-anarest                  [aws]                      9  (142h41m39s)
glenn-drive                    [aws]                      1  (141h41m39s)
jane-1635868819-01-n1cpu4      [gce:cockroach-ephemeral]  1  (10h41m39s)
lin-ana                        [aws]                      9  (178h41m39s)
local-foo                      [local]                    4  (-)
radu-foo                       [gce:andrei-jepsen]        4  (12h41m39s)
radu-test                      [gce:cockroach-ephemeral]  4  (12h41m39s)
```

Release note: None

#### roachprod: don't remove LOCK file

We use a LOCK file during sync. We create the file, acquire an
exclusive lock and at the end remove the file. The removal of the file
will fail if another process was waiting for the lock. Also, there is
a race where we could be deleting the file that is in use by another
process, and that would allow a third process to create the file
again.

To fix these issues, we let the LOCK file be; there is no need to
remove it - we are relying on `flock`, not on exclusive file creation.

Release note: None

#### roachprod: fix improperly wrapped errors

Partial backport of this commit:
  *: fix improperly wrapped errors

  I'm working on a linter that detects errors that are not wrapped
  correctly, and it discovered these.

Release note: None

#### roachprod: fix `roachprod start` ignoring --binary flag

Merging #71660 introduced a bug where roachprod ignores --binary
flag when running `roachprod start`.

This patch reverts to the old way of setting config.Binary.

Release note: None

Fixes #72425 #72420 #72373 #72372

#### roachprod: update doc on local clusters

The behavior changed in
#71970.

Release note: None

#### pkg/roachprod: allow multiple-stores to be created on GCP

Port an existing flag from the AWS roachprod flags that allows multiple
stores to be created. When this flag is enabled, multiple data
directories are created and mounted as `/mnt/data{1..N}`.

Standardize the existing ext4 disk creation logic in the GCE setup
script to match the AWS functionality. Interleave the existing ZFS setup
commands based on the `--filesystem` flag.

Fix a bug introduced in #54986 that will always create multiple data
disks, ignoring the value of the flag. This has the effect of never
creating a RAID 0 array, which is the intended default behavior.

The ability to create a RAID 0 array on GCE VMs is required for the
Pebble write-throughput benchmarks.

Release note: None

#### roachprod: move quiet determination out of the library

Moving the logic of automatically enabling Quiet in non-terminal
output.

Release note: None

#### roachprod: clean up use of SyncedCluster

`SyncedCluster` is currently used to pass the cluster name (with
optional node selector) and the settings. This is a misuse of the type
and complicates things conceptually.

This change separates out the relevant settings into a new struct
`ClusterSettings`. All commands now pass the cluster name and the
`ClusterSettings` instead of passing a `SyncedCluster`.

Release note: None


Co-authored-by: Andrew Werner <awerner32@gmail.com>
Co-authored-by: Steven Danna <danna@cockroachlabs.com>
Co-authored-by: Ahmad Abedalqader <ahmad.abedalqader@cockroachlabs.com>
Co-authored-by: Rail Aliiev <rail@iqchoice.com>
Co-authored-by: rimadeodhar <rima@cockroachlabs.com>
Co-authored-by: Radu Berinde <radu@cockroachlabs.com>
Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Co-authored-by: Nick Travers <travers@cockroachlabs.com>
healthy-pod pushed a commit to healthy-pod/cockroach that referenced this pull request Nov 18, 2021
In cockroachdb#71660, roachprod library was created under pkg/roachprod
by moving the logic under pkg/cmd/roachprod to pkg/roachprod and
pointing the binary subcommands to the library.

This patch updates the logic to make it more suitable for a library
than a binary and integrates those changes into related tools such
as pkg/cmd/roachprod and pkg/cmd/roachtest.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-roachprod C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-dev-inf

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants