Skip to content

roachprod: default to secure clusters unless GCE cockroach-ephemeral#151230

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
golgeek:ludo/secure-clusters
Aug 14, 2025
Merged

roachprod: default to secure clusters unless GCE cockroach-ephemeral#151230
craig[bot] merged 1 commit intocockroachdb:masterfrom
golgeek:ludo/secure-clusters

Conversation

@golgeek
Copy link
Copy Markdown
Contributor

@golgeek golgeek commented Aug 1, 2025

Since #147737, to ease testing for engineering teams, roachprod defaults to insecure clusters. This change has led to security impact for other entities at CRL.

This patch brings smart selection of secure vs. insecure cluster depending on the Cloud project the cluster is provisionned in. If the cluster is provisionned in the engineering's ephemeral GCP project, it will default to insecure, while all other configurations will default to secure.

In order to achieve that, and because this depends on the cluster's configuration that is not available at the CLI package level, the CLI now keeps track of the user's arguments (--secure, --insecure, default) and passes the information to the roachprod library.

Epic: none
Release note: None

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@golgeek
Copy link
Copy Markdown
Contributor Author

golgeek commented Aug 1, 2025

No flag, cockroach-ephemeral cluster, -> insecure:

➜  cockroach git:(ludo/secure-clusters) ✗ ./bin/roachprod start "${CLUSTER}"
WARN: cluster ludoleroux-test defaults to insecure, because it is in project cockroach-ephemeral
2025/08/01 21:07:07 cockroach.go:620: ludoleroux-test (system): starting cockroach processes
2025/08/01 21:07:07 cockroach.go:936: starting node 1

Secure flag, cockroach-ephemeral cluster, -> secure:

➜  cockroach git:(ludo/secure-clusters) ✗ ./bin/roachprod start "${CLUSTER}" --secure
2025/08/01 21:07:37 cluster_synced.go:1605: ludoleroux-test: checking certs.tar
ludoleroux-test: initializing certs 1/1 
ludoleroux-test: distributing certs 0/0 
2025/08/01 21:07:43 cockroach.go:620: ludoleroux-test (system): starting cockroach processes
2025/08/01 21:07:43 cockroach.go:936: starting node 1

No flag, non cockroach-ephemeral cluster (AWS), -> secure:

➜  cockroach git:(ludo/secure-clusters) ✗ ./bin/roachprod start "${CLUSTER}-aws"
2025/08/01 21:14:14 cockroach.go:574: WARNING: Service registration and custom ports are not supported for this cluster.
Setting ports to default SQL Port: 26257, and Admin UI Port: 26258.
Attempting to start any additional external SQL processes will fail.
2025/08/01 21:14:14 cluster_synced.go:1605: ludoleroux-test-aws: checking certs.tar
ludoleroux-test-aws: initializing certs 1/1 
ludoleroux-test-aws: distributing certs 0/0 
2025/08/01 21:14:19 cockroach.go:620: ludoleroux-test-aws (system): starting cockroach processes
2025/08/01 21:14:19 cockroach.go:936: starting node 1

Insecure flag, non-cockroach-ephemeral cluster (AWS), -> insecure:

➜  cockroach git:(ludo/secure-clusters) ✗ ./bin/roachprod start "${CLUSTER}-aws" --insecure
2025/08/01 21:15:51 cockroach.go:574: WARNING: Service registration and custom ports are not supported for this cluster.
Setting ports to default SQL Port: 26257, and Admin UI Port: 26258.
Attempting to start any additional external SQL processes will fail.
2025/08/01 21:15:51 cockroach.go:620: ludoleroux-test-aws (system): starting cockroach processes
2025/08/01 21:15:51 cockroach.go:936: starting node 1

@golgeek golgeek force-pushed the ludo/secure-clusters branch from 62bad8b to 7eba42f Compare August 1, 2025 21:23
@golgeek
Copy link
Copy Markdown
Contributor Author

golgeek commented Aug 1, 2025

TC runs:

@golgeek golgeek marked this pull request as ready for review August 1, 2025 22:01
@golgeek golgeek requested review from a team as code owners August 1, 2025 22:01
@golgeek golgeek requested review from a team, herkolategan, nameisbhaskar, shailendra-patel and srosenberg and removed request for a team August 1, 2025 22:01
Copy link
Copy Markdown
Collaborator

@herkolategan herkolategan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand passing a tri-state value to convey if the user explicitly intended to set secure, otherwise we infer it based on context. But since the only context that we really need to infer it is at CLI level, could we not use this https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachprod/cli/util.go#L140, and determine if someone is passing a project or set the env var for a project that we change the default behaviour. Then it will automatically trickle through to the roachprod library, which then doesn't have to care about intent?

Since they won't be able to access their cluster without setting the correct project in any case? And if the project is set we can infer the default at CLI level?

ClusterSettings map[string]string

// This is used to pass the CLI flag
secureFlagsOpt SecureOption
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Having a hidden flag hitching a ride on ClusterSettings here has me a bit worried. I kind of liked the fact that ClusterSettings was a pure exported struct without any hidden tricks.

@golgeek
Copy link
Copy Markdown
Contributor Author

golgeek commented Aug 7, 2025

I understand passing a tri-state value to convey if the user explicitly intended to set secure, otherwise we infer it based on context. But since the only context that we really need to infer it is at CLI level, could we not use this https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachprod/cli/util.go#L140, and determine if someone is passing a project or set the env var for a project that we change the default behaviour. Then it will automatically trickle through to the roachprod library, which then doesn't have to care about intent?

Since they won't be able to access their cluster without setting the correct project in any case? And if the project is set we can infer the default at CLI level?

Not sure I understand your point.

The issue that forced me to pass a tri-state value is that the cluster start operation applies to an already created cluster, and is 100% decorrelated from the cluster creation.

In a nutshel:

  1. you create a cluster in a project (we have the info)
  2. [you stage]
  3. you start your cluster, potentially specifying if you want to force secure or insecure, but without specifying the project the cluster resides in

At step 3, the cli doesn't know about the project the cluster lives in and cannot pre-compute a default applied to the project. So it could pass a general secure, but the roachprod libray then has no way of determining if it's a default secure or a CLI forced secure, which is important in cockroach-ephemeral as default would be insecure unless --secure was forced.

@herkolategan
Copy link
Copy Markdown
Collaborator

Not sure I understand your point.
The issue that forced me to pass a tri-state value is that the cluster start operation applies to an already created cluster, and is 100% decorrelated from the cluster creation.

You're right, forgot about that aspect. In my mind to run "start" (or any other command) on the cluster in question you have to have to pass the project as well, but forgot the cluster cache keeps that info, so you don't have to.

I'll take another pass in a minute :)

@herkolategan
Copy link
Copy Markdown
Collaborator

herkolategan commented Aug 7, 2025

I might be nitpicking now, and I may have some ulterior motives to avoid a conflict on ClusterSettingOption with one of my PRs... but hear me out:

Could roachprod.go not provide an exported function call that retrieves a cluster's info. And then the CLI can in Wrap(...) [1] make that call? Would also be a good way to fail early on, on a non-existent cluster?

But it may still be a significant change since any command that uses "secure" would need to be updated to pass the cluster name, usually arg 0, to Wrap [1] - so that it can do the default secure logic.

I could be wrong, but I think most or all of the operations would use the cache, so it wouldn't be an expensive call, and then the decision can be made on the CLI?

[1] master/pkg/cmd/roachprod/cli/util.go#L140 / https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachprod/cli/util.go#L106

@golgeek
Copy link
Copy Markdown
Contributor Author

golgeek commented Aug 8, 2025

Could roachprod.go not provide an exported function call that retrieves a cluster's info. And then the CLI can in Wrap(...) [1] make that call? Would also be a good way to fail early on, on a non-existent cluster?

Took the night to reflect on it, and I do feel like it's putting a lot of (too much?) intelligence into the CLI.

Making the decision on a default behavior depending on a cluster state is a business rule and not user-defined.
The day we expose roachprod with a different entrypoint (as an API --or a desktop app!--), the same rule has to apply across all entrypoints. On a much more realistic outcome, if the CLI becomes a mere HTTP wrapper that queries a central service having this kind of rules in the central service prevents outdated clients to still spawn default insecure clusters when we switch the rule.

If what really bothers you is the addition of this dual-state or tri-state boolean in the install.ClusterSettingOption interface, I guess I could try and refactor to extract it from the ClusterSettings. Though it would require adding an extra argument to a ton of funcs.

}

type SecureOption interface {
apply(settings *ClusterSettings)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can replace apply(settings *ClusterSettings) with ClusterSettingOption inline, to make the composition clear that SecureOption is a ClusterSettingOption

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, or it wouldn't satisfy the ClusterSettingOptions interface anymore:

// ClusterSettingOption is the interface satisfied by options to MakeClusterSettings.
type ClusterSettingOption interface {
	apply(settings *ClusterSettings)
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure this works:

type SecureOption interface {
	ClusterSettingOption
	compute(c *SyncedCluster) error
}

Checked out the PR to confirm and run tests.

Copy link
Copy Markdown
Collaborator

@herkolategan herkolategan Aug 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will ensure if someone updates ClusterSettingOption, that your option inherits the changes (i.e., it would provide better hints that the implementation needs to be updated).

Copy link
Copy Markdown
Contributor Author

@golgeek golgeek Aug 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! Smart! Thanks!
I thought you wanted to change the signature of the apply() method on the SecureOption interface :)

Edit: updated

@herkolategan
Copy link
Copy Markdown
Collaborator

I guess this all boils down to the fact that we don't store any state, especially cockroach related state. It's the same issue we had with managing tenants / virtual clusters, and opted to store the state in DNS records. It tends to create a mess with defaults regarding different scenarios. The onus is placed on the user to remember details about the cluster, and if not we need to infer what we think the properties of the cluster is based on some context.

The 3 states of Secure is actually: (true, false, guess based on context). The last entry is still a best guess since an older version of roachprod could have been used to create the cluster. When we move to some kind of central state I reckon there would be no need for any of this hopefully.

I'm not opposed to the solution in this PR. It just felt a bit heavy to support a very specific use case, but given the current state of things there's likely no squeaky clean way around it.

@golgeek
Copy link
Copy Markdown
Contributor Author

golgeek commented Aug 8, 2025

I guess this all boils down to the fact that we don't store any state, especially cockroach related state. It's the same issue we had with managing tenants / virtual clusters, and opted to store the state in DNS records. It tends to create a mess with defaults regarding different scenarios. The onus is placed on the user to remember details about the cluster, and if not we need to infer what we think the properties of the cluster is based on some context.

The 3 states of Secure is actually: (true, false, guess based on context). The last entry is still a best guess since an older version of roachprod could have been used to create the cluster. When we move to some kind of central state I reckon there would be no need for any of this hopefully.

I'm not opposed to the solution in this PR. It just felt a bit heavy to support a very specific use case, but given the current state of things there's likely no squeaky clean way around it.

Yep, I've restarted the work on centralized (very close to have an actual demo working), and hopefully keeping a state will become easier from there as we could migrate the state stored in DNS to the service (as well as compute the DNS zone for A records in the service).

This is intended as a stop gap PR, though we all know what stop gap usually means.......

@golgeek golgeek force-pushed the ludo/secure-clusters branch from 7eba42f to fd6c483 Compare August 8, 2025 10:19
Copy link
Copy Markdown
Collaborator

@herkolategan herkolategan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I tried to think of a different approach, but nothing especially cleaner comes to mind.

I did wonder for a second if we should make the "main/exported" option (ClusterSettings.Secure) on the struct itself have multiple states {True, False, Default}, and then compute at the last minute where we need to convert Default -> True or False. But this would probably result in a much larger foot print to keep track of.

settings.Secure = bool(o)
type SecureOption interface {
ClusterSettingOption
compute(c *SyncedCluster) error
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: compute is somewhat ambiguous relative to apply. It also doesn't actually compute much. How about making it into,

overrideForCluster(c *SyncedCluster) error)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went with the verbose overrideBasedOnClusterSettings(c *SyncedCluster) error).

}
c.Nodes = nodes

if c.ClusterSettings.secureFlagsOpt != nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: for brevity, could fold the nullity check into the method.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is an interface, this method is bound to have multiple implementation, so this check being implemented multiple times vs. the single call. It feels safer to implement it there once and for all.

// we make it insecure by default. This is to avoid dealing with certificates
// for ephemeral engineering test clusters.
if len(c.Clouds()) == 1 && c.Clouds()[0] == fmt.Sprintf("%s:%s", gce.ProviderName, gce.DefaultProjectID) {
fmt.Printf("WARN: cluster %s defaults to insecure, because it is in project %s\n",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this could spam the log multiple times. E.g., I see it twice during roachprod start; the second time is during updatePrometheusTargets. Maybe we could suppress it by persisting the override into the cluster metadata (i.e., *cloud.Cluster)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like a much more involved project. Let's revisit and store the secure state of a cluster when we implement a service registry? Or in a subsequent PR if logging verbosity is getting out of hand.

Since cockroachdb#147737, to ease testing for engineering teams, roachprod defaults
to insecure clusters. This change has led to security impact for other
entities at CRL.

This patch brings smart selection of secure vs. insecure cluster
depending on the Cloud project the cluster is provisionned in. If the
cluster is provisionned in the engineering's ephemeral GCP project, it
will default to insecure, while all other configurations will default to
secure.

In order to achieve that, and because this depends on the cluster's
configuration that is not available at the CLI package level, the CLI
now keeps track of the user's arguments (--secure, --insecure, default)
and passes the information to the roachprod library.

Epic: none
Release note: None
@golgeek golgeek force-pushed the ludo/secure-clusters branch from fd6c483 to c48f303 Compare August 13, 2025 08:49
@golgeek
Copy link
Copy Markdown
Contributor Author

golgeek commented Aug 14, 2025

TFTRs!

bors r=herkolategan,srosenberg

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Aug 14, 2025

@craig craig bot merged commit 54bcb71 into cockroachdb:master Aug 14, 2025
23 checks passed
golgeek added a commit to golgeek/cockroach that referenced this pull request Aug 19, 2025
As of cockroachdb#151230, roachprod's default behavior is to provision secure
clusters unless provisioning is in the cockroach-ephemeral GCP project.
To warn the user about the fact that the cluster is insecure in this
configuration, a log was added to stdout.

The addition of this log message broke some K/V performance scripts,
so this patch moves the warning to stderr instead.

Epic: none
Release note: None
craig bot pushed a commit that referenced this pull request Aug 20, 2025
151737: sql: add support for automatically repairing dangling comments r=fqazi a=fqazi

Previously, we added logic to detect dangling comments on descriptors but did not include a mechanism to clean them up. This could block initial upgrades due to dangling entries in the system.comments table. This patch adds support for automatically cleaning up these dangling comments.

Fixes: #151497

Release note (bug fix): Added an automatic repair for dangling or invalid entries in the system.comment table.

152079: roachprod: insecure cluster warn log r=herkolategan a=golgeek

As of #151230, roachprod's default behavior is to provision secure clusters unless provisioning is in the cockroach-ephemeral GCP project. To warn the user about the fact that the cluster is insecure in this configuration, a log was added to stdout.

The addition of this log message broke some K/V performance scripts, so this patch moves the warning to stderr instead.

Epic: none
Release note: None

Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com>
Co-authored-by: Ludovic Leroux <ludo.leroux@cockroachlabs.com>
craig bot pushed a commit that referenced this pull request Aug 20, 2025
152079: roachprod: insecure cluster warn log r=herkolategan a=golgeek

As of #151230, roachprod's default behavior is to provision secure clusters unless provisioning is in the cockroach-ephemeral GCP project. To warn the user about the fact that the cluster is insecure in this configuration, a log was added to stdout.

The addition of this log message broke some K/V performance scripts, so this patch moves the warning to stderr instead.

Epic: none
Release note: None

152170: workload/schemachange: handle LTREE in mixed version CREATE TYPE r=rafiss a=rafiss

Using this type in a composite type in a mixed version cluster will cause UndefinedObject errors, so we need to make the workload aware of that.

informs #150547
Release note: None

Co-authored-by: Ludovic Leroux <ludo.leroux@cockroachlabs.com>
Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants