server, security: Token-based add/join TLS functionality #63492
server, security: Token-based add/join TLS functionality #63492craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
|
This PR also "takes over" #61265. @aaron-crl happy to listen to your feedback about anything! |
|
Thank you for picking this up @itsbilal ! I'll pick this up after breather week. |
845c1e5 to
7fffe92
Compare
1a26529 to
3d01f80
Compare
|
I've added a test and rebased / fixed up some issues. This is ready for a look! |
6df2b8a to
5e6327c
Compare
knz
left a comment
There was a problem hiding this comment.
Discussed separately:
Overall design seems copacetic. Good confidence about the approach.
-> We'll want to circle back on the RFC and align the RFC to the approaches taken in code.
(Who: knz with support from bilal & aaron)
Code review: knz primary, Ben secondary.
Code duplication from the security package: this is creating tech debt.
-> File an issue to reduce that duplication (bilal)
The code is now using both CertLocator and CertManager: this creating another layer of tech debt.
-> File an issue to coalesce them into one (and perhaps mention reuse what Darin built in the sub-package) (bilal)
Testing confidence: 3/5
- need to add a version number to the join token (bilal: best effort, otherwise file follow-up issue, link it to the meta-issue about integration)
- testing work is end-to-end; want some additional failure tests (bilal: this PR).
- some of the sub-components do not have unit tests. We may want to punt on this (maybe file a followup issue)
- want unit tests where the join token is consumed (bilal: this PR)
These aspects would raise test coverage from 3/5 to 4/5.
Reviewed 5 of 13 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aaron-crl and @knz)
|
Thanks for that summary, @knz ! I've addressed the parts of testing that I mentioned I'd address in this PR, in the latest push. I'll start to file the follow-up issues soon. |
22bba4a to
9725eef
Compare
knz
left a comment
There was a problem hiding this comment.
Ok I went through a more in-depth review of this PR,
taking over Aaron's role in that regard.
I think the work here is indeed quite well polished already.
I have a few suggestions below in addition to the summary above. Mostly the error checks need to be enhanced for correctness.
A lot of the follow up work can be discretized in follow-up issues, so they are not blocking this PR from merging.
Reviewed 5 of 13 files at r1, 7 of 7 files at r2, 3 of 3 files at r3, 1 of 1 files at r4, 15 of 15 files at r5, 7 of 9 files at r6.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aaron-crl, @itsbilal, and @knz)
pkg/cli/addjoin.go, line 30 at r2 (raw file):
var nodeJoinCmd = &cobra.Command{ Use: "node-join <remote-addr>",
(1st commit) Could we make this cockroach node join, i.e. a subcommand to node, instead of cockroach node-join
Or alternatively cockroach connect join
pkg/cli/addjoin.go, line 31 at r3 (raw file):
Use: "node-join <remote-addr>", Short: "request the TLS certs for a new node from an existing node", Args: cobra.MinimumNArgs(1),
r2: this needs to be bumped to 2, as per usage of the args below.
Also update the Use string above to mention the token.
(I see you change this in r5. Do you plan to squash? If yes, you can dismiss the above; if no, we'll want each individual commit to work)
pkg/cli/addjoin.go, line 34 at r6 (raw file):
) const nodeJoinTimeout = 1 * time.Minute
This needs to be configurable IMHO. If not here, please file a followup issue and add a link here.
You can perhaps reuse cliCtx.cmdTimeout which is already configurable, and add a reference to the --timeout flags for the new join cmd in flags.go.
pkg/cli/addjoin_test.go, line 117 at r6 (raw file):
err = runNodeJoin(nil, []string{token}) require.Error(t, err)
can you assert a more specific error, so we get more confidence that the server is in fact rejecting the join token (and not failing for an unrelated reason) thanks
pkg/server/addjoin.go, line 67 at r2 (raw file):
ctx context.Context, req *serverpb.BundleRequest, ) (*serverpb.BundleResponse, error) { // Validate Add/Join token sharedSecret.
I would recommend a cluster setting to entirely disable these two API endpoints, in case we discover after the fact that they had a security issue.
If not here, please file an issue.
pkg/server/addjoin.go, line 57 at r6 (raw file):
if err != nil { return err } else if row == nil {
len(row) != 1?
pkg/server/addjoin_test.go, line 87 at r6 (raw file):
// Try consuming the token a second time. This should fail. _, err = adminClient.RequestCertBundle(ctx, &cbReq) require.Error(t, err)
ditto, please check the specific error is indeed an invalid token error.
|
(there's also a CI failure that's trying to tell you something) |
itsbilal
left a comment
There was a problem hiding this comment.
TFTR! I made my changes as an additional commit. Also helps that way as the most recent commit changes some CLI commands around (connect is now connect init, node join is now connect join). If it's good and tests pass, I'll squash everything then re-push it here. Will also add a release note for the CLI changes.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aaron-crl and @knz)
pkg/server/addjoin.go, line 67 at r2 (raw file):
Previously, knz (kena) wrote…
I would recommend a cluster setting to entirely disable these two API endpoints, in case we discover after the fact that they had a security issue.
If not here, please file an issue.
Done. I'm just piggy-backing on the feature flag at the moment.
pkg/server/addjoin.go, line 57 at r6 (raw file):
Previously, knz (kena) wrote…
len(row) != 1?
Done.
pkg/server/addjoin_test.go, line 87 at r6 (raw file):
Previously, knz (kena) wrote…
ditto, please check the specific error is indeed an invalid token error.
The next line does this!
pkg/cli/addjoin.go, line 30 at r2 (raw file):
Previously, knz (kena) wrote…
(1st commit) Could we make this
cockroach node join, i.e. a subcommand tonode, instead ofcockroach node-join
Or alternativelycockroach connect join
Done. I also moved cockroach connect to cockroach connect init.
pkg/cli/addjoin.go, line 31 at r3 (raw file):
Previously, knz (kena) wrote…
r2: this needs to be bumped to 2, as per usage of the args below.
Also update the
Usestring above to mention the token.(I see you change this in r5. Do you plan to squash? If yes, you can dismiss the above; if no, we'll want each individual commit to work)
Yes, I'll squash them all.
pkg/cli/addjoin.go, line 34 at r6 (raw file):
Previously, knz (kena) wrote…
This needs to be configurable IMHO. If not here, please file a followup issue and add a link here.
You can perhaps reuse
cliCtx.cmdTimeoutwhich is already configurable, and add a reference to the--timeoutflags for the new join cmd in flags.go.
Sure, will file a follow-up issue. Looks like cockroach connect also has the same issue.
pkg/cli/addjoin_test.go, line 117 at r6 (raw file):
Previously, knz (kena) wrote…
can you assert a more specific error, so we get more confidence that the server is in fact rejecting the join token (and not failing for an unrelated reason) thanks
Done.
knz
left a comment
There was a problem hiding this comment.
modulo squash and release note(s)
when you file the followup issues, can you ping this PR from each issue's description, and also link these issues to the meta-issue #60632
Reviewed 8 of 15 files at r10, 4 of 9 files at r11, 9 of 9 files at r12.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @aaron-crl and @itsbilal)
pkg/server/addjoin.go, line 40 at r12 (raw file):
settings := s.server.ClusterSettings() if settings == nil { return nil, errors.New("could not look up cluster settings")
errors.AssertionFailedf I think?
pkg/server/addjoin.go, line 96 at r12 (raw file):
settings := s.server.ClusterSettings() if settings == nil { return nil, errors.New("could not look up cluster settings")
ditto
itsbilal
left a comment
There was a problem hiding this comment.
TFTR! Finished the squash and added release note.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aaron-crl and @knz)
pkg/server/addjoin.go, line 40 at r12 (raw file):
Previously, knz (kena) wrote…
errors.AssertionFailedfI think?
Done.
pkg/server/addjoin.go, line 96 at r12 (raw file):
Previously, knz (kena) wrote…
ditto
Done.
This change adds a new CLI command, `connect join`, that lets a new node retrieve CA certificates off of an existing secure cluster and bootstrap its own TLS certificates for joining that cluster. This is achieved through the consumption of a join token stored in the join_tokens table. Join tokens can be created using the `create_join_tokens()` sql builtin function, added in a previous change. The previous `connect` command has now been renamed to `connect init`. A new set of GRPC endpoints have been created to handle the server side of this change; to support retrieval of CAs as well as entire initialization cert bundles. The cert bundles with CAs and private keys aren't sent over the wire until a node join token has been verified and consumed. The receiving node (the one running `connect join`) then stores them in its SSLCertsDir and bootstraps its own certificates off of it. This feature is hidden behind a feature flag. A couple other changes in this commit to clean up related code in the `security` package: - Capitalize CA consistently in protobuf structs and method names - Write all CAs/certs from the right places in InitializeFromConfig instead of writing the InterNode one everywhere - Pass around raw byte certificates in auto_tls_init.go instead of doing excess PEM encodings and decodings. - Create a `client.node.crt` signed by `ca-client.crt`, otherwise nodes wouldn't be able to connect to each other. Fixes cockroachdb#61624. Release note (cli change): Rename `connect` to `connect init`, and add `connect join` command to retrieve certificates from an existing secure cluster and setup a new node to connect with it. Co-authored-by: Aaron Blum <aaron@cockroachlabs.com> Co-authored-by: Bilal Akhtar <bilal@cockroachlabs.com>
|
bors r=knz |
|
Build succeeded: |
|
Follow-up issues:
|
Informs #60632.
Epic: CRDB-6663
This change adds a new CLI command,
connect join, that lets a newnode retrieve CA certificates off of an existing secure cluster
and bootstrap its own TLS certificates for joining that cluster.
This is achieved through the consumption of a join token
stored in the join_tokens table. Join tokens can be created
using the
create_join_tokens()sql builtin function, addedin a previous change.
The previous
connectcommand has now been renamed toconnect init.A new set of GRPC endpoints have been created to handle the
server side of this change; to support retrieval of CAs as well
as entire initialization cert bundles. The cert bundles with
CAs and private keys aren't sent over the wire until a
node join token has been verified and consumed. The receiving
node (the one running
connect join) then stores them in itsSSLCertsDir and bootstraps its own certificates off of it.
This feature is hidden behind a feature flag.
A couple other changes in this commit to clean up related code
in the
securitypackage:instead of writing the InterNode one everywhere
doing excess PEM encodings and decodings.
client.node.crtsigned byca-client.crt, otherwisenodes wouldn't be able to connect to each other. Fixes cli: TLS certs generated via
connectcause one-way connectivity problem #61624.Release note (cli change): Rename
connecttoconnect init,and add
connect joincommand to retrieve certificates froman existing secure cluster and setup a new node to connect with
it.
Co-authored-by: Aaron Blum aaron@cockroachlabs.com
Co-authored-by: Bilal Akhtar bilal@cockroachlabs.com