Skip to content

server, security: Token-based add/join TLS functionality #63492

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
itsbilal:aaron-add-join
May 6, 2021
Merged

server, security: Token-based add/join TLS functionality #63492
craig[bot] merged 1 commit intocockroachdb:masterfrom
itsbilal:aaron-add-join

Conversation

@itsbilal
Copy link
Copy Markdown
Contributor

@itsbilal itsbilal commented Apr 12, 2021

Informs #60632.
Epic: CRDB-6663

This change adds a new CLI command, connect join, that lets a new
node retrieve CA certificates off of an existing secure cluster
and bootstrap its own TLS certificates for joining that cluster.
This is achieved through the consumption of a join token
stored in the join_tokens table. Join tokens can be created
using the create_join_tokens() sql builtin function, added
in a previous change.

The previous connect command has now been renamed to
connect init.

A new set of GRPC endpoints have been created to handle the
server side of this change; to support retrieval of CAs as well
as entire initialization cert bundles. The cert bundles with
CAs and private keys aren't sent over the wire until a
node join token has been verified and consumed. The receiving
node (the one running connect join) then stores them in its
SSLCertsDir and bootstraps its own certificates off of it.

This feature is hidden behind a feature flag.

A couple other changes in this commit to clean up related code
in the security package:

  • Capitalize CA consistently in protobuf structs and method names
  • Write all CAs/certs from the right places in InitializeFromConfig
    instead of writing the InterNode one everywhere
  • Pass around raw byte certificates in auto_tls_init.go instead of
    doing excess PEM encodings and decodings.
  • Create a client.node.crt signed by ca-client.crt, otherwise
    nodes wouldn't be able to connect to each other. Fixes cli: TLS certs generated via connect cause one-way connectivity problem #61624.

Release note (cli change): Rename connect to connect init,
and add connect join command to retrieve certificates from
an existing secure cluster and setup a new node to connect with
it.

Co-authored-by: Aaron Blum aaron@cockroachlabs.com
Co-authored-by: Bilal Akhtar bilal@cockroachlabs.com

@itsbilal itsbilal requested review from aaron-crl and knz April 12, 2021 20:53
@itsbilal itsbilal requested a review from a team as a code owner April 12, 2021 20:53
@itsbilal itsbilal self-assigned this Apr 12, 2021
@itsbilal itsbilal requested a review from a team as a code owner April 12, 2021 20:53
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@itsbilal
Copy link
Copy Markdown
Contributor Author

This PR also "takes over" #61265. @aaron-crl happy to listen to your feedback about anything!

@aaron-crl
Copy link
Copy Markdown

Thank you for picking this up @itsbilal ! I'll pick this up after breather week.

@itsbilal
Copy link
Copy Markdown
Contributor Author

I've added a test and rebased / fixed up some issues. This is ready for a look!

@itsbilal itsbilal force-pushed the aaron-add-join branch 2 times, most recently from 6df2b8a to 5e6327c Compare April 22, 2021 18:51
Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed separately:

Overall design seems copacetic. Good confidence about the approach.
-> We'll want to circle back on the RFC and align the RFC to the approaches taken in code.
(Who: knz with support from bilal & aaron)

Code review: knz primary, Ben secondary.

Code duplication from the security package: this is creating tech debt.
-> File an issue to reduce that duplication (bilal)

The code is now using both CertLocator and CertManager: this creating another layer of tech debt.
-> File an issue to coalesce them into one (and perhaps mention reuse what Darin built in the sub-package) (bilal)

Testing confidence: 3/5

  • need to add a version number to the join token (bilal: best effort, otherwise file follow-up issue, link it to the meta-issue about integration)
  • testing work is end-to-end; want some additional failure tests (bilal: this PR).
  • some of the sub-components do not have unit tests. We may want to punt on this (maybe file a followup issue)
  • want unit tests where the join token is consumed (bilal: this PR)
    These aspects would raise test coverage from 3/5 to 4/5.

Reviewed 5 of 13 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aaron-crl and @knz)

@itsbilal
Copy link
Copy Markdown
Contributor Author

itsbilal commented May 4, 2021

Thanks for that summary, @knz ! I've addressed the parts of testing that I mentioned I'd address in this PR, in the latest push. I'll start to file the follow-up issues soon.

@itsbilal itsbilal force-pushed the aaron-add-join branch 2 times, most recently from 22bba4a to 9725eef Compare May 5, 2021 19:59
Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I went through a more in-depth review of this PR,
taking over Aaron's role in that regard.

I think the work here is indeed quite well polished already.
I have a few suggestions below in addition to the summary above. Mostly the error checks need to be enhanced for correctness.

A lot of the follow up work can be discretized in follow-up issues, so they are not blocking this PR from merging.

Reviewed 5 of 13 files at r1, 7 of 7 files at r2, 3 of 3 files at r3, 1 of 1 files at r4, 15 of 15 files at r5, 7 of 9 files at r6.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aaron-crl, @itsbilal, and @knz)


pkg/cli/addjoin.go, line 30 at r2 (raw file):

var nodeJoinCmd = &cobra.Command{
	Use:   "node-join <remote-addr>",

(1st commit) Could we make this cockroach node join, i.e. a subcommand to node, instead of cockroach node-join
Or alternatively cockroach connect join


pkg/cli/addjoin.go, line 31 at r3 (raw file):

	Use:   "node-join <remote-addr>",
	Short: "request the TLS certs for a new node from an existing node",
	Args:  cobra.MinimumNArgs(1),

r2: this needs to be bumped to 2, as per usage of the args below.

Also update the Use string above to mention the token.

(I see you change this in r5. Do you plan to squash? If yes, you can dismiss the above; if no, we'll want each individual commit to work)


pkg/cli/addjoin.go, line 34 at r6 (raw file):

)

const nodeJoinTimeout = 1 * time.Minute

This needs to be configurable IMHO. If not here, please file a followup issue and add a link here.

You can perhaps reuse cliCtx.cmdTimeout which is already configurable, and add a reference to the --timeout flags for the new join cmd in flags.go.


pkg/cli/addjoin_test.go, line 117 at r6 (raw file):

	err = runNodeJoin(nil, []string{token})
	require.Error(t, err)

can you assert a more specific error, so we get more confidence that the server is in fact rejecting the join token (and not failing for an unrelated reason) thanks


pkg/server/addjoin.go, line 67 at r2 (raw file):

	ctx context.Context, req *serverpb.BundleRequest,
) (*serverpb.BundleResponse, error) {
	// Validate Add/Join token sharedSecret.

I would recommend a cluster setting to entirely disable these two API endpoints, in case we discover after the fact that they had a security issue.
If not here, please file an issue.


pkg/server/addjoin.go, line 57 at r6 (raw file):

		if err != nil {
			return err
		} else if row == nil {

len(row) != 1?


pkg/server/addjoin_test.go, line 87 at r6 (raw file):

	// Try consuming the token a second time. This should fail.
	_, err = adminClient.RequestCertBundle(ctx, &cbReq)
	require.Error(t, err)

ditto, please check the specific error is indeed an invalid token error.

@knz
Copy link
Copy Markdown
Contributor

knz commented May 6, 2021

(there's also a CI failure that's trying to tell you something)

Copy link
Copy Markdown
Contributor Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR! I made my changes as an additional commit. Also helps that way as the most recent commit changes some CLI commands around (connect is now connect init, node join is now connect join). If it's good and tests pass, I'll squash everything then re-push it here. Will also add a release note for the CLI changes.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aaron-crl and @knz)


pkg/server/addjoin.go, line 67 at r2 (raw file):

Previously, knz (kena) wrote…

I would recommend a cluster setting to entirely disable these two API endpoints, in case we discover after the fact that they had a security issue.
If not here, please file an issue.

Done. I'm just piggy-backing on the feature flag at the moment.


pkg/server/addjoin.go, line 57 at r6 (raw file):

Previously, knz (kena) wrote…

len(row) != 1?

Done.


pkg/server/addjoin_test.go, line 87 at r6 (raw file):

Previously, knz (kena) wrote…

ditto, please check the specific error is indeed an invalid token error.

The next line does this!


pkg/cli/addjoin.go, line 30 at r2 (raw file):

Previously, knz (kena) wrote…

(1st commit) Could we make this cockroach node join, i.e. a subcommand to node, instead of cockroach node-join
Or alternatively cockroach connect join

Done. I also moved cockroach connect to cockroach connect init.


pkg/cli/addjoin.go, line 31 at r3 (raw file):

Previously, knz (kena) wrote…

r2: this needs to be bumped to 2, as per usage of the args below.

Also update the Use string above to mention the token.

(I see you change this in r5. Do you plan to squash? If yes, you can dismiss the above; if no, we'll want each individual commit to work)

Yes, I'll squash them all.


pkg/cli/addjoin.go, line 34 at r6 (raw file):

Previously, knz (kena) wrote…

This needs to be configurable IMHO. If not here, please file a followup issue and add a link here.

You can perhaps reuse cliCtx.cmdTimeout which is already configurable, and add a reference to the --timeout flags for the new join cmd in flags.go.

Sure, will file a follow-up issue. Looks like cockroach connect also has the same issue.


pkg/cli/addjoin_test.go, line 117 at r6 (raw file):

Previously, knz (kena) wrote…

can you assert a more specific error, so we get more confidence that the server is in fact rejecting the join token (and not failing for an unrelated reason) thanks

Done.

Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: modulo squash and release note(s)

when you file the followup issues, can you ping this PR from each issue's description, and also link these issues to the meta-issue #60632

Reviewed 8 of 15 files at r10, 4 of 9 files at r11, 9 of 9 files at r12.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aaron-crl and @itsbilal)


pkg/server/addjoin.go, line 40 at r12 (raw file):

	settings := s.server.ClusterSettings()
	if settings == nil {
		return nil, errors.New("could not look up cluster settings")

errors.AssertionFailedf I think?


pkg/server/addjoin.go, line 96 at r12 (raw file):

	settings := s.server.ClusterSettings()
	if settings == nil {
		return nil, errors.New("could not look up cluster settings")

ditto

Copy link
Copy Markdown
Contributor Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR! Finished the squash and added release note.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aaron-crl and @knz)


pkg/server/addjoin.go, line 40 at r12 (raw file):

Previously, knz (kena) wrote…

errors.AssertionFailedf I think?

Done.


pkg/server/addjoin.go, line 96 at r12 (raw file):

Previously, knz (kena) wrote…

ditto

Done.

This change adds a new CLI command, `connect join`, that lets a new
node retrieve CA certificates off of an existing secure cluster
and bootstrap its own TLS certificates for joining that cluster.
This is achieved through the consumption of a join token
stored in the join_tokens table. Join tokens can be created
using the `create_join_tokens()` sql builtin function, added
in a previous change.

The previous `connect` command has now been renamed to
`connect init`.

A new set of GRPC endpoints have been created to handle the
server side of this change; to support retrieval of CAs as well
as entire initialization cert bundles. The cert bundles with
CAs and private keys aren't sent over the wire until a
node join token has been verified and consumed. The receiving
node (the one running `connect join`) then stores them in its
SSLCertsDir and bootstraps its own certificates off of it.

This feature is hidden behind a feature flag.

A couple other changes in this commit to clean up related code
in the `security` package:
 - Capitalize CA consistently in protobuf structs and method names
 - Write all CAs/certs from the right places in InitializeFromConfig
   instead of writing the InterNode one everywhere
 - Pass around raw byte certificates in auto_tls_init.go instead of
   doing excess PEM encodings and decodings.
 - Create a `client.node.crt` signed by `ca-client.crt`, otherwise
   nodes wouldn't be able to connect to each other. Fixes cockroachdb#61624.

Release note (cli change): Rename `connect` to `connect init`,
and add `connect join` command to retrieve certificates from
an existing secure cluster and setup a new node to connect with
it.

Co-authored-by: Aaron Blum <aaron@cockroachlabs.com>
Co-authored-by: Bilal Akhtar <bilal@cockroachlabs.com>
@itsbilal
Copy link
Copy Markdown
Contributor Author

itsbilal commented May 6, 2021

bors r=knz

@craig
Copy link
Copy Markdown
Contributor

craig bot commented May 6, 2021

Build succeeded:

@craig craig bot merged commit 0512987 into cockroachdb:master May 6, 2021
@itsbilal
Copy link
Copy Markdown
Contributor Author

itsbilal commented May 7, 2021

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cli: TLS certs generated via connect cause one-way connectivity problem

4 participants