Skip to content

util: Automatically create TLS certificates#24141

Merged
ti-chi-bot merged 8 commits intopingcap:masterfrom
dveeden:automatic_tls
Aug 4, 2021
Merged

util: Automatically create TLS certificates#24141
ti-chi-bot merged 8 commits intopingcap:masterfrom
dveeden:automatic_tls

Conversation

@dveeden
Copy link
Contributor

@dveeden dveeden commented Apr 19, 2021

What problem does this PR solve?

Problem Summary:

Configuring TLS should be as easy as possible and having to use mysql_ssl_rsa_setup from MySQL server should not be needed for simple setups.

What is changed and how it works?

What's Changed:

If no ssl-cert or ssl-key are specified: Create a self signed
cert in the temp storage and use that.

This allows TLS to be used when no user created certificates are
available.

Especially for tiup playground and other simple cases this should be
sufficient.

Note that for caching_sha2_password support we will either need TLS
connections or RSA keypairs. This brings us a step closer in that
direction.

Related

Check List

Tests

  • Manual test (add detailed scripts or steps below)

Run tidb-server without TLS configured and then connect with a modern MySQL client and use \s to see if TLS is used.

Release note

  • A TLS Certificate and Key are automatically generated when TiDB is started without a ssl-cert or ssl-key configured.

@ti-chi-bot ti-chi-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 19, 2021
@morgo morgo added the sig/sql-infra SIG: SQL Infra label Apr 19, 2021
@mjonss
Copy link
Contributor

mjonss commented Apr 19, 2021

/cc @mjonss

@ti-chi-bot ti-chi-bot requested a review from mjonss April 19, 2021 17:21
@morgo
Copy link
Contributor

morgo commented Apr 19, 2021

Getting a test failure:

[2021-04-19T16:57:36.331Z] ----------------------------------------------------------------------

[2021-04-19T16:57:36.331Z] FAIL: tidb_test.go:538: tidbTestSerialSuite.TestTLS

[2021-04-19T16:57:36.331Z] 

[2021-04-19T16:57:36.331Z] tidb_test.go:581:

[2021-04-19T16:57:36.331Z]     c.Assert(err, NotNil)

[2021-04-19T16:57:36.331Z] ... value = nil

[2021-04-19T16:57:36.331Z] 

[2021-04-19T16:57:36.331Z] OOPS: 100 passed, 1 FAILED

[2021-04-19T16:57:36.331Z] --- FAIL: TestT (337.02s)

[2021-04-19T16:57:36.331Z] FAIL

Copy link
Contributor

@mjonss mjonss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some small suggestions.

util/misc.go Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these 90 days be configurable? Or just documented that if you don't manage any TLS certificates, there will be a temporary self-signed cert created that is valid for 90 days and then you need to restart the TiDB node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Not sure what to do. For now I've added the validity of the cert to the logging, however zap.Duration doesn't format it in the best way possible...

We could:

  1. Make the certificates valid for 3 years and expect people to restart before they reach this. Should this be 1 year? 10 years?
  2. We could re-create the certs every 60 days or so and trigger the code behind ALTER INSTANCE RELOAD TLS
  3. We could put a note in the docs saying that the user has to restart every <90 days

This is related to pingcap/docs#4299

I think option 2 is probably best. What do you think?

Could/should we have a hook or so that we call when the certificates are about to expire? Then this allows integration with ACME and local certificate rotation things. But I also don't want to make this too complicated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tidb/executor/simple.go

Lines 1354 to 1372 in 59b99ce

func (e *SimpleExec) executeAlterInstance(s *ast.AlterInstanceStmt) error {
if s.ReloadTLS {
logutil.BgLogger().Info("execute reload tls", zap.Bool("NoRollbackOnError", s.NoRollbackOnError))
sm := e.ctx.GetSessionManager()
tlsCfg, err := util.LoadTLSCertificates(
variable.GetSysVar("ssl_ca").Value,
variable.GetSysVar("ssl_key").Value,
variable.GetSysVar("ssl_cert").Value,
)
if err != nil {
if !s.NoRollbackOnError || config.GetGlobalConfig().Security.RequireSecureTransport {
return err
}
logutil.BgLogger().Warn("reload TLS fail but keep working without TLS due to 'no rollback on error'")
}
sm.UpdateTLSConfig(tlsCfg)
}
return nil
}
is probably close to what we need.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the code to have a goroutine create new certs every 30 days.

@dveeden dveeden force-pushed the automatic_tls branch 2 times, most recently from f546c24 to 3f22a54 Compare April 20, 2021 07:23
Copy link
Contributor

@mjonss mjonss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the update, I don't see how I can reply to the discussion about certificate life time, so I'm pasting it here again :(
@mjonss mjonss 15 hours ago Member
Should these 90 days be configurable? Or just documented that if you don't manage any TLS certificates, there will be a temporary self-signed cert created that is valid for 90 days and then you need to restart the TiDB node?

@dveeden dveeden 1 hour ago Author Member
Good question. Not sure what to do. For now I've added the validity of the cert to the logging, however zap.Duration doesn't format it in the best way possible...

We could:

Make the certificates valid for 3 years and expect people to restart before they reach this. Should this be 1 year? 10 years?
We could re-create the certs every 60 days or so and trigger the code behind ALTER INSTANCE RELOAD TLS
We could put a note in the docs saying that the user has to restart every <90 days
This is related to pingcap/docs#4299

I think option 2 is probably best. What do you think?

Could/should we have a hook or so that we call when the certificates are about to expire? Then this allows integration with ACME and local certificate rotation things. But I also don't want to make this too complicated.

Mattias>
I agree with option 2. What happens when it is reloaded, will it force current connections to be disconnected or can they just continue until they disconnect?

@dveeden dveeden force-pushed the automatic_tls branch 3 times, most recently from 504ae44 to ee849f6 Compare April 20, 2021 13:27
@ti-chi-bot ti-chi-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 20, 2021
Copy link
Contributor

@mjonss mjonss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

util/misc.go Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice solution!

Should we also randomise the reloading over a day, so not all the TiDB nodes reloads at the same time, but spreads it out a bit? (Only a suggestion, patch approved)

@ti-chi-bot
Copy link
Member

@mjonss: /lgtm is only allowed for the reviewers in list.

Details

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@morgo
Copy link
Contributor

morgo commented Apr 20, 2021

/lgtm

@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Apr 20, 2021
Copy link
Contributor

@xhebox xhebox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

@dveeden dveeden marked this pull request as draft April 26, 2021 12:01
@ti-chi-bot ti-chi-bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 26, 2021
@dveeden
Copy link
Contributor Author

dveeden commented Apr 29, 2021

I see some issues with the current state of this PR:

  • It rotates the certificate file, but doesn't refresh the certificate that is used
  • Running ALTER INSTANCE RELOAD TLS periodically works, but isn't user friendly
  • Can we run ALTER INSTANCE RELOAD TLS from the code?
  • What that SQL does is to create a new tls.Config. Not sure how to do that from the code.
  • It looks like Ssl_server_not_after etc are not implemented in TiDB. This makes it harder to verify this PR does the right thing. Also this makes using custom certs and monitoring them harder.
  • Looks like to implement Ssl_server_not_after etc access to tls.Config is needed to get the X.509 Cert etc. This doesn't seem to be available from the TLSConnectionState which is used for Ssl_cipher.

An OpenSSL one-liner to get the dates from the MySQL protocol. This can be used until Ssl_server_not_after is implemented.

$ openssl x509 -in <(openssl s_client -starttls mysql -connect 127.0.0.1:4000 2>/dev/null) -noout -dates
notBefore=Apr  1 14:23:11 2021 GMT
notAfter=Mar 30 14:23:11 2031 GMT

@ti-chi-bot ti-chi-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 29, 2021
@zhouqiang-cl
Copy link
Contributor

/merge

@ti-chi-bot
Copy link
Member

@zhouqiang-cl: /merge is only allowed for the committers, you can assign this pull request to the committer in list by filling /assign @committer in the comment to help merge this pull request.

Details

In response to this:

/merge

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@zhouqiang-cl
Copy link
Contributor

/run-check_dev_2

1 similar comment
@zhouqiang-cl
Copy link
Contributor

/run-check_dev_2

@zhouqiang-cl
Copy link
Contributor

@dveeden Seems this pr make CI much slower #26717 /cc @tiancaiamao

@tiancaiamao
Copy link
Contributor

tiancaiamao commented Jul 29, 2021

After this change, some test case runs slower than before.

I filter the cases that run more than 1s, master branch:

PASS: http_handler_test.go:1419: HTTPHandlerTestSuite.TestDebugZip	1.172s
PASS: http_handler_test.go:686: HTTPHandlerTestSuite.TestTiFlashReplica	2.314s
PASS: statistics_handler_test.go:88: testDumpStatsSuite.TestDumpStatsAPI	1.304s
PASS: tidb_test.go:1400: tidbTestSuite.TestGracefulShutdown	2.749s

and this branch:

PASS: http_handler_test.go:1419: HTTPHandlerTestSuite.TestDebugZip	1.661s
PASS: http_handler_test.go:870: HTTPHandlerTestSuite.TestDecodeColumnValue	1.423s
PASS: http_handler_test.go:1505: HTTPHandlerTestSuite.TestFailpointHandler	4.317s
PASS: http_handler_test.go:934: HTTPHandlerTestSuite.TestGetIndexMVCC	1.103s
PASS: http_handler_test.go:671: HTTPHandlerTestSuite.TestGetMVCCNotFound	2.592s
PASS: http_handler_test.go:370: HTTPHandlerTestSuite.TestGetRegionByIDWithError	1.681s
PASS: http_handler_test.go:1019: HTTPHandlerTestSuite.TestGetSchema	1.172s
PASS: http_handler_test.go:1000: HTTPHandlerTestSuite.TestGetSettings	2.431s
PASS: http_handler_test.go:581: HTTPHandlerTestSuite.TestGetTableMVCC	3.046s
PASS: http_handler_test.go:1410: HTTPHandlerTestSuite.TestHotRegionInfo	1.617s
PASS: http_handler_test.go:346: HTTPHandlerTestSuite.TestListTableRanges	1.174s
PASS: http_handler_test.go:322: HTTPHandlerTestSuite.TestListTableRegions	1.295s
PASS: http_handler_test.go:284: HTTPHandlerTestSuite.TestRangesAPI	1.809s
PASS: http_handler_test.go:229: HTTPHandlerTestSuite.TestRegionsAPI	1.487s
PASS: http_handler_test.go:250: HTTPHandlerTestSuite.TestRegionsAPIForClusterIndex	1.405s
PASS: http_handler_test.go:446: HTTPHandlerTestSuite.TestRegionsFromMeta	1.004s
PASS: http_handler_test.go:1349: HTTPHandlerTestSuite.TestServerInfo	1.056s
PASS: http_handler_test.go:1527: HTTPHandlerTestSuite.TestTestHandler	2.206s
PASS: http_handler_test.go:686: HTTPHandlerTestSuite.TestTiFlashReplica	3.797s
PASS: statistics_handler_test.go:88: testDumpStatsSuite.TestDumpStatsAPI	2.584s
PASS: tidb_test.go:1391: tidbTestSuite.TestGracefulShutdown	3.215s
PASS: tidb_test.go:612: tidbTestSuite.TestOnlySocket	1.116s
PASS: tidb_test.go:425: tidbTestSuite.TestSocket	1.687s
PASS: tidb_test.go:454: tidbTestSuite.TestSocketAndIp	1.222s
PASS: tidb_test.go:398: tidbTestSuite.TestSocketForwarding	1.333s
PASS: tidb_test.go:278: tidbTestSuite.TestStatusAPIWithTLS	3.056s
PASS: tidb_test.go:320: tidbTestSuite.TestStatusAPIWithTLSCNCheck	2.235s
PASS: tidb_test.go:259: tidbTestSuite.TestStatusPort	1.493s
PASS: conn_test.go:374: ConnTestSuite.TestDispatchClientProtocol41	1.452s

You can try this step manually to verify that:

make failpoint-enable
cd server
go test 1>a.log 2>b.log
grep '^PASS:' a.log

And we are preventing the CI to be slower #26575

@dveeden

@zhouqiang-cl
Copy link
Contributor

/merge cancel

@ti-chi-bot
Copy link
Member

@zhouqiang-cl: /merge cancel is only allowed for the PR author and the committers in list.

Details

In response to this:

/merge cancel

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@zhouqiang-cl
Copy link
Contributor

/merge cancel

@dveeden
Copy link
Contributor Author

dveeden commented Aug 2, 2021

@tiancaiamao Thanks for the info about the CI testing. The issue is that creating the certificates takes a bit of time and if we include that in many of the tests then that adds up. f7c1a269fdcbf5cf30f09a14fad5251c93723226 should fix that.

This is what I get now:

$ awk '/^PASS:/ { t=substr($NF,1,length($NF)-1); if (t>1) { print $0 }}' stdout.log 
PASS: http_handler_test.go:379: HTTPHandlerTestSuite.TestBinlogRecover	2.641s
PASS: http_handler_test.go:1419: HTTPHandlerTestSuite.TestDebugZip	1.187s
PASS: http_handler_test.go:686: HTTPHandlerTestSuite.TestTiFlashReplica	2.527s
PASS: http_handler_test.go:1444: HTTPHandlerTestSuite.TestZipInfoForSQL	10.176s
PASS: statistics_handler_test.go:88: testDumpStatsSuite.TestDumpStatsAPI	1.388s
PASS: tidb_test.go:1391: tidbTestSuite.TestGracefulShutdown	2.730s
PASS: conn_test.go:587: ConnTestSuite.TestConnExecutionTimeout	12.404s
PASS: conn_test.go:756: ConnTestSuite.TestTiFlashFallback	269.574s
PASS: tidb_test.go:1500: tidbTestTopSQLSuite.TestTopSQLCPUProfile	4.436s

The list is the same as for the master branch

dveeden added 4 commits August 2, 2021 14:43
If no `ssl-cert` or `ssl-key` are specified: Create a self signed
cert in the temp storage and use that.

This allows TLS to be used when no user created certificates are
available.

Especially for `tiup playground` and other simple cases this should be
sufficient.

Note that for `caching_sha2_password` support we will either need TLS
connections or RSA keypairs. This brings us a step closer in that
direction.

The created certificate are valid for 90 days and new certificates are
created every 30 days.

See also:
- "Automatic SSL and RSA File Generation" on https://dev.mysql.com/doc/refman/8.0/en/creating-ssl-rsa-files-using-mysql.html
- https://docs.pingcap.com/tidb/stable/enable-tls-between-clients-and-servers
- pingcap#9411
- pingcap#18084
@tiancaiamao
Copy link
Contributor

/merge

The CI timeout limit is changed to 5s ...
and this time, it PR seems to work, anyway.

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

DetailsCommit hash: dc842a5

@dveeden
Copy link
Contributor Author

dveeden commented Aug 4, 2021

/run-check_dev_2

@ti-chi-bot
Copy link
Member

@dveeden: Your PR was out of date, I have automatically updated it for you.

At the same time I will also trigger all tests for you:

/run-all-tests

If the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/config component/expression sig/execution SIG execution sig/sql-infra SIG: SQL Infra size/L Denotes a PR that changes 100-499 lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants