Add certificate rotation functionality to ValKey#1870
Conversation
a27f0a6 to
9a52a30
Compare
Signed-off-by: Ovais Khan <ovais.khan@snapchat.com>
9a52a30 to
87444ea
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## unstable #1870 +/- ##
============================================
- Coverage 71.06% 70.82% -0.25%
============================================
Files 123 123
Lines 65671 65686 +15
============================================
- Hits 46669 46521 -148
- Misses 19002 19165 +163
🚀 New features to boost your workflow:
|
Signed-off-by: Ovais Khan <ovais.khan@snapchat.com>
6656b09 to
05adf5c
Compare
|
@madolson @zuiderkwast @uriyage - would be good to get your feedback on this pull request. |
|
IMO,
|
Signed-off-by: Ovais Khan <ovais.khan@snapchat.com>
|
Thanks @sungming2 for the comments. Here are my initial thoughts:
|
|
Thanks for the follow up.
I don’t think e.g., |
|
Just some thoughts on @sungming2's comments:
I think there is no harm in reloading the certificate even if there is no change.
It is possible to configure an already-expired or not-yet-valid certificate using configuration file and CONFIG SET so I think the automatic reloading should behave the same. Should we refuse to load expired certificates? This means CONFIG SET should fail if we try to configure an expired certificate. What about starting with a configuration file with an expired certificate, should the server refuse to start in this case? It would be a behavior change... Maybe we can do it as a separate PR?
We can do it as a future improvement. Inotify is not portable so we still need some fallback to periodic checks. Is every second too often to check, then maybe check less often? 10 sec, 60 sec?
When changing the certificate chains and CA certificates, users need to have an overlap during which both old and new certificates are valid. It's possible to combine multiple CA certificates in one file if I'm not mistaken. This too is no different to updating the certificates manually using CONFIG SET. |
Yes I think we can do. I don't see strong reason to load expired certificate |
I don't disagree, but just pointing out that that issue exists regardless of this change. We can however change the behavior of all ways you can configure the certificate to have that check and not club it with this change. In that change, we will have to answer questions around the behavior of server startup or config set commands as pointed out by @zuiderkwast
Yeah, there has to be significant amount of overlap in cert expiry. If it helps, we can put this information on the release notes of a wiki as well. |
|
So let's defer the cert expiry validation to another PR. It's somewhat orthogonal to this change.
We have documentation: https://valkey.io/topics/encryption/ . This page could need some improvements. It's markdown files in a git repo so it's easy to update it. |
Signed-off-by: Ovais Khan <ovais.khan@snapchat.com>
zuiderkwast
left a comment
There was a problem hiding this comment.
Good work. The reloading code looks good. I just have some comments on the tests structure and the config name.
| echo "Generating necessary certificates for TLS rotation testing." | ||
| rm -rf tests/tls tests/tls_1 tests/tls_2 | ||
|
|
||
| utils/gen-test-certs.sh | ||
| mv tests/tls tests/tls_1 | ||
| utils/gen-test-certs.sh | ||
| mv tests/tls tests/tls_2 | ||
| utils/gen-test-certs.sh |
There was a problem hiding this comment.
Generating the certificates is currently a separate step from running the tests. I think we can keep it that way.
We can modify the utils/gen-test-certs.sh script to generate all the certificates we need for the tests including these new tests.
I don't think we should add this new runtest script on the top level to run only this one test suite. We can put these rotation tests in tests/unit/tls.tcl instead.
| proc valkey_client_tls {args} { | ||
| set level 0 | ||
| if {[llength $args] > 0 && [string is integer [lindex $args 0]]} { | ||
| set level [lindex $args 0] | ||
| set args [lrange $args 1 end] | ||
| } | ||
|
|
||
| set tlsoptions "" | ||
| if {[llength $args] > 0 && ![string is integer [lindex $args 0]]} { | ||
| set tlsoptions [lrange $args 0 end] | ||
| } | ||
|
|
||
| # create client that takes in custom tls options | ||
| set client [redis [srv $level "host"] [srv $level "port"] 0 $::tls $tlsoptions] | ||
|
|
||
| # # select the right db and read the response (OK) | ||
| $client select 9 | ||
| return $client | ||
| } |
There was a problem hiding this comment.
This function is a copy of valkey_client with the addition that it takes optional tls options and adds then to the arguments used for creating a client?
Can we avoid the duplicated code by some refactoring?
I think we can pass tlsoptions as a single optional argument to valkey_client.
Duplicated code is a source of errors. I can already see some mistakes in this copy of the function: This code tries to call a function called redis, which should be valkey. The call $client select 9 should not be done if $::singledb is set.
There was a problem hiding this comment.
I think we can put these tests in tests/unit/tls.tcl, alternatively in a new file but it should be under tests/unit/ or tests/integration/. I don't see a need for a new directory just for this test suite.
| # Allow the server to monitor the filesystem and rotate out TLS certificates if | ||
| # they change on disk, defaults to no. | ||
| # | ||
| # tls-rotation no |
There was a problem hiding this comment.
I would say reload instead of rotate. It can be used for rotation but it doesn't do the rotation for you. The server only cares about changed files and reloads them.
How about tls-auto-reload-certs?
There was a problem hiding this comment.
Good call - tls-auto-reload-certs sounds as this more closely matches the functionality
|
|
||
| test {TLS: Clients with outdated credentials cannot connect} { | ||
| catch {set r2 [valkey_client_tls -keyfile "$clientdir1/client.key" -certfile "$clientdir1/client.crt" -require 1 -cafile "$clientdir1/ca.crt"]} e | ||
| assert_no_match {*::redis::redisHandle*} $e |
There was a problem hiding this comment.
redis? It needs to be valkey?
|
|
||
| test {TLS: Check that client with old credentials won't connect} { | ||
| catch {set r2 [valkey_client_tls -keyfile "$clientdir1/client.key" -certfile "$clientdir1/client.crt" -require 1 -cafile "$clientdir2/ca.crt"]} e | ||
| assert_no_match {*::redis::redisHandle*} $e |
| /* Reload the TLS cert if necessary. This effectively rotates the | ||
| * cert if a change has been made on disk, but the ValKey server hasn't | ||
| * been notified. */ | ||
| run_with_period(1000) { | ||
| tlsReload(); | ||
| } | ||
|
|
There was a problem hiding this comment.
We currently require users to explicitly trigger the certificate rotation, why do we think the right approach is to have a cron just that is running and checking periodically? Is this possible to do with a linux job or something?
There was a problem hiding this comment.
Checking periodically is not perfect. A better solution would be to use inotify to watch for file system events, but it's not portable so we'll anyway need some fallback to periodic checking for other OSes.
I guess you could already call valkey-cli CONFIG SET from a system cron job to set one of the TLS configs such as tls-port to the value it already has. This triggers a reload of certs. But you need to know one of the configs, or you need to do CONFIG GET first. There is no command to only reload them.
Some programs reload their configs when you send a SIGHUP to it. Could that be something to consider?
There was a problem hiding this comment.
But you need to know one of the configs, or you need to do CONFIG GET first. There is no command to only reload them.
Yeah, seems like doing CONFIG GET first is a pretty simple solution then you can orchestrate it your own way. You also get explicit error handling since the response back will be synchronous.
There was a problem hiding this comment.
I guess you could already call valkey-cli CONFIG SET from a system cron job to set one of the TLS configs such as tls-port to the value it already has
Also note that refreshing TLS configs can be expensive. Not only is it file system IO on the main thread, but it drops the SSL context which means that the SSL session cache is dropped, meaning any clients using session resumption will feel the impact with more expensive connection establishment.
When writing a sidecar process to do automated rotation, I can imagine many might start with a simple design like:
- Publish certificates to some place for downloading by the servers
- Some cron on the server downloads the certificates every few minutes
- The cron also executes CONFIG SET after redownloading (since it should be idempotent)
Maybe we can verify if the checksum of the file has changed in tlsConfigure and short-circuit if not. I think it would help in either case (whether automatically triggered by certificate refresh in serverCron, or manually triggered by CONFIG SET) to avoid accidental drop of the session cache.
There was a problem hiding this comment.
Fwiw we could do the TLS config refresh in a background thread, to limit the impact on the main thread.
|
Replaced with #3020 |
This PR adds the capability of certificate rotation to ValKey by reloading the certificates from disk if it changes. The feature is disabled out of box and is only enabled if
tls-rotationconfig is true.