Skip to content

Add certificate rotation functionality to ValKey#1870

Closed
ovaiskhansc wants to merge 4 commits into
valkey-io:unstablefrom
ovaiskhansc:ok_tls_cert_rotation
Closed

Add certificate rotation functionality to ValKey#1870
ovaiskhansc wants to merge 4 commits into
valkey-io:unstablefrom
ovaiskhansc:ok_tls_cert_rotation

Conversation

@ovaiskhansc

Copy link
Copy Markdown

This PR adds the capability of certificate rotation to ValKey by reloading the certificates from disk if it changes. The feature is disabled out of box and is only enabled if tls-rotation config is true.

@ovaiskhansc ovaiskhansc force-pushed the ok_tls_cert_rotation branch from a27f0a6 to 9a52a30 Compare March 21, 2025 23:16
Signed-off-by: Ovais Khan <ovais.khan@snapchat.com>
@ovaiskhansc ovaiskhansc force-pushed the ok_tls_cert_rotation branch from 9a52a30 to 87444ea Compare March 21, 2025 23:33
@codecov

codecov Bot commented Mar 23, 2025

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.82%. Comparing base (d56a1f7) to head (749e39b).
⚠️ Report is 784 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1870      +/-   ##
============================================
- Coverage     71.06%   70.82%   -0.25%     
============================================
  Files           123      123              
  Lines         65671    65686      +15     
============================================
- Hits          46669    46521     -148     
- Misses        19002    19165     +163     
Files with missing lines Coverage Δ
src/config.c 78.39% <ø> (ø)
src/server.c 87.57% <100.00%> (+0.03%) ⬆️
src/server.h 100.00% <ø> (ø)
src/tls.c 100.00% <ø> (ø)

... and 11 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: Ovais Khan <ovais.khan@snapchat.com>
@ovaiskhansc ovaiskhansc force-pushed the ok_tls_cert_rotation branch from 6656b09 to 05adf5c Compare March 24, 2025 17:53
@ovaiskhansc ovaiskhansc marked this pull request as ready for review March 24, 2025 17:54
@ovaiskhansc

Copy link
Copy Markdown
Author

@madolson @zuiderkwast @uriyage - would be good to get your feedback on this pull request.

@sungming2

Copy link
Copy Markdown
Contributor

IMO,

  1. Relying solely on the modification time of certificate files to detect changes can be unreliable, as mtime can be changed without actual content change. I think another approach to detect change is to compare the fingerprints of the certificates which could minimize unnecessary reloads.
  2. Before loading a new certificate, it might be necessary to validate the certificate to prevent potential security issues such as notBefore and notAfter fields to ensure that the certificate is currently valid and hasn't expired.
  3. Is it possible to use monitoring filesystem events like inotify instead of reloading the certificate every second?

Signed-off-by: Ovais Khan <ovais.khan@snapchat.com>
@ovaiskhansc

Copy link
Copy Markdown
Author

Thanks @sungming2 for the comments. Here are my initial thoughts:

  1. We can surely do some extra validations to handle the cases where the last modified time changes without actual certificate changes, but it will also come with some overhead and how many times would the time just bump as opposed to the actual certificate changing as well
  2. My understanding is that the tlsConfigure method is already doing some validations and is used currently to do the same on server startup as well as on cert rotation via CONFIG SET command. Why do we want to add more validations elsewhere?
  3. I can look at the feasibility of doing that in a platform agnostic manner, any recommendations on the library that we can look at?

@sungming2

Copy link
Copy Markdown
Contributor

Thanks for the follow up.
Just to clarify, when I mentioned validating the certificate, I wasn’t referring only to expiration checks. I think we also need to consider how this change impacts cluster mode.
If certificate rotation simply means replacing the old certificate with renewed one signed by the same CA, then that's fine. But if we think about other scenarios like when a new certificate is issued by a different CA, it can break the cluster. Valkey frequently needs to re-establish connections between nodes due to timeouts, network issue, or node config changes. If some nodes have reloaded the new certificate which is no longer trusted by others, TLS handshakes will be failing.

My understanding is that the tlsConfigure method is already doing some validations

I don’t think tlsConfigure has explicit certificate checks. It can still load expired certificate without error (It probably needs to be fixed). It throws an error when the client/peer attempts to connect it since handshake fails

e.g.,

29132:M 25 Mar 2025 22:36:15.117 # Error accepting a client connection: error:14094415:SSL routines:ssl3_read_bytes:sslv3 alert certificate expired (addr= laddr=127.0.0.1:6379)

@zuiderkwast

Copy link
Copy Markdown
Contributor

Just some thoughts on @sungming2's comments:

  • Relying solely on the modification time of certificate files to detect changes can be unreliable, as mtime can be changed without actual content change. I think another approach to detect change is to compare the fingerprints of the certificates which could minimize unnecessary reloads.

I think there is no harm in reloading the certificate even if there is no change.

  • Before loading a new certificate, it might be necessary to validate the certificate to prevent potential security issues such as notBefore and notAfter fields to ensure that the certificate is currently valid and hasn't expired.

It is possible to configure an already-expired or not-yet-valid certificate using configuration file and CONFIG SET so I think the automatic reloading should behave the same.

Should we refuse to load expired certificates? This means CONFIG SET should fail if we try to configure an expired certificate. What about starting with a configuration file with an expired certificate, should the server refuse to start in this case? It would be a behavior change... Maybe we can do it as a separate PR?

  • Is it possible to use monitoring filesystem events like inotify instead of reloading the certificate every second?

We can do it as a future improvement. Inotify is not portable so we still need some fallback to periodic checks. Is every second too often to check, then maybe check less often? 10 sec, 60 sec?

Just to clarify, when I mentioned validating the certificate, I wasn’t referring only to expiration checks. I think we also need to consider how this change impacts cluster mode.
If certificate rotation simply means replacing the old certificate with renewed one signed by the same CA, then that's fine. But if we think about other scenarios like when a new certificate is issued by a different CA, it can break the cluster. Valkey frequently needs to re-establish connections between nodes due to timeouts, network issue, or node config changes. If some nodes have reloaded the new certificate which is no longer trusted by others, TLS handshakes will be failing.

When changing the certificate chains and CA certificates, users need to have an overlap during which both old and new certificates are valid. It's possible to combine multiple CA certificates in one file if I'm not mistaken.

This too is no different to updating the certificates manually using CONFIG SET.

@sungming2

Copy link
Copy Markdown
Contributor

Should we refuse to load expired certificates? This means CONFIG SET should fail if we try to configure an expired certificate. What about starting with a configuration file with an expired certificate, should the server refuse to start in this case? It would be a behavior change... Maybe we can do it as a separate PR?

Yes I think we can do. I don't see strong reason to load expired certificate

@ovaiskhansc

Copy link
Copy Markdown
Author

I don't see strong reason to load expired certificate

I don't disagree, but just pointing out that that issue exists regardless of this change. We can however change the behavior of all ways you can configure the certificate to have that check and not club it with this change. In that change, we will have to answer questions around the behavior of server startup or config set commands as pointed out by @zuiderkwast

When changing the certificate chains and CA certificates, users need to have an overlap during which both old and new certificates are valid. It's possible to combine multiple CA certificates in one file if I'm not mistaken.

Yeah, there has to be significant amount of overlap in cert expiry. If it helps, we can put this information on the release notes of a wiki as well.

@zuiderkwast

Copy link
Copy Markdown
Contributor

So let's defer the cert expiry validation to another PR. It's somewhat orthogonal to this change.

When changing the certificate chains and CA certificates, users need to have an overlap during which both old and new certificates are valid. It's possible to combine multiple CA certificates in one file if I'm not mistaken.

Yeah, there has to be significant amount of overlap in cert expiry. If it helps, we can put this information on the release notes of a wiki as well.

We have documentation: https://valkey.io/topics/encryption/ . This page could need some improvements. It's markdown files in a git repo so it's easy to update it.

Signed-off-by: Ovais Khan <ovais.khan@snapchat.com>

@zuiderkwast zuiderkwast left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work. The reloading code looks good. I just have some comments on the tests structure and the config name.

Comment thread runtest-rotation
Comment on lines +18 to +25
echo "Generating necessary certificates for TLS rotation testing."
rm -rf tests/tls tests/tls_1 tests/tls_2

utils/gen-test-certs.sh
mv tests/tls tests/tls_1
utils/gen-test-certs.sh
mv tests/tls tests/tls_2
utils/gen-test-certs.sh

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generating the certificates is currently a separate step from running the tests. I think we can keep it that way.

We can modify the utils/gen-test-certs.sh script to generate all the certificates we need for the tests including these new tests.

I don't think we should add this new runtest script on the top level to run only this one test suite. We can put these rotation tests in tests/unit/tls.tcl instead.

Comment thread tests/test_helper.tcl
Comment on lines +252 to +270
proc valkey_client_tls {args} {
set level 0
if {[llength $args] > 0 && [string is integer [lindex $args 0]]} {
set level [lindex $args 0]
set args [lrange $args 1 end]
}

set tlsoptions ""
if {[llength $args] > 0 && ![string is integer [lindex $args 0]]} {
set tlsoptions [lrange $args 0 end]
}

# create client that takes in custom tls options
set client [redis [srv $level "host"] [srv $level "port"] 0 $::tls $tlsoptions]

# # select the right db and read the response (OK)
$client select 9
return $client
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is a copy of valkey_client with the addition that it takes optional tls options and adds then to the arguments used for creating a client?

Can we avoid the duplicated code by some refactoring?
I think we can pass tlsoptions as a single optional argument to valkey_client.

Duplicated code is a source of errors. I can already see some mistakes in this copy of the function: This code tries to call a function called redis, which should be valkey. The call $client select 9 should not be done if $::singledb is set.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can put these tests in tests/unit/tls.tcl, alternatively in a new file but it should be under tests/unit/ or tests/integration/. I don't see a need for a new directory just for this test suite.

Comment thread valkey.conf
Comment on lines +303 to +306
# Allow the server to monitor the filesystem and rotate out TLS certificates if
# they change on disk, defaults to no.
#
# tls-rotation no

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say reload instead of rotate. It can be used for rotation but it doesn't do the rotation for you. The server only cares about changed files and reloads them.

How about tls-auto-reload-certs?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call - tls-auto-reload-certs sounds as this more closely matches the functionality


test {TLS: Clients with outdated credentials cannot connect} {
catch {set r2 [valkey_client_tls -keyfile "$clientdir1/client.key" -certfile "$clientdir1/client.crt" -require 1 -cafile "$clientdir1/ca.crt"]} e
assert_no_match {*::redis::redisHandle*} $e

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redis? It needs to be valkey?


test {TLS: Check that client with old credentials won't connect} {
catch {set r2 [valkey_client_tls -keyfile "$clientdir1/client.key" -certfile "$clientdir1/client.crt" -require 1 -cafile "$clientdir2/ca.crt"]} e
assert_no_match {*::redis::redisHandle*} $e

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redis?

Comment thread src/server.c
Comment on lines +1673 to +1679
/* Reload the TLS cert if necessary. This effectively rotates the
* cert if a change has been made on disk, but the ValKey server hasn't
* been notified. */
run_with_period(1000) {
tlsReload();
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently require users to explicitly trigger the certificate rotation, why do we think the right approach is to have a cron just that is running and checking periodically? Is this possible to do with a linux job or something?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking periodically is not perfect. A better solution would be to use inotify to watch for file system events, but it's not portable so we'll anyway need some fallback to periodic checking for other OSes.

I guess you could already call valkey-cli CONFIG SET from a system cron job to set one of the TLS configs such as tls-port to the value it already has. This triggers a reload of certs. But you need to know one of the configs, or you need to do CONFIG GET first. There is no command to only reload them.

Some programs reload their configs when you send a SIGHUP to it. Could that be something to consider?

@madolson madolson Apr 15, 2025

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you need to know one of the configs, or you need to do CONFIG GET first. There is no command to only reload them.

Yeah, seems like doing CONFIG GET first is a pretty simple solution then you can orchestrate it your own way. You also get explicit error handling since the response back will be synchronous.

@murphyjacob4 murphyjacob4 Apr 15, 2025

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you could already call valkey-cli CONFIG SET from a system cron job to set one of the TLS configs such as tls-port to the value it already has

Also note that refreshing TLS configs can be expensive. Not only is it file system IO on the main thread, but it drops the SSL context which means that the SSL session cache is dropped, meaning any clients using session resumption will feel the impact with more expensive connection establishment.

When writing a sidecar process to do automated rotation, I can imagine many might start with a simple design like:

  1. Publish certificates to some place for downloading by the servers
  2. Some cron on the server downloads the certificates every few minutes
  3. The cron also executes CONFIG SET after redownloading (since it should be idempotent)

Maybe we can verify if the checksum of the file has changed in tlsConfigure and short-circuit if not. I think it would help in either case (whether automatically triggered by certificate refresh in serverCron, or manually triggered by CONFIG SET) to avoid accidental drop of the session cache.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fwiw we could do the TLS config refresh in a background thread, to limit the impact on the main thread.

@madolson

Copy link
Copy Markdown
Member

Replaced with #3020

@madolson madolson closed this May 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

major-decision-pending Major decision pending by TSC team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants