feature: rotate encryption keys#194
feature: rotate encryption keys#194ryanuber wants to merge 23 commits intohashicorp:masterfrom ryanuber:f-rotate-key
Conversation
…ork in the serf layer to handle the actual rotation.
…t later to do the swap.
…nodes received the new encryption key
…decode problems later on.
… rotating the key is the actual swap-out
… This allows one to update the encryption key in the serf config using an event handler.
|
@ryanuber Thanks for opening a PR so we can discuss this. I've thought about this problem for a while, and I think that we do need to change Memberlist to support this in a safe way. Basically, I think the system needs to support multiple keys being used to enable a smooth transition. Otherwise, as you said, there are messages that will be lost, or nodes that will be dropped from the cluster. I think the interface we need is something like So the primary internal changes are support for multiple keys which can all be used for decryption, an "encrypt" key that is used to encrypt outgoing messages, and a few special internal queries to do management of these keys. For nodes that have been disconnected a long time, or new nodes joining the cluster, it basically becomes necessary for an operator to manage key distribution outside of Serf. There is no way around this as far as I can tell. Using an older key is not acceptable as it opens up Serf to a downgrade attack by an attacker. The other issue with both my scheme and this scheme, is that it does not offer perfect forward secrecy. If an attack compromises any key, they effectively compromise all future keys. I'll think more about this problem, but I'm not sure there is an easy answer. |
|
I've tried searching the literature on this problem, and I don't really see anything that describes a PFS system for n members without n-to-n communication and key generation. In Serfs case, this would be not feasible, as a 1000 nodes cluster would need to run 1MM rounds of Diffie Helman. However our security model allows us to assume that each node running Serf is not compromised, and so I think there is an opportunity to use a OTP (http://en.wikipedia.org/wiki/One-time_pad) to secure the key exchange. If each node is provided with say 16K of OTP material in advance, then up to 1024 key rotations can be done before a new OTP pad needs to be distributed to all nodes. I think it's possible to design the system to support with and without the OTP pad (all the mechanisms would be the same), but basically provide a giant warning that without the OTP pad you do NOT get PFS. Due to the nature of OTP it must also be distributed outside of Serf. |
|
@armon thanks for the reply. I totally see your point with Diffie-Hellman. Bummer. So if there was a single 16K OTP pad in use on a given cluster, I am thinking we would need to come up with some way of knowing how many times the key had been rotated. The reason being, if new nodes join, they would need to skip ahead to get the same key as the current nodes. Either that, or it would become a requirement that a new OTP pad was distributed any time new nodes were added to achieve PFS. About rotating the pad once depleted, maybe Serf would just need to refuse key rotation if all available "pages" were used. Automatically regenerating/distributing a new pad might be difficult, since the operator would need to be able to retrieve it to deploy new nodes without again rotating the complete pad. This sounds pretty good though - the complexity and initial orchestration investment from the end user's perspective is essentially no different than it is today, with the added benefit that you get quite a few key rotations without having to distribute any private data. |
|
@ryanuber Yeah I think the |
|
So I've been thinking about this more, and even had a little time last night to do a basic OTP implementation in Go here. @armon I think I might be missing something that is clear to you, though. Let's say we go the OTP route. OTP itself defines an encryption algorithm, providing simple methods to encrypt/decrypt, and its strength coming from truly single-use pages. So that is all well and good, but in terms of how we would leverage OTP in Serf, what are we going to encrypt using it? The I initially went down the path of just having a simple pre-shared blob of bytes and a pointer to seek around the pre-determined key lengths. If we were to do something like that, it would be super-easy to implement, and all the operator would need to provide to each node would be the byte blob and the pointer, but I am guessing that there are better ways of doing this. I'll keep digging and post again if I find anything promising. As always, thanks for your feedback. |
|
@ryanuber Yeah, so the way I imagine it working is that there is an So when the If there is a Page/Offset provided, then the Value is the new key XOR'd with the OTP material at the correct offset. If there is no Page/Offset the key is simply the value (no PFS). The offset is just a counter, since we know each offset represents a 16 byte chunk (must match the key length). Now there are some key details to this. For one, the snapshot file must be leveraged to remember the last used offset in each page, and we must refuse to install a key that is re-using an older offset. This is simple enough, as we just do something like "otp-offset: ". This is enough so that all parties can agree on the key. If they have no matching page, the decrypt cannot happen. It also makes supporting multiple pages pretty simple. My only concern is that the OTP work does not add much security. Basically we have 2 situations to worry about:
So I guess it's worth thinking about the complexity this adds. Do we realistically need to worry about a compromise of AES? Is a host compromise more likely (in which case, all this fancy crypto is obviated)? |
|
@armon, I think you're on the right track with this. AES being broken into is probably not where we should focus our efforts. So lets think fresh: If we assume that the cluster is not already compromised, then distributing new encryption keys over the network will not be a concern since they will be encrypted during transmission. This seems pretty reasonable to me, at least for a "1.0" ability to rotate keys. I adjusted my Serf code in another branch to give a few things a try, and the results are pretty good so far. The code adds:
Memberlist code (PR) is here Here is an example session of what it would accomplish. Assume that a cluster is spun up with encryption enabled using the key I ran these commands on a local serf cluster running 99 nodes and had 0 failed messages, so things are looking better! It would be up to the operator to ensure that installing the new key was successful before trying to run Also noteworthy is that while joining a cluster, the joining node must be started with the current cluster encryption key in its keyring already so that it can understand the gossip messages as they come in. It will otherwise be rejected almost immediately, but this is by design. I know you guys are busy, so no rush on getting back to me here. Let me know if you think this is directionally correct. |
|
@ryanuber This is awesome, and I think definitely the right direction for a first attempt. I looked over a lot of the branch and it looks good. Minor feed back items:
This is looking really good! Thanks! |
|
Closing in favor of #199 |
Took a stab at one approach for swapping out the encryption key. It seems to be working so far on clusters size 1-99 nodes running locally.
The main problem is deciding when to swap the key. In this implementation, a small number of messages are likely to be rejected, since all members in the cluster don't necessarily swap keys at the same exact instant. This hasn't caused failed member issues for me yet but I can see how that could happen.
This pull is mainly to solicit some feedback and gather ideas on how best to support this functionality in Serf, which is why it has no tests/docs etc. yet.
It's probably also possible to handle key rotation using queries / events with handlers and SIGHUP'ing the agent, but I think this would face the same problems. I do think we need a good, integrated way of rotating the key, especially if Serf is to be used in production environments.
Another thought would be to allow a
AltSecretKeyor similar in Memberlist config, and fall back to using that to decrypt messages ifSecretKeyfails. This would give us a little leniency in getting a new key in place. We could callSwapKeys()or something like that on memberlist, which would allow us to start encrypting with the new key, and fall back decrypting with the old one. Just an idea.Here is the high-level approach I took:
serf rotate-key 8M1Sbtz6ElPdDHfDbzbNJg==internal_queryto all members with the new key contained in the payload.serf.config.NewSecretKey, then sends an ack.rotate-keyevent to all members, indicating that they should now swap keys.rotate-key, and starts a new goroutine (see 7)serf.config.BroadcastTimeoutso that Serf has a chance to rebroadcast successfully before the key is swapped. Once the sleep has expired, theserf.config.MemberlistConfig.SecretKeygets swapped out.rotate-keyevent containing the base64-encoded key as the payload, allowing the user to handle patching a config file.Any and all thoughts/suggestions welcome!