Skip to content

feature: rotate encryption keys#194

Closed
ryanuber wants to merge 23 commits intohashicorp:masterfrom
ryanuber:f-rotate-key
Closed

feature: rotate encryption keys#194
ryanuber wants to merge 23 commits intohashicorp:masterfrom
ryanuber:f-rotate-key

Conversation

@ryanuber
Copy link
Member

Took a stab at one approach for swapping out the encryption key. It seems to be working so far on clusters size 1-99 nodes running locally.

The main problem is deciding when to swap the key. In this implementation, a small number of messages are likely to be rejected, since all members in the cluster don't necessarily swap keys at the same exact instant. This hasn't caused failed member issues for me yet but I can see how that could happen.

This pull is mainly to solicit some feedback and gather ideas on how best to support this functionality in Serf, which is why it has no tests/docs etc. yet.

It's probably also possible to handle key rotation using queries / events with handlers and SIGHUP'ing the agent, but I think this would face the same problems. I do think we need a good, integrated way of rotating the key, especially if Serf is to be used in production environments.

Another thought would be to allow a AltSecretKey or similar in Memberlist config, and fall back to using that to decrypt messages if SecretKey fails. This would give us a little leniency in getting a new key in place. We could call SwapKeys() or something like that on memberlist, which would allow us to start encrypting with the new key, and fall back decrypting with the old one. Just an idea.

Here is the high-level approach I took:

  1. User initiates key rotation using serf rotate-key 8M1Sbtz6ElPdDHfDbzbNJg==
  2. Serf sends an internal_query to all members with the new key contained in the payload.
  3. Each member receives the query and stores the new key in serf.config.NewSecretKey, then sends an ack.
  4. Serf counts received acks vs. number of total members in the cluster. If any node fails to ack receiving the new key, then key rotation is aborted.
  5. Serf broadcasts a rotate-key event to all members, indicating that they should now swap keys.
  6. Each member receives rotate-key, and starts a new goroutine (see 7)
  7. [in background] Sleep for duration of serf.config.BroadcastTimeout so that Serf has a chance to rebroadcast successfully before the key is swapped. Once the sleep has expired, the serf.config.MemberlistConfig.SecretKey gets swapped out.
  8. Agent receives a rotate-key event containing the base64-encoded key as the payload, allowing the user to handle patching a config file.

Any and all thoughts/suggestions welcome!

ryanuber added 23 commits March 25, 2014 19:00
…ork in the serf layer to handle the actual rotation.
… This allows one to update the encryption key in the serf config using an event handler.
@armon
Copy link
Member

armon commented Mar 29, 2014

@ryanuber Thanks for opening a PR so we can discuss this. I've thought about this problem for a while, and I think that we do need to change Memberlist to support this in a safe way. Basically, I think the system needs to support multiple keys being used to enable a smooth transition. Otherwise, as you said, there are messages that will be lost, or nodes that will be dropped from the cluster.

I think the interface we need is something like install-key, use-key, remove-key, available-keys. This way the work flow works is that an operator uses install-key to add a new encryption key to the cluster (this would be a query internally). Once a key is installed, any message that uses that key can be decrypted. Then available-keys is a query that requests which keys a node knows about. This is used by an operator to check that all nodes have a key for example. The use-key query then causes nodes to switch the encryption key they are using. Because the old keys is still available, while the transition is happening messages continue to decrypt properly. Once all nodes have switched to the new key, remove-key is used to delete that key. That prevents any messages using that key from being used.

So the primary internal changes are support for multiple keys which can all be used for decryption, an "encrypt" key that is used to encrypt outgoing messages, and a few special internal queries to do management of these keys.

For nodes that have been disconnected a long time, or new nodes joining the cluster, it basically becomes necessary for an operator to manage key distribution outside of Serf. There is no way around this as far as I can tell. Using an older key is not acceptable as it opens up Serf to a downgrade attack by an attacker.

The other issue with both my scheme and this scheme, is that it does not offer perfect forward secrecy. If an attack compromises any key, they effectively compromise all future keys. I'll think more about this problem, but I'm not sure there is an easy answer.

@armon
Copy link
Member

armon commented Mar 29, 2014

I've tried searching the literature on this problem, and I don't really see anything that describes a PFS system for n members without n-to-n communication and key generation. In Serfs case, this would be not feasible, as a 1000 nodes cluster would need to run 1MM rounds of Diffie Helman.

However our security model allows us to assume that each node running Serf is not compromised, and so I think there is an opportunity to use a OTP (http://en.wikipedia.org/wiki/One-time_pad) to secure the key exchange. If each node is provided with say 16K of OTP material in advance, then up to 1024 key rotations can be done before a new OTP pad needs to be distributed to all nodes.

I think it's possible to design the system to support with and without the OTP pad (all the mechanisms would be the same), but basically provide a giant warning that without the OTP pad you do NOT get PFS.

Due to the nature of OTP it must also be distributed outside of Serf.

@ryanuber
Copy link
Member Author

@armon thanks for the reply. I totally see your point with Diffie-Hellman. Bummer.

So if there was a single 16K OTP pad in use on a given cluster, I am thinking we would need to come up with some way of knowing how many times the key had been rotated. The reason being, if new nodes join, they would need to skip ahead to get the same key as the current nodes. Either that, or it would become a requirement that a new OTP pad was distributed any time new nodes were added to achieve PFS.

About rotating the pad once depleted, maybe Serf would just need to refuse key rotation if all available "pages" were used. Automatically regenerating/distributing a new pad might be difficult, since the operator would need to be able to retrieve it to deploy new nodes without again rotating the complete pad.

This sounds pretty good though - the complexity and initial orchestration investment from the end user's perspective is essentially no different than it is today, with the added benefit that you get quite a few key rotations without having to distribute any private data.

@armon
Copy link
Member

armon commented Mar 31, 2014

@ryanuber Yeah I think the install-key command would identify the OTP page + offset being used. The ID of the page would probably be a hash of the page, and the offset just a counter. Each agent would need to refuse re-use of the same page/counter value. I agree that without any new OTP material Serf just refuses until a new page is added.

@ryanuber
Copy link
Member Author

ryanuber commented Apr 2, 2014

So I've been thinking about this more, and even had a little time last night to do a basic OTP implementation in Go here.

@armon I think I might be missing something that is clear to you, though. Let's say we go the OTP route. OTP itself defines an encryption algorithm, providing simple methods to encrypt/decrypt, and its strength coming from truly single-use pages. So that is all well and good, but in terms of how we would leverage OTP in Serf, what are we going to encrypt using it? The -encrypt value? It can't be a random value, because each node needs to realize the same encryption key. We could broadcast a new key with each key rotation and then encrypt it with the next OTP page on each node, but I think this would require some kind of -otp-secret-file option or something to store the value. I thought about using the first page of the OTP pad as the key to encrypt, and then encrypting it with subsequent keys, among other things, but all ideas just ended up feeling incorrect to me.

I initially went down the path of just having a simple pre-shared blob of bytes and a pointer to seek around the pre-determined key lengths. If we were to do something like that, it would be super-easy to implement, and all the operator would need to provide to each node would be the byte blob and the pointer, but I am guessing that there are better ways of doing this.

I'll keep digging and post again if I find anything promising. As always, thanks for your feedback.

@armon
Copy link
Member

armon commented Apr 2, 2014

@ryanuber Yeah, so the way I imagine it working is that there is an -otp-file flag, that can be specified multiple times. Each file specified provides a "page" of OTP material. Each "page" is then identified by the SHA1 of the entire contents.

So when the InstallKey query happens it does something like:

{"Page": "<SHA1 of page>", "Offset": 0..n, "Value": <OTP protected key>}

If there is a Page/Offset provided, then the Value is the new key XOR'd with the OTP material at the correct offset. If there is no Page/Offset the key is simply the value (no PFS). The offset is just a counter, since we know each offset represents a 16 byte chunk (must match the key length).

Now there are some key details to this. For one, the snapshot file must be leveraged to remember the last used offset in each page, and we must refuse to install a key that is re-using an older offset. This is simple enough, as we just do something like "otp-offset: ".

This is enough so that all parties can agree on the key. If they have no matching page, the decrypt cannot happen. It also makes supporting multiple pages pretty simple.

My only concern is that the OTP work does not add much security. Basically we have 2 situations to worry about:

  1. Attacker can read all network traffic. If all messages are encrypted, they must effectively brute-force AES-GCM 128, which is so expensive. IF they can do this, they can currently break all communication (past and future). If we have PFS support then only the current session is leaked.

  2. Attacker compromises a host. Since the host has the encryption keys and OTP pads, once again all past and future communications is broken, even with PFS. This requires an operator to replace the encrypt keys AND OTP pads. This is the same as if the key is compromised without any support for key rotation.

So I guess it's worth thinking about the complexity this adds. Do we realistically need to worry about a compromise of AES? Is a host compromise more likely (in which case, all this fancy crypto is obviated)?

@ryanuber
Copy link
Member Author

ryanuber commented Apr 5, 2014

@armon, I think you're on the right track with this. AES being broken into is probably not where we should focus our efforts.

So lets think fresh: If we assume that the cluster is not already compromised, then distributing new encryption keys over the network will not be a concern since they will be encrypted during transmission. This seems pretty reasonable to me, at least for a "1.0" ability to rotate keys.

I adjusted my Serf code in another branch to give a few things a try, and the results are pretty good so far. The code adds:

  • install-key command
  • use-key command
  • remove-key command
  • -keyring-file config option

Memberlist code (PR) is here
Serf code (branch) is here

Here is an example session of what it would accomplish. Assume that a cluster is spun up with encryption enabled using the key I0fRZhCAejDX62bxogaaIg==. The operator then wants to rotate the encryption key. Here is how it could be done:

$ serf install-key 3heP8bn+cl9RniyUdn3wgQ==
Successfully installed new key

$ serf use-key 3heP8bn+cl9RniyUdn3wgQ==
Successfully changed primary encryption key

$ serf remove-key I0fRZhCAejDX62bxogaaIg==
Successfully removed encryption key

I ran these commands on a local serf cluster running 99 nodes and had 0 failed messages, so things are looking better!

It would be up to the operator to ensure that installing the new key was successful before trying to run use-key. The routines I have here will error out and return an exit code of 1 if any node fails to receive the new key. The install-key command can be run multiple times if some nodes succeed and some fail without negative consequences.

Also noteworthy is that while joining a cluster, the joining node must be started with the current cluster encryption key in its keyring already so that it can understand the gossip messages as they come in. It will otherwise be rejected almost immediately, but this is by design.

I know you guys are busy, so no rush on getting back to me here. Let me know if you think this is directionally correct.

@armon
Copy link
Member

armon commented Apr 7, 2014

@ryanuber This is awesome, and I think definitely the right direction for a first attempt. I looked over a lot of the branch and it looks good. Minor feed back items:

  • I don't think we need a EventRotateKey event
  • With the messageKeyResponseType, let's send the corresponding err back along with reap. Then the CLI can do something nice like output "node: err" list. This gives the operator a little more insight. (e.g. the operator can see which nodes are missing the key, etc.) Also for remove-key if they run it multiple times, the error will be something like "key does not exist", which can be ignored by the operator.
  • Lets break Serf. ModifyKeyring into multiple functions. It feels dirty to parse a command string. There are only 3 methods required. If may be even more clean to add Serf.KeyManager() which returns a new object, that one then can encapsulate AddKey, UseKey, and RemoveKey. That way the Serf API does not get bloated either.
  • I think we can also move all the CLIs under a single command. So we would do serf keys install or serf keys remove. Also just to minimize the number of commands (to avoid overloading those new to Serf).

This is looking really good! Thanks!

@ryanuber
Copy link
Member Author

Closing in favor of #199

@ryanuber ryanuber closed this Apr 11, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants