tls: add support for client-side session resumption.#4791
tls: add support for client-side session resumption.#4791htuch merged 18 commits intoenvoyproxy:masterfrom
Conversation
*Risk Level*: Low *Testing*: bazel test //test/... *Docs Changes*: Added *Release Notes*: Added Signed-off-by: Piotr Sikora <piotrsikora@google.com>
source/common/ssl/context_impl.cc
Outdated
| if (max_session_keys_ > 0) { | ||
| absl::WriterMutexLock l(&session_keys_mu_); | ||
| if (!session_keys_.empty()) { | ||
| SSL_SESSION* session = session_keys_.front().get(); |
There was a problem hiding this comment.
Note, similarly to @julia-stripe's PR, this assumes a single session store per cluster, not per endpoint.
There was a problem hiding this comment.
@PiotrSikora can you talk a bit about the rationalization for that assumption?
There was a problem hiding this comment.
The assumption is wrong most of the time (i.e. we should save session per endpoint, not per cluster), which is one of the reasons why this was sitting in my local tree until now.
However, the per cluster store works perfectly fine for (a) single-endpoint clusters, (b) deployments using shared cache and/or session tickets, where the same session can be resumed across all endpoints, so it's a stepping stone in the right direction, and per endpoint store can be added a bit later.
There was a problem hiding this comment.
@PiotrSikora can you clarify what the next stones are towards per-endpoint stores? I think @julia-stripe and I were under the impression there was a single ClientContextImpl per endpoint, rather than per-cluster, which it sounds like isn't true. As you point out, this PR as stands won't work well for multi-endpoint clusters without shared session ticket keys, which is our big use-case.
There was a problem hiding this comment.
@bobby-stripe it's either:
- Storing sessions in the endpoint object, so that session's lifetime is the same as endpoint's.
- Adding ability to configure "session cache key", so that sessions stored in
ClientContextImplcan be retrieved by a specific key(s) (e.g.%UPSTREAM_CLUSTER%- which is effectively what we have right now,%UPSTREAM_HOST%or%REQ(:AUTHORITY)%).
I'm leaning towards the second option, since it's much more flexible, but I didn't have time to work on this yet.
There was a problem hiding this comment.
@PiotrSikora how do you want to move forward with this? Merge this PR and then do 1/2 above? I'm guessing you get most of your wins by just doing (1) above.
There was a problem hiding this comment.
I'd leaning towards (2), since it's much more flexible solution.
But yeah, let's definitely merge this PR as-is (feature-wise).
There was a problem hiding this comment.
Can you open up a ticket for the per-endpoint continuation to track?
Signed-off-by: Piotr Sikora <piotrsikora@google.com>
| // for TLSv1.2 and older) to store for the purpose of session resumption. | ||
| // | ||
| // Defaults to 1, setting this to 0 disables session resumption. | ||
| google.protobuf.UInt32Value max_session_keys = 4; |
There was a problem hiding this comment.
Would it make sense (in the future) to have a corresponding server-side setting for how many session tickets a TLS 1.3 server will issue at a time? the RFC says:
Servers that issue tickets SHOULD offer at least as many tickets as the number of connections that a client might use; for example, a web browser using HTTP/1.1 [RFC7230] might open six connections to a server.
There was a problem hiding this comment.
Yes, but it's currently hardcoded to static const int kNumTickets = 2 in BoringSSL (see: https://boringssl.googlesource.com/boringssl/+/c0c9001440db8121bdc1ff1307b3a9aedf26fcd8/ssl/tls13_server.cc#165). cc @davidben
| // for TLSv1.2 and older) to store for the purpose of session resumption. | ||
| // | ||
| // Defaults to 1, setting this to 0 disables session resumption. | ||
| google.protobuf.UInt32Value max_session_keys = 4; |
There was a problem hiding this comment.
Q: Should this feature be disabled or enabled by default?
There was a problem hiding this comment.
Are there any security implications from enabling by default? E.g. are we materially increasing the amount of code that might be subject to compromise in BoringSSL etc. I have zero clue on this, but my inclination would be if there was a tradeoff to sacrifice performance (i.e. the resumption) for improved default security posture.
There was a problem hiding this comment.
There is a case of a "privacy leak" (passive observer being able to correlate connections from the same user by looking at the session that's being resumed, which is sent unencrypted in ClientHello) in TLS versions older than 1.3, but that's mostly a threat to end-users (so also a single-user client-side proxy) and not middle/edge proxies and/or service mesh, so I don't think that it justifies having this off by default in Envoy.
Note: TLSv1.3 sends single-use sessions, so the default of 1 is probably too small. Perhaps we could make it vary by default, i.e. 1 if tls_maximum_protocol_version is smaller than TLSv1.3, and 4(?) otherwise.
There was a problem hiding this comment.
I think the common case is service mesh and middle/edge proxies, so we should optimize for that. It's probably not great to be optimizing for TLS 1.3 quiet yet, I don't think this is universal by far.
There was a problem hiding this comment.
Sorry, I somehow missed this comment earlier.
What I meant regarding TLS v1.3 is basically:
if upstream_tls_context.tls_params.tls_maximum_protocol_version == TLSv1_3:
max_session_keys = 4;
else
max_session_keys = 1;
But I'm fine leaving the default at 1 for the time being, and we can revisit it later, when we enable TLSv1.3 by default.
|
This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
…ent_session_reuse Signed-off-by: Piotr Sikora <piotrsikora@google.com>
|
@lizan @ggreenway could you take a look? Thanks! |
source/common/ssl/context_impl.h
Outdated
| const bool allow_renegotiation_; | ||
| const size_t max_session_keys_; | ||
| mutable absl::Mutex session_keys_mu_; | ||
| mutable std::deque<bssl::UniquePtr<SSL_SESSION>> session_keys_ GUARDED_BY(session_keys_mu_); |
There was a problem hiding this comment.
Make this mutable doesn't looks correct, should we just make newSsl non const? SslSocket holds non-const shared pointer so it should be OK.
…ent_session_reuse
Signed-off-by: Piotr Sikora <piotrsikora@google.com>
htuch
left a comment
There was a problem hiding this comment.
I'll step in to do a final pass. Some of my comments will show my ignorance of how BoringSSL works, so if you want to respond to them with more verbose code comments, that would make the life easier for the next neophyte who steps on this code :)
| } | ||
|
|
||
| if (max_session_keys_ > 0) { | ||
| SSL_CTX_set_session_cache_mode(ctx_.get(), SSL_SESS_CACHE_CLIENT); |
There was a problem hiding this comment.
Should we check or ASSERT errors for these BoringSSL calls?
There was a problem hiding this comment.
Not for those. SSL_CTX_set_session_cache_mode() cannot fail and returns previously configured mode, SSL_CTX_sess_set_new_cb() cannot fail and returns void.
source/common/ssl/context_impl.cc
Outdated
| if (max_session_keys_ > 0) { | ||
| absl::WriterMutexLock l(&session_keys_mu_); | ||
| if (!session_keys_.empty()) { | ||
| SSL_SESSION* session = session_keys_.front().get(); |
There was a problem hiding this comment.
Can you open up a ticket for the per-endpoint continuation to track?
source/common/ssl/context_impl.cc
Outdated
| if (max_session_keys_ > 0) { | ||
| absl::WriterMutexLock l(&session_keys_mu_); | ||
| if (!session_keys_.empty()) { | ||
| SSL_SESSION* session = session_keys_.front().get(); |
There was a problem hiding this comment.
Can you add some comments here on why picking the front of the queue is the right thing to do? I.e. why not the third item in the Q?
| return ssl_con; | ||
| } | ||
|
|
||
| int ClientContextImpl::newSessionKey(SSL_SESSION* session) { |
There was a problem hiding this comment.
I think we are safe for multi cert work, given that the client contexts will continue to have a single cert, but could you comment here on whether in the future, if we support multiple client certs, whether anything needs to change?
There was a problem hiding this comment.
I don't think that we'll ever support multiple client certificates that can affect sessions, since client certificates are not revalidated during session resumption by the server.
In any case, this would be covered by #5073.
source/common/ssl/context_impl.cc
Outdated
| } | ||
|
|
||
| if (max_session_keys_ > 0) { | ||
| absl::WriterMutexLock l(&session_keys_mu_); |
There was a problem hiding this comment.
Sad that we need to take a writer mutex on a data path operation here. I assume that we're not that concerned because we expect connections to be relatively long lived. Is there a case for being able to take a reader mutex on the common path?
There was a problem hiding this comment.
Well, we don't really need to take it. The alternative is to use per-worker lock-less session cache, but that would result in bigger memory usage and much lower hit rate, so I think that using shared cache is a good trade-off.
In theory, we only need write/write locks when we store single-use session keys (TLS 1.3), so I've added an optimization to use read/write locks for the other cases.
Thanks!
There was a problem hiding this comment.
Thanks for the explanation and switching to reader mutex. I think this is fine for scalability, since this only impacts initial connection latency, not per request and also we'll eventually move to a shared connection pool model for scalability.
…ent_session_reuse
Signed-off-by: Piotr Sikora <piotrsikora@google.com>
Signed-off-by: Piotr Sikora <piotrsikora@google.com>
htuch
left a comment
There was a problem hiding this comment.
LGTM, some final nits and we can ship, thanks.
test/common/ssl/ssl_socket_test.cc
Outdated
| true); | ||
| NiceMock<Network::MockListenerCallbacks> callbacks; | ||
| Network::MockConnectionHandler connection_handler; | ||
| DangerousDeprecatedTestTime test_time; |
There was a problem hiding this comment.
Do we need this? Can we use the simulated time_system above?
There was a problem hiding this comment.
No, it just shows how old the code really is. Thanks!
test/common/ssl/ssl_socket_test.cc
Outdated
| client_connection->connect(); | ||
|
|
||
| size_t connect_count = 0; | ||
| auto connectSecondTime = [&]() { |
There was a problem hiding this comment.
Nit: slight preference for explicit capture list here; I don't mind & in tests if it really makes them a lot less verbose (as opposed to regular code, where we should avoid wildcard), but here it doesn't help much.
There was a problem hiding this comment.
Done. I updated those functions, as well as other lambdas in this file, but the list is a bit ridiculous, to be honest...
test/common/ssl/ssl_socket_test.cc
Outdated
| client_connection->connect(); | ||
|
|
||
| size_t connect_count = 0; | ||
| auto connectSecondTime = [&]() { |
There was a problem hiding this comment.
Nit: this should be connect_second_time, as it's still a variable, albeit a lambda.
There was a problem hiding this comment.
It's also a function, and connectSecondTime() looks nicer than connect_second_time(), IMHO.
But I changed it anyway.
test/common/ssl/ssl_socket_test.cc
Outdated
| } | ||
| }; | ||
|
|
||
| EXPECT_CALL(client_connection_callbacks, onEvent(Network::ConnectionEvent::Connected)) |
There was a problem hiding this comment.
Do you think we could InSequence these?
There was a problem hiding this comment.
Done, but due to the ordering of events that depends on the version of the TLS protocol and whether or not the session was successfully resumed, this resulted in quite a lot of extra code.
See: 4aeefe4
| testClientSessionResumption(server_ctx_yaml, client_ctx_yaml, true, GetParam()); | ||
| } | ||
|
|
||
| TEST_P(SslSocketTest, ClientSessionResumptionEnabledTls13) { |
There was a problem hiding this comment.
Nit: prefer a // one liner explaining in plain text what all these tests do.
…ent_session_reuse Signed-off-by: Piotr Sikora <piotrsikora@google.com>
Signed-off-by: Piotr Sikora <piotrsikora@google.com>
Signed-off-by: Piotr Sikora <piotrsikora@google.com>
Signed-off-by: Piotr Sikora <piotrsikora@google.com>
Signed-off-by: Piotr Sikora <piotrsikora@google.com>
Signed-off-by: Piotr Sikora <piotrsikora@google.com>
test/common/ssl/ssl_socket_test.cc
Outdated
| auto stopSecondTime = [&]() { | ||
| if (++counter == 2) { | ||
| size_t connect_count = 0; | ||
| auto connect_second_time = [&connect_count, &dispatcher, &server_connection, &client_connection, |
There was a problem hiding this comment.
Actually, I agree it's a bit ridiculous to have explicit capture lists in tests when it gets this long. My rule of thumb is in production code, make it explicit (avoid unintended mistakes that can creep in) and in test code, make the capture list explicit when it's short, otherwise you can wildcard it. I had though in the example I pointed at it would only be two items long, but I guess that's not the case across this file.
There was a problem hiding this comment.
Should I revert it or leave it as-is, then? I'm fine with either.
There was a problem hiding this comment.
Yeah, if you could revert back the ones that are actually ridiculous (and leave the shorter ones as is) that would be great.
|
@PiotrSikora coverage fail seem legitimate, not sure why this pure virtual method call is happening, can you take a look? |
|
It's not legitimate, if you look at results of coverage just before |
Signed-off-by: Piotr Sikora <piotrsikora@google.com>
Signed-off-by: Piotr Sikora <piotrsikora@google.com>
|
@PiotrSikora thanks for looking into the failure. I will keep an eye out for these being on Envoy maintainer duty this week. |
Signed-off-by: Piotr Sikora <piotrsikora@google.com>
…ent_session_reuse
|
@PiotrSikora I think we missed the release note for this. Do you mind doing a follow up PR with a release note? Thank you. |
Risk Level: Low Testing: bazel test //test/... Signed-off-by: Piotr Sikora <piotrsikora@google.com> Signed-off-by: Fred Douglas <fredlas@google.com>
Risk Level: Low
Testing: bazel test //test/...
Docs Changes: Added
Release Notes: Added
Signed-off-by: Piotr Sikora piotrsikora@google.com