-
Notifications
You must be signed in to change notification settings - Fork 177
Massive performance loss due to locking issue around use of ssl_update_counter() / ssl_read_counter() #2776
Description
Problem:
When redoing some tests on haproxy+awslc using TLS tickets, it appeared to my surprise that awslc was even slower than OpenSSL (1/4-1/5 of the perf), without eating CPU, as shown in the table below comparing the performance of the different libs in resumptions per second (and percent relative to 1.1.1):
| lib | TLS 1.2 resumptions/s | TLS 1.3 resumptions/s |
|---|---|---|
| openssl-1.1.1 | 208k (100%) | 144k (100%) |
| openssl-3.5.4 | 172k (83%) | 58k (40%) |
| openssl-3.6.0 | 178k (86%) | ~11-63k (44%) |
| aws-lc-1.62.0 | 49k (24%) | 30k (21%) |
The machine (a 64-core EPYC) was 87% idle, suggeting some waiting locks. I rebuilt haproxy with USE_PTHREAD_EMULATION=1 to replace the slow pthread rwlocks, and aws-lc's performance jumped to 91k/s at 100% CPU, clearly indicating a misuse of a pthread rwlock, which is also shown in perf top in this case:
Overhead Shared Object Symbol
43.34% haproxy [.] pthread_rwlock_wrlock
8.61% [kernel] [k] native_queued_spin_lock_slowpath
2.40% haproxy [.] pthread_rwlock_rdlock
1.94% haproxy [.] sha512_block_data_order_nohw
By setting a break point in gdb on the lock, I easily tracked it down to ssl_update_counter() called from ssl_process_ticket() and its sibling ssl_read_counter(), which only take the lock to read or increment an integer (here the session hit counter)! (i.e. not even a case of consistency with other use cases since it's the only place where this counter is manipulated).
Worse, looking closer, I noticed that the ssl_read_counter() function is wrong: the counter is read in the caller out of any lock, then passed to the function, which takes the lock and returns it:
static int ssl_read_counter(const SSL_CTX *ctx, int counter) {
MutexReadLock lock(const_cast<CRYPTO_MUTEX *>(&ctx->lock));
return counter;
}
...
int SSL_CTX_sess_hits(const SSL_CTX *ctx) {
return ssl_read_counter(ctx, ctx->stats.sess_hit);
}
This issue appeared in version 1.4.0 with commit fc60976 ("Add back ssl stat counters (#731)").
Based on ssl_update_counter() which takes the counter as a reference, it's easy to see how the mistake happened: the counter in the reader was expected to be passed by reference as well and not by value, but references makes these invisible from the caller places. Since these counters are only used for stats reporting, it's easy to understand how this mistake remained unnoticed till now.
The reproducer consists in running latest haproxy built against aws-lc with this simple config:
global
tune.ssl.cachesize 0
stats socket /tmp/haproxy.stat level admin mode 666
stats timeout 1h
cpu-policy performance
frontend test
mode http
timeout client 5s
bind 0.0.0.0:1202 ssl crt ecdsa256.pem force-tlsv12 shards by-thread
bind 0.0.0.0:1203 ssl crt ecdsa256.pem force-tlsv13 shards by-thread
http-request redirect location /
And sending resumed connections to it using multiple load generators till the performance doesn't increase anymore. The performance can be monitored in real time using socat /tmp/haproxy.stat - <<< "show info" | grep -i rate, particularly the "SslRate" line (in connections per second).
Solution:
The solutions to these two mixed problems is simple and fits into the same patch:
- use
const int &counterinssl_read_counter()to declare the argument - use __atomic_add_fetch() to increment the counter and __atomic_load_n() to read it
This gives the patch below:
diff --git a/ssl/ssl_lib.cc b/ssl/ssl_lib.cc
index 7503c0f6d..3dcfa437d 100644
--- a/ssl/ssl_lib.cc
+++ b/ssl/ssl_lib.cc
@@ -484,17 +484,11 @@ bool SSL_get_traffic_secrets(const SSL *ssl,
}
void ssl_update_counter(SSL_CTX *ctx, int &counter, bool lock) {
- if (lock) {
- MutexWriteLock ctx_lock(&ctx->lock);
- counter++;
- } else {
- counter++;
- }
+ __atomic_add_fetch(&counter, 1, __ATOMIC_SEQ_CST);
}
-static int ssl_read_counter(const SSL_CTX *ctx, int counter) {
- MutexReadLock lock(const_cast<CRYPTO_MUTEX *>(&ctx->lock));
- return counter;
+static int ssl_read_counter(const SSL_CTX *ctx, const int &counter) {
+ return __atomic_load_n(&counter, __ATOMIC_ACQUIRE);
}
void SSL_CTX_set_aes_hw_override_for_testing(SSL_CTX *ctx,Now the performance looks like this:
| lib | TLS 1.2 resumptions/s | TLS 1.3 resumptions/s |
|---|---|---|
| openssl-1.1.1 | 208k (100%) | 144k (100%) |
| aws-lc-1.62.0 | 49k (24%) | 30k (21%) |
| aws-lc-1.62.0-fixed | 246k (118%) | 218k (151%) |
So the fix increases the performance by 5x in TLSv1.2 and 7x in TLSv1.3.
These stats counters are used by tickets (that's how I noticed the problem) and by a few other places all using these functions to increment counters, though it seems that for some reasons the other ones were not as much impacted by the lock (at least these were not the ones I was noticing when trying to spot them).
Requirements / Acceptance Criteria:
What must a solution address in order to solve the problem? How do we know the solution is complete?
- Fix the massive performance loss compared to openssl
- Restore proper use of atomics when reading the value, even if on x86 this will not be noticed
- RFC links: none
- Related Issues: none
- Will the Usage Guide or other documentation need to be updated?: no
- Testing: How will this change be tested? Call out new integration tests, functional tests, or particularly interesting/important unit tests. => only using performance tests
- Will this change trigger AWS LibCrypto Formal Verification changes? Changes to the implementation of verified algorithms will require additional changes. => no
- Should this change be fuzz tested? Will it handle untrusted input? Create a separate issue to track the fuzzing work. => no
Out of scope:
Is there anything the solution will intentionally NOT address? => not particularly, maybe only the fact that the "lock" parameter in ssl_update_counter() becomes useless and could be cleaned up in a separate patch.