Massive performance loss due to locking issue around use of ssl_update_counter() / ssl_read_counter()

### Problem:

When redoing some tests on haproxy+awslc using TLS tickets, it appeared to my surprise that awslc was even slower than OpenSSL (1/4-1/5 of the perf), without eating CPU, as shown in the table below comparing the performance of the different libs in resumptions per second (and percent relative to 1.1.1):
|lib|TLS 1.2 resumptions/s|TLS 1.3 resumptions/s|
|---|-------------:|----------:|
|openssl-1.1.1|208k (100%)|144k (100%)|
|openssl-3.5.4|172k (83%)|58k (40%)|
|openssl-3.6.0|178k (86%)|~11-63k (44%)|
|**aws-lc-1.62.0**|49k (**24%**)|30k (**21%**)|

The machine (a 64-core EPYC) was 87% idle, suggeting some waiting locks. I rebuilt haproxy with  USE_PTHREAD_EMULATION=1 to replace the slow pthread rwlocks, and aws-lc's performance jumped to 91k/s at 100% CPU, clearly indicating a misuse of a pthread rwlock, which is also shown in perf top in this case:
```
Overhead  Shared Object         Symbol                                          
  43.34%  haproxy                   [.] pthread_rwlock_wrlock
   8.61%  [kernel]                  [k] native_queued_spin_lock_slowpath
   2.40%  haproxy                   [.] pthread_rwlock_rdlock
   1.94%  haproxy                   [.] sha512_block_data_order_nohw
```

By setting a break point in gdb on the lock, I easily tracked it down to `ssl_update_counter()` called from `ssl_process_ticket()` and its sibling `ssl_read_counter()`, which only take the lock to read or increment an integer (here the session hit counter)! (i.e. not even a case of consistency with other use cases since it's the only place where this counter is manipulated).

Worse, looking closer, I noticed that the `ssl_read_counter()` function is wrong: the counter is read **in the caller** out of any lock, then passed to the function, which takes the lock and returns it:
```
static int ssl_read_counter(const SSL_CTX *ctx, int counter) {
  MutexReadLock lock(const_cast<CRYPTO_MUTEX *>(&ctx->lock));
  return counter;
}
...
int SSL_CTX_sess_hits(const SSL_CTX *ctx) {
  return ssl_read_counter(ctx, ctx->stats.sess_hit);
}

```
This issue appeared in version 1.4.0 with commit fc609767bb ("Add back ssl stat counters (#731)").

Based on `ssl_update_counter()` which takes the counter as a reference, it's easy to see how the mistake happened: the counter in the reader was expected to be passed by reference as well and not by value, but references makes these invisible from the caller places. Since these counters are only used for stats reporting, it's easy to understand how this mistake remained unnoticed till now.

The reproducer consists in running latest haproxy built against aws-lc with this simple config:
```
global
  tune.ssl.cachesize 0
  stats socket /tmp/haproxy.stat level admin mode 666
  stats timeout 1h
  cpu-policy performance

frontend test
  mode http
  timeout client 5s
  bind 0.0.0.0:1202 ssl crt ecdsa256.pem force-tlsv12 shards by-thread
  bind 0.0.0.0:1203 ssl crt ecdsa256.pem force-tlsv13 shards by-thread
  http-request redirect location /
```

And sending resumed connections to it using multiple load generators till the performance doesn't increase anymore. The performance can be monitored in real time using `socat /tmp/haproxy.stat - <<< "show info" | grep -i rate`, particularly the "SslRate" line (in connections per second).

### Solution:

The solutions to these two mixed problems is simple and fits into the same patch:
  - use `const int &counter` in `ssl_read_counter()` to declare the argument
  - use __atomic_add_fetch() to increment the counter and __atomic_load_n() to read it

This gives the patch below:
```diff
diff --git a/ssl/ssl_lib.cc b/ssl/ssl_lib.cc
index 7503c0f6d..3dcfa437d 100644
--- a/ssl/ssl_lib.cc
+++ b/ssl/ssl_lib.cc
@@ -484,17 +484,11 @@ bool SSL_get_traffic_secrets(const SSL *ssl,
 }
 
 void ssl_update_counter(SSL_CTX *ctx, int &counter, bool lock) {
-  if (lock) {
-    MutexWriteLock ctx_lock(&ctx->lock);
-    counter++;
-  } else {
-    counter++;
-  }
+  __atomic_add_fetch(&counter, 1, __ATOMIC_SEQ_CST);
 }
 
-static int ssl_read_counter(const SSL_CTX *ctx, int counter) {
-  MutexReadLock lock(const_cast<CRYPTO_MUTEX *>(&ctx->lock));
-  return counter;
+static int ssl_read_counter(const SSL_CTX *ctx, const int &counter) {
+  return __atomic_load_n(&counter, __ATOMIC_ACQUIRE);
 }
 
 void SSL_CTX_set_aes_hw_override_for_testing(SSL_CTX *ctx,
```

Now the performance looks like this:

|lib|TLS 1.2 resumptions/s|TLS 1.3 resumptions/s|
|---|-------------:|----------:|
|openssl-1.1.1|208k (100%)|144k (100%)|
|aws-lc-1.62.0|49k (24%)|30k (21%)|
|**aws-lc-1.62.0-fixed**|246k (**118%**)|218k (**151%**)|

So the fix increases the performance by 5x in TLSv1.2 and 7x in TLSv1.3.

These stats counters are used by tickets (that's how I noticed the problem) and by a few other places all using these functions to increment counters, though it seems that for some reasons the other ones were not as much impacted by the lock (at least these were not the ones I was noticing when trying to spot them).

### Requirements / Acceptance Criteria:

What must a solution address in order to solve the problem? How do we know the solution is complete?

- Fix the massive performance loss compared to openssl
- Restore proper use of atomics when reading the value, even if on x86 this will not be noticed

* **RFC links:** none
* **Related Issues:** none
* **Will the Usage Guide or other documentation need to be updated?**: no
* **Testing:** How will this change be tested? Call out new integration tests, functional tests, or particularly interesting/important unit tests. => only using performance tests
  * **Will this change trigger AWS LibCrypto Formal Verification changes?** Changes to the implementation of verified algorithms will require additional changes. => no
  * **Should this change be fuzz tested?** Will it handle untrusted input? Create a separate issue to track the fuzzing work. => no

### Out of scope:

Is there anything the solution will intentionally NOT address? => not particularly, maybe only the fact that the "lock" parameter in `ssl_update_counter()` becomes useless and could be cleaned up in a separate patch.

[fix-awslc-lock.patch.txt](https://github.com/user-attachments/files/23212110/fix-awslc-lock.patch.txt)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Massive performance loss due to locking issue around use of ssl_update_counter() / ssl_read_counter() #2776

Problem:

Solution:

Requirements / Acceptance Criteria:

Out of scope:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

lib	TLS 1.2 resumptions/s	TLS 1.3 resumptions/s
openssl-1.1.1	208k (100%)	144k (100%)
openssl-3.5.4	172k (83%)	58k (40%)
openssl-3.6.0	178k (86%)	~11-63k (44%)
aws-lc-1.62.0	49k (24%)	30k (21%)

Massive performance loss due to locking issue around use of ssl_update_counter() / ssl_read_counter() #2776

Description

Problem:

Solution:

Requirements / Acceptance Criteria:

Out of scope:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions