Skip to content

Fix leak of SSL caches for auxiliary threads#4801

Merged
renecannao merged 1 commit intov2.7from
v2.7-fix_aux_threads_ssl_leaks
Jan 24, 2025
Merged

Fix leak of SSL caches for auxiliary threads#4801
renecannao merged 1 commit intov2.7from
v2.7-fix_aux_threads_ssl_leaks

Conversation

@JavierJF
Copy link
Collaborator

Issue Description

This PR fixes a memory leak that can be triggered when SSL parameters are used for backend servers and the workload imposes the creation of auxiliary threads. Examples of this are:

  • ProxySQL requires to kill many backend connections; this spawns many kill_query_thread.
  • Monitor can't keep-up with the monitoring queue due to a large number of backend servers. This spawns auxiliary threads.

Reproduction

kill_query_thread

There are several ways of achieving reproduction, one for kill_query_thread and other for auxiliary
monitoring threads. For the former we require to have a backend server configured with SSL enabled and
having set mysql-ssl_p2s_% config variables. The following test should be executed in a loop:

/**
 * @file test_conn_err_unkwn_db-t.cpp
 * @brief Memory leak reproduction.
 */

#include <cstring>
#include <stdio.h>

#include "mysql.h"

#include "tap.h"
#include "command_line.h"

const uint32_t DEF_CONN_COUNT { 1000 };

int main(int argc, char** argv) {
	CommandLine cl;

	if (cl.getEnv()) {
		diag("Failed to get the required environmental variables.");
		return EXIT_FAILURE;
	}

	uint32_t CONN_COUNT { argc >= 2 ? std::atoi(argv[1]) : DEF_CONN_COUNT };
	diag("Creating faulty backend connections   CONN_COUNT=%d", CONN_COUNT);

	while (CONN_COUNT) {
		CONN_COUNT -= 1;

		MYSQL* proxy = mysql_init(NULL);

		if (!mysql_real_connect(proxy, cl.host, cl.username, cl.password, NULL, cl.port, NULL, 0)) {
			fprintf(stderr, "File %s, line %d, Error: %s\n", __FILE__, __LINE__, mysql_error(proxy));
			return EXIT_FAILURE;
		}

		mysql_query(proxy, "DO 1");
		mysql_select_db(proxy, "``");
		mysql_query(proxy, "DO 1");

		mysql_close(proxy);
	}

	return exit_status();
}

For the leak to be noticeable, a high number of backend connections should be created, a sensible number for
the test is 100000. This also depends on the size of the SSL certificates in use.

Monitor Auxiliary Threads

For making the leak noticeable (faster) in this scenario we need to create a decent number of auxiliary
threads for monitoring. The first step for this is configuring ProxySQL with a high number of backend
servers and reduce the time for monitoring intervals to the minimum:

DELETE FROM mysql_servers;

WITH RECURSIVE cnt(x) AS (
   SELECT 1 UNION ALL SELECT x + 1 FROM cnt LIMIT 200
)
INSERT INTO
    mysql_servers(hostgroup_id,hostname,port,status,use_ssl)
SELECT
    x - 1 AS val, ('127.0.0.' || x), 13306, 'ONLINE', 1
FROM
    cnt

SET mysql-monitor_connect_interval=5000;
SET mysql-monitor_ping_interval=5000;
SET mysql-monitor_threads_min=2;
SET mysql-monitor_threads_max=4;

LOAD MYSQL VARIABLES TO RUNTIME;
LOAD MYSQL SERVERS TO RUNTIME;

-- As in the previous test, configure SSL variables 'mysql-ssl_p2s_%'

Even with this, ProxySQL is too conservative in the creation of auxiliary threads to clearly notice the
leak. So, we will slightly tweak ProxySQL code itself, to exaggerate thread creation and prevent reuse. The
following patch will be sufficient:

diff --git a/lib/MySQL_Monitor.cpp b/lib/MySQL_Monitor.cpp
index dfd67dec..bdb7abd5 100644
--- a/lib/MySQL_Monitor.cpp
+++ b/lib/MySQL_Monitor.cpp
@@ -79,6 +79,8 @@ class ConsumerThread : public Thread {
        // Remove 1 item at a time and process it. Blocks if no items are
        // available to process.
        for (int i = 0; (thrn ? i < thrn : 1); i++) {
+           if (m_queue.size() == 0) { break; }
+
            //VALGRIND_DISABLE_ERROR_REPORTING;
            WorkItem<T>* item = (WorkItem<T>*)m_queue.remove();
            //VALGRIND_ENABLE_ERROR_REPORTING;
@@ -5173,17 +5175,20 @@ __monitor_run:
                    qsize = threads_max * 2 - num_threads;
                }
                if (qsize > 0) {
+                   qsize = 100;
                    proxy_info("Monitor is starting %d helper threads\n", qsize);
                    ConsumerThread<MySQL_Monitor_State_Data> **threads_aux= (ConsumerThread<MySQL_Monitor_State_Data> **)malloc(sizeof(ConsumerThread<MySQL_Monitor_State_Data> *)*qsize);
                    aux_threads = qsize;
                    started_threads += aux_threads;
                    for (unsigned int i=0; i<qsize; i++) {
-                       threads_aux[i] = new ConsumerThread<MySQL_Monitor_State_Data>(*queue, 245, "MyMonStateData");
+                       threads_aux[i] = new ConsumerThread<MySQL_Monitor_State_Data>(*queue, 0, "MyMonStateData");
                        threads_aux[i]->start(2048,false);
                    }
                    for (unsigned int i=0; i<qsize; i++) {
                        threads_aux[i]->join();
                        delete threads_aux[i];
                    }
                    free(threads_aux);
                    aux_threads = 0;

Memory Analysis

This leak is more tricky to detect using memory analysis tools that focus on the heap, or even jemalloc
profiler itself. This is because the "lost references" are from thread-local variables that were lost on
exit threads. This prevents the profilers from tracing this memory as they would normally do with references
lost to heap-allocated resources.

This can be seeing in the two attached dumps. In report-no-leak.pdf we can see that when jemalloc focus on
in-use space, it's unable to report any leaks, and it reports almost no memory usage, despite the process
retaining several hundred MB of (resident) memory:

ps aux | grep "[.]/src/proxysql" | awk '{ print ($5/1024),($6/1024),$11 }'
712.367 544.781 ./src/proxysql

This is expected, as the references were lost on thread exit. In contrast, we need to focus on alloc_space
(total memory allocated) and how much of that memory jemalloc is able to trace back. When checking
report-alloc.pdf, we can see that ma_tls_set_certs is responsible for an unusually large amount of
allocated memory (due to connection creation), and more importantly, jemalloc haven't traced back that
memory to the allocator.

If we compare the previous reports with the new ones after the fix, we can see that report-fix-inuse.pdf
holds no difference, and in report-fix-alloc.pdf is shown that jemalloc is now able to trace back the
memory allocated for ma_tls_set_certs back to the allocator. Process resident memory is now stable:

ps aux | grep "[.]/src/proxysql" | awk '{ print ($5/1024),($6/1024),$11 }'
235.832 129.488 ./src/proxysql

The dumps an ProxySQL binary used for the analysis have been attached.

report-no-leak.pdf
report-alloc.pdf
report-fix-alloc.pdf
report-fix-inuse.pdf

memory-dumps-01.tar.gz
memory-dumps-00.tar.gz

Whenever a thread creates MySQL connections using SSL params the
thread-local caches are populated. These thread-local resources need to
be free on thread exit via 'mysql_thread_end', otherwise these objects
will be leak. This is not relevant for the fixed size thread-pools, but
it's for workloads making use of temporary/auxiliary threads.
@renecannao
Copy link
Contributor

Can one of the admins verify this patch?

@renecannao renecannao merged commit f8d0542 into v2.7 Jan 24, 2025
19 of 98 checks passed
renecannao added a commit that referenced this pull request Feb 3, 2025
Fix leak of SSL caches for auxiliary threads - Port of #4801
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants