Skip to content

Fix usage of wait_io_or_timeout from TLS layers#17

Merged
alexey-milovidov merged 1 commit intoClickHouse:3.1from
azat-ch:ch/wait_io_or_timeout-fixes
Jul 1, 2024
Merged

Fix usage of wait_io_or_timeout from TLS layers#17
alexey-milovidov merged 1 commit intoClickHouse:3.1from
azat-ch:ch/wait_io_or_timeout-fixes

Conversation

@azat
Copy link
Copy Markdown
Member

@azat azat commented Jul 1, 2024

wait_io_or_timeout() accepts milliseconds, while options has seconds, since this is just plain MYSQL_OPT_READ_TIMEOUT/...

The one that has milliseconds are pvio->timeout[PVIO_*_TIMEOUT], so use them.

Usually this is not a problem, but, in case of interrupt (i.e. signal - EINTR) SSL_read() will return SSL_ERROR_WANT_READ/SSL_ERROR_WANT_WRITE, and then wait_io_or_timeout() will be called, and timeout will be wrong, may cause a failure.

P.S. this will fix 02479_mysql_connect_to_self failure, I will write better description in the PR updating submodules in ClickHouse repo.

Upstream PR: mariadb-corporation#250

wait_io_or_timeout() accepts milliseconds, while options has seconds,
since this is just plain MYSQL_OPT_READ_TIMEOUT/...

The one that has milliseconds are pvio->timeout[PVIO_*_TIMEOUT], so use
them.

Usually this is not a problem, but, in case of interrupt (i.e. signal -
EINTR) SSL_read() will return SSL_ERROR_WANT_READ/SSL_ERROR_WANT_WRITE,
and then wait_io_or_timeout() will be called, and timeout will be wrong,
may cause a failure.
@alexey-milovidov alexey-milovidov merged commit 607ff32 into ClickHouse:3.1 Jul 1, 2024
@azat azat deleted the ch/wait_io_or_timeout-fixes branch July 1, 2024 16:32
azat added a commit to azat/ClickHouse that referenced this pull request Jul 1, 2024
Occasionally, 02479_mysql_connect_to_self fails on CI [1].

  [1]: ClickHouse#50911

The problem was indeed query profiler and EINTR, but not in a way you
may think.

For such failures you may see the following trace in trace_log:

    contrib/openssl/crypto/bio/bss_sock.c:127::sock_read
    contrib/openssl/crypto/bio/bio_meth.c:121::bread_conv
    contrib/openssl/crypto/bio/bio_lib.c:285::bio_read_intern
    contrib/openssl/crypto/bio/bio_lib.c:311::BIO_read
    contrib/openssl/ssl/record/methods/tls_common.c:398::tls_default_read_n
    contrib/openssl/ssl/record/methods/tls_common.c:575::tls_get_more_records
    contrib/openssl/ssl/record/methods/tls_common.c:1122::tls_read_record
    contrib/openssl/ssl/record/rec_layer_s3.c:645::ssl3_read_bytes
    contrib/openssl/ssl/s3_lib.c:4527::ssl3_read_internal
    contrib/openssl/ssl/s3_lib.c:4551::ssl3_read
    contrib/openssl/ssl/ssl_lib.c:2343::ssl_read_internal
    contrib/openssl/ssl/ssl_lib.c:2357::SSL_read
    contrib/mariadb-connector-c/libmariadb/secure/openssl.c:729::ma_tls_read
    contrib/mariadb-connector-c/libmariadb/ma_tls.c:90::ma_pvio_tls_read
    contrib/mariadb-connector-c/libmariadb/ma_pvio.c:250::ma_pvio_read
    contrib/mariadb-connector-c/libmariadb/ma_pvio.c:297::ma_pvio_cache_read
    contrib/mariadb-connector-c/libmariadb/ma_net.c:373::ma_real_read
    contrib/mariadb-connector-c/libmariadb/ma_net.c:427::ma_net_read
    contrib/mariadb-connector-c/libmariadb/mariadb_lib.c:192::ma_net_safe_read
    contrib/mariadb-connector-c/libmariadb/mariadb_lib.c:2138::mthd_my_read_query_result
    contrib/mariadb-connector-c/libmariadb/mariadb_lib.c:2212::mysql_real_query
    src/Common/mysqlxx/Query.cpp:56::mysqlxx::Query::executeImpl()
    src/Common/mysqlxx/Query.cpp:73::mysqlxx::Query::use()
    src/Processors/Sources/MySQLSource.cpp:50::DB::MySQLSource::Connection::Connection()

After which the connection will fail with:

    Code: 1000. DB::Exception: Received from localhost:9000. DB::Exception: mysqlxx::ConnectionLost: Lost connection to MySQL server during query (127.0.0.1:9004). (POCO_EXCEPTION)

But, if you will take a look at ma_tls_read() you will see that it has
proper retries for SSL_ERROR_WANT_READ (and EINTR is just a special case
of it), but still, for some reason it fails.

And the reason is the units of the read/write timeout, ma_tls_read()
calls poll(read_timeout) in case of SSL_ERROR_WANT_READ, but, it
incorrectly assume that the timeout is in milliseconds, but that timeout
was in seconds, this bug had been fixed in [2], and now it works like a
charm!

  [2]: ClickHouse/mariadb-connector-c#17

I've verified it with patching openssl library:

    diff --git a/crypto/bio/bss_sock.c b/crypto/bio/bss_sock.c
    index 82f7be85ae..3d2f3926a0 100644
    --- a/crypto/bio/bss_sock.c
    +++ b/crypto/bio/bss_sock.c
    @@ -124,7 +124,24 @@ static int sock_read(BIO *b, char *out, int outl)
                 ret = ktls_read_record(b->num, out, outl);
             else
     # endif
    -            ret = readsocket(b->num, out, outl);
    +        {
    +            static int i = 0;
    +            static int j = 0;
    +            if (!(++j % 5))
    +            {
    +                fprintf(stderr, "sock_read: inject EAGAIN with ret=0\n");
    +                ret = 0;
    +                errno = EAGAIN;
    +            }
    +            else if (!(++i % 3))
    +            {
    +                fprintf(stderr, "sock_read: inject EAGAIN with ret=-1\n");
    +                ret = -1;
    +                errno = EAGAIN;
    +            }
    +            else
    +                ret = readsocket(b->num, out, outl);
    +        }
             BIO_clear_retry_flags(b);
             if (ret <= 0) {
                 if (BIO_sock_should_retry(ret))

And before this patch (well, not the patch itself, but the referenced
patch in mariadb-connector-c) if you will pass read_write_timeout=1 it
will fail:

    SELECT * FROM mysql('127.0.0.1:9004', system, one, 'default', '', SETTINGS connect_timeout = 100, connection_wait_timeout = 100, read_write_timeout=1)
    Code: 1000. DB::Exception: Received from localhost:9000. DB::Exception: mysqlxx::ConnectionLost: Lost connection to MySQL server during query (127.0.0.1:9004). (POCO_EXCEPTION)

But after, it always works:

    $ ch benchmark -c10 -q "SELECT * FROM mysql('127.0.0.1:9004', system, one, 'default', '', SETTINGS connection_pool_size=1, connect_timeout = 100, connection_wait_timeout = 100, read_write_timeout=1)"
    ^CStopping launch of queries. SIGINT received.

    Queries executed: 478.

    localhost:9000, queries: 478, QPS: 120.171, RPS: 120.171, MiB/s: 0.001, result RPS: 120.171, result MiB/s: 0.001.

    0.000%          0.014 sec.
    10.000%         0.058 sec.
    20.000%         0.065 sec.
    30.000%         0.073 sec.
    40.000%         0.079 sec.
    50.000%         0.087 sec.
    60.000%         0.089 sec.
    70.000%         0.091 sec.
    80.000%         0.095 sec.
    90.000%         0.100 sec.
    95.000%         0.102 sec.
    99.000%         0.105 sec.
    99.900%         0.140 sec.
    99.990%         0.140 sec.

Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
@alexey-milovidov alexey-milovidov self-assigned this Jul 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants