Skip to content

Fix possible data-race StorageKafka with statistics_interval_ms>0#66311

Merged
alexey-milovidov merged 1 commit intoClickHouse:masterfrom
azat:rdkafka-fix-data-race
Aug 8, 2024
Merged

Fix possible data-race StorageKafka with statistics_interval_ms>0#66311
alexey-milovidov merged 1 commit intoClickHouse:masterfrom
azat:rdkafka-fix-data-race

Conversation

@azat
Copy link
Copy Markdown
Member

@azat azat commented Jul 10, 2024

Changelog category (leave one):

  • Not for changelog (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Fix possible data-race StorageKafka with statistics_interval_ms>0

The problem here is that ignorelist did not work by some reason, if I will look at the ignored functions it should not contain any TSan interseption code, while it does:

$ lldb-13 clickhouse
(lldb) target create "clickhouse"
disas -n rd_avg_rollover
Current executable set to '/home/azat/ch/tmp/tsan-test/clickhouse' (x86_64).
(lldb) disas -n rd_avg_rollover
clickhouse`rd_kafka_stats_emit_avg:
clickhouse[0x1cbf84a7] <+39>:  leaq   0x30(%r15), %r12
clickhouse[0x1cbf84ab] <+43>:  movq   %r12, %rdi
clickhouse[0x1cbf84ae] <+46>:  callq  0x1ccdad40                ; rdk_thread_mutex_lock at tinycthread.c:111
clickhouse[0x1cbf84b3] <+51>:  leaq   0x58(%r15), %rdi
clickhouse[0x1cbf84b7] <+55>:  callq  0x71b5390                 ; __tsan_read4
clickhouse[0x1cbf84bc] <+60>:  cmpl   $0x0, 0x58(%r15)
clickhouse[0x1cbf84c1] <+65>:  je     0x1cbf8595                ; <+277> [inlined] rd_avg_rollover + 238 at rdavg.h
clickhouse[0x1cbf84c7] <+71>:  leaq   -0xc8(%rbp), %rdi
clickhouse[0x1cbf84ce] <+78>:  xorl   %esi, %esi
clickhouse[0x1cbf84d0] <+80>:  callq  0x1ccdac80                ; rdk_thread_mutex_init at tinycthread.c:62
clickhouse[0x1cbf84d5] <+85>:  leaq   0x5c(%r15), %rdi
clickhouse[0x1cbf84d9] <+89>:  callq  0x71b5390                 ; __tsan_read4

(lldb) disas -n rd_avg_calc
clickhouse`rd_kafka_broker_ops_io_serve:
clickhouse[0x1cbdf086] <+1990>: leaq   0x5a4(%rbx), %rdi
clickhouse[0x1cbdf08d] <+1997>: callq  0x71b5390                 ; __tsan_read4
clickhouse[0x1cbdf092] <+2002>: cmpl   $0x0, 0x5a4(%rbx)
clickhouse[0x1cbdf099] <+2009>: je     0x1cbdf12b                ; <+2155> [inlined] rd_kafka_broker_timeout_scan + 719 at rdkafka_broker.c

I guess the reason is that they had been inlined

So now rd_avg_calc() guarded with a mutex.

Refs: ClickHouse/librdkafka#11
Fixes: #60939
Follow-up for: #50999
Cc: @antaljanosbenjamin

@robot-clickhouse robot-clickhouse added pr-not-for-changelog This PR should not be mentioned in the changelog submodule changed At least one submodule changed in this PR. labels Jul 10, 2024
@robot-clickhouse-ci-2
Copy link
Copy Markdown
Contributor

robot-clickhouse-ci-2 commented Jul 10, 2024

This is an automated comment for commit 301ac5d with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Check nameDescriptionStatus
Stateless testsRuns stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc❌ failure
Successful checks
Check nameDescriptionStatus
AST fuzzerRuns randomly generated queries to catch program errors. The build type is optionally given in parenthesis. If it fails, ask a maintainer for help✅ success
BuildsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
ClickBenchRuns [ClickBench](https://github.com/ClickHouse/ClickBench/) with instant-attach table✅ success
Compatibility checkChecks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help✅ success
Docker keeper imageThe check to build and optionally push the mentioned image to docker hub✅ success
Docker server imageThe check to build and optionally push the mentioned image to docker hub✅ success
Docs checkBuilds and tests the documentation✅ success
Fast testNormally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here✅ success
Flaky testsChecks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integration tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc✅ success
Install packagesChecks that the built packages are installable in a clear environment✅ success
Integration testsThe integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests✅ success
Performance ComparisonMeasure changes in query performance. The performance test report is described in detail here. In square brackets are the optional part/total tests✅ success
Stateful testsRuns stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Stress testRuns stateless functional tests concurrently from several clients to detect concurrency-related errors✅ success
Style checkRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report✅ success
Unit testsRuns the unit tests for different release types✅ success
Upgrade checkRuns stress tests on server version from last release and then tries to upgrade it to the version from the PR. It checks if the new server can successfully startup without any errors, crashes or sanitizer asserts✅ success

@azat azat force-pushed the rdkafka-fix-data-race branch from aac1db2 to 076d2bb Compare July 10, 2024 14:30
@azat azat marked this pull request as ready for review July 10, 2024 14:31
@antaljanosbenjamin antaljanosbenjamin self-assigned this Jul 10, 2024
@alexey-milovidov
Copy link
Copy Markdown
Member

@azat, we have fixed a bunch of tests, could you please merge with master?

@azat
Copy link
Copy Markdown
Member Author

azat commented Jul 14, 2024

Stateless tests (debug) [2/2] — Timeout 10800 exceeded
Stateless tests (debug, s3 storage) [2/2] — Timeout 10800 exceeded

Likely part of the problem is gdb:

warning: (Internal error: pc 0x1 in read in psymtab, but not in symtab.)
warning: (Internal error: pc 0x1 in read in psymtab, but not in symtab.)
warning: (Internal error: pc 0x2 in read in psymtab, but not in symtab.)
warning: (Internal error: pc 0x37e in read in psymtab, but not in symtab.)
warning: (Internal error: pc 0x37e in read in psymtab, but not in symtab.)

And another part is CI wrappers - #66036 (comment)

@azat
Copy link
Copy Markdown
Member Author

azat commented Jul 14, 2024

Performance Comparison (release) [4/4] — 2 errors, 1 slower, 2 unstable

@azat
Copy link
Copy Markdown
Member Author

azat commented Jul 14, 2024

Stateless tests (ubsan) — Tests are not finished, fail: 0, passed: 6552, skipped: 54

"Fixed" in ba176a9

@azat azat force-pushed the rdkafka-fix-data-race branch from 076d2bb to d8865a6 Compare July 14, 2024 06:17
@alexey-milovidov
Copy link
Copy Markdown
Member

@azat, there is MSan failure in GRPC that has to be fixed.

@azat
Copy link
Copy Markdown
Member Author

azat commented Jul 22, 2024

It has been fixed in #66509

@alexey-milovidov
Copy link
Copy Markdown
Member

@azat, merge with the master branch to incorporate these fixes into this PR.

@azat azat force-pushed the rdkafka-fix-data-race branch from d8865a6 to 4adca97 Compare July 22, 2024 19:43
@azat azat force-pushed the rdkafka-fix-data-race branch from 4adca97 to 316dd79 Compare August 3, 2024 21:54
@azat
Copy link
Copy Markdown
Member Author

azat commented Aug 4, 2024

@alexey-milovidov
Copy link
Copy Markdown
Member

Unfortunately, it blocks merge.

@alexey-milovidov
Copy link
Copy Markdown
Member

Thank you for the fix, we will try to merge this dependent PR asap.

@azat azat force-pushed the rdkafka-fix-data-race branch from 316dd79 to c2e8f3a Compare August 5, 2024 18:25
@alexey-milovidov
Copy link
Copy Markdown
Member

@azat There was a mistake in the CI infrastructure related to retries.
Could you please merge with master to make this PR green?

@azat
Copy link
Copy Markdown
Member Author

azat commented Aug 7, 2024

That is likely fixed in #67738. OK. Rebased.

@azat azat force-pushed the rdkafka-fix-data-race branch from c2e8f3a to 1e37ae2 Compare August 7, 2024 21:08
@azat
Copy link
Copy Markdown
Member Author

azat commented Aug 8, 2024

freebsd build got broken - #68014

The problem here is that ignorelist did not work by some reason, if I
will look at the ignored functions it should not contain any TSan
interseption code, while it does:

    $ lldb-13 clickhouse
    (lldb) target create "clickhouse"
    disas -n rd_avg_rollover
    Current executable set to '/home/azat/ch/tmp/tsan-test/clickhouse' (x86_64).
    (lldb) disas -n rd_avg_rollover
    clickhouse`rd_kafka_stats_emit_avg:
    clickhouse[0x1cbf84a7] <+39>:  leaq   0x30(%r15), %r12
    clickhouse[0x1cbf84ab] <+43>:  movq   %r12, %rdi
    clickhouse[0x1cbf84ae] <+46>:  callq  0x1ccdad40                ; rdk_thread_mutex_lock at tinycthread.c:111
    clickhouse[0x1cbf84b3] <+51>:  leaq   0x58(%r15), %rdi
    clickhouse[0x1cbf84b7] <+55>:  callq  0x71b5390                 ; __tsan_read4
    clickhouse[0x1cbf84bc] <+60>:  cmpl   $0x0, 0x58(%r15)
    clickhouse[0x1cbf84c1] <+65>:  je     0x1cbf8595                ; <+277> [inlined] rd_avg_rollover + 238 at rdavg.h
    clickhouse[0x1cbf84c7] <+71>:  leaq   -0xc8(%rbp), %rdi
    clickhouse[0x1cbf84ce] <+78>:  xorl   %esi, %esi
    clickhouse[0x1cbf84d0] <+80>:  callq  0x1ccdac80                ; rdk_thread_mutex_init at tinycthread.c:62
    clickhouse[0x1cbf84d5] <+85>:  leaq   0x5c(%r15), %rdi
    clickhouse[0x1cbf84d9] <+89>:  callq  0x71b5390                 ; __tsan_read4

    (lldb) disas -n rd_avg_calc
    clickhouse`rd_kafka_broker_ops_io_serve:
    clickhouse[0x1cbdf086] <+1990>: leaq   0x5a4(%rbx), %rdi
    clickhouse[0x1cbdf08d] <+1997>: callq  0x71b5390                 ; __tsan_read4
    clickhouse[0x1cbdf092] <+2002>: cmpl   $0x0, 0x5a4(%rbx)
    clickhouse[0x1cbdf099] <+2009>: je     0x1cbdf12b                ; <+2155> [inlined] rd_kafka_broker_timeout_scan + 719 at rdkafka_broker.c

I guess the reason is that they had been inlined

So now rd_avg_calc() guarded with a mutex.

Refs: ClickHouse/librdkafka#11
Fixes: ClickHouse#60939
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
@azat azat force-pushed the rdkafka-fix-data-race branch from 1e37ae2 to 301ac5d Compare August 8, 2024 05:43
@antaljanosbenjamin
Copy link
Copy Markdown
Member

antaljanosbenjamin commented Aug 8, 2024

00974_query_profiler is not connected to this PR in my opinion, because

  • it has failed twice already on master and in another PR.
  • I doubt this PR has anything to do with how CPU profilers work.
    I tried to to check the test, but I couldn't figure out what could cause the issue.

I will manually set the mergeable check and merge this PR.

PS: here are the two other failures in the past 10 days.

@alexey-milovidov alexey-milovidov merged commit 2e6ecf6 into ClickHouse:master Aug 8, 2024
@robot-ch-test-poll3 robot-ch-test-poll3 added the pr-synced-to-cloud The PR is synced to the cloud repo label Aug 8, 2024
@azat azat deleted the rdkafka-fix-data-race branch August 10, 2024 07:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-not-for-changelog This PR should not be mentioned in the changelog pr-synced-to-cloud The PR is synced to the cloud repo submodule changed At least one submodule changed in this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TSAN race in librdkafka: rd_avg_calc vs rd_avg_rollover

6 participants