Skip to content

tests: get back test_distributed_queries_stress without flakiness#44537

Closed
azat wants to merge 2 commits intoClickHouse:masterfrom
azat:tests/fix-test_distributed_queries_stress
Closed

tests: get back test_distributed_queries_stress without flakiness#44537
azat wants to merge 2 commits intoClickHouse:masterfrom
azat:tests/fix-test_distributed_queries_stress

Conversation

@azat
Copy link
Copy Markdown
Member

@azat azat commented Dec 23, 2022

Sometimes one of containers got KILL'ed:

2022-11-20 15:06:43 [ 317 ] DEBUG : run container_id:roottestdistributedqueriesstress_node1_r1_1 detach:False nothrow:False cmd: ['bash', '-c', "echo 'select * from dist_two where key = 0;\n    select * from dist_two where key = 1;\n    select * from dist_two where key = 2;\n    select * from dist_two where key = 3;\n    select * from dist_two;' | clickhouse benchmark --concurrency=100 --cumulative --delay=0 --timelimit=3 --hedged_connection_timeout_ms=200 --connect_timeout_with_failover_ms=200 --connections_with_failover_max_tries=5 --async_socket_for_remote=0 --distributed_group_by_no_merge=2"] (cluster.py:1745, exec_in_container)
2022-11-20 15:06:43 [ 317 ] DEBUG : Command:['docker', 'exec', 'roottestdistributedqueriesstress_node1_r1_1', 'bash', '-c', "echo 'select * from dist_two where key = 0;\n    select * from dist_two where key = 1;\n    select * from dist_two where key = 2;\n    select * from dist_two where key = 3;\n    select * from dist_two;' | clickhouse benchmark --concurrency=100 --cumulative --delay=0 --timelimit=3 --hedged_connection_timeout_ms=200 --connect_timeout_with_failover_ms=200 --connections_with_failover_max_tries=5 --async_socket_for_remote=0 --distributed_group_by_no_merge=2"] (cluster.py:95, run_and_check)
2022-11-20 15:08:48 [ 317 ] DEBUG : Stderr:Loaded 5 queries. (cluster.py:105, run_and_check)
2022-11-20 15:08:48 [ 317 ] DEBUG : Exitcode:137 (cluster.py:107, run_and_check)

(Note 137 exit code is 128+KILL)

parallel1_0_dockerd.log:time="2022-11-20T15:08:48.244758252Z" level=debug msg="Revoking external connectivity on endpoint roottestdistributedqueriesstress_node1_r1_1 (82dfd051d379869bf885f90745cd4b097c70cd04bd3b4f86e49096358112fc51)"
parallel1_0_dockerd.log:time="2022-11-20T15:08:48.445809392Z" level=debug msg="82dfd051d379869bf885f90745cd4b097c70cd04bd3b4f86e49096358112fc51 (1c6863e).deleteSvcRecords(roottestdistributedqueriesstress_node1_r1_1, 172.16.8.2, <nil>, true) updateSvcRecord sid:82dfd051d3798
parallel1_0_dockerd.log:time="2022-11-20T15:08:48.526045522Z" level=debug msg="Releasing addresses for endpoint roottestdistributedqueriesstress_node1_r1_1's interface on network roottestdistributedqueriesstress_default"

The problem is likely OOM, that is the problem only under ASan with lots of threads.

Fixes: #41776
Supersedes: #44573

Changelog category (leave one):

  • Not for changelog (changelog entry is not required)

@robot-ch-test-poll3 robot-ch-test-poll3 added the pr-not-for-changelog This PR should not be mentioned in the changelog label Dec 23, 2022
@azat azat force-pushed the tests/fix-test_distributed_queries_stress branch from e51887b to 010682d Compare December 23, 2022 18:06
@azat azat marked this pull request as draft December 24, 2022 06:27
@azat azat force-pushed the tests/fix-test_distributed_queries_stress branch from 1cfa1ee to 05ad7da Compare December 26, 2022 20:58
@azat azat marked this pull request as ready for review December 26, 2022 20:58
@azat azat force-pushed the tests/fix-test_distributed_queries_stress branch from 05ad7da to 74ff54b Compare December 27, 2022 09:15
@azat azat force-pushed the tests/fix-test_distributed_queries_stress branch 2 times, most recently from daf99b3 to 5b6392f Compare December 27, 2022 11:35
@azat azat changed the title tests: fix test_distributed_queries_stress flakiness (likely due to OOM) tests: get back test_distributed_queries_stress without flakiness Dec 27, 2022
azat added 2 commits December 27, 2022 15:54
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
Sometimes one of containers got KILL'ed:

    2022-11-20 15:06:43 [ 317 ] DEBUG : run container_id:roottestdistributedqueriesstress_node1_r1_1 detach:False nothrow:False cmd: ['bash', '-c', "echo 'select * from dist_two where key = 0;\n    select * from dist_two where key = 1;\n    select * from dist_two where key = 2;\n    select * from dist_two where key = 3;\n    select * from dist_two;' | clickhouse benchmark --concurrency=100 --cumulative --delay=0 --timelimit=3 --hedged_connection_timeout_ms=200 --connect_timeout_with_failover_ms=200 --connections_with_failover_max_tries=5 --async_socket_for_remote=0 --distributed_group_by_no_merge=2"] (cluster.py:1745, exec_in_container)
    2022-11-20 15:06:43 [ 317 ] DEBUG : Command:['docker', 'exec', 'roottestdistributedqueriesstress_node1_r1_1', 'bash', '-c', "echo 'select * from dist_two where key = 0;\n    select * from dist_two where key = 1;\n    select * from dist_two where key = 2;\n    select * from dist_two where key = 3;\n    select * from dist_two;' | clickhouse benchmark --concurrency=100 --cumulative --delay=0 --timelimit=3 --hedged_connection_timeout_ms=200 --connect_timeout_with_failover_ms=200 --connections_with_failover_max_tries=5 --async_socket_for_remote=0 --distributed_group_by_no_merge=2"] (cluster.py:95, run_and_check)
    2022-11-20 15:08:48 [ 317 ] DEBUG : Stderr:Loaded 5 queries. (cluster.py:105, run_and_check)
    2022-11-20 15:08:48 [ 317 ] DEBUG : Exitcode:137 (cluster.py:107, run_and_check)

(Note 137 exit code is 128+KILL)

    parallel1_0_dockerd.log:time="2022-11-20T15:08:48.244758252Z" level=debug msg="Revoking external connectivity on endpoint roottestdistributedqueriesstress_node1_r1_1 (82dfd051d379869bf885f90745cd4b097c70cd04bd3b4f86e49096358112fc51)"
    parallel1_0_dockerd.log:time="2022-11-20T15:08:48.445809392Z" level=debug msg="82dfd051d379869bf885f90745cd4b097c70cd04bd3b4f86e49096358112fc51 (1c6863e).deleteSvcRecords(roottestdistributedqueriesstress_node1_r1_1, 172.16.8.2, <nil>, true) updateSvcRecord sid:82dfd051d3798
    parallel1_0_dockerd.log:time="2022-11-20T15:08:48.526045522Z" level=debug msg="Releasing addresses for endpoint roottestdistributedqueriesstress_node1_r1_1's interface on network roottestdistributedqueriesstress_default"

The problem is likely OOM, that is the problem only under ASan with lots
of threads.

v2: tests: decrease concurrency for test_distributed_queries_stress
v3: increase timeout for internal command execution
v4: rebase on top of ClickHouse#44573
Fixes: ClickHouse#41776
Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
@azat azat force-pushed the tests/fix-test_distributed_queries_stress branch from 5b6392f to c5a2d3f Compare December 27, 2022 14:54
Copy link
Copy Markdown
Member

@alesapin alesapin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm strongly against getting this test back. Integration tests are not designed for any stressful workloads. Your changes in the config for system tables and background pools are absolutely unreliable; we will add new system logs and new pools.

The complexity of this test proves that the integration tests framework is not designed for this.

@azat azat closed this Dec 27, 2022
@azat azat deleted the tests/fix-test_distributed_queries_stress branch July 27, 2023 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-not-for-changelog This PR should not be mentioned in the changelog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

test_distributed_queries_stress is flaky

3 participants