Tune test_storage_rabbitmq flaky test by pamarcos · Pull Request #75656 · ClickHouse/ClickHouse

pamarcos · 2025-02-06T09:44:10Z

Tune test_storage_rabbitmq to try to fix #71049:

Increase max memory usage for RabbitMQ from 2GB to 4GB
Split the problematic tests into a separate test that runs by itself without any other test in parallel
Fix test_attach_broken_table and test_rabbitmq_nack_failed_insert to be able to run them multiple times. The latter now properly restores the original configuration

Close #71049

Changelog category (leave one):

CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

...

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

CI Settings (Only check the boxes if you know what you are doing)

All builds in Builds_1 and Builds_2 stages are always mandatory and will run independently of the checks below:

Only: Stateless tests
Only: Integration tests
Only: Performance tests

Skip: Style check
Skip: Fast test

Run all checks ignoring all possible failures (Resource-intensive. All test jobs execute in parallel).
Disable CI cache

All callers of kill_rabbitmq are already calling revive_rabbitmq, so it's redundant to call that directly (without any sleep).

This is an attempt to address the rabbitmq flaky test issue.

robot-clickhouse-ci-2 · 2025-02-06T09:53:48Z

This is an automated comment for commit d8e7b45 with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Check name	Description	Status
Integration tests	The integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests	❌ failure

Successful checks

Check name	Description	Status
Builds	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success

azat

I guess you are trying to fix the following (test got killed) - #71049 (comment)

But it is unclear how tests groups works, so it is only a matter of time when it will fail again

How about adding this test into tests/integration/parallel_skip.json?

And also I'm pretty sure that rabbitmq can be tuned somehow to make it less memory greedy (if it the RabbitMQ after all), and it is actually already configured to use not more then 2GiB, and with total 64GiB for ClickHouse, it should be OK (of course it is unpredictable with parallel tests, though I'm pretty sure that other ClickHouse servers can it ~2GiB, but nothing is killed)

So TL;DR; let's forbid parallel run for this test

qoega · 2025-02-06T13:43:16Z

If we block connectivity via partition manager it is usually good to run not in parallel. We just do not need to have too many non parallel tests

pamarcos · 2025-02-07T08:50:47Z

I don't know where the culprit of the issue is, but memory does not seem to be the one to blame.

But it is unclear how tests groups works, so it is only a matter of time when it will fail again

Yep, I'm not sure how batches are created, but AFAIK they're not deterministically split. I also agree increasing the number of batches does not deterministically guarantee the tests will run with more resources, so I think it's worth a try testing to run these tests by themselves. Thanks for the tip.

and it is actually already configured to use not more then 2GiB

Where do you see that 2GiB limit?

EDIT: Answering myself:

ClickHouse/docker/test/integration/runner/misc/rabbitmq/rabbitmq.conf

Line 17 in 83358e9

vm_memory_high_watermark.absolute = 2GB

This reverts commit d8e7b45.

clickhouse-gh · 2025-02-07T09:25:03Z

Workflow [PR], commit [585bb54]

Run them sequentially to avoid other parallel tests messing with them.

It's more reliable according to the documentation: https://pypi.org/project/pytest-timeout/ at the expense of presenting worse stacks. I've tested it and the stack is reasonable to pinpoint where it got stuck.

…ur own This one calls rabbitmq_debuginfo, which provides extra information about the status of RabbitMQ's docker instance.

pamarcos · 2025-02-20T09:06:39Z

The CI found a known flaky test, a new LOGICAL_ERROR, and 2 new flaky tests that I've created issues for:

LOGICAL_ERROR: Incorrect mutation commands, trying to rename column arr_v2 to arr, but part all_6_6_0 already has column arr #76495
Stateless test 03198_unload_primary_key_outdated is flaky #76494
Stateless test 02555_davengers_rename_chain is flaky #76493

What's concerning the rabbitmq tests is green. I'm not 100% sure it's going to fix them for good, but it's definitely much better than it was. Please @azat take another look whenever you can.

tests/integration/test_storage_rabbitmq_failed_connection/configs/users.xml

tests/integration/pytest.ini

azat

LGTM

Though I would update the description, to something like (since it is unclear will it fix the test or not) - Tune test_storage_rabbitmq

azat · 2025-02-20T15:10:53Z

Integration tests (asan, old analyzer, 1/6) — Job timeout expired, fail: 100, passed: 443

Interesting, why it is Job timeout, even though the i.e. test_storage_azure_blob_storage/test_cluster.py::test_union_all (for which there are no results in the report) completed - https://s3.amazonaws.com/clickhouse-test-reports/PRs/75656/da96286f8f3ef66d0c2beb80bc6ae367d5c82695//integration_tests_asan_old_analyzer_1_6/integration_run_parallel4_0.log

azat · 2025-02-21T11:46:15Z

@pamarcos
Add pytest-fork to ensure JUnit XML output completes

👍

pamarcos · 2025-02-21T13:49:29Z

I've reverted the changes I did to use thread as timeout_method because getting that to work was not as quick as I though. I'll address it in a separate PR so that we can merge this one ASAP to get rid of the disturbing flaky tests

Remove unnecessary call to revive_rabbitmq

e31e2ac

All callers of kill_rabbitmq are already calling revive_rabbitmq, so it's redundant to call that directly (without any sleep).

robot-clickhouse-ci-1 added the pr-ci label Feb 6, 2025

Increase num_batches from 4->6 and 6->8 for Integration tests

d8e7b45

This is an attempt to address the rabbitmq flaky test issue.

pamarcos force-pushed the fix-rabbitmq-flaky-test branch from 60e8c35 to d8e7b45 Compare February 6, 2025 09:52

azat self-assigned this Feb 6, 2025

azat requested changes Feb 6, 2025

View reviewed changes

pamarcos added 3 commits February 7, 2025 09:03

Revert "Increase num_batches from 4->6 and 6->8 for Integration tests"

c878d51

This reverts commit d8e7b45.

Increase max memory usage for RabbitMQ

18009bd

Add test_storage_rabbitmq to the list of sequential tests

6675d99

pamarcos added 2 commits February 7, 2025 09:28

Merge branch 'master' into fix-rabbitmq-flaky-test

27769ab

Split problematic tests to their own separate test

cbac2d8

Run them sequentially to avoid other parallel tests messing with them.

azat added the 🍃 green ci 🌿 Fixing flaky tests in CI label Feb 7, 2025

pamarcos and others added 12 commits February 11, 2025 17:17

Merge branch 'master' into fix-rabbitmq-flaky-test

bb6d77d

Merge branch 'master' into fix-rabbitmq-flaky-test

786475e

Merge branch 'master' into fix-rabbitmq-flaky-test

8d3a7b0

Remove non-existing pytest config: session-timeout

739989f

Add back session_timeout with proper name

6931911

Merge branch 'master' into fix-rabbitmq-flaky-test

ff30afc

Add split test into parallel_skip.yaml

5240442

Use thread method for timeout

9393c6e

It's more reliable according to the documentation: https://pypi.org/project/pytest-timeout/ at the expense of presenting worse stacks. I've tested it and the stack is reasonable to pinpoint where it got stuck.

Minor improvements to test

8987152

Use rabbitmq_cluster.wait_rabbitmq_to_start instead of implementing o…

c44605c

…ur own This one calls rabbitmq_debuginfo, which provides extra information about the status of RabbitMQ's docker instance.

Fix rabbitmq tests so they can run multiple times

b3fc7d9

Restore original configuration after executing the test

3c60410

pamarcos marked this pull request as ready for review February 20, 2025 09:04

pamarcos requested a review from azat February 20, 2025 09:06

pamarcos changed the title ~~Fix rabbitmq flaky test~~ Fix test_storage_rabbitmq flaky test Feb 20, 2025

azat reviewed Feb 20, 2025

View reviewed changes

tests/integration/test_storage_rabbitmq_failed_connection/configs/users.xml Outdated Show resolved Hide resolved

azat reviewed Feb 20, 2025

View reviewed changes

tests/integration/pytest.ini Outdated Show resolved Hide resolved

Move test_failed_connection.py within test_storage_rabbitmq dir

da96286

pamarcos requested a review from azat February 20, 2025 11:46

azat approved these changes Feb 20, 2025

View reviewed changes

pamarcos changed the title ~~Fix test_storage_rabbitmq flaky test~~ Tune test_storage_rabbitmq flaky test Feb 20, 2025

Restore signal method for timeout

585bb54

pamarcos force-pushed the fix-rabbitmq-flaky-test branch from 906348a to 585bb54 Compare February 21, 2025 13:47

pamarcos added this pull request to the merge queue Feb 24, 2025

Merged via the queue into master with commit 6f6ff76 Feb 24, 2025
124 of 126 checks passed

pamarcos deleted the fix-rabbitmq-flaky-test branch February 24, 2025 15:29

robot-ch-test-poll4 added the pr-synced-to-cloud The PR is synced to the cloud repo label Feb 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tune test_storage_rabbitmq flaky test#75656

Tune test_storage_rabbitmq flaky test#75656
pamarcos merged 21 commits intomasterfrom
fix-rabbitmq-flaky-test

pamarcos commented Feb 6, 2025 •

edited

Loading

Uh oh!

robot-clickhouse-ci-2 commented Feb 6, 2025 •

edited by robot-ch-test-poll4

Loading

Uh oh!

azat left a comment

Uh oh!

qoega commented Feb 6, 2025

Uh oh!

pamarcos commented Feb 7, 2025 •

edited

Loading

Uh oh!

clickhouse-gh bot commented Feb 7, 2025 •

edited

Loading

Uh oh!

pamarcos commented Feb 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

azat left a comment

Uh oh!

azat commented Feb 20, 2025

Uh oh!

azat commented Feb 21, 2025

Uh oh!

pamarcos commented Feb 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

pamarcos commented Feb 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Documentation entry for user-facing changes

CI Settings (Only check the boxes if you know what you are doing)

Uh oh!

robot-clickhouse-ci-2 commented Feb 6, 2025 • edited by robot-ch-test-poll4 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

azat left a comment

Choose a reason for hiding this comment

Uh oh!

qoega commented Feb 6, 2025

Uh oh!

pamarcos commented Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clickhouse-gh bot commented Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pamarcos commented Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

azat left a comment

Choose a reason for hiding this comment

Uh oh!

azat commented Feb 20, 2025

Uh oh!

azat commented Feb 21, 2025

Uh oh!

pamarcos commented Feb 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

pamarcos commented Feb 6, 2025 •

edited

Loading

robot-clickhouse-ci-2 commented Feb 6, 2025 •

edited by robot-ch-test-poll4

Loading

pamarcos commented Feb 7, 2025 •

edited

Loading

clickhouse-gh bot commented Feb 7, 2025 •

edited

Loading

pamarcos commented Feb 20, 2025 •

edited

Loading