Skip to content

Tune test_storage_rabbitmq flaky test#75656

Merged
pamarcos merged 21 commits intomasterfrom
fix-rabbitmq-flaky-test
Feb 24, 2025
Merged

Tune test_storage_rabbitmq flaky test#75656
pamarcos merged 21 commits intomasterfrom
fix-rabbitmq-flaky-test

Conversation

@pamarcos
Copy link
Copy Markdown
Member

@pamarcos pamarcos commented Feb 6, 2025

Tune test_storage_rabbitmq to try to fix #71049:

  • Increase max memory usage for RabbitMQ from 2GB to 4GB
  • Split the problematic tests into a separate test that runs by itself without any other test in parallel
  • Fix test_attach_broken_table and test_rabbitmq_nack_failed_insert to be able to run them multiple times. The latter now properly restores the original configuration

Close #71049

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

...

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

CI Settings (Only check the boxes if you know what you are doing)

All builds in Builds_1 and Builds_2 stages are always mandatory and will run independently of the checks below:

  • Only: Stateless tests
  • Only: Integration tests
  • Only: Performance tests

  • Skip: Style check
  • Skip: Fast test

  • Run all checks ignoring all possible failures (Resource-intensive. All test jobs execute in parallel).
  • Disable CI cache

All callers of kill_rabbitmq are already calling revive_rabbitmq,
so it's redundant to call that directly (without any sleep).
This is an attempt to address the rabbitmq flaky test issue.
@pamarcos pamarcos force-pushed the fix-rabbitmq-flaky-test branch from 60e8c35 to d8e7b45 Compare February 6, 2025 09:52
@robot-clickhouse-ci-2
Copy link
Copy Markdown
Contributor

robot-clickhouse-ci-2 commented Feb 6, 2025

This is an automated comment for commit d8e7b45 with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Check nameDescriptionStatus
Integration testsThe integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests❌ failure
Successful checks
Check nameDescriptionStatus
BuildsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success

@azat azat self-assigned this Feb 6, 2025
Copy link
Copy Markdown
Member

@azat azat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you are trying to fix the following (test got killed) - #71049 (comment)

But it is unclear how tests groups works, so it is only a matter of time when it will fail again

How about adding this test into tests/integration/parallel_skip.json?

And also I'm pretty sure that rabbitmq can be tuned somehow to make it less memory greedy (if it the RabbitMQ after all), and it is actually already configured to use not more then 2GiB, and with total 64GiB for ClickHouse, it should be OK (of course it is unpredictable with parallel tests, though I'm pretty sure that other ClickHouse servers can it ~2GiB, but nothing is killed)

So TL;DR; let's forbid parallel run for this test

@qoega
Copy link
Copy Markdown
Member

qoega commented Feb 6, 2025

If we block connectivity via partition manager it is usually good to run not in parallel. We just do not need to have too many non parallel tests

@pamarcos
Copy link
Copy Markdown
Member Author

pamarcos commented Feb 7, 2025

I don't know where the culprit of the issue is, but memory does not seem to be the one to blame.

But it is unclear how tests groups works, so it is only a matter of time when it will fail again

Yep, I'm not sure how batches are created, but AFAIK they're not deterministically split. I also agree increasing the number of batches does not deterministically guarantee the tests will run with more resources, so I think it's worth a try testing to run these tests by themselves. Thanks for the tip.

and it is actually already configured to use not more then 2GiB

Where do you see that 2GiB limit?

EDIT: Answering myself:

vm_memory_high_watermark.absolute = 2GB

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Feb 7, 2025

Workflow [PR], commit [585bb54]

@azat azat added the 🍃 green ci 🌿 Fixing flaky tests in CI label Feb 7, 2025
@pamarcos pamarcos marked this pull request as ready for review February 20, 2025 09:04
@pamarcos
Copy link
Copy Markdown
Member Author

pamarcos commented Feb 20, 2025

The CI found a known flaky test, a new LOGICAL_ERROR, and 2 new flaky tests that I've created issues for:

What's concerning the rabbitmq tests is green. I'm not 100% sure it's going to fix them for good, but it's definitely much better than it was. Please @azat take another look whenever you can.

@pamarcos pamarcos requested a review from azat February 20, 2025 09:06
@pamarcos pamarcos changed the title Fix rabbitmq flaky test Fix test_storage_rabbitmq flaky test Feb 20, 2025
@pamarcos pamarcos requested a review from azat February 20, 2025 11:46
Copy link
Copy Markdown
Member

@azat azat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Though I would update the description, to something like (since it is unclear will it fix the test or not) - Tune test_storage_rabbitmq

@pamarcos pamarcos changed the title Fix test_storage_rabbitmq flaky test Tune test_storage_rabbitmq flaky test Feb 20, 2025
@azat
Copy link
Copy Markdown
Member

azat commented Feb 20, 2025

Integration tests (asan, old analyzer, 1/6) — Job timeout expired, fail: 100, passed: 443

Interesting, why it is Job timeout, even though the i.e. test_storage_azure_blob_storage/test_cluster.py::test_union_all (for which there are no results in the report) completed - https://s3.amazonaws.com/clickhouse-test-reports/PRs/75656/da96286f8f3ef66d0c2beb80bc6ae367d5c82695//integration_tests_asan_old_analyzer_1_6/integration_run_parallel4_0.log

@azat
Copy link
Copy Markdown
Member

azat commented Feb 21, 2025

@pamarcos pamarcos force-pushed the fix-rabbitmq-flaky-test branch from 906348a to 585bb54 Compare February 21, 2025 13:47
@pamarcos
Copy link
Copy Markdown
Member Author

I've reverted the changes I did to use thread as timeout_method because getting that to work was not as quick as I though. I'll address it in a separate PR so that we can merge this one ASAP to get rid of the disturbing flaky tests

@pamarcos pamarcos added this pull request to the merge queue Feb 24, 2025
Merged via the queue into master with commit 6f6ff76 Feb 24, 2025
124 of 126 checks passed
@pamarcos pamarcos deleted the fix-rabbitmq-flaky-test branch February 24, 2025 15:29
@robot-ch-test-poll4 robot-ch-test-poll4 added the pr-synced-to-cloud The PR is synced to the cloud repo label Feb 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🍃 green ci 🌿 Fixing flaky tests in CI pr-ci pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky integration test test_storage_rabbitmq

7 participants