Tune test_storage_rabbitmq flaky test#75656
Conversation
All callers of kill_rabbitmq are already calling revive_rabbitmq, so it's redundant to call that directly (without any sleep).
This is an attempt to address the rabbitmq flaky test issue.
60e8c35 to
d8e7b45
Compare
|
This is an automated comment for commit d8e7b45 with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page
Successful checks
|
azat
left a comment
There was a problem hiding this comment.
I guess you are trying to fix the following (test got killed) - #71049 (comment)
But it is unclear how tests groups works, so it is only a matter of time when it will fail again
How about adding this test into tests/integration/parallel_skip.json?
And also I'm pretty sure that rabbitmq can be tuned somehow to make it less memory greedy (if it the RabbitMQ after all), and it is actually already configured to use not more then 2GiB, and with total 64GiB for ClickHouse, it should be OK (of course it is unpredictable with parallel tests, though I'm pretty sure that other ClickHouse servers can it ~2GiB, but nothing is killed)
So TL;DR; let's forbid parallel run for this test
|
If we block connectivity via partition manager it is usually good to run not in parallel. We just do not need to have too many non parallel tests |
|
I don't know where the culprit of the issue is, but memory does not seem to be the one to blame.
Yep, I'm not sure how batches are created, but AFAIK they're not deterministically split. I also agree increasing the number of batches does not deterministically guarantee the tests will run with more resources, so I think it's worth a try testing to run these tests by themselves. Thanks for the tip.
Where do you see that 2GiB limit? EDIT: Answering myself: |
Run them sequentially to avoid other parallel tests messing with them.
It's more reliable according to the documentation: https://pypi.org/project/pytest-timeout/ at the expense of presenting worse stacks. I've tested it and the stack is reasonable to pinpoint where it got stuck.
…ur own This one calls rabbitmq_debuginfo, which provides extra information about the status of RabbitMQ's docker instance.
|
The CI found a known flaky test, a new LOGICAL_ERROR, and 2 new flaky tests that I've created issues for:
What's concerning the rabbitmq tests is green. I'm not 100% sure it's going to fix them for good, but it's definitely much better than it was. Please @azat take another look whenever you can. |
tests/integration/test_storage_rabbitmq_failed_connection/configs/users.xml
Outdated
Show resolved
Hide resolved
azat
left a comment
There was a problem hiding this comment.
LGTM
Though I would update the description, to something like (since it is unclear will it fix the test or not) - Tune test_storage_rabbitmq
Interesting, why it is |
906348a to
585bb54
Compare
|
I've reverted the changes I did to use thread as timeout_method because getting that to work was not as quick as I though. I'll address it in a separate PR so that we can merge this one ASAP to get rid of the disturbing flaky tests |
Tune test_storage_rabbitmq to try to fix #71049:
test_attach_broken_tableandtest_rabbitmq_nack_failed_insertto be able to run them multiple times. The latter now properly restores the original configurationClose #71049
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
...
Documentation entry for user-facing changes
CI Settings (Only check the boxes if you know what you are doing)
All builds in Builds_1 and Builds_2 stages are always mandatory and will run independently of the checks below: