Skip to content

Fix rabbitmq test by starting RabbitMQ from scratch every test#78186

Merged
pamarcos merged 2 commits intomasterfrom
fix-rabbitmq-test-once-again
Mar 26, 2025
Merged

Fix rabbitmq test by starting RabbitMQ from scratch every test#78186
pamarcos merged 2 commits intomasterfrom
fix-rabbitmq-test-once-again

Conversation

@pamarcos
Copy link
Copy Markdown
Member

@pamarcos pamarcos commented Mar 24, 2025

Use rabbitmqctl to stop and start instead of killing the docker instance

Closes #71049

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

...

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Mar 24, 2025

Workflow [PR], commit [00ffc43]

@clickhouse-gh clickhouse-gh bot added the pr-ci label Mar 24, 2025
@nikitamikhaylov
Copy link
Copy Markdown
Member

Almost! Except test_storage_rabbitmq/test_failed_connection.py::test_rabbitmq_restore_failed_connection_without_losses_2 is failing.

@pamarcos pamarcos changed the title Fix rabbitmq test by starting from scratch every test Fix rabbitmq test by starting RabbitMQ from scratch every test Mar 25, 2025
@pamarcos
Copy link
Copy Markdown
Member Author

Almost! Except test_storage_rabbitmq/test_failed_connection.py::test_rabbitmq_restore_failed_connection_without_losses_2 is failing.

Yep, curious how the flaky check went okay, but the single test execution failed 😏.
I've run all tests tests thousands of times on my local dev without issues, for what is worth.

At least the test where it failed shows something quite clear and not an obscure error due to some weird RabbitMQ server thing. I'll keep investigating 🧐

@nikitamikhaylov
Copy link
Copy Markdown
Member

I found something interesting in logs:

2025-03-24 19:54:32.495582+00:00 [warning] <0.1422.0> memory resource limit alarm set on node rabbit@rabbitmq1.
2025-03-24 19:54:32.495582+00:00 [warning] <0.1422.0>
2025-03-24 19:54:32.495582+00:00 [warning] <0.1422.0> **********************************************************
2025-03-24 19:54:32.495582+00:00 [warning] <0.1422.0> *** Publishers will be blocked until this alarm clears ***
2025-03-24 19:54:32.495582+00:00 [warning] <0.1422.0> **********************************************************
2025-03-24 19:54:32.495582+00:00 [warning] <0.1422.0>
2025-03-24 19:54:48.436350+00:00 [info] <0.2194.0> vm_memory_high_watermark clear. Memory used:503021568 allowed:4000000000
2025-03-24 19:54:48.436560+00:00 [warning] <0.2192.0> memory resource limit alarm cleared on node rabbit@rabbitmq1
2025-03-24 19:54:48.436630+00:00 [warning] <0.2192.0> memory resource limit alarm cleared across the cluster

And also Rabbit was doing nothing for 2 minutes:

2025-03-24 19:55:42.488308+00:00 [debug] <0.2556.0> Will stop virtual host process reconciliation after 12 runs
2025-03-24 19:57:44.332872+00:00 [debug] <0.2622.0> Consistent hashing exchange: removing binding from exchange exchange 'consumer_reconnect_test_consumer_reconnect' in vhost '/' to destinat
ion queue '1_test_consumer_reconnect' in vhost '/' with routing key '1'
2025-03-24 19:57:44.333978+00:00 [warning] <0.2436.0> closing AMQP connection <0.2436.0> (172.16.1.5:50766 -> 172.16.1.2:5672, vhost: '/', user: 'root', duration: '2M, 57s'):
2025-03-24 19:57:44.333978+00:00 [warning] <0.2436.0> client unexpectedly closed TCP connection

And these two minutes were exactly the time we tried to read the messages from the Rabbit and gave up at the end

2025-03-24 19:57:43 [ 672 ] DEBUG : Result: 148591 / 150000 (test_failed_connection.py:252, test_rabbitmq_restore_failed_connection_without_losses_2)

@nikitamikhaylov
Copy link
Copy Markdown
Member

Also:

rabbitmq1-1  | 2025-03-24 19:53:53.942113+00:00 [info] <0.291.0> Memory high watermark set to 3814 MiB (4000000000 bytes) of 63258 MiB (66330923008 bytes) total
rabbitmq1-1  | 2025-03-24 19:53:53.944444+00:00 [info] <0.293.0> Enabling free disk space monitoring (disk free space: 108880531456, total memory: 66330923008)
rabbitmq1-1  | 2025-03-24 19:53:53.944541+00:00 [info] <0.293.0> Disk free limit set to 50MB

Do we really have 64Gb RAM in the RabbitMQ container? Let's use more of that then.

@nikitamikhaylov nikitamikhaylov self-assigned this Mar 25, 2025
pamarcos and others added 2 commits March 26, 2025 01:56
Use rabbitmqctl to stop and start instead of killing the docker instance
@nikitamikhaylov nikitamikhaylov force-pushed the fix-rabbitmq-test-once-again branch from 68d2057 to 00ffc43 Compare March 26, 2025 00:56
@pamarcos pamarcos marked this pull request as ready for review March 26, 2025 08:24
@pamarcos
Copy link
Copy Markdown
Member Author

pamarcos commented Mar 26, 2025

Do we really have 64Gb RAM in the RabbitMQ container?

Well, not exactly. My understanding is that we set 64GB for the outer docker runner that orchestrates everything. Then, we run the tests along with the rest of docker containers (DoD or Docker on Docker) such as RabbitMQ within those limits.

Thanks @nikitamikhaylov 🙏
I already increased from 2GB to 4GB the memory used by RabbitMQ in the prior PR. I was checking what could have changed because there is a clear increase in the number of times this test failed . Before, it didn't fail that much even with 2GB 🤔

Anyhoo, let's merge this and I'll keep monitoring it

@pamarcos pamarcos enabled auto-merge March 26, 2025 08:38
@pamarcos pamarcos added this pull request to the merge queue Mar 26, 2025
Merged via the queue into master with commit 7cd5024 Mar 26, 2025
119 checks passed
@pamarcos pamarcos deleted the fix-rabbitmq-test-once-again branch March 26, 2025 08:47
@robot-ch-test-poll3 robot-ch-test-poll3 added the pr-synced-to-cloud The PR is synced to the cloud repo label Mar 26, 2025
@pamarcos
Copy link
Copy Markdown
Member Author

Still happening for test_storage_rabbitmq/test_failed_connection.py::test_rabbitmq_restore_failed_connection_without_losses_2 😭

https://s3.amazonaws.com/clickhouse-test-reports/REFs/master/8e682a936336fd64055217548b60fcfacbac588e//integration_tests_release_3_4/integration_run_test_storage_rabbitmq_test_failed_connection_py_0.log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-ci pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky integration test test_storage_rabbitmq

3 participants