Skip to content

tests: fix flakiness of integration tests due to too poor logs checks#87719

Merged
azat merged 15 commits intoClickHouse:masterfrom
azat:ci/flakiness-due-to-small-look_behind_lines
Oct 1, 2025
Merged

tests: fix flakiness of integration tests due to too poor logs checks#87719
azat merged 15 commits intoClickHouse:masterfrom
azat:ci/flakiness-due-to-small-look_behind_lines

Conversation

@azat
Copy link
Copy Markdown
Member

@azat azat commented Sep 26, 2025

CI found failure again 1:

And after added #86030, I can see that logs contains rows starting from:

2025.09.26 16:33:49.268044 [ 965 ] {} <Trace> test_database.postgresql_replica_4 (938f7e82-5e0e-4f88-ad88-a5ac82b289c3): Trying to reserve 1.00 MiB using storage policy from min volume index 0

While the haystack was slightly earlier:

2025.09.26 16:33:49.267221 [ 965 ] {BgSchPool::2fe592b6-5cb1-45f2-becc-27ceea4a4e98} <Warning> PostgreSQLReplicaConsumer(postgres_database): Table postgresql_replica_1 is skipped from replication stream because its structure has changes. Please detach this table and reattach to resume the replication (relation id: 16619)

This is a typical problem with test logging, let's simply increase look_behind_lines, but not in the test, but globally, to avoid any flakiness further, 100 or 10000 is not a big deal for grep -F

But after this tests for kafka failed... And I decided to fix them as well.

Changelog category (leave one):

  • Not for changelog (changelog entry is not required)

Fixes: #86185

… way)

CI found failuer again [1]:

  [1]: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=87584&sha=97e95fa30f48eb6414c18ffc59e2c70c38a437d1&name_0=PR&name_1=Integration%20tests%20%28amd_binary%2C%204%2F5%29

And after added ClickHouse#86030, I can see that logs contains rows starting from:

    2025.09.26 16:33:49.268044 [ 965 ] {} <Trace> test_database.postgresql_replica_4 (938f7e82-5e0e-4f88-ad88-a5ac82b289c3): Trying to reserve 1.00 MiB using storage policy from min volume index 0

While the haystack was slightly earlier:

    2025.09.26 16:33:49.267221 [ 965 ] {BgSchPool::2fe592b6-5cb1-45f2-becc-27ceea4a4e98} <Warning> PostgreSQLReplicaConsumer(postgres_database): Table postgresql_replica_1 is skipped from replication stream because its structure has changes. Please detach this table and reattach to resume the replication (relation id: 16619)

This is a typical problem with test logging, let's simply increase
look_behind_lines, but not in the test, but globally, to avoid any
flakiness further, 100 or 10000 is not a big deal for `grep -F`
@azat azat added the 🍃 green ci 🌿 Fixing flaky tests in CI label Sep 26, 2025
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Sep 26, 2025

Workflow [PR], commit [6f64561]

Summary:

job_name test_name status info comment
Integration tests (amd_asan, flaky check) failure
Job Timeout Expired FAIL

@clickhouse-gh clickhouse-gh bot added the pr-not-for-changelog This PR should not be mentioned in the changelog label Sep 26, 2025
@george-larionov george-larionov self-assigned this Sep 26, 2025
Copy link
Copy Markdown
Member

@george-larionov george-larionov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@azat azat changed the title tests: fix test_postgresql_replica_database_engine flakiness (general way) tests: fix flakiness of integration tests due to too low context for looking pattern in logs Sep 27, 2025
@azat azat force-pushed the ci/flakiness-due-to-small-look_behind_lines branch from 474b31a to 0db69a7 Compare September 27, 2025 16:54
@azat
Copy link
Copy Markdown
Member Author

azat commented Sep 28, 2025

test_cgroup_limit/test.py::test_cgroup_cpu_limit

Status: Downloaded newer image for ubuntu:22.04
docker: Error response from daemon: No such image: ubuntu:22.04

Integration tests (arm_binary, distributed plan, 3/4)

test_merge_tree_s3/test.py::test_merge_canceled_by_s3_errors[node-broken_s3_always_multi_part]

Is flaky as well

@azat
Copy link
Copy Markdown
Member Author

azat commented Sep 28, 2025

Kafka tests are very poor, they have instance.wait_for_log_line("kafka.*Stalled") while nothing prevent them from looking into some old logs...

@azat azat changed the title tests: fix flakiness of integration tests due to too low context for looking pattern in logs tests: fix flakiness of integration tests due to too poor logs checks Sep 29, 2025
@azat
Copy link
Copy Markdown
Member Author

azat commented Sep 29, 2025

Kafka tests are very poor, they have instance.wait_for_log_line("kafka.*Stalled") while nothing prevent them from looking into some old logs...

@antaljanosbenjamin FYI I have to fix all kafka tests (at least in test_storage_kafka/test_batch_fast.py) to make it work reliably and maybe this will fix some flakiness of these tests.

@antaljanosbenjamin
Copy link
Copy Markdown
Member

while nothing prevent them from looking into some old logs...

In flaky checks, there might be repeated lines of logs which can cause issue in some tests I think, that's why I was very conservative with the log lines.

But I trust you and the CI.

@azat azat enabled auto-merge September 29, 2025 15:46
@azat
Copy link
Copy Markdown
Member Author

azat commented Sep 30, 2025

test_suggestions/test.py::test_suggestions_backwards_compatibility_for_multiple_suggestions_prefix

@azat
Copy link
Copy Markdown
Member Author

azat commented Sep 30, 2025

Flaky check will not pass in this PR because I've touched too many kafka tests and it is not able to run all changed tests within timeout.

@azat
Copy link
Copy Markdown
Member Author

azat commented Sep 30, 2025

CI:

azat added a commit to azat/ClickHouse that referenced this pull request Sep 30, 2025
…dentifier generated

In one of CI runs the generated name was completelly from digits [1],
while it is not a valid identifier, so the test failed.

  [1]: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=87888&sha=edc7f7ddbe1ac2910244def7f77cf1858a9e5af7&name_0=PR&name_1=Integration%20tests%20%28amd_asan%2C%20old%20analyzer%2C%202%2F6%29

I've looked through all other usages of `gg choice.*string.digits` and
this is mostly the only one (except for kafka, which I will in ClickHouse#87719)
In flaky check previous iterations may took more then 200 seconds, and
in this case the DNS cache will be updated in the middle..
@azat azat added this pull request to the merge queue Oct 1, 2025
Merged via the queue into ClickHouse:master with commit 23bfe37 Oct 1, 2025
120 of 123 checks passed
@azat azat deleted the ci/flakiness-due-to-small-look_behind_lines branch October 1, 2025 05:43
@robot-clickhouse-ci-1 robot-clickhouse-ci-1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Oct 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🍃 green ci 🌿 Fixing flaky tests in CI pr-not-for-changelog This PR should not be mentioned in the changelog pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Test test_postgresql_replica_database_engine/test_0.py::test_table_schema_changes is flaky

4 participants